lp-arch - VADA

advertisement
Lower Power Architecture
Design
1999. 8.2
성균관대학교 조 준 동 교수
http://vada.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
1
Architectural-level Synthesis
• Translate HDL models into sequencing graphs.
• Behavioral-level optimization:
– Optimize abstract models independently from the
implementation parameters.
•
Architectural synthesis and optimization:
– Create macroscopic structure:
• data-path and control-unit.
– Consider area and delay information
• Hardware compilation:
– Compile HDL model into sequencing graph.
– Optimize sequencing graph.
– Generate gate-level interconnection for a cell library. of the
implementation.
SungKyunKwan Univ.
VADA Lab.
2
Architecture-Level Solutions
•
Architecture-Driven Voltage Scaling: Choose more parallel
architecture, Lowering V dd reduces energy, but increase delays
•
Regularity: to minimize the power in the control hardware and the
interconnection network.
Modularity: to exploit data locality through distributed processing units, memories and control.
– Spatial locality: an algorithm can be partitioned into natural clusters
based on connectivity
– Temporal locality:average lifetimes of variables (less temporal storage,
probability of future accesses referenced in the recent past).
Few memory references: since references to memories are expensive in
terms of power.
•
•
•
Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses
SungKyunKwan Univ.
VADA Lab.
3
Power Measure of P
SungKyunKwan Univ.
VADA Lab.
4
Architecture Trade-off
Reference Data Path
SungKyunKwan Univ.
VADA Lab.
5
Parallel Data Path
SungKyunKwan Univ.
VADA Lab.
6
Pipelined Data Path
SungKyunKwan Univ.
VADA Lab.
7
A Simple Data Path, Result
SungKyunKwan Univ.
VADA Lab.
8
Uni-processor Implementation
SungKyunKwan Univ.
VADA Lab.
9
Multi-Processor
Implementation
SungKyunKwan Univ.
VADA Lab.
10
Datapath Parallelization
SungKyunKwan Univ.
VADA Lab.
11
FIR Parallelization
Mahesh Mejendale, Sunil D. Sherlekar, G. Venkatesh “Low-Power
Realization of FIR Filters on Programmable DSP’s” IEEE Transations on
very large scale integration (VLSI) system, Vol. 6, No. 4, December 1998
SungKyunKwan Univ.
VADA Lab.
12
Memory Parallelization
At first order P= C * f/2 * Vdd2
SungKyunKwan Univ.
VADA Lab.
13
VLIW Architecture
SungKyunKwan Univ.
VADA Lab.
14
VLIW - cont.
•
•
•
•
•
•
Compiler takes the responsibility for finding the operations that can be
issued in parallel and creating a single very long instruction containing
these operations. VLIW instruction decoding is easier than superscalar
instruction due to the fixed format and to no instruction dependency.
The fixed format could present more limitations to the combination of
operations.
Intel P6: CISC instructions are combined on chip to provide a set of
micro-operations (i.e., long instruction word) that can be executed in
parallel.
As power becomes a major issue in the design of fast -Pro, the simple
is the better architecture.
VLIW architecture, as they are simpler than N-issue machines, could
be considered as promising architectures to achieve simultaneously
high-speed and low-power.
SungKyunKwan Univ.
VADA Lab.
15
Synchronous VS. Asynchronous
•
•
Synchronous system: A signal path starts from a clocked flip- flop
through combinational gates and ends at another clocked flip- flop. The
clock signals do not participate in computation but are required for
synchronizing purposes. With advancement in technology, the systems
tend to get bigger and bigger, and as a result the delay on the clock
wires can no longer be ignored. The problem of clock skew is thus
becoming a bottleneck for many system designers. Many gates switch
unnecessarily just because they are connected to the clock, and not
because they have to process new inputs. The biggest gate is the clock
driver itself which must switch.
Asynchronous system (self-timed): an input signal (request) starts the
computation on a module and an output signal (acknowledge) signifies
the completion of the computation and the availability of the requested
data. Asynchronous systems are potentially response to transitions on
any of their inputs at anytime, since they have no clock with which to
sample their inputs.
SungKyunKwan Univ.
VADA Lab.
16
Asynchronous - Cont.
•
•
•
•
•
•
More difficult to implement, requiring explicit synchronization between
communication blocks without clocks
If the signal feeds directly to conventional gate-level circuitry, invalid
logic levels could propagate throughout the system.
Glitches, which are filtered out by the clock in synchronous designs,
may cause an asynchronous design to malfunction.
Asynchronous designs are not widely used, designers can't find the
supporting design tools and methodologies they need.
DCC Error Corrector of Compact cassette player saves power of 80%
as compared to the synchronous counterpart.
Offers more architectural options/freedom encourages distributed,
localized control offers more freedom to adapt the supply voltage
S. Furber, M. Edwards. “Asynchronous Design Methodologies”.
1993
SungKyunKwan Univ.
VADA Lab.
17
Asynchronous design with adaptive scaling of the
supply voltage
(a) Synchronous
system
(b)
Asynchronous
system with
adaptive scaling
of the supply
voltage
SungKyunKwan Univ.
VADA Lab.
18
Asynchronous Pipeline
SungKyunKwan Univ.
VADA Lab.
19
PIPELINED SELF-TIMED micro P
SungKyunKwan Univ.
VADA Lab.
20
Hazard-free Circuits
6% more logics
SungKyunKwan Univ.
VADA Lab.
21
Through WAVE PIPELINING
SungKyunKwan Univ.
VADA Lab.
22
Wave-pipelining on FPGA
• Pipeline의 문제점
– Balanced partitioning
– Delay element overhead
– Tclk > Tmax - Tmin + clock skew + setup/hold
time
– Area, Power, 전체 지연시간의 증가
– Clock distribution problem
• Wavepipelining = high throughput w/o
such overhead =Ideal pipelining
SungKyunKwan Univ.
VADA Lab.
23
FPGA on WavePipeline
• LUT의 delay는 다양한 logic function에
서도 비슷하다.
• 동일delay를 구성할 수 있다.
• FPGA element delay (wire, LUT,
interconnection)
• Powerful layout editor
• Fast design cycle
SungKyunKwan Univ.
VADA Lab.
24
WP advantages
• Area efficient - register, clock
distribution network & clock buffer 필
요 없음.
• Low power dissipation
• Higher throughput
• Low latency
SungKyunKwan Univ.
VADA Lab.
25
Disadvantage
• Degraded performance in certain case
• Difficult to achieve sharp rise and fall time
in synchronous design
• Layout is critical for balancing the delay
• Parameter variation - power supply and
temperature dependence
SungKyunKwan Univ.
VADA Lab.
26
Experimental Results
Conventional Pipeline
wavepipeline
Register
0
286
28
Max path
delay
Min. path
delay
Max Freq.
74.188ns
12.730 ns
68.969 ns
13.5 MHz
78.6 MHz
50 MHz
CLB #
49
143
148
Latency
75ns
80 ns
Power
19.6mW/Mhz
169 ns (13
clk)
76.8
mW/MHz +
clock driver
9.0ns
52.356 ns
64.8
mW/MHz
By 이재형, SKKU
SungKyunKwan Univ.
VADA Lab.
27
Observation
• WP multiplier는 delay를 조절하기 위한 LUTs의 추가
가 많아서 전력소모 면에서 큰 이득은 보지 못했다.
• FPGA에서 delay를 조절하기 위해 LUTs나 net delay
를 사용하지 않고 별도의 delay 소자를 사용하면 보
다 효과적
• 또한, 동일한 level을 가지는 multiplier를 설계하면
WP 구현이 용이하고 pipeline 구조보다 전력소모나
면적에서 큰 이득을 얻을 수 있을 것이다.
SungKyunKwan Univ.
VADA Lab.
28
VON NEUMANN VERSUS HARVARD
SungKyunKwan Univ.
VADA Lab.
29
Power vs Area of Micro-coded
Microprocessor
1.5V and 10MHz clock rate: instruction and data memory accesses
account for 47% of the total power consumption.
SungKyunKwan Univ.
VADA Lab.
30
Memory Architecture
SungKyunKwan Univ.
VADA Lab.
31
Exploiting Locality for Low-Power Design
•A spatially local cluster: group of algorithm operations that are tightly
connected to each other in the flow graph representation.
• Two nodes are tightly connected to each other on the flow graph representation
if the shortest distance between them, in terms of number of edges traversed, is low.
•Power consumption (mW) in
the maximally time-shared
and fully-parallel versions of
the QMF sub-band coder filter
• Improvement of a factor of
10.5 at the expense of a 20%
increase in area
• The interconnect elements
(buses, multiplexers, and
buffers) consumes 43% and
28% of the total power in
the time-shared and parallel
versions.
SungKyunKwan Univ.
VADA Lab.
32
Cascade filter layouts
(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP
SungKyunKwan Univ.
VADA Lab.
33
Frequency Multipliers and Dividers
SungKyunKwan Univ.
VADA Lab.
34
Low Power DSP
• 수행시간의 대부분이 DO-LOOP에서 이
루어짐
VSELP Vocoder : 83.4 %
2D 8x8 DCT
: 98.3 %
LPC computation : 98.0 %
DO-LOOP의 Power Minimization
==> DSP의 Power Minimization
VSELP : Vector Sum Excited Linear Prediction
LPC : Linear Prediction Coding
SungKyunKwan Univ.
VADA Lab.
35
Low Power DSP
• Instruction Buffer (또는 Cache)
locality 이용
Program memory의 access를 줄인다.
• Decoded Instruction Buffer
– LOOP의 첫번째 iteration의 decoding결과를
RAM에 저장한 후 재사용
– Fetch/Decoding 과정을 제거
– 30~40% Power Saving
SungKyunKwan Univ.
VADA Lab.
36
Stage-Skip Pipeline
•The power savings is
achieved by stopping
the instruction fetch
and decode stages of
the processor during
the loop execution
except its first iteration.
•DIB = Decoded
Instruction Buffer
• 40 % power savings
using DSP or RISC
processor.
SungKyunKwan Univ.
VADA Lab.
37
Stage-Skip Pipeline
•Selector: selects the output from
either the instruction decoder or
DIB
• The decoded instruction signals
for a loop are temporarily stored
in the DIB and are reused in each
iteration
of the loop.
•The power wasted in the
conventional pipeline is saved in
our pipeline by stopping the
instruction fetching and decoding
for each loop execution.
SungKyunKwan Univ.
VADA Lab.
38
Stage-Skip Pipeline
Majority of execution
cycles in signal
processing programs
are used for loop
execution :
40% reduction in
power with area
increase 2%.
SungKyunKwan Univ.
VADA Lab.
39
Optimizing Power using Transformation
INPUT FLOWGRAPH
LOCAL TRANSFORMATION
PRIMITIVES
Associativity,
Distributivity,
Retiming,
Common Sub-expression
OUTPUT FLOWGRAPH
SEARCH MECHANISM
simulated Rejectionless,
Steepest Decent,
Heuristics
GLOBAL
TRANSFORMATION
PRIMITIVES
Retiming,
Pipelining,
Look-Ahead,
Associativity
POWER
ESTIMATION
SungKyunKwan Univ.
VADA Lab.
40
Data- flow based transformations
•
•
•
•
•
•
•
•
•
Tree Height reduction.
Constant and variable propagation.
Common subexpression elimination.
Code motion
Dead-code elimination
The application of algebraic laws such as commutability, distributivity
and associativity.
Most of the parallelism in an algorithm is embodied in the loops.
Loop jamming, partial and complete loop unrolling, strength reduction
and loop retiming and software pipelining.
Retiming: maximize the resource utilization.
SungKyunKwan Univ.
VADA Lab.
41
Tree-height reduction
•Example of tree-height reduction
using commutativity and
associativity
SungKyunKwan Univ.
• Example of tree-height
reduction using
distributivity
VADA Lab.
42
Sub-expression elimination
• Logic expressions:
– Performed by logic optimization.
– Kernel-based methods.
•
Arithmetic expressions:
–
–
–
–
Search isomorphic patterns in the parse trees.
Example:
a= x+ y; b = a+ 1; c = x+ y;
a= x+ y; b = a+ 1; c = a;
SungKyunKwan Univ.
VADA Lab.
43
Examples of other transformations
•
Dead-code elimination:
– a= x; b = x+ 1; c = 2 * x;
– a= x; can be removed if not referenced.
•
Operator-strength reduction:
– a= x2 ; b = 3 * x;
– a= x * x; t = x<<1; b = x+ t;
•
Code motion:
– for ( i = 1; i < a * b) { }
– t = a * b; for ( i = 1; i < t) { }
SungKyunKwan Univ.
VADA Lab.
44
Control- flow based transformations
•
Model expansion.
– Expand subroutine flatten
hierarchy.
– Useful to expand scope of other
optimization techniques.
– Problematic when routine is
called more than once.
– Example:
– x= a+ b; y= a * b; z = foo( x, y) ;
– foo( p, q) {t =q-p; return(t);}
– By expanding foo:
– x= a+ b; y= a * b; z = y-x;
SungKyunKwan Univ.
• Conditional expansion
• Transform conditional into
parallel execution with test at the
end.
• Useful when test depends on
late signals.
• May preclude hardware sharing.
• Always useful for logic
expressions.
• Example:
•y= ab; if ( a) x= b+d; else x= bd;
can be expanded to: x= a( b+ d) +
a’bd;
•y= ab; x= y+ d( a+ b);
VADA Lab.
45
Strength reduction
X
X
**
X
X
+
*
A
+
A
+*
X
B
X2 + AX + B
+
B
X(X + A) + B
X
X
X
+
*
+
*
*
X
*
A
+
+
X
A
+
*
*
X
+
+
B
X
+
+
* +
B
C
SungKyunKwan Univ.
VADA Lab.
46
Strength Reduction
SungKyunKwan Univ.
VADA Lab.
47
DIGLOG multiplier
Cmult (n)  253n 2 , Cadd (n)  214n, where n  world length in bits
A  2 j  AR , B  2 k  BR
A  B  (2 j  AR )(2 k  BR )  2 j  BR  2 k  AR  AR  BR
1st Iter 2nd Iter 3rd Iter
Worst-case error
-25%
-6%
-1.6%
Prob. of Error<1% 10%
70%
99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of
seven iteration steps (worst case)
SungKyunKwan Univ.
VADA Lab.
48
Logarithmic Number System
Lx  log 2 | x|,
LAB  LA  LB , LA/ B  LA  LB ,
LA2  LA  1, L A  LA  1,
--> Significant Strength Reduction
SungKyunKwan Univ.
VADA Lab.
49
Switching Activity Reduction
(a) Average
activity in a
multiplier as a
function of the
constant value
(b) A parallel
and serial
implementations
of an adder tree.
SungKyunKwan Univ.
VADA Lab.
50
Pipelining
SungKyunKwan Univ.
VADA Lab.
51
Associativity Transformation
SungKyunKwan Univ.
VADA Lab.
52
Interlaced Accumulation Programming for Low
Power
SungKyunKwan Univ.
VADA Lab.
53
Two’s complement implementation of an
accumulator
SungKyunKwan Univ.
VADA Lab.
54
Sign magnitude implementation of
an accumulator.
SungKyunKwan Univ.
VADA Lab.
55
Number representation trade-off for arithmetic
SungKyunKwan Univ.
VADA Lab.
56
Signal statistics for Sign Magnitude implementation of
the accumulator datapath assuming random inputs.
SungKyunKwan Univ.
VADA Lab.
57
Download