PPTX slides

advertisement
Achieving over 50% system speedup with
custom instructions and multi-threading.
Kaiming Ho
Fraunhofer IIS
kaiming.ho@iis.fraunhofer.de
Kaiming Ho
June 3rd, 2014
Overview
• Introduction and system description
• Motivation for work
• Optimization approach
– using user defined instructions (UDI)
– using multi-threading (MT)
• Results
• Concluding remarks
Kaiming Ho
2
Video Encoder System
(Overview)
DDR memory
memory
dedicated
hardware
ff 4c ff 51 00 2f 00 00
07 80 00 04 38 00 ff 93
f3 b6 ...
MIPS
processor
running s/w
Kaiming Ho
video in
(1080p30)
- encoded byte stream
(IP/UDP/RTP)
- statistics (IP/UDP)
ethernet out
(1000Mbps)
3
Overview of software
• Main software is partitioned into three parts
– Each part must finish before the next starts
from
h/w
PART1
(rate
optimization)
PART2
(codestream
formation)
PART3
(output to
network)
DONE
• Timestamps are added to measure how long each part takes.
Add up time for all three parts for performance metric.
– convert absolute time to frames/sec. (33.33ms -> 30fps)
• s/w also instrumented to count instructions.
– can calculate instr./cycle (IPC)
• h/w delivers input at 30 fps. Analyze rate at which s/w is done.
Kaiming Ho
– visualize in GUI
4
Visualization GUI
Performance before
all optimizations
Kaiming Ho
5
Optimization approach
1. Identify functional hot-spots which can be replaced
by user-defined custom instructions (UDI).
– base instruction-set is extended
– One custom instruction replaces many instructions from
the base-ISA.
– Highest impact when
• # instructions replaced is high
• function is called often.
2. Use multi-threading (MT) to run all three parts
simultaneously.
– stalls in execution pipeline reduce instructions/cycle (IPC).
– when one thread stalls, attempt to schedule an instruction
from another thread.
– increases effective IPC.
Kaiming Ho
6
Using User-defined instructions (UDI)
• MIPS UDI allows complex functions to be implemented
in a single custom instruction.
– ISA is extended to include new custom instructions
– Fully supported in compiler tool-chain.
• Instructions take the form:
reg_result = custom_udi(reg_src1, reg_src2);
– Two 32-bit source operands (both optional) and one 32-bit
result (also optional).
– Typical RISC style.
– Instructions can be pure (no side-effects), or can update
internal state.
• Instructions are likely domain specific.
Kaiming Ho
7
UDI Examples (1)
• Bit accumulation, with zero-stuffing.
– hard for 32-bit processor to do.
• <n> bits are pushed into an accumulator.
• When eight 1’s in a row occur, an extra “0” is added.
• data is popped out 16/32-bits at a time.
accumulator state
0 1 1 01 1 10 0
1 01 1 1 1 1 1 1 1 1 0 01 1 0 1 1
bitwr_push
0x1f2, 10
bitwr_push
0xfd, 8
bitwr_getlen r10
(r10 <= 19)
bitwr_pop16 r11
(r11 <= 0xecff)
bitwr_push
Kaiming Ho
0x17ffd, 18
8
UDI Examples (2)
• FIFO pointer management.
– not domain specific. Could find use in multiple applications.
• Internal state:
struct {
unsigned
unsigned
unsigned
unsigned
} FIFO_PTR;
ring_start
rd_ptr
*ring_start;
*ring_end;
*wr_ptr;
*rd_ptr;
wr_ptr
• s/w writes one word at a time
ring_end
– check for buffer full
– handle wraparound
unsigned *FIFO_PTR_INC_WP() {
unsigned *retval, *next_wp;
next_wp = retval = FIFO_PTR.wr_ptr;
// increment and wrap
next_wp += 1;
if (next_wp == FIFO_PTR.ring_end)
next_wp = FIFO_PTR.ring_start;
// check for full
if (next_wp == FIFO_PTR.rd_ptr)
return NULL;
FIFO_PTR.wr_ptr = next_wp;
return retval;
}
Kaiming Ho
Usage:
ptr = FIFO_PTR_INC_WP();
if (ptr)
*ptr = data;
PC:
PC:
PC:
PC:
bfc059fc
bfc05a00
bfc05a04
bfc05a08
UDI
BEQZ
NOP
SW
r3
// inc_wp
r3, 0xbfc05ac8
r3, 0(r3)
FIFO_PTR_INC_WP() reduced to one
atomic UDI
9
UDI savings
instr.
count
cycle
count
13 instr, 38 cyc.
• Two UDI replace 47 standard instructions, taking 95 cycles.
• UDI does not stall.
• Amount saved is dependent on input.
• # standard instructions variable.
• With UDI, always 2 instructions.
UDI name
cycles
saved
(per use)
instr.
saved
(per use)
freq.
of use
(per frame)
BIT WRITE (push)
46-161
29-77
20889
BIT WRITE (get_len)
46-108
24-48
4185
BIT WRITE (pop)
31-82
16-42
3288
FIFO PTR (inc wp)
39-101
22-46
3288
FIFO PTR (inc rp)
16
8
9
Kaiming Ho
34 instr, 57 cyc.
overall
speedup
22%
1.9%
10
Performance gain from UDI
83.72ms (before)
62.76ms (after)
Savings: 20.96ms (25%)
Kaiming Ho
11
multi-threading (1)
• instructions/cycle (IPC) is a measure of efficiency in CPU
execution pipeline.
– stalls due to cache misses, multi-cycle instructions, branch
penalties, etc… decrease IPC.
• A CPU working in multi-threaded mode attempts to
schedule instructions from a different thread when one
stalls.
– increases effective IPC
• Programs with low IPC in single-threaded mode benefit
most from multi-threading.
Representative execution statistics of our program gathered in the lab:
part1:
3056 cyc,
1587
part2: 4597034 cyc, 1954337
part3: 2454570 cyc, 816940
total: 7054660 cyc, 2772864
avg. IPC is 0.393
Kaiming Ho
instr.
instr.
instr.
instr.
avg. IPC is low!!
Expect MT to have
significant impact
12
multi-threading (2)
• Execution of our program (in ST), over time is shown
below.
frame2
frame1
part1
part2
frame3
part3
part1
part2
part3
part1
30fps
30fps
part2
part3
30fps
– Too slow. The 30fps time budget is overrun.
• With MT, each part runs in its own thread, which are
interleaved together.
– overall effect is better performance.
Kaiming Ho
13
Multi-threading and IRQ handling
• Traditional ST programs get interrupted when external IRQs are
asserted.
– running of ‘normal’ program is interrupted with running IRQ handler.
• When MT programs are architected the same way, ALL threads are
interrupted when IRQ occurs.
– On IRQ, CPU goes to exception level and MT is effectively turned off.
– very inefficient. When IRQ handler stalls, cycles are wasted.
• Our program takes many interrupts. (175k / sec.)
• Different approach:
– IRQ handler is given its own thread.
– Assertion of IRQ does not cause a CPU interrupt. They wake up the
thread with the IRQ handler.
– When IRQ handler runs, it is scheduled simultaneously with other
threads in the system.
– No IRQ overhead.
– CPU never goes to exception level.
Kaiming Ho
14
Performance gain from MT
ST
45%
MT
Original
performance: 83.72ms
With UDI
and MT
Kaiming Ho
: 43.37ms
15
Discussion of Results
ST/noUDI (111MHz):
86.6ms. IPC 42.42%
cyc.
instr.
p1:
2*1126
1130
p2:
2*3301726 2967154
p3:
2*1508140 1114213
IPC
50.17%
44.95%
37.01%
26%
MT/noUDI (111MHz):
68.6ms.
cyc.
instr.
p1:
2*1458
1125
p2:
2*3745384 2967201
p3:
2*1508443 1080524
•
IPC
38.58%
39.61%
35.89%
MT/UDI (111MHz):
43.8ms.
cyc.
p1:
2*1973
p2:
2*2435277
p3:
2*1508548
49%
instr.
1125
1741563
1078515
IPC
23.84%
41.17%
37.00%
56%
MT/UDI/rate_alloc (111MHz):
57.3ms. (34/30/32)
cyc.
instr.
p1:
2*1531915 639041
p2:
2*3187194 1741574
p3:
2*2249536 1057951
IPC
28.50%
35.76%
35.83%
custom instructions are part of multiplier pipeline.
•
When MT is used, same # instr. takes longer.
–
–
•
•
98%
ST/UDI/rate_alloc (111MHz):
89.5ms. IPC 35.22%
cyc.
instr.
p1:
2*1339904 639013
p2:
2*2115672 1741554
p3:
2*1508090 1113907
IPC
50.17%
41.17%
37.01%
Adding UDI decreases #instr. and IPC.
–
•
ST/UDI (111MHz):
65.4ms. IPC 39.39%
cyc.
instr.
p1:
2*1126
1130
p2:
2*2118058 1741554
p3:
2*1508130 1114291
IPC of individual threads lower
Overall IPC (performance) is higher.
•
IPC
19.25%
27.34%
23.56%
adding extra processing
with memory accesses and
FPU decreases IPC.
effect of MT is enhanced.
lower IPC in ST means greater gain from ST->MT
Frequency of CPU does not matter
–
Kaiming Ho
Our application is not I/O or memory bound.
16
Concluding Remarks
• Over 50% improvement in performance was obtained by
using two simple techniques:
– Use of custom user-defined instructions (UDI)
– Use of multi-threading (MT) technology.
• UDI reduces the number of instructions executed.
Consistently saves 20-25%.
– Easy to implement compared to dedicated h/w design.
– man-weeks of work vs. man-years.
• Benefit of MT is more variable.
–
–
–
–
Kaiming Ho
Between 26-49% has been measured.
depends on operating point. Image complexity. IPC of application.
Heavily loaded systems benefit more.
memory or I/O bound applications benefit more
17
way
Achieving over 50% system speedup with
custom instructions and multi-threading
THANK YOU!!!
Kaiming Ho
18
Download