Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS kaiming.ho@iis.fraunhofer.de Kaiming Ho June 3rd, 2014 Overview • Introduction and system description • Motivation for work • Optimization approach – using user defined instructions (UDI) – using multi-threading (MT) • Results • Concluding remarks Kaiming Ho 2 Video Encoder System (Overview) DDR memory memory dedicated hardware ff 4c ff 51 00 2f 00 00 07 80 00 04 38 00 ff 93 f3 b6 ... MIPS processor running s/w Kaiming Ho video in (1080p30) - encoded byte stream (IP/UDP/RTP) - statistics (IP/UDP) ethernet out (1000Mbps) 3 Overview of software • Main software is partitioned into three parts – Each part must finish before the next starts from h/w PART1 (rate optimization) PART2 (codestream formation) PART3 (output to network) DONE • Timestamps are added to measure how long each part takes. Add up time for all three parts for performance metric. – convert absolute time to frames/sec. (33.33ms -> 30fps) • s/w also instrumented to count instructions. – can calculate instr./cycle (IPC) • h/w delivers input at 30 fps. Analyze rate at which s/w is done. Kaiming Ho – visualize in GUI 4 Visualization GUI Performance before all optimizations Kaiming Ho 5 Optimization approach 1. Identify functional hot-spots which can be replaced by user-defined custom instructions (UDI). – base instruction-set is extended – One custom instruction replaces many instructions from the base-ISA. – Highest impact when • # instructions replaced is high • function is called often. 2. Use multi-threading (MT) to run all three parts simultaneously. – stalls in execution pipeline reduce instructions/cycle (IPC). – when one thread stalls, attempt to schedule an instruction from another thread. – increases effective IPC. Kaiming Ho 6 Using User-defined instructions (UDI) • MIPS UDI allows complex functions to be implemented in a single custom instruction. – ISA is extended to include new custom instructions – Fully supported in compiler tool-chain. • Instructions take the form: reg_result = custom_udi(reg_src1, reg_src2); – Two 32-bit source operands (both optional) and one 32-bit result (also optional). – Typical RISC style. – Instructions can be pure (no side-effects), or can update internal state. • Instructions are likely domain specific. Kaiming Ho 7 UDI Examples (1) • Bit accumulation, with zero-stuffing. – hard for 32-bit processor to do. • <n> bits are pushed into an accumulator. • When eight 1’s in a row occur, an extra “0” is added. • data is popped out 16/32-bits at a time. accumulator state 0 1 1 01 1 10 0 1 01 1 1 1 1 1 1 1 1 0 01 1 0 1 1 bitwr_push 0x1f2, 10 bitwr_push 0xfd, 8 bitwr_getlen r10 (r10 <= 19) bitwr_pop16 r11 (r11 <= 0xecff) bitwr_push Kaiming Ho 0x17ffd, 18 8 UDI Examples (2) • FIFO pointer management. – not domain specific. Could find use in multiple applications. • Internal state: struct { unsigned unsigned unsigned unsigned } FIFO_PTR; ring_start rd_ptr *ring_start; *ring_end; *wr_ptr; *rd_ptr; wr_ptr • s/w writes one word at a time ring_end – check for buffer full – handle wraparound unsigned *FIFO_PTR_INC_WP() { unsigned *retval, *next_wp; next_wp = retval = FIFO_PTR.wr_ptr; // increment and wrap next_wp += 1; if (next_wp == FIFO_PTR.ring_end) next_wp = FIFO_PTR.ring_start; // check for full if (next_wp == FIFO_PTR.rd_ptr) return NULL; FIFO_PTR.wr_ptr = next_wp; return retval; } Kaiming Ho Usage: ptr = FIFO_PTR_INC_WP(); if (ptr) *ptr = data; PC: PC: PC: PC: bfc059fc bfc05a00 bfc05a04 bfc05a08 UDI BEQZ NOP SW r3 // inc_wp r3, 0xbfc05ac8 r3, 0(r3) FIFO_PTR_INC_WP() reduced to one atomic UDI 9 UDI savings instr. count cycle count 13 instr, 38 cyc. • Two UDI replace 47 standard instructions, taking 95 cycles. • UDI does not stall. • Amount saved is dependent on input. • # standard instructions variable. • With UDI, always 2 instructions. UDI name cycles saved (per use) instr. saved (per use) freq. of use (per frame) BIT WRITE (push) 46-161 29-77 20889 BIT WRITE (get_len) 46-108 24-48 4185 BIT WRITE (pop) 31-82 16-42 3288 FIFO PTR (inc wp) 39-101 22-46 3288 FIFO PTR (inc rp) 16 8 9 Kaiming Ho 34 instr, 57 cyc. overall speedup 22% 1.9% 10 Performance gain from UDI 83.72ms (before) 62.76ms (after) Savings: 20.96ms (25%) Kaiming Ho 11 multi-threading (1) • instructions/cycle (IPC) is a measure of efficiency in CPU execution pipeline. – stalls due to cache misses, multi-cycle instructions, branch penalties, etc… decrease IPC. • A CPU working in multi-threaded mode attempts to schedule instructions from a different thread when one stalls. – increases effective IPC • Programs with low IPC in single-threaded mode benefit most from multi-threading. Representative execution statistics of our program gathered in the lab: part1: 3056 cyc, 1587 part2: 4597034 cyc, 1954337 part3: 2454570 cyc, 816940 total: 7054660 cyc, 2772864 avg. IPC is 0.393 Kaiming Ho instr. instr. instr. instr. avg. IPC is low!! Expect MT to have significant impact 12 multi-threading (2) • Execution of our program (in ST), over time is shown below. frame2 frame1 part1 part2 frame3 part3 part1 part2 part3 part1 30fps 30fps part2 part3 30fps – Too slow. The 30fps time budget is overrun. • With MT, each part runs in its own thread, which are interleaved together. – overall effect is better performance. Kaiming Ho 13 Multi-threading and IRQ handling • Traditional ST programs get interrupted when external IRQs are asserted. – running of ‘normal’ program is interrupted with running IRQ handler. • When MT programs are architected the same way, ALL threads are interrupted when IRQ occurs. – On IRQ, CPU goes to exception level and MT is effectively turned off. – very inefficient. When IRQ handler stalls, cycles are wasted. • Our program takes many interrupts. (175k / sec.) • Different approach: – IRQ handler is given its own thread. – Assertion of IRQ does not cause a CPU interrupt. They wake up the thread with the IRQ handler. – When IRQ handler runs, it is scheduled simultaneously with other threads in the system. – No IRQ overhead. – CPU never goes to exception level. Kaiming Ho 14 Performance gain from MT ST 45% MT Original performance: 83.72ms With UDI and MT Kaiming Ho : 43.37ms 15 Discussion of Results ST/noUDI (111MHz): 86.6ms. IPC 42.42% cyc. instr. p1: 2*1126 1130 p2: 2*3301726 2967154 p3: 2*1508140 1114213 IPC 50.17% 44.95% 37.01% 26% MT/noUDI (111MHz): 68.6ms. cyc. instr. p1: 2*1458 1125 p2: 2*3745384 2967201 p3: 2*1508443 1080524 • IPC 38.58% 39.61% 35.89% MT/UDI (111MHz): 43.8ms. cyc. p1: 2*1973 p2: 2*2435277 p3: 2*1508548 49% instr. 1125 1741563 1078515 IPC 23.84% 41.17% 37.00% 56% MT/UDI/rate_alloc (111MHz): 57.3ms. (34/30/32) cyc. instr. p1: 2*1531915 639041 p2: 2*3187194 1741574 p3: 2*2249536 1057951 IPC 28.50% 35.76% 35.83% custom instructions are part of multiplier pipeline. • When MT is used, same # instr. takes longer. – – • • 98% ST/UDI/rate_alloc (111MHz): 89.5ms. IPC 35.22% cyc. instr. p1: 2*1339904 639013 p2: 2*2115672 1741554 p3: 2*1508090 1113907 IPC 50.17% 41.17% 37.01% Adding UDI decreases #instr. and IPC. – • ST/UDI (111MHz): 65.4ms. IPC 39.39% cyc. instr. p1: 2*1126 1130 p2: 2*2118058 1741554 p3: 2*1508130 1114291 IPC of individual threads lower Overall IPC (performance) is higher. • IPC 19.25% 27.34% 23.56% adding extra processing with memory accesses and FPU decreases IPC. effect of MT is enhanced. lower IPC in ST means greater gain from ST->MT Frequency of CPU does not matter – Kaiming Ho Our application is not I/O or memory bound. 16 Concluding Remarks • Over 50% improvement in performance was obtained by using two simple techniques: – Use of custom user-defined instructions (UDI) – Use of multi-threading (MT) technology. • UDI reduces the number of instructions executed. Consistently saves 20-25%. – Easy to implement compared to dedicated h/w design. – man-weeks of work vs. man-years. • Benefit of MT is more variable. – – – – Kaiming Ho Between 26-49% has been measured. depends on operating point. Image complexity. IPC of application. Heavily loaded systems benefit more. memory or I/O bound applications benefit more 17 way Achieving over 50% system speedup with custom instructions and multi-threading THANK YOU!!! Kaiming Ho 18