Implementation of the Convolution
Operation on General Purpose
Processors
0
1
1
2
0
2
0
1
1
Ernest Jamro
AGH Technical University
Kraków, Poland
Contents
1. Introduction.
2. Pentium and Itanium processors:
• pipelining
• superscalar
• Very Long Instruction Word (VLIW)
• Single Instruction Multiple Data (SIMD).
3. Convolvers implemented in FPGAs.
4. Conclusions.
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
2
Mathematical operation b y
N / 2 , x
M / 2
1
D
N i
1 M
0 j
1
0 w i , j
a y
i , x
j where:
N, M – the size of the convolution kernel (usually odd numbers), a y,x
- an input, b y,x
– an output, w i,j
- a coefficient of the convolution,
D- a common denominator, D=2 n .
For image: 512
512
25 frames/s for convolution kernel 3
3
L
M
= N
X
N
Y
N
F
N
M= 58 982 400 multiplies/s
L
A
= N
X
N
Y
N
F
(N
M-1)= 52 428 800 additions/s
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
3
C-language program
For every output pixel int sum= D/2; // accumulation result – initially D/2 to minimise the division rounding error for(int i= 0; i<N; i++) // vertical convolution
{ for(int j=0; j<M; j++) // horizontal convolution
sum+= *pw++ * *pa++; // the kernel of the convolution pa+= Nx-M; // pa1 will point the first pixel in the next line
} sum/= D; // division by the common denominator
*pb= (BYTE) sum; // conversion from int (4 bytes) to 1 byte variable, save the result.
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
4
Clock
T
N+4
T
N+3
T
N+2
T
N+1
T
N
Implementation on different
Pentium processors
Pipelining in Pentium 75MHz
Instruction fetching
M+4
M+4
Instruction
Decoding
(stage 1)
M+3
M+3
Instruction
Decoding
(stage 2)
Instruction
Branch
M+2
Instruction
M+1
Execution
Instruction
M+1
Instruction
Branch
Instruction
M+1
Instruction
M+1
Registers updating
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
5
Loop unrolling sum= D/2 // initialisation sum+= *pa++ *pw++; sum+= *pa++ *pw++; sum+= *pa * *pw++; pa+= N; // go to the next line sum+= *pa++ *pw++; sum+= *pa++ *pw++;
...
sum/= D;
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
6
t
LU
/t stand
0.8
Loop Unrolling
Relative calculation time after LU
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
486-DX4-100 P75 P166 P300 Athlon 800MHz
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
7
DSP solution e.g. Texas Instruments TMS320C80
Hardware Loop Control
Special registers:
• Loop Counter - number of branches to the start of the loop
• Loop End - points the last instruction in the loop
• Loop Start - point the first instruction in the loop
• Loop Reload
Superscalar architecture
Instruction Level Parallelism
Non-optimised
// pixel 3 (top-right pixel in the convolution window) xor edx,edx // clear edx mov dl, byte ptr [ecx+2] // load data a (pixel 3) imul edx,dword ptr [edi+8h] //multiply: pixel3 * w[0][2] add eax, edx // accumulate the result of the multiplication
Optimised
xor edx, edx //start of calculation for pixel 3: clear imul ebx, dword ptr [edi+4] //pixel2: ebx=pel2*w[0][1]
mov dl, byte ptr [ecx+2] // pixel 3: dl=pel3 add eax, ebx // end of calculation for pixel2: eax+= ebx
imul edx, dword ptr [edi+8] // pixel 3: edx=pel3*w[0][2] xor ebx, ebx // start calculation for pixel 4: clear
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
9
2.5
Superscalar (continue)
Number of instruction executed in a clock cycle
1
0.5
2
1.5
Normal
Optimised
Multiplierless
Mult-less opt.
0
486-DX4-100 P75 P166 P300 Athlon 800MHz
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
10
DSP TMS320C80
Parallel Processor (VLIW)
Execution Units
1. Data Unit (DU).
2. Address Unit (AU) local and global.
3. Program Flow Control Unit (PFCU).
Convolution operations
1. multiply, (executed by the DU)
2. accumulate (executed by the DU)
3. load the coefficient (executed by the AU)
4. increment coefficient pointer (executed by the AU)
5. load the input pixel (executed by the AU)
6. increment input pixel pointer (executed by the AU)
7. control the loop (executed by the PFCU)
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
11
VLIW processor
Crusoe processor compatible with Pentium
FADD ADD
128 bits
LD BRCC
Floating
Point Unit
Integer
ALU Unit
Load/Save
Unit
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
Branch
Unit
12
Itanium microprocessor
• Greater number of registers (R0-R127 - general purpose registers).
• VLIW-like architecture - each 128-bit bundle contains
3 instructions, which enables the processor to dispatch instructions with simple instructions decoding.
• Stops define which instructions can be executed in parallel - simpler grouping instructions in to be executed in parallel.
• Control and data speculation (included during compiletime).
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
13
Simultaneous Multithreading
Cycle Unit A Unit B Unit C
0 P0 I0 P0 I1
1
-
P0 Memory Access
2 P0 Memory Access
3 P0 I2 - P0 I3
4 - P1 I0 P1 I1
Time switched multithreading
Cycle Unit A Unit B Unit C
2
3
0
1
4
P0 I0
P1 I1
P1 I4
P0 I4
P1 I7
P0 I1
P1 I2
P0 I3
P0 I5
P0 I6
P1 I0
P1 I3
P1 I5
P1 I6
P0 I7
Simultaneous Multithreading
Disadvantages:
• Lager cache size
• Operation system support
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
14
Single Instruction stream
Multiple Data stream (SIMD)
MMX Coprocessor
MM0 S
W MM1
= =
MM0 S
W
T
X
=
T
X
U
Y
=
U
Y
V
Z
=
V
Z
Multiply
MM0
MM1
=
MM0
S
T
W X
=
S
W+T
X
=
U
V
Y Z
=
U
Y+V
Z
=
Multiply & accumulate
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
15
SIMD
Relative calculation time for t
MMX
/t stand standard and MMX processor
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
P166 P300 Athlon 800MHz
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
16
Different processor speeds
Calculation time [ms] for 512
512 image and for convolution kernel 3
3 (integer unit)
450
400
350
300
250
200
150
100
50
0
486-DX4-100 P75 P166 P300 Athlon 800MHz
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
17
Comparison of different microprocessors & DSPs
Time [ms]
25
20
15
10
5
0
50
45
40
35
30
P3
00
M
M
X
P3
00
P7
5
E. Jamro, K. Wiatr;
DS
P5
60
00
@
10
0M
Hz
TM
S3
20
C8
0@
50
M
Hz
18
Dedicated VLSI Processors
Input
Line Buffer w
2,2
a y+2,x+2 z
-1 w
2,1
a y+2,x+1 z
-1 w
2,0
a y+2,x w
a y+1,x+2
1,2 z
-1 w
1,1
a y+1,x+1 z
-1
a y+1,x w
1,0
Line Buffer w
0,2
a y,x+2 z
-1 w
0,1
a y,x+1 z
-1
a y,x w
0,0
+
b y+1,x+1
Output
19
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
FPGAs
Similar design scheme as for ASICs but:
• Quick time-to-market (simpler designing and testing)
• Flexible design &dynamic reprogramming
Available resources:
• Memory blocks (for line buffers)
• Dedicated carry logic (for arithmetic units)
• Built-in 18x18 multipliers (Virtex II)
• Design Automation Tools and Cores
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
20
Conclusions:
Suggestions for improving microprocessors performance
• Improve instruction decoding and despatching (by including compile-time information, VLIW-like architecture)
• Introduce Simultaneous Multithreading
• Enlarge data format in SIMD
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
21
Will microprocessors speed grow?
• Clock frequency doubles every five years ...
...but the speed of light never changes
(Moore meets Einstein?)
• Saturated architecture of the microprocessors
– pipelining
– instruction level parallelism
– branch prediction & speculative execution
– compile-time information
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
22
Solution?
Microprocessor
MMX (SIMD) coprocessor
FPGA-like coprocessor
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors
23
?
The rest of the image is not shown because of insufficient computation power
24
E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors