Implementation of convolution Comparison of microprocessors

advertisement

Implementation of the Convolution

Operation on General Purpose

Processors

0

1

1

2

0

2

0

1

1

Ernest Jamro

AGH Technical University

Kraków, Poland

Contents

1. Introduction.

2. Pentium and Itanium processors:

• pipelining

• superscalar

• Very Long Instruction Word (VLIW)

• Single Instruction Multiple Data (SIMD).

3. Convolvers implemented in FPGAs.

4. Conclusions.

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

2

Mathematical operation b y

N / 2 , x

M / 2

1

D

N i

1 M 

 

0 j

1

0 w i , j

 a y

 i , x

 j where:

N, M – the size of the convolution kernel (usually odd numbers), a y,x

- an input, b y,x

– an output, w i,j

- a coefficient of the convolution,

D- a common denominator, D=2 n .

For image: 512

512

25 frames/s for convolution kernel 3

3

L

M

= N

X

N

Y

N

F

N

M= 58 982 400 multiplies/s

L

A

= N

X

N

Y

N

F

(N

M-1)= 52 428 800 additions/s

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

3

C-language program

For every output pixel int sum= D/2; // accumulation result – initially D/2 to minimise the division rounding error for(int i= 0; i<N; i++) // vertical convolution

{ for(int j=0; j<M; j++) // horizontal convolution

sum+= *pw++ * *pa++; // the kernel of the convolution pa+= Nx-M; // pa1 will point the first pixel in the next line

} sum/= D; // division by the common denominator

*pb= (BYTE) sum; // conversion from int (4 bytes) to 1 byte variable, save the result.

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

4

Clock

T

N+4

T

N+3

T

N+2

T

N+1

T

N

Implementation on different

Pentium processors

Pipelining in Pentium 75MHz

Instruction fetching

M+4

M+4

Instruction

Decoding

(stage 1)

M+3

M+3

Instruction

Decoding

(stage 2)

Instruction

Branch

M+2

Instruction

M+1

Execution

Instruction

M+1

Instruction

Branch

Instruction

M+1

Instruction

M+1

Registers updating

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

5

Loop unrolling sum= D/2 // initialisation sum+= *pa++ *pw++; sum+= *pa++ *pw++; sum+= *pa * *pw++; pa+= N; // go to the next line sum+= *pa++ *pw++; sum+= *pa++ *pw++;

...

sum/= D;

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

6

t

LU

/t stand

0.8

Loop Unrolling

Relative calculation time after LU

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

486-DX4-100 P75 P166 P300 Athlon 800MHz

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

7

DSP solution e.g. Texas Instruments TMS320C80

Hardware Loop Control

Special registers:

• Loop Counter - number of branches to the start of the loop

• Loop End - points the last instruction in the loop

• Loop Start - point the first instruction in the loop

• Loop Reload

Superscalar architecture

Instruction Level Parallelism

Non-optimised

// pixel 3 (top-right pixel in the convolution window) xor edx,edx // clear edx mov dl, byte ptr [ecx+2] // load data a (pixel 3) imul edx,dword ptr [edi+8h] //multiply: pixel3 * w[0][2] add eax, edx // accumulate the result of the multiplication

Optimised

xor edx, edx //start of calculation for pixel 3: clear imul ebx, dword ptr [edi+4] //pixel2: ebx=pel2*w[0][1]

mov dl, byte ptr [ecx+2] // pixel 3: dl=pel3 add eax, ebx // end of calculation for pixel2: eax+= ebx

imul edx, dword ptr [edi+8] // pixel 3: edx=pel3*w[0][2] xor ebx, ebx // start calculation for pixel 4: clear

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

9

2.5

Superscalar (continue)

Number of instruction executed in a clock cycle

1

0.5

2

1.5

Normal

Optimised

Multiplierless

Mult-less opt.

0

486-DX4-100 P75 P166 P300 Athlon 800MHz

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

10

DSP TMS320C80

Parallel Processor (VLIW)

Execution Units

1. Data Unit (DU).

2. Address Unit (AU) local and global.

3. Program Flow Control Unit (PFCU).

Convolution operations

1. multiply, (executed by the DU)

2. accumulate (executed by the DU)

3. load the coefficient (executed by the AU)

4. increment coefficient pointer (executed by the AU)

5. load the input pixel (executed by the AU)

6. increment input pixel pointer (executed by the AU)

7. control the loop (executed by the PFCU)

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

11

VLIW processor

Crusoe processor compatible with Pentium

FADD ADD

128 bits

LD BRCC

Floating

Point Unit

Integer

ALU Unit

Load/Save

Unit

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

Branch

Unit

12

Itanium microprocessor

• Greater number of registers (R0-R127 - general purpose registers).

• VLIW-like architecture - each 128-bit bundle contains

3 instructions, which enables the processor to dispatch instructions with simple instructions decoding.

• Stops define which instructions can be executed in parallel - simpler grouping instructions in to be executed in parallel.

• Control and data speculation (included during compiletime).

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

13

Simultaneous Multithreading

Cycle Unit A Unit B Unit C

0 P0 I0 P0 I1

1

-

P0 Memory Access

2 P0 Memory Access

3 P0 I2 - P0 I3

4 - P1 I0 P1 I1

Time switched multithreading

Cycle Unit A Unit B Unit C

2

3

0

1

4

P0 I0

P1 I1

P1 I4

P0 I4

P1 I7

P0 I1

P1 I2

P0 I3

P0 I5

P0 I6

P1 I0

P1 I3

P1 I5

P1 I6

P0 I7

Simultaneous Multithreading

Disadvantages:

• Lager cache size

• Operation system support

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

14

Single Instruction stream

Multiple Data stream (SIMD)

MMX Coprocessor

MM0 S

W MM1

= =

MM0 S

W

T

X

=

T

X

U

Y

=

U

Y

V

Z

=

V

Z

Multiply

MM0

MM1

=

MM0

S

T

W X

=

S

W+T

X

=

U

V

Y Z

=

U

Y+V

Z

=

Multiply & accumulate

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

15

SIMD

Relative calculation time for t

MMX

/t stand standard and MMX processor

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

P166 P300 Athlon 800MHz

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

16

Different processor speeds

Calculation time [ms] for 512

512 image and for convolution kernel 3

3 (integer unit)

450

400

350

300

250

200

150

100

50

0

486-DX4-100 P75 P166 P300 Athlon 800MHz

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

17

Comparison of different microprocessors & DSPs

Time [ms]

25

20

15

10

5

0

50

45

40

35

30

P3

00

M

M

X

P3

00

P7

5

E. Jamro, K. Wiatr;

DS

P5

60

00

@

10

0M

Hz

TM

S3

20

C8

0@

50

M

Hz

18

Dedicated VLSI Processors

Input

Line Buffer w

2,2

a y+2,x+2 z

-1 w

2,1

a y+2,x+1 z

-1 w

2,0

a y+2,x w

a y+1,x+2

1,2 z

-1 w

1,1

a y+1,x+1 z

-1

a y+1,x w

1,0

Line Buffer w

0,2

a y,x+2 z

-1 w

0,1

a y,x+1 z

-1

a y,x w

0,0

+

b y+1,x+1

Output

19

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

FPGAs

Similar design scheme as for ASICs but:

• Quick time-to-market (simpler designing and testing)

• Flexible design &dynamic reprogramming

Available resources:

• Memory blocks (for line buffers)

• Dedicated carry logic (for arithmetic units)

• Built-in 18x18 multipliers (Virtex II)

• Design Automation Tools and Cores

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

20

Conclusions:

Suggestions for improving microprocessors performance

• Improve instruction decoding and despatching (by including compile-time information, VLIW-like architecture)

• Introduce Simultaneous Multithreading

• Enlarge data format in SIMD

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

21

Will microprocessors speed grow?

• Clock frequency doubles every five years ...

...but the speed of light never changes

(Moore meets Einstein?)

• Saturated architecture of the microprocessors

– pipelining

– instruction level parallelism

– branch prediction & speculative execution

– compile-time information

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

22

Solution?

Microprocessor

MMX (SIMD) coprocessor

FPGA-like coprocessor

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

23

Thank you for your attention

?

The rest of the image is not shown because of insufficient computation power

24

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

Download