Implementation of convolution Comparison of microprocessors

Implementation of the Convolution

Operation on General Purpose

Processors











0

1

1



2

0

2



0

1

1









Ernest Jamro

AGH Technical University

Kraków, Poland

Contents

1. Introduction.

2. Pentium and Itanium processors:

• pipelining

• superscalar

• Very Long Instruction Word (VLIW)

• Single Instruction Multiple Data (SIMD).

3. Convolvers implemented in FPGAs.

4. Conclusions.

E. Jamro, K. Wiatr; Implementation of Convolution Operation on General Purpose Processors

2

Mathematical operation b y



N / 2 , x



M / 2



1

D

N i



1 M 

 

0 j



1

0 w i , j

 a y

 i , x

 j where:

N, M – the size of the convolution kernel (usually odd numbers), a y,x

- an input, b y,x

– an output, w i,j

- a coefficient of the convolution,

D- a common denominator, D=2 n .

For image: 512



512



25 frames/s for convolution kernel 3



3

L

M

= N

X



N

Y



N

F



N



M= 58 982 400 multiplies/s

L

A

= N

X



N

Y



N

F



(N



M-1)= 52 428 800 additions/s


3

C-language program

For every output pixel int sum= D/2; // accumulation result – initially D/2 to minimise the division rounding error for(int i= 0; i<N; i++) // vertical convolution

{ for(int j=0; j<M; j++) // horizontal convolution

sum+= *pw++ * *pa++; // the kernel of the convolution pa+= Nx-M; // pa1 will point the first pixel in the next line

} sum/= D; // division by the common denominator

*pb= (BYTE) sum; // conversion from int (4 bytes) to 1 byte variable, save the result.


4

Clock

T

N+4

T

N+3

T

N+2

T

N+1

T

N

Implementation on different

Pentium processors

Pipelining in Pentium 75MHz

Instruction fetching

M+4

M+4

Instruction

Decoding

(stage 1)

M+3

M+3

Instruction

Decoding

(stage 2)

Instruction

Branch

M+2

Instruction

M+1

Execution

Instruction

M+1

Instruction

Branch

Instruction

M+1

Instruction

M+1

Registers updating


5

Loop unrolling sum= D/2 // initialisation sum+= *pa++ *pw++; sum+= *pa++ *pw++; sum+= *pa * *pw++; pa+= N; // go to the next line sum+= *pa++ *pw++; sum+= *pa++ *pw++;

...

sum/= D;


6

t

LU

/t stand

0.8

Loop Unrolling

Relative calculation time after LU

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

486-DX4-100 P75 P166 P300 Athlon 800MHz


7

DSP solution e.g. Texas Instruments TMS320C80

Hardware Loop Control

Special registers:

• Loop Counter - number of branches to the start of the loop

• Loop End - points the last instruction in the loop

• Loop Start - point the first instruction in the loop

• Loop Reload

Superscalar architecture

Instruction Level Parallelism

Non-optimised

// pixel 3 (top-right pixel in the convolution window) xor edx,edx // clear edx mov dl, byte ptr [ecx+2] // load data a (pixel 3) imul edx,dword ptr [edi+8h] //multiply: pixel3 * w[0][2] add eax, edx // accumulate the result of the multiplication

Optimised

xor edx, edx //start of calculation for pixel 3: clear imul ebx, dword ptr [edi+4] //pixel2: ebx=pel2*w[0][1]

mov dl, byte ptr [ecx+2] // pixel 3: dl=pel3 add eax, ebx // end of calculation for pixel2: eax+= ebx

imul edx, dword ptr [edi+8] // pixel 3: edx=pel3*w[0][2] xor ebx, ebx // start calculation for pixel 4: clear


9

2.5

Superscalar (continue)

Number of instruction executed in a clock cycle

1

0.5

2

1.5

Normal

Optimised

Multiplierless

Mult-less opt.

0

486-DX4-100 P75 P166 P300 Athlon 800MHz


10

DSP TMS320C80

Parallel Processor (VLIW)

Execution Units

1. Data Unit (DU).

2. Address Unit (AU) local and global.

3. Program Flow Control Unit (PFCU).

Convolution operations

1. multiply, (executed by the DU)

2. accumulate (executed by the DU)

3. load the coefficient (executed by the AU)

4. increment coefficient pointer (executed by the AU)

5. load the input pixel (executed by the AU)

6. increment input pixel pointer (executed by the AU)

7. control the loop (executed by the PFCU)


11

VLIW processor

Crusoe processor compatible with Pentium

FADD ADD

128 bits

LD BRCC

Floating

Point Unit

Integer

ALU Unit

Load/Save

Unit


Branch

Unit

12

Itanium microprocessor

• Greater number of registers (R0-R127 - general purpose registers).

• VLIW-like architecture - each 128-bit bundle contains

3 instructions, which enables the processor to dispatch instructions with simple instructions decoding.

• Stops define which instructions can be executed in parallel - simpler grouping instructions in to be executed in parallel.

• Control and data speculation (included during compiletime).


13

Simultaneous Multithreading

Cycle Unit A Unit B Unit C

0 P0 I0 P0 I1

1

-

P0 Memory Access

2 P0 Memory Access

3 P0 I2 - P0 I3

4 - P1 I0 P1 I1

Time switched multithreading

Cycle Unit A Unit B Unit C

2

3

0

1

4

P0 I0

P1 I1

P1 I4

P0 I4

P1 I7

P0 I1

P1 I2

P0 I3

P0 I5

P0 I6

P1 I0

P1 I3

P1 I5

P1 I6

P0 I7

Simultaneous Multithreading

Disadvantages:

• Lager cache size

• Operation system support


14

Single Instruction stream

Multiple Data stream (SIMD)

MMX Coprocessor

MM0 S



W MM1

= =

MM0 S



W

T



X

=

T



X

U



Y

=

U



Y

V



Z

=

V



Z

Multiply

MM0

MM1

=

MM0

S



T



W X

=

S



W+T



X

=

U



V



Y Z

=

U



Y+V



Z

=

Multiply & accumulate


15

SIMD

Relative calculation time for t

MMX

/t stand standard and MMX processor

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

P166 P300 Athlon 800MHz


16

Different processor speeds

Calculation time [ms] for 512



512 image and for convolution kernel 3



3 (integer unit)

450

400

350

300

250

200

150

100

50

0

486-DX4-100 P75 P166 P300 Athlon 800MHz


17

Comparison of different microprocessors & DSPs

Time [ms]

25

20

15

10

5

0

50

45

40

35

30

P3

00

M

M

X

P3

00

P7

5

E. Jamro, K. Wiatr;

DS

P5

60

00

@

10

0M

Hz

TM

S3

20

C8

0@

50

M

Hz

18

Dedicated VLSI Processors

Input

Line Buffer w

2,2

a y+2,x+2 z

-1 w

2,1

a y+2,x+1 z

-1 w

2,0

a y+2,x w

a y+1,x+2

1,2 z

-1 w

1,1

a y+1,x+1 z

-1

a y+1,x w

1,0

Line Buffer w

0,2

a y,x+2 z

-1 w

0,1

a y,x+1 z

-1

a y,x w

0,0

+

b y+1,x+1

Output

19


FPGAs

Similar design scheme as for ASICs but:

• Quick time-to-market (simpler designing and testing)

• Flexible design &dynamic reprogramming

Available resources:

• Memory blocks (for line buffers)

• Dedicated carry logic (for arithmetic units)

• Built-in 18x18 multipliers (Virtex II)

• Design Automation Tools and Cores


20

Conclusions:

Suggestions for improving microprocessors performance

• Improve instruction decoding and despatching (by including compile-time information, VLIW-like architecture)

• Introduce Simultaneous Multithreading

• Enlarge data format in SIMD


21

Will microprocessors speed grow?

• Clock frequency doubles every five years ...

...but the speed of light never changes

(Moore meets Einstein?)

• Saturated architecture of the microprocessors

– pipelining

– instruction level parallelism

– branch prediction & speculative execution

– compile-time information


22

Solution?

Microprocessor

MMX (SIMD) coprocessor

FPGA-like coprocessor


23

Thank you for your attention

?

The rest of the image is not shown because of insufficient computation power

24


Implementation of convolution Comparison of microprocessors

Thank you for your attention

Related documents

Products

Support

Implementation of convolution Comparison of microprocessors

Thank you for your attention

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib