Introduction Lecture 8: SIMD Architectures 2015-11-24 Vector processors

advertisement
2015-11-24
Lecture 8: SIMD Architectures




Vector processors
Array processors
Cray supercomputers
Multimedia extensions
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 8
Introduction

Manipulation of arrays or vectors is a common operation in scientific
and engineering applications.

Typical operations of array-oriented data include:





Processing one or more vectors to produce a scalar result.
Combining two vectors to produce a third one.
Combining a scalar and a vector to generate a vector.
A combination of the above three operations.
Two architectures suitable for vector processing are:
 Pipelined vector processors
• Implemented in many supercomputers
 Parallel array processors

They are architecture for data parallelism.
 The user or the compiler does the difficult work of finding the
parallelism, so the hardware doesn’t have to.
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 8
1
2015-11-24
Exploiting Parallelism
There are three major categories of parallelism:

Instruction-level parallelism (ILP)
 Multiple instructions from one instruction stream are
executed simultaneously.

Thread-level parallelism (TLP)
 Multiple instruction streams are executed
simultaneously.

Data parallelism (DP)
 The same operation is performed simultaneously on
arrays of data elements.
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 8
Vector Processor Architecture
Scalar Unit
Scalar
Instructions
Instruction
Fetch and
Decode unit
Scalar
Registers
Scalar
Functional Units
Memory
Vector
Registers
Vector
Instructions
Vector
Functional Units
Vector Unit
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 8
2
2015-11-24
Vector Unit Operation
Vector Registers
Pipelined ALU
Memory
System
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 8
Vector Processors

A processor operates on an entire vector in one single instruction.

Strictly speaking, vector processors are not parallel processors.

There are not several CPUs in a vector processor, running in parallel.


They only behave like SIMD computers.
They are SISD processors with vector instructions executed in pipelined
manner.

It has vector registers that can each store usually 64 to 128 values.

Vector instructions examples:


Load a vector from memory into vector register;

Store a vector into memory;

Arithmetic and logic operations between vectors; and

Operations between vectors and scalars.
The programmers are allowed to use vector operations directly, and the
compiler translates them into vector instructions at machine level.
Zebo Peng, IDA, LiTH
6
TDTS 08 – Lecture 8
3
2015-11-24
Vector Unit

A vector unit consists of a pipelined functional unit, which perform
ALU operation of vectors in a pipeline.

It consists of several registers:
 A set of general purpose vector registers, each of length s (e.g., 128);
 A vector length register VL (a scalar value), which stores the length l
(0  l  s) of the currently processed vector(s);
 A mask register M, which stores a set of l bits, one for each element
in a vector, interpreted as Boolean values:
• Vector instructions can be executed in masked mode so that
vector elements corresponding to a false value in M are ignored.
8
VL
VR1
M
10 1 01 1 1 1 01 10
Zebo Peng, IDA, LiTH
8
…
…
…
7
TDTS 08 – Lecture 8
Vector Program Example

Consider an element-by-element addition of two N-element vectors A
and B to create the sum vector C.

On an SISD machine, this computation will be implemented as:
for i = 0 to N-1 do
C[i] := A[i] + B[i];
Loop:


LOAD
LOAD
ADD
STO
ADD
BRA
R1, A[i]
R2, B[i]
R1, R2
R1, C[i]
i, #1
Loop, i <= 128
This execution has:
128*6 = 768 fetches;
128 additions; and
128 conditional branches.
There will be N*K instruction fetches (assuming that K instructions are
needed for each iteration) and N additions.
There will also be N conditional branches (if loop unrolling is not used).
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 8
4
2015-11-24
Vector Program Example (Cont’d)

In a vector computer, we need only one statement:
C[0:N-1]  A[0:N-1] + B[0:N-1];
Vector code:



LOAD_V
LOAD_V
ADD_V
STO_V
V1, A
V2, B
V3, V1, V2
C, V3
This execution has:
4 fetches [SISD: 768];
128 additions;
0 branches [SISD: 128].
N additions will still be performed, now in pipelined fashion.
There will only be K’ instruction fetches (e.g., Load A, Load B, Add_vector,
Write C. K’ = 4).
No conditional branch is needed.
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 8
Features of Vector Processors

Advantages:
 Quick fetch and decode of a single instruction for multiple operations.
 The instruction provides a regular source of data, which arrive at
each cycle, and can be processed in a pipelined fashion.
 The compiler generates codes to fully utilize both the vector unit and
the scalar unit.

Memory-to-memory operation mode:
 No vector registers are needed.
 It can process very long vectors; but setup time is large.
 It appeared in the 70’s and died in the 80’s (memory bottleneck).

Register-to-register operations are more popular now:
 Operations are performed to values stored in the vector registers.

They are usually part of a supercomputer or a mainframe.
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 8
5
2015-11-24
IBM 3090 with Vector Facility

Similar to a superscalar computer.

Zebo Peng, IDA, LiTH
Except that
parallelism is mainly
due to vector
computation

Little impact on
software.

Vector processors
execute vector
instructions.
11
TDTS 08 – Lecture 8
Lecture 9: SIMD Architectures




Vector processors
Array processors
Cray supercomputers
Multimedia extensions
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 8
6
2015-11-24
Array Processors

Built with N identical processing elements and a number of
memory modules.
 All PEs are under the control of a single control unit.
 They execute instructions in a lock-step mode.

Processing units and memory elements communicate with each
other through an interconnection network.
 Different topologies can be used, e.g., crossbar.

Complexity of the control unit is at the same level of the uniprocessor system.

Control unit is usually itself a computer with its own high speed
registers, local memory and ALU.

The main memory is the collection of the memory modules.
Zebo Peng, IDA, LiTH
13
TDTS 08 – Lecture 8
Global Memory Organization
IS
PE2
...
PEn
M1
M2
...
Control
Unit
Interconnection Network
PE1
I/O
System
Mk
Shared
Memory
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 8
7
2015-11-24
Array Processor Classification

Processing element complexity:
 Single-bit processors
• e.g., Connection Machine (CM-2)  65536 PEs connected by a hypercube network (Thinking Machine Co.).
 Multi-bit processors
• e.g., ILLIAC IV (64-bit), and MasPar MP-1 (32-bit).

Processor-memory interconnection:
 Dedicated memory organization
• ILLIAC IV, CM-2, MP-1
 Global memory organization
• Bulk Synchronous Parallel (BSP) computer
Zebo Peng, IDA, LiTH
15
TDTS 08 – Lecture 8
Global Memory Organization
IS
PE2
...
PEn
M1
M2
...
Control
Unit
Interconnection Network
PE1
I/O
System
Mk
Shared
Memory
Zebo Peng, IDA, LiTH
16
TDTS 08 – Lecture 8
8
2015-11-24
Control
Unit
IS
PE1
M1
PE2
M2
...
Mcont
PEn
Interconnection Network
Dedicated Memory Organization
Mn
I/O
System
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 8
Features of Array Processors

Control and scalar type instructions are executed in the
control unit.

Vector instructions are performed in the processing elements.
 Each vector element mapped to a PE.


Data organization and detection of parallelism in a program
are major issues when using such architecture.
Operations like C(i) = A(i) × B(i), 1  i  n could be executed
in parallel, if the elements of the arrays A and B are
distributed properly among the processors/memory-modules.
 Ex. PEi is assigned the task of computing C(i).
 In the ideal case, both have the same dimension.
Zebo Peng, IDA, LiTH
18
TDTS 08 – Lecture 8
9
2015-11-24
An Example
To compute
N
Y   A(i )  B (i )
i 1
Assuming:
 A dedicated memory organization.
 Elements of A and B are properly and perfectly distributed among
processors (the compiler can help here).
We have:
 The product terms are computed in parallel.
 Additions can be done in log2N iterations in a pair-wise manner.
 Speed up factor (assuming that addition and multiplication take
the same time):
S=
Zebo Peng, IDA, LiTH
2N-1
1+ log 2 N
19
N
32
S
10,5
64 128 256 512 1024
18
32
57 102 186
TDTS 08 – Lecture 8
ILLIAC IV

ILLIAC IV is a classical example of Array Processors.

A typical SIMD computer for array processing.

64 Processing Elements (PEs), each with its local
memory.

One single Control Unit (CU).

CU can access all memory.

PEs can access local memory and communicate with
neighbors.

CU reads program and broadcasts instructions to PEs.
Zebo Peng, IDA, LiTH
20
TDTS 08 – Lecture 8
10
2015-11-24
ILLIAC IV Architecture
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 8
Lecture 9: SIMD Architectures




Vector processors
Array processors
Cray supercomputers
Multimedia extensions
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 8
11
2015-11-24
Cray X1: Parallel Vector Machine
Cray combines several technologies in the X1 machine:

 12.8 gFLOPS high-performance vector processors.
 Shared caches.
• 4 processor nodes sharing 2 MB cache, and up to 64 GB of
memory.
 Multi-streaming vector processing.
 Multiple node architecture.
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 8
Cray X1: Building Block

MSP: Multi-Streaming vector Processor
 Formed by 4 SSPs (each a 2-pipe vector processor).
 Balance computations across SSPs.
 Compiler will try to vectorize/parallelize across the MSP,
achieving “streaming.”
custom
blocks
12.8 Gflops (64 bit)
25.6 Gflops (32 bit)
S
V
S
V
V
S
V
V
S
V
V
V
51 GB/s load
25-41 GB/s store
2 MB cache
0.5 MB
$
0.5 MB
$
0.5 MB
$
To local memory and network:
Zebo Peng, IDA, LiTH
24
0.5 MB
$
shared caches
Figure source J. Levesque, Cray
TDTS 08 – Lecture 8
12
2015-11-24
Cray X1: Node
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
M
M
M
M
M
M
M
mem
mem
mem
mem
mem
mem
mem
M
M
M
M
M
M
M
M
M
mem
mem
mem
mem
mem
mem
mem
mem
mem
IO



IO
Shared memory.
32 network links and four I/O links per node.
A Cray X1 machine consists of 32 such nodes.
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 8
Cray X1: Parallelism

Many levels of parallelism





Within a processor: vectorization.
Within an MSP: streaming.
Within a node: shared memory.
Across nodes: message passing.
Some are automated by the compiler, others require work by the
programmer:
 This is a common trend.
 The more complex the architecture, the more difficult it is for the
programmer to exploit it.

Hard to fit this machine into a simple taxonomy!
 Locally: SIMD (vector processing)
 Globally: MIMD
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 8
13
2015-11-24
Most Powerful Supercomputer

Tianhe-2, located at the National Supercomputing Center in
Guangzhou, China.
A typical high-end PC:

Performance: 33.86 petaFLOPS (1015)

Peak rate: 54.9 petaFLOPS.

Total memory size: 1,375 teraBytes (1012)
-

Power consumption: 17.6 MW.
-

Huge number of microprocessors (16,000 computer nodes,
with a total of 3,120,000 cores, Intel Xeon Phi).

Cost is estimated to $390 million.

[Triolith: NSC in Linköping, 407 teraFLOPS, ranked 122th
in the world.]
Zebo Peng, IDA, LiTH
27
-
Performance 20
gigaFLOPS.
No. of cores: 8.
Clock rate: 3.5GHz.
Memory: 16 GB.
TDTS 08 – Lecture 8
Growth of Supercomp Performance
The y-axis shows performance in GFLOPS. The red line denotes the fastest supercomputer; the yellow line
no. 500, and dark blue line the total combined performance of supercomputers on the TOP500 list.
Zebo Peng, IDA, LiTH
28
TDTS 08 – Lecture 8
14
2015-11-24
Lecture 9: SIMD Architectures




Vector processors
Array processors
Cray supercomputers
Multimedia extensions
Zebo Peng, IDA, LiTH
29
TDTS 08 – Lecture 8
Multimedia Extensions
How do we extend general purpose microprocessors so that they
can handle multimedia applications efficiently?
Analysis of the need:
 Video and audio applications very often deal with large arrays of
small data types (8 or 16 bits).
 Such applications exhibit a large potential of SIMD (vector)
parallelism.
 Data parallelism.
Solutions:
 General purpose microprocessors are equipped with special
instructions to exploit this parallelism.
 The specialized multimedia instructions perform vector
computations on bytes, half-words, or words.
Zebo Peng, IDA, LiTH
30
TDTS 08 – Lecture 8
15
2015-11-24
Special Instructions

Conventional instruction sets have been extended in order to
improve performance with multimedia applications:





MMX for Intel x86 family;
VIS for UltraSparc;
MDMX for MIPS; and
MAX-2 for Hewlett-Packard PA-RISC.
The Pentium line provides 57 MMX instructions, which treat data
in a SIMD fashion to improve the performance of:





Computer-aided design;
Internet application;
Computer visualization;
Video games; and
Speech recognition.
Zebo Peng, IDA, LiTH
31
TDTS 08 – Lecture 8
Implementation
The basic idea: sub-word execution

Use the entire width of a processor data path (e.g., 64 bits),
when processing small data (8, 12, or 16 bits).

With word size of 64 bits, an adder can be used to implement
eight 8-bit additions in parallel.
R1
a7
a6
+1
a5
+1
a4
+1
a3
+1
a2
+1
a1
+1
a0
+1
+1

MMX technology allows a single instruction to work on
multiple pieces of data.

Consequently we have practically a kind of SIMD parallelism,
at a reduced scale and with very low cost.
Zebo Peng, IDA, LiTH
32
TDTS 08 – Lecture 8
16
2015-11-24
Packed Data Types

Three packed data types are defined for parallel operations:
packed byte, packed word, packed double word.
Packed byte
q7
q6
q5
q4
q3
q2
q1
q0
Packed word
q3
q2
q1
q0
Packed double word
q1
q0
Quad word
q0
64 bits
Zebo Peng, IDA, LiTH
33
TDTS 08 – Lecture 8
SIMD Arithmetic Examples
ADD R3  R1, R2
a7
a6
a5
a4
a3
a2
a1
a0
+
+
+
+
+
+
+
+
b7
b6
b5
b4
b3
b2
b1
b0
R2
=
=
=
=
=
=
=
=
a6+b6
a5+b5
a4+b4
a3+b3
a2+b2
a1+b1
a0+b0
a7+b7
R3
R1
Hardware support is needed to check sub-word execution overflow!
MULADD R3  R1, R2
R1
R2
a7
×&+
b7
=
a6
×&+
b6
=
a5
×&+
b5
=
a4
×&+
b4
=
a3
×&+
b3
=
a2
×&+
b2
=
a1
×&+
b1
=
a0
×&+
b0
=
R3 (a6×b6)+(a7×b7) (a4×b4)+(a5×b5) (a2×b2)+(a3×b3) (a0×b0)+(a1×b1)
Zebo Peng, IDA, LiTH
34
TDTS 08 – Lecture 8
17
2015-11-24
Performance Comparison

The following shows the performance of Pentium processors
(32-bit machine) with and without MMX technology:
Application
Without
MMX
With
MMX
Speedup
Video
155.52
268.70
1.72
Image
Processing
159.03
743.90
4.67
3D geometry
161.52
166.44
1.03
Audio
149.80
318.90
2.13
OVERALL
156.00
255.43
1.64
Zebo Peng, IDA, LiTH
35
TDTS 08 – Lecture 8
Summary

Vector processors are SISD processors which include in their
instruction set instructions operating on vectors.
 They are implemented using pipelined functional units.
 They behave like SIMD machines.

Array processors, being typical SIMD, execute the same operation
on a set of interconnected processing units.

Both vector and array processors are specialized for numerical
problems expressed in matrix or vector formats.
 They are usually integrated inside a large computer.

Many modern architectures deploy usually several parallel
architecture concepts at the same time, such as Cray X1.

Multimedia applications exhibit a large potential of SIMD
parallelism, which can be accelerated by extending the traditional
SISD instruction set and architecture.
Zebo Peng, IDA, LiTH
36
TDTS 08 – Lecture 8
18
Download