imagine 3 - Chip Architect

advertisement
GenTera’s
I M A G IN E 3
Introducing:
GenTera’s
IMAGINE 3
HANS DE VRIES
GenTera’s
Building Blocks
I M A G IN E 3
Imagine 3
Core Processor
PCI/AGP
Bus
interface
0.5 Gigabyte/s
Multi-Stream (32)
Scalar / Vector Processor
128 bit
DDRSDRAM
Bus
4.2 Gigabyte/s
80 Billion operations / second
Data
(Video)
Input
Advanced High Quality
3D Graphics / Volume processing
Pipelines
Data
(Video)
Output
160 Megabyte/s
1.0 Gigabyte/s
220 Billion operations / second
Data flow
Ring
Input
2.0 Gigabyte/s
Graphics Mask
Generator
Motion Estimator
100 Billion op/s
Data flow
Ring
Output
2.0 Gigabyte/s
GenTera’s
I M A G IN E 3
Core Processor
HISC™ processor architecture
120 General Purpose registers (2x32 bit)
256 Vector registers (2x32 bit)
256x4 MAC Vector registers (2x32 bit)
128 Special Purpose control registers. (2x32 bit), 1200 control table registers (2x32 bit)
80 Billion operations per second (320 operations per cycle)
10 Giga Byte per second streaming I/O (memory & processor I/O)
including 64 Multiply Accumulates per cycle with saturate.
40 Conditional operations per cycle. 24 internal addresses per cycle
32 simultaneous concatenated vector streams (32 bit) (128 in byte mode)
Single cycle 2D and 3D addressing modes. (1D, 2D and 3D memory management)
C and C++ compiler,
Assembler, Linker, Debugger
Visual Simulator
Soft In circuit Emulator
Image Processing Library
3D graphics Library
Multi Media Library
Machine Vision Library
GenTera’s
I M A G IN E 3
HISC Processor Architecture
HISC:
Hierarchical
Instruction Set
Computer
RISC LEVEL:
provides
C and C++
compatibility
VLIW LEVEL:
A moderate length VLIW instruction word plus
fully programmable bus interconnect directly
controlled by the instruction code.
VARIABLE LENGTH VECTOR PROCESSING:
Enables up to 32 simultaneous and concatenated Vector Processing
Streams. Word based Vector Processing (32, 2x16, 4x8) is
symmetrically applied throughout the entire architecture.
EXTENDED VECTOR PROCESSING:
Numerous function specific Control Register add extended functionality that is activated
by the of group extended operations (as opposed to the basic operations)
This increases the effective instruction word for vector operations to 1000+ bits
GenTera’s
Core Processor
I M A G IN E 3
Examples of Basic Processor Stream performance
(from external memory to external memory)
Standard GUI functions:
Screen to Screen Copy
3 operand ROPS
Bitmap to Color expansion
2000
500
1000
2000
Mega pixels/s 8 bit pixels
Mega pixels/s 32 bit pixels
Mega pixels/s 8 bit pixels
Mega pixels/s 8 bit pixels
Windows Direct Draw GUI functions:
Pseudo to True Color
True Color to Pseudo
Z buffer aware copy
Alpha Blended Copy
500 Mega pixels/s
500 Mega pixels/s
666 Mega pixels/s
500 Mega pixels/s
250 Mega pixels/s
8 bit pseudo to 16 bit or 32 bit colors
32,16 bit color to 8 bit pseudo color
8 bit pixels, 16 bit Z buffer
16 bit pixels, 16 bit Z buffer
32 bit ARGB pixels
GenTera’s
I M A G IN E 3
Core Processor
Examples of Core Processor stream performance (2)
(from external memory to external memory)
Multi Media Functions: (numbers in result pixels/s)
YUV to RGB conversion
DCT and IDCT (8x8 blocks)
DCT and IDCT (8x8 blocks)
500 Mega pixels/s ( 32 bit color, 16 bit hi-color, 8 bit pseudo)
167 Mega pixels/s ( 16 bit values, 32 bit calculations)
667 Mega pixels/s ( 8 bit values, 16 bit calculations)
Photo shop type Image Processing Functions: (numbers in result pixels/s)
3x3 kernel convolution
7x7 kernel convolution
Bi-cubic Rotation
Bi-cubic Scaling
2000 Mega pixels/s
500 Mega pixels/s
1000 Mega pixels/s
1000 Mega pixels/s
(8 bit pixels, 16 bit calculations)
(8 bit pixels, 16 bit calculations)
(8 bit pixels, 16 bit calculations)
(8 bit pixels, 16 bit calculations)
3D graphics Geometry:
(4x4) homogeneous transformations plus perspective divides for X , Y and Z for meshed
triangles in 32 bit floating point (IEEE): 50 Million triangles/s
GenTera’s
Core Processor
I M A G IN E 3
Data Read Ports
Data
Write
Ports
REG A0
VIO 0
REG A1
VIO 1
REG B0
DIO 0
REG B1
DIO 1
A0
A1
B0
B1
VIO WR
REG WR0
Interconnect
(100 % connectivity)
DIO WR
REG WR1
X0
X1
Y0
Y1
MACX0
ALU X0
MAC X1
ALU X1
MACY0
ALU Y0
MAC Y1
ALU Y1
Data Processing Units
Data
Write
Ports
GenTera’s
Core Processor
I M A G IN E 3
Control Register Busses
Control reg bus 1
bits [63:32]
I3D1
MES1
RING1
REG
ALU
ALU
MAC
MAC
VIO 1
A1/0
B1
B1
A1B1
X1
Y1
X1
Y1
B1/0
X0
Y0
B0/1
MAC
MAC
VIO 0
A1/0
DIO
MSK1
VAU 1
bus interconnect
A0/1
A0/1
B0
B0
A0B0
X0
I3D0
MES0
RING0
REG
ALU
ALU
SEQ
MTAB
EMI
Control reg bus 0
bits [31:0]
Y0
VAU 0
MSK0
GenTera’s
Instruction Word
I M A G IN E 3
Highly orthogonal VLIW instruction word
Data Processing Functions
63
ND0
=0
59
36
48
24
12
0
Dd
Wr0
B0
A0
Y0
X0
Da
Wr1
B1
A1
Y1
X1
127 123
112
100
88
76
64
GenTera’s
Interconnect
I M A G IN E 3
A0 A1 B0 B1 X0 X1 Y0 Y1
A0 A1 B0 B1 X0 X1 Y0 Y1
Select path 1
Select path 2
Data
Processing
Unit
A0 A1 B0 B1 X0 X1 Y0 Y1
Select path
Data Write Port
Instruction Word
provides 8-way
Interconnectivity
In
Scalar-Processing
Mode
GenTera’s
Interconnect
IM A G I N E 3
A0
R
E
G
A0
M
E
M
B0
R
E
G
B0 X0 X0 Y0
M A M A
E L A L
M U C U
Y0
M
A
C
A1
R
E
G
A1
M
E
M
B1
R
E
G
B1
M
E
M
X1
A
L
U
X1
M
A
C
Y1
A
L
U
Y1
M
A
C
A0
R
E
G
A0
M
E
M
B0
R
E
G
B0 X0 X0 Y0
M A M A
E L A L
M U C U
Select path 1
Y0
M
A
C
A1
R
E
G
A1
M
E
M
B1
R
E
G
B1
M
E
M
X1
A
L
U
X1
M
A
C
Y1
A
L
U
Y1
M
A
C
Select path 2
Data Processing Unit
A0
R
E
G
A0
M
E
M
B0
R
E
G
B0 X0 X0 Y0
M A M A
E L A L
M U C U
Y0
M
A
C
A1
R
E
G
A1
M
E
M
B1
R
E
G
B1
M
E
M
X1
A
L
U
X1
M
A
C
Y1
A
L
U
Y1
M
A
C
Instruction Word
provides
100%
Select path 2
Data Write Port
Interconnectivity
In
Vector Processing
Mode
GenTera’s
Instruction Word
I M A G IN E 3
Data processing instruction fields
24
20
16
12
8
Y0
4
0
X0
1
MAC
path 1
path 2
1
MAC
path 1
path 2
0 0
ALU
path 1
path 2
0 0
ALU
path 1
path 2
path 1
path 2
0 1 Shift, Ufu
path 1
path 2
0 1 Shift, Ufu
GenTera’s
Instruction Word
I M A G IN E 3
Data read ports instruction fields
48
44
40
32
36
memory port
memory port
size
0 0 0 0
register port
0 0 Be31
24
A0
B0
0 0 0 0 VIO function
28
DIO read
size
register port
16 bit imm. [15:8]
0 0 Be20
0 1
register
size
0 1
1 0
control register
size
1
16 bit imm. [7:0]
register
size
11 bit signed immediate
GenTera’s
Instruction Word
I M A G IN E 3
DIO address / data and (control-) register write ports fields
123 62
59
63
127
ND
DIO address DIO rd/wr
0
DIO
address
select
58
size rd addr x rd addr
52
48
Wr0
DIO
data
select
Non data- size wr addr x wr data
processing
function
56
register port
0
register
path
1
control register
path
GenTera’s
Parallel Conditional Processing
I M A G IN E 3
64 bit Uniform Status Register
[63:56]
[55:48]
[47:40]
[39:32]
[31:24]
[23:16]
X1
X1
X1
X1
X0
X0
Y1
Status for
Byte 7
Y1
Status for
Byte 6
Y1
Status for
Byte 5
Y1
Status for
Byte 4
ALU Status:
Y0
Status for
Byte 3
Y0
Status for
Byte 2
[15:8]
X0
Y0
Status for
Byte 1
[7:0]
X0
Y0
Status for
Byte 0
Overflow, Carry, Minus, Zero
(ALU, Shifts, Unary functions)
S0
C0
M0
Z0
MAC Status: Wrong, Lower, Higher, Inside
W0
L0
H0
I0
GenTera’s
Parallel Conditional Processing
I M A G IN E 3
Status: Generation, Collection and Application
3
X1
3
7
Y1 7
Y1
2
X1 7
2
X1 6
6
3
A1
B1
3
V1
2
2
Y1 6
ALU
MAC 1
ALU
MAC 1
X1 5
0
0
X1 4
5
VEC. 1
REG.
4
0
0
3
3
3
Y1 5
MSK
VAU
1
Y1 4
3
X0
3
Y0 3
Y0
2
X0 3
2
X0 2
2
A0
B0
V0
2
2
Y0 2
ALU
MAC 1
ALU
MAC 1
X0 1
0
0
X0 0
1
VEC. 1
REG.
0
0
Y0 1
Y0 0
MSK
VAU
1
0
GenTera’s
Register File
I M A G IN E 3
GENERAL PURPOSE REGISTERS,
VECTOR REGISTERS
ADDRESSES
8 x Write
Indices
Write Port C
Vector Index
generators
256 vector registers
8 x Read A
Indices
Read Port A
Vector Index
generators
8 x Read B
Indices
Read Port B
Vector Index
generators
General
Register
Addresses
From the
Instruction
Code
2 x Write
Address
Write
Data
2,4,8 x
2 x 32 bit wide
4 x 16 bit wide
8 x 8 bit wide
up to 24 independent and
conditional byte addresses
up to 8 independent and
conditional byte write enables
120 general registers
2 x Read A
Address
2 x Read B
Address
DATA PORTS
2 x 32 bit / 4 x16 bit / 8 x 8 bit
Read A
Data
2,4,8 x
Read B
Data
2,4,8 x
Write
Port C
Input
BUS
select
Read
Port A
output
BUS
register
Read
Port B
output
BUS
register
A1
A0
B1
B0
I
N
T
E
R
N
A
L
B
U
S
M
A
T
R
I
X
GenTera’s
Function Units
I M A G IN E 3
ALU
Arithmetic,
Boolean,
Shift / Rotate,
Unary Functions
MULTIPLIER
MAC
(un)signed x (un)signed
binary point at:
end, middle or top
graphics formats
( 0.0..1.0 == 00..ff )
Vector
Registers
4 x 8, 2 x 16, 1 x 32
32 bit float
256
words
x
64 bit
4 x 8, 2 x 16, 1 x 32
32 bit float
ACCUMULATOR
Variable Range Clamp
GenTera’s
Multiplier / Accumulator
I M A G IN E 3
8 bit Matrix functions:
Quad Inproduct (16 multiplies & 12 adds per MAC)
32 bit input data into a 4 tab shift register (4 times for each byte)
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
Matrixvec (16 multiplies & 12 adds per MAC)
32 bit
input data
distributed
to all four
columns
( 4 times for
4 bytes )
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
16 bit
16 bit
16 bit
16 bit
GenTera’s
Multiplier / Accumulator
IM A G I N E 3
8 bit Matrix functions:
Open GL Blend Function ( 8 multiplies & 4 adds per MAC)
32 bit input data into a 4 tab shift register (4 times for each byte)
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
16 bit
16 bit
32 bit input data into a 4 tab shift register (4 times for each byte)
8 bit
16 bit
16 bit
8 bit
16 bit
16 bit
16 bit
16 bit
8 bit
16 bit
16 bit
8 bit
16 bit
Coefficients fixed or derived from the input operands:
0
1
2
3
4
5
6
7
8
9
BLEND_CONSTANT
BLEND_ZERO
BLEND_ONE
SRC_COLOR
INV_SRC_COLOR
SRC_ALPHA
INV_SRC_ALPHA
DST_ALPHA
INV_DST_ALPHA
DST_COLOR
10
11
12
13
14
15
INV_DST_COLOR
SRC_ALPHA_SATURATE
BOTH_SRC_ALPHA (source)
BOTH_SRC_ALPHA (dest)
BOTH_INV_SRC_ALPHA (source)
BOTH_INV_SRC_ALPHA (dest)
MAX_INTENSITY (source)
MAX_INTENSITY (dest)
MIN_INTENSITY (source)
MIN_INTENSITY (dest)
16 bit
GenTera’s
Multiplier / Accumulator
I M A G IN E 3
16 bit Matrix functions:
Convolute (4 multiplies & 2 adds per Multiplier)
32 bit input data into a 2 tab shift register (2 times for each 16 word)
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
16 bit
Transform (4 multiplies & 2 adds per Multiplier)
32 bit input data
distributed to both
columns ( 2 times
for each 16 word)
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
32 bit
32 bit
16 bit
Mix:
MH [63:32] =Coef 10[31:0] . Mb [31:16] + Coef 11[31:0] . Ma [31:16]
ML [ 31:0 ] =Coef 00[31:0] . Mb [ 15:0 ] + Coef 01[31:0] . Ma [ 15:0 ]
Merge: MH [63:32] =Coef 10[31:0] . Ma [31:16] + Coef 11[31:0] . Ma [ 15:0 ]
ML [ 31:0 ] =Coef 00[31:0] . Mb [31:16] + Coef 01[31:0] . Mb [ 15:0 ]
16 bit
GenTera’s
Multiplier/Accumulator
I M A G IN E 3
Single Multiplier/Accumulator
Single Multiplier/Accumulator
handles
all with
the same hardware!
Single
Multiplier/Accumulator
Each all
of the
4 Multiplier/Accumulators
handles
with
the same hardware!
handles
all
with
the sameby
hardware!
handles all operations
utilizing
32 x 32 bit extern
32
x same
32
extern
hardware!
32the
x 32
bitbit
intern
32
x
32
bit
extern
xaccumulate
32
32
x bit
32intern
bit
extern
6432
bit
32
x
32
bit
intern
64 bit
32accumulate
x 32 bit intern
64 bit accumulate
32 x 32 bit floating point
64 bit accumulate
Imagine 3 operations per cycle:
64: 8x16 bit:
64: 8x16 bit:
32: 8x16 bit:
16: 16x16 bit:
16: 16x16 bit:
16: 16x32 bit:
16: 16x32 bit:
16: 16x32 bit:
quad in-product (4 comp.)
4x4 matrix x vector
Open GL blending functions
in-product, cross-product
complex product
FIR filter
in-product, cross-product
complex product
16 x 16 bit extern
x 16
extern
1616
x 32
bitbit
intern
16
x
16
bit
extern
32 bit intern
3216
bitx16
accumulate
x3216
extern
bitbit
intern
3216
bitxaccumulate
32 bit intern
3216
bitxaccumulate
32 bit accumulate
16 x 16 bit extern
x 16
extern
1616
x 32
bitbit
intern
16
x
16
bit
extern
32 bit intern
3216
bitx16
accumulate
x3216
extern
bitbit
intern
3216
bitxaccumulate
32 bit intern
3216
bitxaccumulate
32 bit accumulate
16 x 16 bit extern
x 16
extern
1616
x 32
bitbit
intern
16
x
16
bit
extern
32 bit intern
3216
bitx16
accumulate
x
16
bit
extern
32 bit intern
3216
bitxaccumulate
32 bit intern
3216
bitxaccumulate
32 bit accumulate
16 x 16 bit extern
x 16
extern
1616
x 32
bitbit
intern
16
x
16
bit
extern
32 bit intern
3216
bitx16
accumulate
x
16
bit
extern
32 bit intern
3216
bitxaccumulate
32 bit intern
3216
bitxaccumulate
32 bit accumulate
8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x 8 extern
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern
8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern
8 x16
8
x
8
extern
8
x
8
extern
8
x
8
extern
8
x
extern
x16 intern8 x 88 extern
x16 intern8 x 88 extern
x16 intern8 x 88 extern
x16 8intern
8 x 88 extern
8 x16 intern
8 x16 intern
8 x16 intern
8 x16 intern
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern
8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern
8 x16
8 x 8 extern
8 x 8 extern
8 x 8 extern
8 x 8 extern
x16 intern8 x 88 extern
x16 intern8 x 88 extern
x16 intern8 x 88 extern
x16 intern
8 x 88 extern
8 x16 intern
8 x16 intern
8 x16 intern
8 x16 intern
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern
8 x16
8
x
8
extern
8
x
8
extern
8
x
8
extern
8
x
extern
8 x168 intern
8 x168 intern
8 x168 intern
8 x1688intern
x 8 extern
x 8 extern
x 8 extern
x 8 extern
x16 intern8 x 88 extern
x16 intern8 x 88 extern
x16 intern8 x 88 extern
x16 intern
8 x 88 extern
8 x16 intern
8 x16 intern
8 x16 intern
8 x16 intern
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern 8 x16
8 x 8intern
extern
8 x16
8
x
8
extern
8
x
8
extern
8
x
8
extern
8
x
extern
8 x168 intern
8 x168 intern
8 x168 intern
8 x1688intern
x 8 extern
x 8 extern
x 8 extern
x 8 extern
8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern
8 x16 intern
8 x16 intern
8 x16 intern
8 x16 intern
GenTera’s
Vector processing
I M A G IN E 3
Variable length vector processing made simple.
1
2
3
4
5
6
7
8
genad(A0)
9
1 11 1
0
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3 3
1 2
3
3
3
4
3
5
B0=input
A0=rd4x8(ri)
X0=mult(A0,B0,nuu)
genad(A1)
A1=rd4x8(ri)
Y0=subsat(X0,A1)
B1=rd(RING_Data)
X1=mult(Y0,B1,nus)
DA=again
D0=word4x8(uI)
X0=addsat(X1,D0)
Y0=matxvec(X0)
Y1=inproduct(X0)
X1=addsat(Y0,Y1)
outputV1
ACTUAL ASSEMBLY CODE FOR THE EXAMPLE ABOVE:
repeat, graph (label_1);;;
label_1:
genad(A0) => B0=input, A0=rd4x8(ri) => X0=mult(A,V,nuu ) ===> genad(A1) =>A1=rd4x8(ri) => Y0=subsat(X0,A1),
B1=rd4x8(RING_Data) => X1=mult(Y0,B1,nus) ===> DA=Again ==> D0=word4x8(uI), X0=addsat(X1,D0) => Y0=matxvec(X0),
Y1=inproduct(X0) =====> X1=addsat(Y0,Y1) => outputV1;
GenTera’s
I M A G IN E 3
The Imagine 3 core can
stream data from memory
or other processors at 10
GByte/sec. (Compared to
0.48 GByte/sec. for the
Imagine 1 )
Dataflow
Ring
input
10 Gigabyte Streaming I/O
VECTOR UNITS: Simultaneous
input and output to and from memory
I M A G IN E 3
Internal Data Processing
Core
DATA CACHE or 3D GRAPHICS /VOLUME pipelines
INPUT AND OUTPUT
Dataflow
Ring
output
GenTera’s
Non-aligned S I M D
I M A G IN E 3
SIMD processing made simple with non-aligned memory accesses
(No complex time-consuming shift-mask-merge operations needed)
8 bit
8 bit
8 bit
8 bit
32 bit memory word
32 bit memory word
32 bit memory word
32 bit word
8 bit
8 bit
8 bit
8 bit
GenTera’s
I M A G IN E 3
32 bit words
2 x 16 bit words
16 bit words
4 x 8 bit words
8 bit words
2 x 8 bit words
Non Aligned Vector Accesses
2 Input and 2 output vectors simultaneous
GenTera’s
I M A G IN E 3
I
m
a
g
i
n
e
Vector I/O
Memory Vector Accesses
Vector Access Units: up to 32 vectors in flight
data/color input
conversion
2D restructuring
Vector pipeline
data/color input
conversion
2D restructuring
Vector pipeline
2 kB Vector
pre-fetch buffer
2 kB Vector
pre-fetch buffer
3
P
r
o
c
e
s
s
o
r
C
o
r
e
data/color output
conversion
2D restructuring
Vector pipeline
2.25 kB Vector
data/color output
conversion
2D restructuring
Vector pipeline
2.25 kB Vector
Mask Unit
256 pixels / voxels
Mask Unit
256 pixels / voxels
write buffer
write buffer
E
x
t
e
r
n
a
l
M
e
m
o
r
y
I
n
t
e
r
f
a
c
e
GenTera’s
1, 2 and 3D
I M A G IN E 3
1 M Byte PAGE
memory management
1 M Byte PAGE
1 M Byte PAGE
X
1024
x
1024
512
x
1024
256
x
1024
8 bit
pixel
TILE
16 bit
pixel
TILE
32 bit
pixel
TILE
Y
Z
256 x 128 x 128
128 x 128
x 128
64 x 128
x 128
8 bit
voxel
BRICK
16 bit
voxel
BRICK
32 bit
voxel
BRICK
Y
X
GenTera’s
I M A G IN E 3
3D texture/volume Hardware
Very High Quality
220 Billion operations/sec:
2 x 440 operations per cycle (4 ns)
Texture Quality:
Texture Types:
BI linear, TRI Linear and QUAD interpolation.
32 bit ARGB, 16 bit (4 types), 8,4,2 and 1 bit pseudo color
16 bit and 32 bit greyscale (signed and unsigned), 2x16 bit complex
Texture Size:
16,384 x 16,384 max (2d)2048 x 2048 x 2048 max (3d)
Texture Dimension: 1, 2 and 3 dimensional textures.
Texture Clamping: Clamp and Wrap for all 3 co-ordinates.
Texture Border:
0 or 1 pixels texture borders, Border Color supported.
Texture MIP maps up to 16 levels: selection made for each individual pixel.
Perspective division for al 9 parameters: S, T, R, Alpha, Red, Green, Blue, Fog, Z
Perspective Correct Texture Mapping,
Perspective Correct Texture Lighting,
Perspective Correct Linear and Exponential (2 types) Fog,
Perspective Correct Depth Buffering,
GenTera’s
3D graphics Pipelines
I M A G IN E 3
3D
graphics
pipeline
control
unit
Perspect.
MIP map
processing
pipeline
(F,A,R,G,B)
Bressenham Edge Start
Interpolators(Q,R,S,T,Z-1) (F,A,R,G,B)
Vector Start
Interpolators(Q,R,S,T,Z-1)
(F,A,R,G,B)
Perspective
3D correct
Lighting
Pixel Value
Interpolators(Q,R,S,T,Z-1)
Perspective
3D co-ordinate
Generator
5 stages
Perspective
Lighting & Fog
Coefficients
Memory Access
Internal Delay Line
for
Interpolation, Lighting & Fog
Coefficients
3 - 17 stages
Perspective
Interpolation
Coefficients
5 stages
Perspective MIP Map
Addresses Calculations
2 stages
External Memory
with
MIP Map Textures
4 - 6 stages
Memory Access
Re-order buffers
Texel Interpolation / Lighting
coefficients generator
Texel Interpolation / Lighting
Multiply stage
Texel
Color Look Up
Texel Selection / Expansion
Memory Access
Data Load unit
Memory Access
Input Fifo / Port Select
Texel
Interp./
Lighting
control
unit
Texel Interpolation / Lighting
Summation stage
D BUS
GenTera’s
IM A G I N E 3
3D texture/volume Hardware
3D graphics Pipeline + Core stream performance
(from external memory to external memory)
Direct Draw functions: (numbers in result pixels/s)
Bilinear Image Scale:
333 Mega pixels/s (32 bit gray scale or 32 bit color pixels )
Bilinear Image Rotate:
333 Mega pixels/s (32 bit gray scale or 32 bit color pixels )
Bilinear Affine Transform:
333 Mega pixels/s (32 bit gray scale or 32 bit color pixels )
MPEG functions: (numbers in result pixels/s)
Bilinear Scaling plus
kYUV to αRGB
333 Mega pixels/s (32 bit αRGB pixels)
3D functions: (numbers in result pixels/sec)
Z-buffered, Perspective Correct, Bilinear Interpolated Texture mapping with perspective
correct lighting and exponential fog (Texture size up to 16k x 16k), MIP-Mapping:
300 Mega pixels/sec. (32 bit αRGB pixels, 16 bit hi-color, 8 bit pseudo, 16 bit Z values)
GenTera’s
I M A G IN E 3
Fan Beam Back projection
The 3D Texture/Volume pipelines and the Multiplier / Accumulators in the
Imagine 3 can handle eight 16 bit linear interpolated samples per cycle
with 32 bit accuracy.
Back Projection
Direction
Vector
Direction
GenTera’s
I M A G IN E 3
Cone beam reconstruction
The Back projection in cone
beam systems requires the:
Inverse
perspective
mapping
from filtered images back to a
3D volume. The Imagine 3
performs this directly with it’s
3D volume pipelines.
GenTera’s
De-blur filtering
I M A G IN E 3
FIR filter performance (16 bit input, 32 bit calculations)
128 Tab:
256 Tab:
512 Tab:
324 projections
512 values
840 projections
928 values
32 Mega-pixels / second
16 Mega-pixels / second
8 Mega-pixels / second
256x256
result
image
512 x 512
result
image
Filtered Backprojection for Medical Imaging
324 x 512 to 256 x 256
De-blur filtering
Backprojection
Reconstruction
10 ms (256 tabs)
11 ms
21 ms
Filtered Backprojection for Medical Imaging
840 x 928 to 512 x 512
De-blur filtering 100 ms (512 tabs)
Backprojection 108 ms
Reconstruction 208 ms
GenTera’s
I M A G IN E 3
De-blur filtering (FFT)
Complex input Fast Fourier Transform performance (vectorized)
32 bit Floating Point
256 Point:
512 Point:
1024 Point:
2048 Point:
4096 Point:
8192 Point:
16384 Point:
1200 projections
of
960 values
8 μs
18 μs
40 μs
88 μs
192 μs
436 μs
896 μs
512 x 512
result
image
32 bit Integer
16 bit Integer
4 μs
9 μs
20 μs
44 μs
96 μs
218 μs
448 μs
2.0 μs
4.4 μs
10 μs
22 μs
48 μs
109 μs
224 μs
Filtered Back-projection for Medical Imaging
1200 x 960 to 512 x 512
FFT filtering
Back-projection
106 ms (2048 point FP)
157 ms
Reconstruction
263 ms
GenTera’s
I M A G IN E 3
Radar Display Processing
Cartesian to Polar conversion with bi-linear interpolation 32 bit colors:
250 Mega-pixels /second
GenTera’s
Motion Estimators
I M A G IN E 3
Motion Estimation Unit for MPEG1…MPEG4 video encoding
100
Billion operations / second
- software controllable,
- arbitrary MxN kernel sizes up to 256 by 256
- arbitrary search space sizes up to 4096 by 4069 for HDTV and higher
- allows optimizing algorithms (reduced search space)
- forward and backward prediction
- vector processing co-operation with core for bi-cubic pixel interpolation / rotation
Performance:
Compare a 16x16 pixel block with any other 16x16 pixel block
(half, quarter, 1/8th, 1/16th pixels with bi-cubic interpolation)
120 Million Block Compares / second
GenTera’s
I M A G IN E 3
Graphics Mask Generators
Generates Transparent and Opaque Masks for 512 pixels
multiple units work in parallel:
Window Mask Generator
Automatically clips pixels outside the View Port (scissoring)
Span line Mask Generator
for Concave Polygons and arbitrary Objects
Range Mask generator
for Depth Buffer Tests, Stencil Buffer Tests, Alpha Test, Chroma Keying Tests et cetera
Complex Mask Generator
for Concave and Complex Polygons according to the odd/even
or winding rules
Alpha Mask Generator
For objects with partially covered pixels
GenTera’s
Graphics Mask Generators
I M A G IN E 3
Window X min /max
Window Y min /max
The Window is defined by the
Window registers
Range mask 0
Range mask 1
Range mask 2
Range mask 3
Complex mask 0
Complex mask 1
Complex mask 2
Complex mask 3
Spanline Delta Start
Spanline Address
The Spanline registers define the
outlines of the triangle
Spanline 0 Start/ End
Spanline 1 Start/ End
Spanline 2 Start/ End
Spanline 3 Start/ End
Spanline Y min / max
Overlap
triangle
Spanline Length (-1)
The Range Mask contains the result of the Depht
buffer test (overlapping triangle)
The Complex Mask is used in this example to
hold the Polygon Stipple pattern
Spanline Delta End
GenTera’s
I M A G IN E 3
Multi media I/O units
Video Output
(Α), R, G, B outputs with 330 MHz dot clock for 1800 x 1400 screen format at 90 Hz.
12 (16) bit video out for Studio Quality video processing. Interface to DVI-TFT
transmitters for high resolution, high quality LCD displays.
Video Input
CCIR 656: 8 bit digital video input for NTSC, PAL, SECAM, HDTV and custom formats
Audio Codec 97 Interface
Standard from Intel, Creative Labs, Yamaha, Analog Devices and Nat.Semiconductor
Supports Analog speakers, Microphone, Headphone + Headphone micro, Telephony and
Modem signals, CD analog audio in, Analog Video Sound In, PC beep in, et cetera
Digital Audio: 4 stereo serial I/O ports (I2S type and S type emulation capabilities)
Supports CD , DVD and Dolby AC3 input or output
External Device Control
8 bit classic μP interface bus and I2C type emulation capability
MIDI interface (Input and output for synthesizers and keyboards)
GenTera’s
I M A G IN E 3
Real Time Support
MULTI MEDIA REAL TIME SUPPORT
Level 1 Events (1 micro second response time requirement)
Horizontal Sync interrupts, Video I/O interrupts, Register Virtualization interrupts.
Level 2 Events(2 - 100 micro second response time requirement)
Communication Fifo interrupts, Mailbox Interrupts, I2S Fifo Interrupts, Ac97 Fifo Interrupts
Midi Interrupt, I2C interrupt, Vertical Sync Interrupts, Scheduler Clock Tick, et cetera
Threads ( 100 micro - 10 millisecond response time requirement)
Host Command Queues Manager
Audio Stream managers
Modem Stream managers
User definable threads
GenTera’s
IM A G I N E 3
High-end Board
8 Processors: 3.2 Tera operations/s 4 GigaByte memory
IMAGINE IMAGINE IMAGINE IMAGINE
3
3
3
3
GenTera’s
IM A G I N E 3
High-end Board
8 Imagine 3 processors, 3200 Billion operations per second
32 GigaByte per second Memory Bandwidth
16 GigaByte per second Inter-Processor Bandwidth
- Perspective Volume Rendering: 1000 x 1000 x 1000 at 15 frames/second
(based on 25% volume traversal)
- Cone Beam Reconstruction: 512 x 512 x 512 from 10002x128 in 4 seconds
- Real Time 3D ultra sound reconstruction and visualization
- Real Time HDTV MPEG 4 video encoding
- Advanced Radar Processing
GenTera’s
I M A G IN E 3
High Speed Dataflow Ring
Up to 2 Gigabyte per second Dataflow Ring (SSTL-2)
Point-to-point with Broadcast options and auto configuration
I M A G IN E
3
IMAGINE
3
I M A G IN E
3
IMAGINE
3
IM A G I N E
3
I M A G IN E
3
IM A G I N E
3
IM A G I N E
3
GenTera’s
High Speed System I/O
I M A G IN E 3
The Dataflow Ring also provides very high speed System I/O.
Entry level system can use the programmable Video Data I/O for
general purpose I/O. ( 160 MB/s per processor, 1 GB/s per processor )
Video out
1 GB/s
Video In
160 MB/s
IMAGINE
3
I M A G IN E
3
I M A G IN E
3
IM A G I N E
3
I M A G IN E
3
IM A G I N E
3
I M A G IN E
3
IM A G I N E
3
Optional
System
I/O
FPGA
e.g:
Xilinx
Virtex II
Dataflow
input:
Up to
2.0
GB/s
DataFlow
Output:
Up to
2.0
GB/s
GenTera’s
Pipeline Processing
I M A G IN E 3
The Dataflow Ring allows long vector processing pipelines
over multiple processors. Here an example with just 2 processors
Vector Read
from memory
Vector Write
to memory
MAC as
FIR filter
Dataflow
Ring
Vector Read
from memory
Vector Write
to memory
256 entry
vector register
ALU
Dataflow
Ring
ALU
Bi linear Interpolated Data
from the Graphics pipeline
ALU
Dataflow
Ring
MAC as
3D blend unit
Bi linear Interpolated Data
from the Graphics pipeline
GenTera’s
128 bit memory bus (reads)
I M A G IN E 3
16 kbyte
1st Level
data cache
128 bit
16 kbyte
1st Level
instruction cache
Dual
3D-graphics
pipelines
Dual 128 word x
128 bit
Vector input fifo’s
PCI/AGP
Memory
Read access
Video Output
128 word
x 128 bit fifo
4.2 Gigabyte /second Memory Bus: 128 bit PC2100
GenTera’s
I M A G IN E 3
128 bit memory bus (writes)
16 kbyte
1st level
data cache
PCI/AGP
Memory
Write access
Dual 128 word x
128 bit
Vector output fifos
4.2 Gigabyte /second
Memory Bus.
(128 bit PC2100)
16 word x 128 bit
write buffer
128 bit
8-fold address interleaved memory
reads and writes. Out of order
accesses with coherency checking
GenTera’s
I M A G IN E 3
END
GenTera’s
IMAGINE 3
HANS DE VRIES
Download