GenTera’s I M A G IN E 3 Introducing: GenTera’s IMAGINE 3 HANS DE VRIES GenTera’s Building Blocks I M A G IN E 3 Imagine 3 Core Processor PCI/AGP Bus interface 0.5 Gigabyte/s Multi-Stream (32) Scalar / Vector Processor 128 bit DDRSDRAM Bus 4.2 Gigabyte/s 80 Billion operations / second Data (Video) Input Advanced High Quality 3D Graphics / Volume processing Pipelines Data (Video) Output 160 Megabyte/s 1.0 Gigabyte/s 220 Billion operations / second Data flow Ring Input 2.0 Gigabyte/s Graphics Mask Generator Motion Estimator 100 Billion op/s Data flow Ring Output 2.0 Gigabyte/s GenTera’s I M A G IN E 3 Core Processor HISC™ processor architecture 120 General Purpose registers (2x32 bit) 256 Vector registers (2x32 bit) 256x4 MAC Vector registers (2x32 bit) 128 Special Purpose control registers. (2x32 bit), 1200 control table registers (2x32 bit) 80 Billion operations per second (320 operations per cycle) 10 Giga Byte per second streaming I/O (memory & processor I/O) including 64 Multiply Accumulates per cycle with saturate. 40 Conditional operations per cycle. 24 internal addresses per cycle 32 simultaneous concatenated vector streams (32 bit) (128 in byte mode) Single cycle 2D and 3D addressing modes. (1D, 2D and 3D memory management) C and C++ compiler, Assembler, Linker, Debugger Visual Simulator Soft In circuit Emulator Image Processing Library 3D graphics Library Multi Media Library Machine Vision Library GenTera’s I M A G IN E 3 HISC Processor Architecture HISC: Hierarchical Instruction Set Computer RISC LEVEL: provides C and C++ compatibility VLIW LEVEL: A moderate length VLIW instruction word plus fully programmable bus interconnect directly controlled by the instruction code. VARIABLE LENGTH VECTOR PROCESSING: Enables up to 32 simultaneous and concatenated Vector Processing Streams. Word based Vector Processing (32, 2x16, 4x8) is symmetrically applied throughout the entire architecture. EXTENDED VECTOR PROCESSING: Numerous function specific Control Register add extended functionality that is activated by the of group extended operations (as opposed to the basic operations) This increases the effective instruction word for vector operations to 1000+ bits GenTera’s Core Processor I M A G IN E 3 Examples of Basic Processor Stream performance (from external memory to external memory) Standard GUI functions: Screen to Screen Copy 3 operand ROPS Bitmap to Color expansion 2000 500 1000 2000 Mega pixels/s 8 bit pixels Mega pixels/s 32 bit pixels Mega pixels/s 8 bit pixels Mega pixels/s 8 bit pixels Windows Direct Draw GUI functions: Pseudo to True Color True Color to Pseudo Z buffer aware copy Alpha Blended Copy 500 Mega pixels/s 500 Mega pixels/s 666 Mega pixels/s 500 Mega pixels/s 250 Mega pixels/s 8 bit pseudo to 16 bit or 32 bit colors 32,16 bit color to 8 bit pseudo color 8 bit pixels, 16 bit Z buffer 16 bit pixels, 16 bit Z buffer 32 bit ARGB pixels GenTera’s I M A G IN E 3 Core Processor Examples of Core Processor stream performance (2) (from external memory to external memory) Multi Media Functions: (numbers in result pixels/s) YUV to RGB conversion DCT and IDCT (8x8 blocks) DCT and IDCT (8x8 blocks) 500 Mega pixels/s ( 32 bit color, 16 bit hi-color, 8 bit pseudo) 167 Mega pixels/s ( 16 bit values, 32 bit calculations) 667 Mega pixels/s ( 8 bit values, 16 bit calculations) Photo shop type Image Processing Functions: (numbers in result pixels/s) 3x3 kernel convolution 7x7 kernel convolution Bi-cubic Rotation Bi-cubic Scaling 2000 Mega pixels/s 500 Mega pixels/s 1000 Mega pixels/s 1000 Mega pixels/s (8 bit pixels, 16 bit calculations) (8 bit pixels, 16 bit calculations) (8 bit pixels, 16 bit calculations) (8 bit pixels, 16 bit calculations) 3D graphics Geometry: (4x4) homogeneous transformations plus perspective divides for X , Y and Z for meshed triangles in 32 bit floating point (IEEE): 50 Million triangles/s GenTera’s Core Processor I M A G IN E 3 Data Read Ports Data Write Ports REG A0 VIO 0 REG A1 VIO 1 REG B0 DIO 0 REG B1 DIO 1 A0 A1 B0 B1 VIO WR REG WR0 Interconnect (100 % connectivity) DIO WR REG WR1 X0 X1 Y0 Y1 MACX0 ALU X0 MAC X1 ALU X1 MACY0 ALU Y0 MAC Y1 ALU Y1 Data Processing Units Data Write Ports GenTera’s Core Processor I M A G IN E 3 Control Register Busses Control reg bus 1 bits [63:32] I3D1 MES1 RING1 REG ALU ALU MAC MAC VIO 1 A1/0 B1 B1 A1B1 X1 Y1 X1 Y1 B1/0 X0 Y0 B0/1 MAC MAC VIO 0 A1/0 DIO MSK1 VAU 1 bus interconnect A0/1 A0/1 B0 B0 A0B0 X0 I3D0 MES0 RING0 REG ALU ALU SEQ MTAB EMI Control reg bus 0 bits [31:0] Y0 VAU 0 MSK0 GenTera’s Instruction Word I M A G IN E 3 Highly orthogonal VLIW instruction word Data Processing Functions 63 ND0 =0 59 36 48 24 12 0 Dd Wr0 B0 A0 Y0 X0 Da Wr1 B1 A1 Y1 X1 127 123 112 100 88 76 64 GenTera’s Interconnect I M A G IN E 3 A0 A1 B0 B1 X0 X1 Y0 Y1 A0 A1 B0 B1 X0 X1 Y0 Y1 Select path 1 Select path 2 Data Processing Unit A0 A1 B0 B1 X0 X1 Y0 Y1 Select path Data Write Port Instruction Word provides 8-way Interconnectivity In Scalar-Processing Mode GenTera’s Interconnect IM A G I N E 3 A0 R E G A0 M E M B0 R E G B0 X0 X0 Y0 M A M A E L A L M U C U Y0 M A C A1 R E G A1 M E M B1 R E G B1 M E M X1 A L U X1 M A C Y1 A L U Y1 M A C A0 R E G A0 M E M B0 R E G B0 X0 X0 Y0 M A M A E L A L M U C U Select path 1 Y0 M A C A1 R E G A1 M E M B1 R E G B1 M E M X1 A L U X1 M A C Y1 A L U Y1 M A C Select path 2 Data Processing Unit A0 R E G A0 M E M B0 R E G B0 X0 X0 Y0 M A M A E L A L M U C U Y0 M A C A1 R E G A1 M E M B1 R E G B1 M E M X1 A L U X1 M A C Y1 A L U Y1 M A C Instruction Word provides 100% Select path 2 Data Write Port Interconnectivity In Vector Processing Mode GenTera’s Instruction Word I M A G IN E 3 Data processing instruction fields 24 20 16 12 8 Y0 4 0 X0 1 MAC path 1 path 2 1 MAC path 1 path 2 0 0 ALU path 1 path 2 0 0 ALU path 1 path 2 path 1 path 2 0 1 Shift, Ufu path 1 path 2 0 1 Shift, Ufu GenTera’s Instruction Word I M A G IN E 3 Data read ports instruction fields 48 44 40 32 36 memory port memory port size 0 0 0 0 register port 0 0 Be31 24 A0 B0 0 0 0 0 VIO function 28 DIO read size register port 16 bit imm. [15:8] 0 0 Be20 0 1 register size 0 1 1 0 control register size 1 16 bit imm. [7:0] register size 11 bit signed immediate GenTera’s Instruction Word I M A G IN E 3 DIO address / data and (control-) register write ports fields 123 62 59 63 127 ND DIO address DIO rd/wr 0 DIO address select 58 size rd addr x rd addr 52 48 Wr0 DIO data select Non data- size wr addr x wr data processing function 56 register port 0 register path 1 control register path GenTera’s Parallel Conditional Processing I M A G IN E 3 64 bit Uniform Status Register [63:56] [55:48] [47:40] [39:32] [31:24] [23:16] X1 X1 X1 X1 X0 X0 Y1 Status for Byte 7 Y1 Status for Byte 6 Y1 Status for Byte 5 Y1 Status for Byte 4 ALU Status: Y0 Status for Byte 3 Y0 Status for Byte 2 [15:8] X0 Y0 Status for Byte 1 [7:0] X0 Y0 Status for Byte 0 Overflow, Carry, Minus, Zero (ALU, Shifts, Unary functions) S0 C0 M0 Z0 MAC Status: Wrong, Lower, Higher, Inside W0 L0 H0 I0 GenTera’s Parallel Conditional Processing I M A G IN E 3 Status: Generation, Collection and Application 3 X1 3 7 Y1 7 Y1 2 X1 7 2 X1 6 6 3 A1 B1 3 V1 2 2 Y1 6 ALU MAC 1 ALU MAC 1 X1 5 0 0 X1 4 5 VEC. 1 REG. 4 0 0 3 3 3 Y1 5 MSK VAU 1 Y1 4 3 X0 3 Y0 3 Y0 2 X0 3 2 X0 2 2 A0 B0 V0 2 2 Y0 2 ALU MAC 1 ALU MAC 1 X0 1 0 0 X0 0 1 VEC. 1 REG. 0 0 Y0 1 Y0 0 MSK VAU 1 0 GenTera’s Register File I M A G IN E 3 GENERAL PURPOSE REGISTERS, VECTOR REGISTERS ADDRESSES 8 x Write Indices Write Port C Vector Index generators 256 vector registers 8 x Read A Indices Read Port A Vector Index generators 8 x Read B Indices Read Port B Vector Index generators General Register Addresses From the Instruction Code 2 x Write Address Write Data 2,4,8 x 2 x 32 bit wide 4 x 16 bit wide 8 x 8 bit wide up to 24 independent and conditional byte addresses up to 8 independent and conditional byte write enables 120 general registers 2 x Read A Address 2 x Read B Address DATA PORTS 2 x 32 bit / 4 x16 bit / 8 x 8 bit Read A Data 2,4,8 x Read B Data 2,4,8 x Write Port C Input BUS select Read Port A output BUS register Read Port B output BUS register A1 A0 B1 B0 I N T E R N A L B U S M A T R I X GenTera’s Function Units I M A G IN E 3 ALU Arithmetic, Boolean, Shift / Rotate, Unary Functions MULTIPLIER MAC (un)signed x (un)signed binary point at: end, middle or top graphics formats ( 0.0..1.0 == 00..ff ) Vector Registers 4 x 8, 2 x 16, 1 x 32 32 bit float 256 words x 64 bit 4 x 8, 2 x 16, 1 x 32 32 bit float ACCUMULATOR Variable Range Clamp GenTera’s Multiplier / Accumulator I M A G IN E 3 8 bit Matrix functions: Quad Inproduct (16 multiplies & 12 adds per MAC) 32 bit input data into a 4 tab shift register (4 times for each byte) 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit Matrixvec (16 multiplies & 12 adds per MAC) 32 bit input data distributed to all four columns ( 4 times for 4 bytes ) 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit GenTera’s Multiplier / Accumulator IM A G I N E 3 8 bit Matrix functions: Open GL Blend Function ( 8 multiplies & 4 adds per MAC) 32 bit input data into a 4 tab shift register (4 times for each byte) 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 16 bit 16 bit 32 bit input data into a 4 tab shift register (4 times for each byte) 8 bit 16 bit 16 bit 8 bit 16 bit 16 bit 16 bit 16 bit 8 bit 16 bit 16 bit 8 bit 16 bit Coefficients fixed or derived from the input operands: 0 1 2 3 4 5 6 7 8 9 BLEND_CONSTANT BLEND_ZERO BLEND_ONE SRC_COLOR INV_SRC_COLOR SRC_ALPHA INV_SRC_ALPHA DST_ALPHA INV_DST_ALPHA DST_COLOR 10 11 12 13 14 15 INV_DST_COLOR SRC_ALPHA_SATURATE BOTH_SRC_ALPHA (source) BOTH_SRC_ALPHA (dest) BOTH_INV_SRC_ALPHA (source) BOTH_INV_SRC_ALPHA (dest) MAX_INTENSITY (source) MAX_INTENSITY (dest) MIN_INTENSITY (source) MIN_INTENSITY (dest) 16 bit GenTera’s Multiplier / Accumulator I M A G IN E 3 16 bit Matrix functions: Convolute (4 multiplies & 2 adds per Multiplier) 32 bit input data into a 2 tab shift register (2 times for each 16 word) 16 bit 32 bit 32 bit 16 bit 32 bit 32 bit 16 bit 32 bit 32 bit 16 bit 32 bit 32 bit 16 bit 16 bit Transform (4 multiplies & 2 adds per Multiplier) 32 bit input data distributed to both columns ( 2 times for each 16 word) 16 bit 32 bit 32 bit 16 bit 32 bit 32 bit 16 bit 32 bit 32 bit 16 bit 32 bit 32 bit 16 bit Mix: MH [63:32] =Coef 10[31:0] . Mb [31:16] + Coef 11[31:0] . Ma [31:16] ML [ 31:0 ] =Coef 00[31:0] . Mb [ 15:0 ] + Coef 01[31:0] . Ma [ 15:0 ] Merge: MH [63:32] =Coef 10[31:0] . Ma [31:16] + Coef 11[31:0] . Ma [ 15:0 ] ML [ 31:0 ] =Coef 00[31:0] . Mb [31:16] + Coef 01[31:0] . Mb [ 15:0 ] 16 bit GenTera’s Multiplier/Accumulator I M A G IN E 3 Single Multiplier/Accumulator Single Multiplier/Accumulator handles all with the same hardware! Single Multiplier/Accumulator Each all of the 4 Multiplier/Accumulators handles with the same hardware! handles all with the sameby hardware! handles all operations utilizing 32 x 32 bit extern 32 x same 32 extern hardware! 32the x 32 bitbit intern 32 x 32 bit extern xaccumulate 32 32 x bit 32intern bit extern 6432 bit 32 x 32 bit intern 64 bit 32accumulate x 32 bit intern 64 bit accumulate 32 x 32 bit floating point 64 bit accumulate Imagine 3 operations per cycle: 64: 8x16 bit: 64: 8x16 bit: 32: 8x16 bit: 16: 16x16 bit: 16: 16x16 bit: 16: 16x32 bit: 16: 16x32 bit: 16: 16x32 bit: quad in-product (4 comp.) 4x4 matrix x vector Open GL blending functions in-product, cross-product complex product FIR filter in-product, cross-product complex product 16 x 16 bit extern x 16 extern 1616 x 32 bitbit intern 16 x 16 bit extern 32 bit intern 3216 bitx16 accumulate x3216 extern bitbit intern 3216 bitxaccumulate 32 bit intern 3216 bitxaccumulate 32 bit accumulate 16 x 16 bit extern x 16 extern 1616 x 32 bitbit intern 16 x 16 bit extern 32 bit intern 3216 bitx16 accumulate x3216 extern bitbit intern 3216 bitxaccumulate 32 bit intern 3216 bitxaccumulate 32 bit accumulate 16 x 16 bit extern x 16 extern 1616 x 32 bitbit intern 16 x 16 bit extern 32 bit intern 3216 bitx16 accumulate x 16 bit extern 32 bit intern 3216 bitxaccumulate 32 bit intern 3216 bitxaccumulate 32 bit accumulate 16 x 16 bit extern x 16 extern 1616 x 32 bitbit intern 16 x 16 bit extern 32 bit intern 3216 bitx16 accumulate x 16 bit extern 32 bit intern 3216 bitxaccumulate 32 bit intern 3216 bitxaccumulate 32 bit accumulate 8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x extern x16 intern8 x 88 extern x16 intern8 x 88 extern x16 intern8 x 88 extern x16 8intern 8 x 88 extern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x 8 extern x16 intern8 x 88 extern x16 intern8 x 88 extern x16 intern8 x 88 extern x16 intern 8 x 88 extern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x extern 8 x168 intern 8 x168 intern 8 x168 intern 8 x1688intern x 8 extern x 8 extern x 8 extern x 8 extern x16 intern8 x 88 extern x16 intern8 x 88 extern x16 intern8 x 88 extern x16 intern 8 x 88 extern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8intern extern 8 x16 8 x 8 extern 8 x 8 extern 8 x 8 extern 8 x extern 8 x168 intern 8 x168 intern 8 x168 intern 8 x1688intern x 8 extern x 8 extern x 8 extern x 8 extern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern 8 x16 intern GenTera’s Vector processing I M A G IN E 3 Variable length vector processing made simple. 1 2 3 4 5 6 7 8 genad(A0) 9 1 11 1 0 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 3 1 2 3 3 3 4 3 5 B0=input A0=rd4x8(ri) X0=mult(A0,B0,nuu) genad(A1) A1=rd4x8(ri) Y0=subsat(X0,A1) B1=rd(RING_Data) X1=mult(Y0,B1,nus) DA=again D0=word4x8(uI) X0=addsat(X1,D0) Y0=matxvec(X0) Y1=inproduct(X0) X1=addsat(Y0,Y1) outputV1 ACTUAL ASSEMBLY CODE FOR THE EXAMPLE ABOVE: repeat, graph (label_1);;; label_1: genad(A0) => B0=input, A0=rd4x8(ri) => X0=mult(A,V,nuu ) ===> genad(A1) =>A1=rd4x8(ri) => Y0=subsat(X0,A1), B1=rd4x8(RING_Data) => X1=mult(Y0,B1,nus) ===> DA=Again ==> D0=word4x8(uI), X0=addsat(X1,D0) => Y0=matxvec(X0), Y1=inproduct(X0) =====> X1=addsat(Y0,Y1) => outputV1; GenTera’s I M A G IN E 3 The Imagine 3 core can stream data from memory or other processors at 10 GByte/sec. (Compared to 0.48 GByte/sec. for the Imagine 1 ) Dataflow Ring input 10 Gigabyte Streaming I/O VECTOR UNITS: Simultaneous input and output to and from memory I M A G IN E 3 Internal Data Processing Core DATA CACHE or 3D GRAPHICS /VOLUME pipelines INPUT AND OUTPUT Dataflow Ring output GenTera’s Non-aligned S I M D I M A G IN E 3 SIMD processing made simple with non-aligned memory accesses (No complex time-consuming shift-mask-merge operations needed) 8 bit 8 bit 8 bit 8 bit 32 bit memory word 32 bit memory word 32 bit memory word 32 bit word 8 bit 8 bit 8 bit 8 bit GenTera’s I M A G IN E 3 32 bit words 2 x 16 bit words 16 bit words 4 x 8 bit words 8 bit words 2 x 8 bit words Non Aligned Vector Accesses 2 Input and 2 output vectors simultaneous GenTera’s I M A G IN E 3 I m a g i n e Vector I/O Memory Vector Accesses Vector Access Units: up to 32 vectors in flight data/color input conversion 2D restructuring Vector pipeline data/color input conversion 2D restructuring Vector pipeline 2 kB Vector pre-fetch buffer 2 kB Vector pre-fetch buffer 3 P r o c e s s o r C o r e data/color output conversion 2D restructuring Vector pipeline 2.25 kB Vector data/color output conversion 2D restructuring Vector pipeline 2.25 kB Vector Mask Unit 256 pixels / voxels Mask Unit 256 pixels / voxels write buffer write buffer E x t e r n a l M e m o r y I n t e r f a c e GenTera’s 1, 2 and 3D I M A G IN E 3 1 M Byte PAGE memory management 1 M Byte PAGE 1 M Byte PAGE X 1024 x 1024 512 x 1024 256 x 1024 8 bit pixel TILE 16 bit pixel TILE 32 bit pixel TILE Y Z 256 x 128 x 128 128 x 128 x 128 64 x 128 x 128 8 bit voxel BRICK 16 bit voxel BRICK 32 bit voxel BRICK Y X GenTera’s I M A G IN E 3 3D texture/volume Hardware Very High Quality 220 Billion operations/sec: 2 x 440 operations per cycle (4 ns) Texture Quality: Texture Types: BI linear, TRI Linear and QUAD interpolation. 32 bit ARGB, 16 bit (4 types), 8,4,2 and 1 bit pseudo color 16 bit and 32 bit greyscale (signed and unsigned), 2x16 bit complex Texture Size: 16,384 x 16,384 max (2d)2048 x 2048 x 2048 max (3d) Texture Dimension: 1, 2 and 3 dimensional textures. Texture Clamping: Clamp and Wrap for all 3 co-ordinates. Texture Border: 0 or 1 pixels texture borders, Border Color supported. Texture MIP maps up to 16 levels: selection made for each individual pixel. Perspective division for al 9 parameters: S, T, R, Alpha, Red, Green, Blue, Fog, Z Perspective Correct Texture Mapping, Perspective Correct Texture Lighting, Perspective Correct Linear and Exponential (2 types) Fog, Perspective Correct Depth Buffering, GenTera’s 3D graphics Pipelines I M A G IN E 3 3D graphics pipeline control unit Perspect. MIP map processing pipeline (F,A,R,G,B) Bressenham Edge Start Interpolators(Q,R,S,T,Z-1) (F,A,R,G,B) Vector Start Interpolators(Q,R,S,T,Z-1) (F,A,R,G,B) Perspective 3D correct Lighting Pixel Value Interpolators(Q,R,S,T,Z-1) Perspective 3D co-ordinate Generator 5 stages Perspective Lighting & Fog Coefficients Memory Access Internal Delay Line for Interpolation, Lighting & Fog Coefficients 3 - 17 stages Perspective Interpolation Coefficients 5 stages Perspective MIP Map Addresses Calculations 2 stages External Memory with MIP Map Textures 4 - 6 stages Memory Access Re-order buffers Texel Interpolation / Lighting coefficients generator Texel Interpolation / Lighting Multiply stage Texel Color Look Up Texel Selection / Expansion Memory Access Data Load unit Memory Access Input Fifo / Port Select Texel Interp./ Lighting control unit Texel Interpolation / Lighting Summation stage D BUS GenTera’s IM A G I N E 3 3D texture/volume Hardware 3D graphics Pipeline + Core stream performance (from external memory to external memory) Direct Draw functions: (numbers in result pixels/s) Bilinear Image Scale: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels ) Bilinear Image Rotate: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels ) Bilinear Affine Transform: 333 Mega pixels/s (32 bit gray scale or 32 bit color pixels ) MPEG functions: (numbers in result pixels/s) Bilinear Scaling plus kYUV to αRGB 333 Mega pixels/s (32 bit αRGB pixels) 3D functions: (numbers in result pixels/sec) Z-buffered, Perspective Correct, Bilinear Interpolated Texture mapping with perspective correct lighting and exponential fog (Texture size up to 16k x 16k), MIP-Mapping: 300 Mega pixels/sec. (32 bit αRGB pixels, 16 bit hi-color, 8 bit pseudo, 16 bit Z values) GenTera’s I M A G IN E 3 Fan Beam Back projection The 3D Texture/Volume pipelines and the Multiplier / Accumulators in the Imagine 3 can handle eight 16 bit linear interpolated samples per cycle with 32 bit accuracy. Back Projection Direction Vector Direction GenTera’s I M A G IN E 3 Cone beam reconstruction The Back projection in cone beam systems requires the: Inverse perspective mapping from filtered images back to a 3D volume. The Imagine 3 performs this directly with it’s 3D volume pipelines. GenTera’s De-blur filtering I M A G IN E 3 FIR filter performance (16 bit input, 32 bit calculations) 128 Tab: 256 Tab: 512 Tab: 324 projections 512 values 840 projections 928 values 32 Mega-pixels / second 16 Mega-pixels / second 8 Mega-pixels / second 256x256 result image 512 x 512 result image Filtered Backprojection for Medical Imaging 324 x 512 to 256 x 256 De-blur filtering Backprojection Reconstruction 10 ms (256 tabs) 11 ms 21 ms Filtered Backprojection for Medical Imaging 840 x 928 to 512 x 512 De-blur filtering 100 ms (512 tabs) Backprojection 108 ms Reconstruction 208 ms GenTera’s I M A G IN E 3 De-blur filtering (FFT) Complex input Fast Fourier Transform performance (vectorized) 32 bit Floating Point 256 Point: 512 Point: 1024 Point: 2048 Point: 4096 Point: 8192 Point: 16384 Point: 1200 projections of 960 values 8 μs 18 μs 40 μs 88 μs 192 μs 436 μs 896 μs 512 x 512 result image 32 bit Integer 16 bit Integer 4 μs 9 μs 20 μs 44 μs 96 μs 218 μs 448 μs 2.0 μs 4.4 μs 10 μs 22 μs 48 μs 109 μs 224 μs Filtered Back-projection for Medical Imaging 1200 x 960 to 512 x 512 FFT filtering Back-projection 106 ms (2048 point FP) 157 ms Reconstruction 263 ms GenTera’s I M A G IN E 3 Radar Display Processing Cartesian to Polar conversion with bi-linear interpolation 32 bit colors: 250 Mega-pixels /second GenTera’s Motion Estimators I M A G IN E 3 Motion Estimation Unit for MPEG1…MPEG4 video encoding 100 Billion operations / second - software controllable, - arbitrary MxN kernel sizes up to 256 by 256 - arbitrary search space sizes up to 4096 by 4069 for HDTV and higher - allows optimizing algorithms (reduced search space) - forward and backward prediction - vector processing co-operation with core for bi-cubic pixel interpolation / rotation Performance: Compare a 16x16 pixel block with any other 16x16 pixel block (half, quarter, 1/8th, 1/16th pixels with bi-cubic interpolation) 120 Million Block Compares / second GenTera’s I M A G IN E 3 Graphics Mask Generators Generates Transparent and Opaque Masks for 512 pixels multiple units work in parallel: Window Mask Generator Automatically clips pixels outside the View Port (scissoring) Span line Mask Generator for Concave Polygons and arbitrary Objects Range Mask generator for Depth Buffer Tests, Stencil Buffer Tests, Alpha Test, Chroma Keying Tests et cetera Complex Mask Generator for Concave and Complex Polygons according to the odd/even or winding rules Alpha Mask Generator For objects with partially covered pixels GenTera’s Graphics Mask Generators I M A G IN E 3 Window X min /max Window Y min /max The Window is defined by the Window registers Range mask 0 Range mask 1 Range mask 2 Range mask 3 Complex mask 0 Complex mask 1 Complex mask 2 Complex mask 3 Spanline Delta Start Spanline Address The Spanline registers define the outlines of the triangle Spanline 0 Start/ End Spanline 1 Start/ End Spanline 2 Start/ End Spanline 3 Start/ End Spanline Y min / max Overlap triangle Spanline Length (-1) The Range Mask contains the result of the Depht buffer test (overlapping triangle) The Complex Mask is used in this example to hold the Polygon Stipple pattern Spanline Delta End GenTera’s I M A G IN E 3 Multi media I/O units Video Output (Α), R, G, B outputs with 330 MHz dot clock for 1800 x 1400 screen format at 90 Hz. 12 (16) bit video out for Studio Quality video processing. Interface to DVI-TFT transmitters for high resolution, high quality LCD displays. Video Input CCIR 656: 8 bit digital video input for NTSC, PAL, SECAM, HDTV and custom formats Audio Codec 97 Interface Standard from Intel, Creative Labs, Yamaha, Analog Devices and Nat.Semiconductor Supports Analog speakers, Microphone, Headphone + Headphone micro, Telephony and Modem signals, CD analog audio in, Analog Video Sound In, PC beep in, et cetera Digital Audio: 4 stereo serial I/O ports (I2S type and S type emulation capabilities) Supports CD , DVD and Dolby AC3 input or output External Device Control 8 bit classic μP interface bus and I2C type emulation capability MIDI interface (Input and output for synthesizers and keyboards) GenTera’s I M A G IN E 3 Real Time Support MULTI MEDIA REAL TIME SUPPORT Level 1 Events (1 micro second response time requirement) Horizontal Sync interrupts, Video I/O interrupts, Register Virtualization interrupts. Level 2 Events(2 - 100 micro second response time requirement) Communication Fifo interrupts, Mailbox Interrupts, I2S Fifo Interrupts, Ac97 Fifo Interrupts Midi Interrupt, I2C interrupt, Vertical Sync Interrupts, Scheduler Clock Tick, et cetera Threads ( 100 micro - 10 millisecond response time requirement) Host Command Queues Manager Audio Stream managers Modem Stream managers User definable threads GenTera’s IM A G I N E 3 High-end Board 8 Processors: 3.2 Tera operations/s 4 GigaByte memory IMAGINE IMAGINE IMAGINE IMAGINE 3 3 3 3 GenTera’s IM A G I N E 3 High-end Board 8 Imagine 3 processors, 3200 Billion operations per second 32 GigaByte per second Memory Bandwidth 16 GigaByte per second Inter-Processor Bandwidth - Perspective Volume Rendering: 1000 x 1000 x 1000 at 15 frames/second (based on 25% volume traversal) - Cone Beam Reconstruction: 512 x 512 x 512 from 10002x128 in 4 seconds - Real Time 3D ultra sound reconstruction and visualization - Real Time HDTV MPEG 4 video encoding - Advanced Radar Processing GenTera’s I M A G IN E 3 High Speed Dataflow Ring Up to 2 Gigabyte per second Dataflow Ring (SSTL-2) Point-to-point with Broadcast options and auto configuration I M A G IN E 3 IMAGINE 3 I M A G IN E 3 IMAGINE 3 IM A G I N E 3 I M A G IN E 3 IM A G I N E 3 IM A G I N E 3 GenTera’s High Speed System I/O I M A G IN E 3 The Dataflow Ring also provides very high speed System I/O. Entry level system can use the programmable Video Data I/O for general purpose I/O. ( 160 MB/s per processor, 1 GB/s per processor ) Video out 1 GB/s Video In 160 MB/s IMAGINE 3 I M A G IN E 3 I M A G IN E 3 IM A G I N E 3 I M A G IN E 3 IM A G I N E 3 I M A G IN E 3 IM A G I N E 3 Optional System I/O FPGA e.g: Xilinx Virtex II Dataflow input: Up to 2.0 GB/s DataFlow Output: Up to 2.0 GB/s GenTera’s Pipeline Processing I M A G IN E 3 The Dataflow Ring allows long vector processing pipelines over multiple processors. Here an example with just 2 processors Vector Read from memory Vector Write to memory MAC as FIR filter Dataflow Ring Vector Read from memory Vector Write to memory 256 entry vector register ALU Dataflow Ring ALU Bi linear Interpolated Data from the Graphics pipeline ALU Dataflow Ring MAC as 3D blend unit Bi linear Interpolated Data from the Graphics pipeline GenTera’s 128 bit memory bus (reads) I M A G IN E 3 16 kbyte 1st Level data cache 128 bit 16 kbyte 1st Level instruction cache Dual 3D-graphics pipelines Dual 128 word x 128 bit Vector input fifo’s PCI/AGP Memory Read access Video Output 128 word x 128 bit fifo 4.2 Gigabyte /second Memory Bus: 128 bit PC2100 GenTera’s I M A G IN E 3 128 bit memory bus (writes) 16 kbyte 1st level data cache PCI/AGP Memory Write access Dual 128 word x 128 bit Vector output fifos 4.2 Gigabyte /second Memory Bus. (128 bit PC2100) 16 word x 128 bit write buffer 128 bit 8-fold address interleaved memory reads and writes. Out of order accesses with coherency checking GenTera’s I M A G IN E 3 END GenTera’s IMAGINE 3 HANS DE VRIES