Vector Unit Assembly bquintero@fullsail.com Overview Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library Review Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit VU1 is the GS’s alternate processing unit Each Unit has a direct pipeline to it’s respective processor Vector Units are designed for 4Dx32bit vectors Review VU0/1 each have access to 32 float registers and 16 integer register Float registers are not like PC registers; they are 128bits in size (PC is 32bit) 128bits can fit 4 float values at once (4D vector) Integer registers are typically used as loop counters and address calculators Review VU0 has two bus lines One bus is dedicated to the CPU The other bus is used to communicate with all other devices VU0 has 4KB of $ VU0 dedicated I$ D$ 4KB 4KB CPU CORE shared bus SYS RAM Vector Unit Processing Speed The graph shows some vector-math intensive function calls 200K calls were made to each function 70 60 50 time(ms) 40 VU0 EE 30 20 10 0 Add Scale Cross Macro and Micro Modes Vector Unit Zero (VU0) has two modes Micro mode is a mode that allows your vector processor to act as an independent CPU A mini program is uploaded and executed in parallel to the main CPU Macro mode allows your CPU to directly offload heavy vector computation with low overhead Most popular method, hands down. Micro Mode When uploaded, the micro program is executed independent to the CPU This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine Macro Mode Macro mode is a much easier method of executing fast math functionality Assembly can be used as inline instructions, telling the compiler to offload the math to VU0 Notes Just because it’s in assembly does not mean it will be faster Switching CPU focus has it’s overheads Assembly Structure There is typically a specific method to writing assembly routines Load the variable data/addresses to registers Apply vector computations to those registers Store the result back into a variable address Overhead of using assembly is in the load and store Make sure that the computation stage will improve performance enough to offset the load/store overhead Vector Unit MIPS Instructions Coprocessor Transfer Instructions Store / Load Coprocessor Branch Instructions Macro (primitive) calculation instructions Add / Subtract / Multiply / Divide / ect… Micro subroutine execution instructions (VU Macro Instructions) EEVectorAdd Adding two vectors using the EE Core (CPU) // (Vec4T *v0, Vec4T *v1, Vec4T *v2) { v2->x v2->y v2->z v2->w } = = = = v0->x v0->y v0->z v0->w + + + + v1->x; v1->y; v1->z; v1->w; VectorAdd Adding two vectors using the VU0 // (Vec4T *v0, Vec4T *v1, Vec4T *v2) { asm __volatile__ (" lqc2 vf05, 0x0(%0) lqc2 vf06, 0x0(%1) vadd.xyzw vf07, vf05, vf06 sqc2 vf07, 0x0(%2)” : : "r" (v0) , "r" (v1), "r" (v2) ); } EECrossProduct Notice how we must use a temp because of the cross // (Vec4T *v1, Vec4T *v2, Vec4T *cross) { Vec4T temp; temp.x = v1->y * v2->z - v1->z * v2->y; temp.y = v1->z * v2->x - v1->x * v2->z; temp.z = v1->x * v2->y - v1->y * v2->x; VectorCopy(&temp, cross); } CrossProduct // (Vec4T *v1, Vec4T *v2, Vec4T *cross) { asm __volatile__(" lqc2 vf05, 0x0(%0) lqc2 vf06, 0x0(%1) vopmula.xyz ACC, vf05, vf06 # first vopmsub.xyz vf06, vf06, vf05 # - second vsub.w vf06, vf00, vf00 # w = 0 sqc2 vf06, 0x0(%2)” : // No Output : "r"(v1), "r"(v2), "r"(cross) ); } Vector Outer Product The vopmula instruction performs an outer product The result is stored into the special purpose ACC register VF05 X VF06 X ACC X Y Y Y Z Z Z For Next Time Read Chapters 7.3.2 – 7.4.2 Read Chapters 9.3