Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu Outline Introduction MMX/SSE/SSE2 MPEG 2 Video Compression What we have done? Conclusion MMX/SSE/SSE2 MMX SSE 57 new instructions; 8 64-bit wide MMX registers; 4 new data types. (3 packed data type and 1 64-bit entity) 8 new 128-bit SIMD floating-point registers; 50 new instructions that work on packed floating-point data; 8 new instructions to control data cacheability; 12 new instructions that extend the MMX instruction set. SSE2 Support 64-bit floating-point values MPEG 2 video compression Project outline 1. Dig out a MPEG2 Enc/Dec C code 2. Generate profiling information 3. Identify the kernels 4. Rewrite kernels using SSE 5. Performance results Profiling results of the original code mpeg2decode mpeg2encode idct() dist1() 70 35 60 30 fdct() 25 Percentage (%) Percentage (%) 50 40 30 20 20 15 10 10 5 0 0 1 2 3 4 5 Different Functions 6 7 8 1 2 3 4 5 Different Functions 6 7 8 Example 1 – optimizing dist1() if ((v = p1[0] - p2[0])<0) v = -v; s+= v; if ((v = p1[1] - p2[1])<0) v = -v; s+= v; if ((v = p1[2] - p2[2])<0) v = -v; s+= v; if ((v = p1[3] - p2[3])<0) v = -v; s+= v; if ((v = p1[4] - p2[4])<0) v = -v; s+= v; if ((v = p1[5] - p2[5])<0) v = -v; s+= v; if ((v = p1[6] - p2[6])<0) v = -v; s+= v; if ((v = p1[7] - p2[7])<0) v = -v; s+= v; if ((v = p1[8] - p2[8])<0) v = -v; s+= v; if ((v = p1[9] - p2[9])<0) v = -v; s+= v; if ((v = p1[10] - p2[10])<0) v = -v; s+= v; if ((v = p1[11] - p2[11])<0) v = -v; s+= v; if ((v = p1[12] - p2[12])<0) v = -v; s+= v; if ((v = p1[13] - p2[13])<0) v = -v; s+= v; if ((v = p1[14] - p2[14])<0) v = -v; s+= v; if ((v = p1[15] - p2[15])<0) v = -v; s+= v; asm volatile (" movdqu (%1), %%XMM0 movdqu (%2), %%XMM1 psadbw %%XMM0, %%XMM1 movdq2q %%XMM1, %%MM0 pslldq $8, %%XMM1 movdq2q %%XMM1, %%MM1 paddd %%MM1, %%MM0 movd %%MM0, %0" : "=r"(s) : "r"(p1), "r"(p2)); 4-5X speed-up, but it can be faster! This code segment is for calculating residual matrices in the prediction stage in Encoder Four ways to write super-fast code Rearrange data fetching to maximize cache hit; Unroll loops to eliminate unnecessary branches; Utilize SSE instructions to take full advantage of parallelism; Apply code scheduling to exploit multiple issue capability of Pentium 4's superscalar microarchitecture. Example 2 – optimize idct() Three nested loops forms the kernel of DCT: for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0; for (k=0; k<8; k++) partial_product+= c[k][j]*block[i][k]; tmp[i][j] = partial_product; } A verbatim translation from C to assembly doesn’t do much better. It misses the whole point of manually writing an assembly procedure. We need parallelism! Results 68.72% 50.1s 70 Total Run Time (s) 40 30 25X 4X in idct() in dist1() 16.34s 20 10 2.45s 3.83s Percentage in Total Run Time (%) 50 60 50 40 34.39% 30 20 13.04% 9.99% 10 0 0 1 2 3 Original vs. Modified Functions 4 1 2 3 Original vs. Modified Functions Experimental Results are averaged over 3 runs. 4 Platform Compatibility (1) Algorithm for Checking Availability of MMX bool isMMXSupported() { int fSupported; asm { mov eax,1 // CPUID level 1 cpuid // EDX = feature flag and edx,0x800000 // test bit 23 of feature flag mov fSupported,edx // != 0 if MMX is supported } if (fSupported != 0) return true; else return false; } Platform Compatibility (2) Algorithm for Checking Availability of SSE bool isISSESupported() { int processor; int features; int extfeatures = 0; asm{ pusha mov eax,1 cpuid mov processor,eax // Store processor family/model/step mov features,edx // Store features bits mov eax,080000000h cpuid // Check which extended functions can be called cmp eax,080000001h // Extended Feature Bits jb nofeatures // Jump if not supported mov eax,080000001h // Select function 0x80000001 cpuid mov extfeatures,edx // Store extended features bits nofeatures: popa } if (((features $>>$ 25) \& 1) != 0) return true; else if (((extfeatures $>>$ 22) \& 1) != 0) return true; else return false; } Y SSE? SSE Routine N MMX Routine MMX? Y N Normal Routine END Thank you!