Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

advertisement
Implementation of MPEG2 Codec
with MMX/SSE/SSE2 Technology
Speaker: Rong Jiang, Xu Jin
Instructor: Yu-Hen Hu
Outline

Introduction
 MMX/SSE/SSE2
 MPEG
2 Video Compression
What we have done?
 Conclusion

MMX/SSE/SSE2

MMX




SSE





57 new instructions;
8 64-bit wide MMX registers;
4 new data types. (3 packed data type and 1 64-bit entity)
8 new 128-bit SIMD floating-point registers;
50 new instructions that work on packed floating-point data;
8 new instructions to control data cacheability;
12 new instructions that extend the MMX instruction set.
SSE2

Support 64-bit floating-point values
MPEG 2 video compression
Project outline
1. Dig out a MPEG2 Enc/Dec C code
2. Generate profiling information
3. Identify the kernels
4. Rewrite kernels using SSE
5. Performance results
Profiling results of the original code
mpeg2decode
mpeg2encode
idct()
dist1()
70
35
60
30
fdct()
25
Percentage (%)
Percentage (%)
50
40
30
20
20
15
10
10
5
0
0
1
2
3
4
5
Different Functions
6
7
8
1
2
3
4
5
Different Functions
6
7
8
Example 1 – optimizing dist1()
if ((v = p1[0] - p2[0])<0) v = -v; s+= v;
if ((v = p1[1] - p2[1])<0) v = -v; s+= v;
if ((v = p1[2] - p2[2])<0) v = -v; s+= v;
if ((v = p1[3] - p2[3])<0) v = -v; s+= v;
if ((v = p1[4] - p2[4])<0) v = -v; s+= v;
if ((v = p1[5] - p2[5])<0) v = -v; s+= v;
if ((v = p1[6] - p2[6])<0) v = -v; s+= v;
if ((v = p1[7] - p2[7])<0) v = -v; s+= v;
if ((v = p1[8] - p2[8])<0) v = -v; s+= v;
if ((v = p1[9] - p2[9])<0) v = -v; s+= v;
if ((v = p1[10] - p2[10])<0) v = -v; s+= v;
if ((v = p1[11] - p2[11])<0) v = -v; s+= v;
if ((v = p1[12] - p2[12])<0) v = -v; s+= v;
if ((v = p1[13] - p2[13])<0) v = -v; s+= v;
if ((v = p1[14] - p2[14])<0) v = -v; s+= v;
if ((v = p1[15] - p2[15])<0) v = -v; s+= v;
asm volatile ("
movdqu (%1), %%XMM0
movdqu (%2), %%XMM1
psadbw %%XMM0, %%XMM1
movdq2q %%XMM1, %%MM0
pslldq $8, %%XMM1
movdq2q %%XMM1, %%MM1
paddd %%MM1, %%MM0
movd %%MM0, %0"
: "=r"(s)
: "r"(p1), "r"(p2));
4-5X speed-up, but it
can be faster!
This code segment is for calculating residual matrices in the prediction stage in Encoder
Four ways to write super-fast code




Rearrange data fetching to maximize cache hit;
Unroll loops to eliminate unnecessary branches;
Utilize SSE instructions to take full advantage of
parallelism;
Apply code scheduling to exploit multiple issue
capability of Pentium 4's superscalar microarchitecture.
Example 2 – optimize idct()
Three nested loops forms the kernel of DCT:
for (i=0; i<8; i++)
for (j=0; j<8; j++)
{
partial_product = 0.0;
for (k=0; k<8; k++)
partial_product+= c[k][j]*block[i][k];
tmp[i][j] = partial_product;
}
A verbatim translation from C to assembly doesn’t
do much better. It misses the whole point of
manually writing an assembly procedure.


We need parallelism!


Results
68.72%
50.1s
70
Total Run Time (s)
40
30
25X
4X
in
idct()
in
dist1()
16.34s
20
10
2.45s
3.83s
Percentage in Total Run Time (%)
50
60
50
40
34.39%
30
20
13.04%
9.99%
10
0
0
1
2
3
Original vs. Modified Functions
4
1
2
3
Original vs. Modified Functions
Experimental Results are averaged over 3 runs.
4
Platform Compatibility (1)
Algorithm for Checking Availability of MMX
bool isMMXSupported()
{
int fSupported;
asm
{
mov eax,1
// CPUID level 1
cpuid
// EDX = feature flag
and edx,0x800000 // test bit 23 of feature flag
mov fSupported,edx // != 0 if MMX is supported
}
if (fSupported != 0)
return true;
else
return false;
}
Platform Compatibility (2)
Algorithm for Checking Availability of SSE
bool isISSESupported()
{
int processor;
int features;
int extfeatures = 0;
asm{
pusha
mov eax,1
cpuid
mov processor,eax
// Store processor family/model/step
mov features,edx
// Store features bits
mov eax,080000000h
cpuid
// Check which extended functions can be called
cmp eax,080000001h // Extended Feature Bits
jb nofeatures
// Jump if not supported
mov eax,080000001h // Select function 0x80000001
cpuid
mov extfeatures,edx // Store extended features bits
nofeatures:
popa
}
if (((features $>>$ 25) \& 1) != 0)
return true;
else if (((extfeatures $>>$ 22) \& 1) != 0)
return true;
else
return false;
}
Y
SSE?
SSE Routine
N
MMX Routine
MMX?
Y
N
Normal Routine
END
Thank you!
Download