Accelerating Multimedia Applications using the Intel SSE and AVX ISA

advertisement
ACCELERATING MULTIMEDIA
APPLICATIONS USING THE INTEL
SSE AND AVX ISA
MIN LI
05/08/2013
INTEL SSE AND AVX ISA
Intel ISA
SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2)
SSE4.2 Specialized for String and Text applications (suitable for applications like template
matching, Genome Sequence Comparison)
AVX (mainly for floating point operations)
AVX1: 256bits
AVX2: 256bits (with some instructions extension)
XMM register and YMM register
XMM: 128bits
YMM: 256bits
INTEL OPENCV LIBRARY
Opencv Library
Various of multimedia applications
Object detection, face recognition,
image processing…
Good candidates for using Intel SSE or AVX ISA for speedup
Intensive computations
I made a video on Youtube to show some tricks in using Opencv library
https://www.youtube.com/watch?v=ISap9zEGE2I
https://www.youtube.com/watch?v=pqSgT0quMBc
GUIDELINES FOR ENABLING THE ISA
Intel SSE and AVX
cat /proc/cpuinfo Make sure SSE and AVX are enabled. Otherwise enable them.
As you can see
All SSE ISA are activated
However only AVX1 is activated, which means I can only use 128bits XMM registers
Note: AVX2 is released in the mid of 2012
INTEL OPENCV LIBRARY
Opencv Library
Various of multimedia applications
Object detection, face recognition,
image processing…
ACCELERATION CASE I
Original:
After modification:
for( int i = 0; i < length; i += 4 ){
double t0 = d1[i] - d2[i];
double t1 = d1[i+1] - d2[i+1];
double t2 = d1[i+2] - d2[i+2];
double t3 = d1[i+3] - d2[i+3];
total_cost += t0*t0 + t1*t1
+ t2*t2 + t3*t3;
}
int chunk = length / 4;
for(i = 0; i < chunk; i++){
__m128 m0, m1;
m0 = _mm_load_ps(&d1[4 * i]);
m1 = _mm_load_ps(&d2[4 * i]);
m1 = _mm_sub_ps(m0, m1);
m1 = _mm_mul_ps(m1, m1);
m1 = _mm_hadd_ps(m1, m1);
m2 = _mm_shuffle_ps(m1, m1, _MM_SHUFFLE(2,3,0,1));
m1 = _mm_add_ps(m1, m2);
total_cost += ((float*)&m1)[0];
if( total_cost > best )
break;
}
ACCELERATION CASE II
Original:
After modification :
float minval = FLT_MAX, maxval = -FLT_MAX;
for( i = 0; i < N; i++, ++it )
{
float v = *(const float*)it.ptr;
if( v < minval )
{
minval = v;
minidx = it.node()->idx;
}
if( v > maxval )
{
maxval = v;
maxidx = it.node()->idx;
}
}
__mm128 m0, m1, m2, m3, m4, minArray, maxArray;
int chunk = N / 4;
for(i = 1; i < chunk; i++){
m0 = __mm_load_ps( (const float*)it.ptr );
it += 4;
m1 = _mm_min_ps(m0, minArray);
m2 = _mm_max_ps(m0, maxArray);
m3 = _mm_cmp_ps(m0, minArray, _CMP_LT_OS);
m4 = _mm_cmp_ps(m0, maxArray, _CMP_GT_OS);
int* mask1 = (int*) &m3;
int* mask2 = (int*) &m4;
for(int j = 0; j < 4; j++){
if(mask1[j] == -1)
minPos[j] = 4 * i + j;
if(mask2[j] == -1)
maxPos[j] = 4 * i + j;
}
minArray = m3; maxArray = m4;
}
if( _minval )
*_minval = minval;
if( _maxval )
*_maxval = maxval;
LOAD OF STRUCTURES
point* points;
 Structues like this :
points[0].x
typedef point_{
int x;
int y;
} point;
points[0].y
points[1].x
points[1].y
 _mm_load_ only takes consecutive mem space!
.
.
.
 What is it like insider the XMM register?
X0
Y0
X1
Y1
X2
Y2
X3
Y3
 How to achieve the following using SSE && AVX ISA?
X0
X1
X2
X3
Y0
Y1
Y2
Y3
Not easy!!!
PERMUTE AND BLEND
(1) __m256i temp = _mm256_load_si256((__m256i*) &points[4 * i]);
X0
Y0
X1
Y1
X2
Y2
X3
Y3
X0
X1
Y0
Y1
Y2
Y2
X2
X3
(5) __m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01);
Y2
Y3
X2
X3
X0
X1
Y0
Y1
(6) temp3 = _mm256_blend_ps(temp3, temp4, 0b00110011);
X0
X1
X2
X3
Y2
Y3
Y0
Y1
(8) temp3 = _mm256_permutevar_ps(temp2, mask2);
X0
X1
X2
X3
Y0
Y1
Y2
Y3
(9) __m128 m1 = _mm256_extractf128_ps(temp3, 1);
X0
X1
X2
X3
(10)__m128 m2 = _mm256_extractf128_ps(temp3, 0);
Y0
Y1
Y2
Y3
(2) __m256 temp2 = _mm256_cvtepi32_ps(temp);
(3) v4si mask1 = {9,8,8,9};
(4)
__m256 temp3 = _mm256_permutevar_ps(temp2, mask1);
(7) v4si mask2 = {0xd,4,4,0xd};
SIMULATION RESULTS
Too many overhead for loading
structures
Not only finding min/max, but also
the position
Runtime Comparison
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
CSD
MML
CNVP
Original
AVX
CLPB
CONCLUSION AND FUTURE WORK
Opencv suitable for SSE or AVX acceleration
Single task has more chance to get speedup
Loading and arranging a structure is really a cumbersome task
Hints for smart automated compilation (such as loading structure)
Suggestions for the expansion of the ISA (new instruction introduced)
Download