http://software.intel.com/en-us/forums/topic/495421 I have continued

advertisement
http://software.intel.com/en-us/forums/topic/495421
I have continued to investigate why I don’t appear to get AVX instruction performance improvements
and also OpenMP performance improvements.
Considering Jim and Tim’s comments about a memory access bottleneck, I have investigated the
impact of memory usage size on vector and OpenMP performance improvements; with some
success.
I have taken a recent example I have posted of OpenMP usage and run it for:


Different vector instruction compilation options, and
Varying memory usage footprints.
I have also carried out these tests on two processors that I have available:


Intel® Xeon® W3520 @ 2.67 GHz with 8mb cache, 12.0 GB memory and 120GB SSD
Intel® Core™ i5 2540M @ 2.60 GHZ with 3mb cache, 8.0 GB memory and 128GB SSD
For those who know processors, both of these are cheap and have relatively poor performance for
their processor class, so I am investigating the performance improvements that can be achieved for
low specked Intel processors.
Apart from processor class (for available instruction set) and processor clock rate, the other important
influences on performance are:


Processor cache size, (8mb and 3mb)
Memory access rate (1066mhz and 1333mhz)
I presume cache size is defined by the processor chip, while memory access is defined by the pc
manufacturer?
Unfortunately, I am not lucky enough to test a range of these options, but perhaps others can.
Compiler Options
I have investigated compiler options for vector instructions and for OpenMP calculation. 6 options
have been used. For vector instructions I have used:
/O2 /Qxhost (include best vector instructions available AVX on i5 and SSE? On Xeon)
/O2 /QxSSE2 (limit vector instructions to SSE2)
/O2 /Qvec- (no vector instructions)
These have been combined with /OpenMP to identify the combined performance improvement that
could be possible.
Memory Options
A range of memory footprints from 0.24 MB up to 10 GB have been tested, although the performance
levels out once the memory usage footprint significantly exceeds the cache capacity.
For subsequent tests, the array dimension N is increased by 25% for each successive test:
x = x * 1.25
n(i) = nint (x/4.) * 4 ! adjust for 32 byte boundary
call matrix_multiply ( n(i), times(:,i), cycles(i) )
The sample program I am using, calculates a matrix multiplication, where (real(8)) [C] = [C] + [A’].[B].
The advantage of this computation is that OpenMP can be applied at the outer loop, providing
maximum efficiency for potential multi-processing. When run it is always achieving about 99% CPU in
task manager.
For small matrix sizes, the matrix multiply computation is cycled, although the OpenMP loop is inside
the cycle loop. This appears to be working, with a target elapse time of at least 5 seconds (10 billion
operations) being achieved.
Idealised Results
From an idealised estimate of performance improvement:
SSE2 should provide 2 x improvement over Qvec-.
AVX should provide 4 x improvement. (assuming 256 bit; the matrix size has been sized as a multiple
or 4 so that dot_product calls are on 32-byte alignment)
OpenMP should provide 4 x improvement for 4 CPU’s
This potentially implies up to 16 x for /QxAVX /Openmp, although this never occurs !!
Actual Results
Results have been assessed, based on run time (QueryPerformanceCounter)
Performance has also been calculated as Mflops (million floating point operations) where I have
defined a single floating point operation as “s = s + A(k,i)*B(k,j)” (floating point multiplication), although
this could be described as 2 operations as there is now little difference between multiplication and
addition.
Performance has also been reported as a performance ratio in comparison to /O2 /Qvec- for the same
memory size. This gives a relative performance factor for the combination of vector or OpenMP
improvement.
The results show that some performance improvement is being achieved by vector or OpenMP
computation but not near as good as the ideal case.
While OpenMP always shows a 4x increase in cpu usage, the run time improvement is typically much
less. This can best be assessed by comparing the run-time performance of OpenMP to the single cpu
run time performance.
The biggest single influence on achieved performance improvement is the memory footprint. For the
hardware I am using there is little (none!) improvement from AVX instructions once the calculation is
no longer cached. My previous testing for large memory tests appeared to show that I was not getting
the AVX instructions to work.
I would have to ask does AVX work for non-cached computation, as I have not shown this occurring
with my hardware. Also, if AVX instructions only work from the cache, what is all the talk about
alignment, as I do not understand the relationship between memory and cache alignment of vectors.
This AVX operation for non-cached calculations can be masked by the memory access speeds. I
need to test with faster memory access speeds using faster memory access hardware.
Reporting
I am preparing some tables and charts of the performance improvements I have identified.
The vertical axis is Mflops or performance ratios.
The horizontal axis is memory footprint. When reported as a log scale, the impact of cache and lack of
AVX benefit for large memory runs, as a log scale
Summary



The memory access bottleneck is apparent, but I don’t know how to overcome it.
From the i5 performance, AVX performance does not appear to be realised.
The influence of cache size and memory access rates can be seen in the Mflop charts below.
These tests are probably identifying that notionally better processors are only any good if the main
bottleneck on performance is related to the improvement these better processors provide. At the
stage I have tested, I appear to get minimal benefit from AVX due to the memory access rate limit.
I would welcome any comments on these results or hope people could run the tests on alternative
hardware configurations, noting the main hardware features identified above, including cache size
and memory access speed.
I5 Mflop Results
Mflop Performance for Core i5 for compile and memory alternatives
8,000
7,000
6,000
Mflop Performance
5,000
mp_host
mp_sse
4,000
mp_vec
sp_host
sp_sse
3,000
sp_vec
2,000
1,000
0
0
1
10
100
1,000
10,000
Memory Usage (MB)
Xeon Mflop Results
Mflop Performance for Xeon for compile and memory alternatives
8,000
7,000
6,000
Mflop Performance
5,000
mp_host
mp_sse
4,000
mp_vec
sp_host
sp_sse
3,000
sp_vec
2,000
1,000
0
0
1
10
100
Memory Usage (MB)
1,000
10,000
I5 Performance Improvement
Performance improvement for Core i5 for compile and memory alternatives
8.0
7.0
Performance ratio from /Qvec-
6.0
5.0
mp_host
mp_sse
4.0
mp_vec
sp_host
sp_sse
3.0
sp_vec
2.0
1.0
0.0
0.1
1
10
100
1000
10000
Memory Usage (MB)
Xeon Performance improvement
Performance improvement for Xeon for compile and memory alternatives
8.0
7.0
Performance ratio from /Qvec-
6.0
5.0
mp_host
mp_sse
4.0
mp_vec
sp_host
sp_sse
3.0
sp_vec
2.0
1.0
0.0
0.1
1
10
100
Memory Usage (MB)
1000
10000
Download