http://software.intel.com/en-us/forums/topic/495421 I have continued to investigate why I don’t appear to get AVX instruction performance improvements and also OpenMP performance improvements. Considering Jim and Tim’s comments about a memory access bottleneck, I have investigated the impact of memory usage size on vector and OpenMP performance improvements; with some success. I have taken a recent example I have posted of OpenMP usage and run it for: Different vector instruction compilation options, and Varying memory usage footprints. I have also carried out these tests on two processors that I have available: Intel® Xeon® W3520 @ 2.67 GHz with 8mb cache, 12.0 GB memory and 120GB SSD Intel® Core™ i5 2540M @ 2.60 GHZ with 3mb cache, 8.0 GB memory and 128GB SSD For those who know processors, both of these are cheap and have relatively poor performance for their processor class, so I am investigating the performance improvements that can be achieved for low specked Intel processors. Apart from processor class (for available instruction set) and processor clock rate, the other important influences on performance are: Processor cache size, (8mb and 3mb) Memory access rate (1066mhz and 1333mhz) I presume cache size is defined by the processor chip, while memory access is defined by the pc manufacturer? Unfortunately, I am not lucky enough to test a range of these options, but perhaps others can. Compiler Options I have investigated compiler options for vector instructions and for OpenMP calculation. 6 options have been used. For vector instructions I have used: /O2 /Qxhost (include best vector instructions available AVX on i5 and SSE? On Xeon) /O2 /QxSSE2 (limit vector instructions to SSE2) /O2 /Qvec- (no vector instructions) These have been combined with /OpenMP to identify the combined performance improvement that could be possible. Memory Options A range of memory footprints from 0.24 MB up to 10 GB have been tested, although the performance levels out once the memory usage footprint significantly exceeds the cache capacity. For subsequent tests, the array dimension N is increased by 25% for each successive test: x = x * 1.25 n(i) = nint (x/4.) * 4 ! adjust for 32 byte boundary call matrix_multiply ( n(i), times(:,i), cycles(i) ) The sample program I am using, calculates a matrix multiplication, where (real(8)) [C] = [C] + [A’].[B]. The advantage of this computation is that OpenMP can be applied at the outer loop, providing maximum efficiency for potential multi-processing. When run it is always achieving about 99% CPU in task manager. For small matrix sizes, the matrix multiply computation is cycled, although the OpenMP loop is inside the cycle loop. This appears to be working, with a target elapse time of at least 5 seconds (10 billion operations) being achieved. Idealised Results From an idealised estimate of performance improvement: SSE2 should provide 2 x improvement over Qvec-. AVX should provide 4 x improvement. (assuming 256 bit; the matrix size has been sized as a multiple or 4 so that dot_product calls are on 32-byte alignment) OpenMP should provide 4 x improvement for 4 CPU’s This potentially implies up to 16 x for /QxAVX /Openmp, although this never occurs !! Actual Results Results have been assessed, based on run time (QueryPerformanceCounter) Performance has also been calculated as Mflops (million floating point operations) where I have defined a single floating point operation as “s = s + A(k,i)*B(k,j)” (floating point multiplication), although this could be described as 2 operations as there is now little difference between multiplication and addition. Performance has also been reported as a performance ratio in comparison to /O2 /Qvec- for the same memory size. This gives a relative performance factor for the combination of vector or OpenMP improvement. The results show that some performance improvement is being achieved by vector or OpenMP computation but not near as good as the ideal case. While OpenMP always shows a 4x increase in cpu usage, the run time improvement is typically much less. This can best be assessed by comparing the run-time performance of OpenMP to the single cpu run time performance. The biggest single influence on achieved performance improvement is the memory footprint. For the hardware I am using there is little (none!) improvement from AVX instructions once the calculation is no longer cached. My previous testing for large memory tests appeared to show that I was not getting the AVX instructions to work. I would have to ask does AVX work for non-cached computation, as I have not shown this occurring with my hardware. Also, if AVX instructions only work from the cache, what is all the talk about alignment, as I do not understand the relationship between memory and cache alignment of vectors. This AVX operation for non-cached calculations can be masked by the memory access speeds. I need to test with faster memory access speeds using faster memory access hardware. Reporting I am preparing some tables and charts of the performance improvements I have identified. The vertical axis is Mflops or performance ratios. The horizontal axis is memory footprint. When reported as a log scale, the impact of cache and lack of AVX benefit for large memory runs, as a log scale Summary The memory access bottleneck is apparent, but I don’t know how to overcome it. From the i5 performance, AVX performance does not appear to be realised. The influence of cache size and memory access rates can be seen in the Mflop charts below. These tests are probably identifying that notionally better processors are only any good if the main bottleneck on performance is related to the improvement these better processors provide. At the stage I have tested, I appear to get minimal benefit from AVX due to the memory access rate limit. I would welcome any comments on these results or hope people could run the tests on alternative hardware configurations, noting the main hardware features identified above, including cache size and memory access speed. I5 Mflop Results Mflop Performance for Core i5 for compile and memory alternatives 8,000 7,000 6,000 Mflop Performance 5,000 mp_host mp_sse 4,000 mp_vec sp_host sp_sse 3,000 sp_vec 2,000 1,000 0 0 1 10 100 1,000 10,000 Memory Usage (MB) Xeon Mflop Results Mflop Performance for Xeon for compile and memory alternatives 8,000 7,000 6,000 Mflop Performance 5,000 mp_host mp_sse 4,000 mp_vec sp_host sp_sse 3,000 sp_vec 2,000 1,000 0 0 1 10 100 Memory Usage (MB) 1,000 10,000 I5 Performance Improvement Performance improvement for Core i5 for compile and memory alternatives 8.0 7.0 Performance ratio from /Qvec- 6.0 5.0 mp_host mp_sse 4.0 mp_vec sp_host sp_sse 3.0 sp_vec 2.0 1.0 0.0 0.1 1 10 100 1000 10000 Memory Usage (MB) Xeon Performance improvement Performance improvement for Xeon for compile and memory alternatives 8.0 7.0 Performance ratio from /Qvec- 6.0 5.0 mp_host mp_sse 4.0 mp_vec sp_host sp_sse 3.0 sp_vec 2.0 1.0 0.0 0.1 1 10 100 Memory Usage (MB) 1000 10000