Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison Baum, Boyett, & Garrison Agenda • • • • • • Problem Statement System Environment Programs Used for Comparison Matrix Processing Programs Results and Analysis SPEC Benchmark Results and Analysis Conclusion Baum, Boyett, & Garrison Problem Statement • The general purpose of our project is to verify Intel’s claim that their compiler is 10% better then the Microsoft Visual compiler. • Data will be gathered using Intel VTune tool from both SPEC CPU 2000 benchmarks and from simple matrix processing programs. Baum, Boyett, & Garrison System Environment • Programs were run on a single processor system with Intel P4 2.4GHz processor and 512 MB RAM. – Windows 2000 operating system • Microsoft Visual .NET compiler • Intel C++ Compiler 7.1 for Windows • Intel VTune Performance Analyzer 7.0 Baum, Boyett, & Garrison Programs Used for Comparison • SPEC CPU 2000 Benchmark – 164.gzip – 300.twolf • Simple Matrix Processing Programs – Array Summation of 10000 elements – Matrix Multiplication of 250x250 matrices Baum, Boyett, & Garrison VTune Setup • Using Intel’s VTune application the following events were measured: – – – – – Instruction Count Clockticks and Clockticks per Instruction Loads & Stores Level 1 cache misses Mispredicted Calls and Branches Baum, Boyett, & Garrison Matrix Processing Programs Results Mispredict ed Calls Mispredict ed Branches Array Sum 10000 (Intel) 1,518 22,285 49,890 1,268,145 Array Sum 10000 (VC++) 4,536 39,123 186,760 Matrix Mult 250 (Intel) 220 5,132 Matrix Mult 250 (VC++) 289 68,354 Executable (*.exe) 1st Level Cache Misses Clockticks Instruction Count Clockticks per Instruction 844,962 18,995,295 981,030 19.36 863,772 1,162,239 13,069,242 1,462,053 8.94 0 0 657,324 9,502,532 1,979,090 4.80 18,640,249 31,728,270 657,328 88,513,594 54,242,733 1.63 Loads Stores Baum, Boyett, & Garrison Matrix Processing Programs Results (cont.) Results for Array Summation Program 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 Array Sum 10000 (Intel) Array Sum 10000 (VC++) 100 10 1 Clockticks per Instruction Instruction Count Clockticks Stores Loads 1st Level Cache Misses Mispredicted Branches Mispredicted Calls Baum, Boyett, & Garrison Matrix Processing Programs Results (cont.) Results for Matrix Multiplication Program 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 Matrix Mult 250 (Intel) Matrix Mult 250 (VC++) 100 10 1 Clockticks per Instruction Instruction Count Clockticks Stores Loads 1st Level Cache Misses Mispredicted Branches Mispredicted Calls Baum, Boyett, & Garrison Matrix Processing Analysis • For Simple Matrix and Array Processing the Intel compiler verified it’s claim of a 10% better compiler – With the exception of the number of Stores executed, the Intel compiler showed approximately a 50% savings in the measured operations. • The Matrix Multiplication program showed one noteworthy result: the Intel compiler had zero events for both 1st Level Cache Misses and for Loads. – Verified by multiple builds and runs Baum, Boyett, & Garrison SPEC Benchmark Results 1st Level Cache Misses Loads Stores Clockticks Instruction Count 871,754,172 2,267,577,936 22,054,374,342 11,101,416,840 106,412,563,515 76,670,596,520 1.39 7,695 869,317,015 2,273,066,852 22,074,844,248 11,108,909,049 107,286,054,470 76,671,138,915 1.40 300.twolf (Intel) 346 4,874,982 7,639,211 77,060,025 32,577,657 484,933,215 210,922,988 2.30 300.twolf (VC++) 537 4,797,552 7,526,588 76,831,638 33,214,416 473,946,742 211,425,444 2.24 Executable (*.exe) 164.gzip (Intel) 164.gzip (VC++) Mispredicte d Calls Mispredicte d Branches 11,725 Clockticks per Instruction Baum, Boyett, & Garrison SPEC Benchmark Results Results for 164.gzip Program 1,000,000,000,000 100,000,000,000 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 164.gzip (Intel) 10,000 164.gzip (VC++) 1,000 100 10 1 Clockticks per Instruction Instruction Count Clockticks Stores Loads 1st Level Cache Misses Mispredicted Branches Mispredicted Calls Baum, Boyett, & Garrison SPEC Benchmark Results Results for 300.twolf Program 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 300.twolf (Intel) 1,000 300.twolf (VC++) 100 10 1 Clockticks per Instruction Instruction Count Clockticks Stores Loads 1st Level Cache Misses Mispredicted Branches Mispredicted Calls Baum, Boyett, & Garrison SPEC CPU 2000 Analysis • SPEC CPU 2000 Benchmarks did not show any significant difference between the two compilers. • SPEC Benchmarks were re-compiled and data sets were collected multiple times to verify the validity of the original data. Baum, Boyett, & Garrison Conclusions • Even though our group saw significant improvements in performance for our small test programs, these same gains could not be duplicated for the Benchmark applications. • These variations might be the result of differences in program complexity. Baum, Boyett, & Garrison Conclusions (cont.) • The Intel C++ Compiler showed results that were equal to or in some cases better than those of Microsoft Visual C++. • While Intel’s claim of 10% better results may not be true in all cases it is still a superior compiler. Baum, Boyett, & Garrison