BaumBoyettGarrison

advertisement
Comparing Intel C++ and Microsoft
Visual C++ Compilers
Michael Baum
David Boyett
Holly Garrison
Baum, Boyett, & Garrison
Agenda
•
•
•
•
•
•
Problem Statement
System Environment
Programs Used for Comparison
Matrix Processing Programs Results and Analysis
SPEC Benchmark Results and Analysis
Conclusion
Baum, Boyett, & Garrison
Problem Statement
• The general purpose of our project is to verify
Intel’s claim that their compiler is 10% better then
the Microsoft Visual compiler.
• Data will be gathered using Intel VTune tool from
both SPEC CPU 2000 benchmarks and from
simple matrix processing programs.
Baum, Boyett, & Garrison
System Environment
• Programs were run on a single processor system
with Intel P4 2.4GHz processor and 512 MB RAM.
– Windows 2000 operating system
• Microsoft Visual .NET compiler
• Intel C++ Compiler 7.1 for Windows
• Intel VTune Performance Analyzer 7.0
Baum, Boyett, & Garrison
Programs Used for Comparison
• SPEC CPU 2000 Benchmark
– 164.gzip
– 300.twolf
• Simple Matrix Processing Programs
– Array Summation of 10000 elements
– Matrix Multiplication of 250x250 matrices
Baum, Boyett, & Garrison
VTune Setup
• Using Intel’s VTune application the following
events were measured:
–
–
–
–
–
Instruction Count
Clockticks and Clockticks per Instruction
Loads & Stores
Level 1 cache misses
Mispredicted Calls and Branches
Baum, Boyett, & Garrison
Matrix Processing Programs Results
Mispredict
ed Calls
Mispredict
ed
Branches
Array Sum 10000 (Intel)
1,518
22,285
49,890
1,268,145
Array Sum 10000
(VC++)
4,536
39,123
186,760
Matrix Mult 250 (Intel)
220
5,132
Matrix Mult 250 (VC++)
289
68,354
Executable (*.exe)
1st Level
Cache
Misses
Clockticks
Instruction
Count
Clockticks
per
Instruction
844,962
18,995,295
981,030
19.36
863,772
1,162,239
13,069,242
1,462,053
8.94
0
0
657,324
9,502,532
1,979,090
4.80
18,640,249
31,728,270
657,328
88,513,594
54,242,733
1.63
Loads
Stores
Baum, Boyett, & Garrison
Matrix Processing Programs Results (cont.)
Results for Array Summation Program
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
Array Sum 10000 (Intel)
Array Sum 10000 (VC++)
100
10
1
Clockticks per Instruction
Instruction Count
Clockticks
Stores
Loads
1st Level Cache Misses
Mispredicted Branches
Mispredicted Calls
Baum, Boyett, & Garrison
Matrix Processing Programs Results (cont.)
Results for Matrix Multiplication Program
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
Matrix Mult 250 (Intel)
Matrix Mult 250 (VC++)
100
10
1
Clockticks per Instruction
Instruction Count
Clockticks
Stores
Loads
1st Level Cache Misses
Mispredicted Branches
Mispredicted Calls
Baum, Boyett, & Garrison
Matrix Processing Analysis
• For Simple Matrix and Array Processing the Intel compiler
verified it’s claim of a 10% better compiler
– With the exception of the number of Stores executed, the Intel
compiler showed approximately a 50% savings in the measured
operations.
• The Matrix Multiplication program showed one
noteworthy result: the Intel compiler had zero events for
both 1st Level Cache Misses and for Loads.
– Verified by multiple builds and runs
Baum, Boyett, & Garrison
SPEC Benchmark Results
1st Level
Cache
Misses
Loads
Stores
Clockticks
Instruction
Count
871,754,172
2,267,577,936
22,054,374,342
11,101,416,840
106,412,563,515
76,670,596,520
1.39
7,695
869,317,015
2,273,066,852
22,074,844,248
11,108,909,049
107,286,054,470
76,671,138,915
1.40
300.twolf (Intel)
346
4,874,982
7,639,211
77,060,025
32,577,657
484,933,215
210,922,988
2.30
300.twolf (VC++)
537
4,797,552
7,526,588
76,831,638
33,214,416
473,946,742
211,425,444
2.24
Executable (*.exe)
164.gzip (Intel)
164.gzip (VC++)
Mispredicte
d Calls
Mispredicte
d Branches
11,725
Clockticks
per
Instruction
Baum, Boyett, & Garrison
SPEC Benchmark Results
Results for 164.gzip Program
1,000,000,000,000
100,000,000,000
10,000,000,000
1,000,000,000
100,000,000
10,000,000
1,000,000
100,000
164.gzip (Intel)
10,000
164.gzip (VC++)
1,000
100
10
1
Clockticks per
Instruction
Instruction Count
Clockticks
Stores
Loads
1st Level Cache Misses
Mispredicted Branches
Mispredicted Calls
Baum, Boyett, & Garrison
SPEC Benchmark Results
Results for 300.twolf Program
1,000,000,000
100,000,000
10,000,000
1,000,000
100,000
10,000
300.twolf (Intel)
1,000
300.twolf (VC++)
100
10
1
Clockticks per Instruction
Instruction Count
Clockticks
Stores
Loads
1st Level Cache Misses
Mispredicted Branches
Mispredicted Calls
Baum, Boyett, & Garrison
SPEC CPU 2000 Analysis
• SPEC CPU 2000 Benchmarks did not show any
significant difference between the two compilers.
• SPEC Benchmarks were re-compiled and data sets
were collected multiple times to verify the validity
of the original data.
Baum, Boyett, & Garrison
Conclusions
• Even though our group saw significant
improvements in performance for our small test
programs, these same gains could not be duplicated
for the Benchmark applications.
• These variations might be the result of differences
in program complexity.
Baum, Boyett, & Garrison
Conclusions (cont.)
• The Intel C++ Compiler showed results that were
equal to or in some cases better than those of
Microsoft Visual C++.
• While Intel’s claim of 10% better results may not be
true in all cases it is still a superior compiler.
Baum, Boyett, & Garrison
Download