What Programming Language/Compiler Researchers should Know about Computer Architecture

advertisement
The University of Texas at Austin
What Programming
Language/Compiler Researchers
should Know about Computer
Architecture
Lizy Kurian John
Department of Electrical and Computer Engineering
The University of Texas at Austin
Lizy Kurian John, LCA, UT Austin
1
Somebody once said
“Computers are dumb actors and
compilers/programmers are the
master playwrights.”
Lizy Kurian John, LCA, UT Austin
2
Computer Architecture Basics
 ISAs
 RISC vs CISC
 Assembly language coding
 Datapath (ALU) and controller
 Pipelining
 Caches
 Out of order execution
Hennessy and Patterson architecture books
Lizy Kurian John, LCA, UT Austin
3
Basics
 ILP
 DLP
 TLP
 Massive parallelism
 SIMD/MIMD
 VLIW
 Performance and Power metrics
Hennessy and Patterson architecture books
ASPLOS, ISCA, Micro, HPCA
Lizy Kurian John, LCA, UT Austin
4
The Bottomline
Programming Language choice
affects performance and power
eg: Java
Compilers affect Performance and
Power
Lizy Kurian John, LCA, UT Austin
5
A Java Hardware Interpreter
Java
class file
Native machine instructions
Hardware
bytecode
translator
Fetch
Native
executable
Decode
Execute
bytecodes
 Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001)
 This technique used by Nazomi Communications,
Parthus (Chicory Systems)
Lizy Kurian John, LCA, UT Austin
6
867.8
321.9
27.7
120.0
28.8
85.2
127.7
146.2
149.7
108.8
71.0
59.8
16.0
50
60.4
150
133.7
109.3
200
135.9
250
146.1
221.5
300
250.2
350
100
492.2
911.7
400
44.8
execution cycles (millions)
4-way performance
989.4
934.1
HardInt Performance
0
db
JDK 1.1.6 Interpreter
javac
JDK 1.1.6 JIT
jess
mpeg
JDK 1.2 Interpreter
JDK 1.2 JIT
mtrt
Hard-Int
• Hard-Int performs consistently better than the interpreter
• In JIT mode, significant performance boost in 4 of 5
applications.
Lizy Kurian John, LCA, UT Austin
7
Compiler and Power
A
B
E
D
F
DDG
C
Cycle 1
A
E
B
Cycle 2
Cycle 3
Cycle 4
E
A
Cycle 1
B
C
D
D
Cycle 3
F
F
C
Peak Power = 3
Energy = 6
Lizy Kurian John, LCA, UT Austin
Cycle 2
Cycle 4
Peak Power = 2
Energy = 6
8
Valluri et al 2001 HPCA
workshop
 Quantitative Study
 Influence of state-of-the-art optimizations on
energy and power of the processor examined
 Optimizations studied


Standard –O1 to –O4 of DEC Alpha’s cc compiler
Four individual optimizations – simple basic-block
instruction scheduling, loop unrolling, function
inlining, and aggressive global scheduling
Lizy Kurian John, LCA, UT Austin
9
Standard Optimizations on
Power
Benchmark opt level
O0
O1
compress
O2
O3
O4
O0
O1
go
O2
O3
O4
O0
O1
li
O2
O3
O4
Energy Exec Time
100
100
74.48
81.55
75.13
81.44
75.13
81.44
79.01
82.77
100
100
66.2
64.13
62.62
61.31
62.62
61.31
63.67
62.19
100
100
81.32
83.66
79.6
75.97
79.6
75.97
85.71
77.89
Insts
100
81.52
82.04
82.04
86.11
100
68.94
63.01
63.01
63.75
100
83.18
82.97
82.97
90.96
Lizy Kurian John, LCA, UT Austin
Avg Power
100
91.33
92.25
92.25
95.45
100
103.23
102.14
102.14
102.38
100
97.2
104.78
104.78
110.05
IPC
100
99.96
100.73
100.73
104.03
100
107.5
102.78
102.78
102.51
100
99.42
109.21
109.21
116.78
10
Somebody once said
“Computers are dumb actors and
compilers/programmers are the
master playwrights.”
Lizy Kurian John, LCA, UT Austin
11
A large part of modern out
of order processors
is hardware that could have been
eliminated if a good compiler
existed.
Lizy Kurian John, LCA, UT Austin
12
Let me get more arrogant
A large part of modern out of order
processors was designed because
computer architects thought compiler
writers could not do a good job.
Lizy Kurian John, LCA, UT Austin
13
Value Prediction
Is a slap on your face
Shen and Lipasti
Lizy Kurian John, LCA, UT Austin
14
Value Locality
 Likelihood that an instruction’s
computed result or a similar predictable
result will occur soon
 Observation – a limited set of unique
values constitute majority of values
produced and consumed during
execution
Lizy Kurian John, LCA, UT Austin
15
Load Value Locality
Lizy Kurian John, LCA, UT Austin
16
Causes of value locality
 Data redundancy – many 0s, sparse matrices,
white space in files, empty cells in spread
sheets
 Program constants –
 Computed branches – base address for jump
tables is a run-time constant
 Virtual function calls – involve code to load a
function pointer – can be constant
Lizy Kurian John, LCA, UT Austin
17
Causes of value locality
 Memory alias resolution – compiler
conservatively generates code – may contain
stores that alias with loads
 Register spill code – stores and subsequent
loads
 Convergent algorithms – convergence in parts
of algorithms before global convergence
 Polling algorithms
Lizy Kurian John, LCA, UT Austin
18
2 Extremist Views
Anything that can be done in
hardware should be done in
hardware.
Anything that can be done in
software should be done in
software.
Lizy Kurian John, LCA, UT Austin
19
What do we need?
The Dumb actor
Or the
The defiant actor – who pays very
little attention to the script
Lizy Kurian John, LCA, UT Austin
20
Challenging all compiler
writers
The last 15 years was the defiant actor’s era
What about the next 15? TLP, Multithreading,
Parallelizing compilers – It’s time for a lot
more dumb acting from the architect’s side.
And it’s time for some good scriptwriting from
the compiler writer’s side.
Lizy Kurian John, LCA, UT Austin
21
The University of Texas at Austin
BACKUP
Lizy Kurian John, LCA, UT Austin
22
Compiler Optimzations
 cc - Native C compiler on Dec Alpha
21064 running OSF1 operating system
 gcc – Used to study the effect of
individual optimizations
Lizy Kurian John, LCA, UT Austin
23
Std Optimizations Levels on cc
-O0 – No optimizations performed
-O1 – Local optimizations such as CSE, copy
propagation, IVE etc
-O2 – Inline expansion of static procedures and
global optimizations such as loop unrolling,
instruction scheduling
-O3 – Inline expansion of global procedures
-O4 – s/w pipelining, loop vectorization etc
Lizy Kurian John, LCA, UT Austin
24
Std Optimizations Levels on
gcc
-O0 – No optimizations performed
-O1 – Local optimizations such as CSE, copy
propagation, dead-code elimination etc
-O2 – aggressive instruction scheduling
-O3 – Inlining of procedures
NOTE:
 Almost same optimizations in each level of cc and gcc
 In cc and gcc, optimizations that increase ILP are in
levels -O2, -O3, and -O4
 cc used where ever possible, gcc used used where
specific hooks are required
Lizy Kurian John, LCA, UT Austin
25
Individual Optimizations
 Four gcc optimizations, all optimizations




applied on top -O1
-fschedule-insns – local register allocation
followed by basic-block list scheduling
-fschedule-insns2 – Postpass scheduling
done
-finline-functions – Integrated all simple
functions into their callers
-funroll-loops – Perform the optimization of
loop unrolling
Lizy Kurian John, LCA, UT Austin
26
Some observations
 Energy consumption reduces when # of
instructions is reduced, i.e., when the
total work done is less, energy is less
 Power dissipation is directly proportional
to IPC
Lizy Kurian John, LCA, UT Austin
27
Observations (contd.)
 Function inlining was found to be good
for both power and energy
 Unrolling was found to be good for
energy consumption but bad for power
dissipation
Lizy Kurian John, LCA, UT Austin
28
MMX/SIMD
Automatic usage of SIMD ISA still
difficult 10+ years after
introduction of MMX.
Lizy Kurian John, LCA, UT Austin
29
Standard Optimizations on Power
(Contd)
Benchmark opt level
O0
O1
saxpy
O2
O3
O4
O0
O1
su2cor
O2
O3
O0
O1
swim
O2
O3
Energy Exec Time
100
100
97.38
100.24
97.69
99.38
97.69
99.38
98.31
99.27
100
100
42.09
51.04
40.99
47.52
40.99
46.37
100
100
30.1
36.64
28.93
34.01
28.93
34.01
Insts
100
92.49
92.49
92.49
92.84
100
33.21
33.1
33.1
100
20.01
19.05
19.05
Lizy Kurian John, LCA, UT Austin
Avg Power
100
97.15
98.3
98.3
99.02
100
82.46
86.28
87.65
100
82.15
85.06
85.06
IPC
100
92.27
93.07
93.07
93.51
100
65.06
69.67
71.38
100
5463
56.01
56.01
30
Download