COP5622 Advanced Topics in Compilation

advertisement
Advanced Topics in
Compilation
COP5622
Prof. Robert van Engelen
Fall 2005
Syllabus
•
•
•
•
•
•
•
Lectures:
Prerequisites:
Instructor:
Office:
Office hours:
Email:
Web page:
Mon & Wed, 2:00PM, 103LOV
COP5621
Prof. Robert van Engelen
471DSL
Tue 1:00PM
[email protected]
http://www.cs.fsu.edu/~engelen/courses/COP5
622
COP5622
2
Books
COP5622
3
… Embedded Computing?
“If you round off the fractions, embedded systems consume
100% of the worldwide production of microprocessors.”
- Jim Turley, Editor, Computer Industry Analyst
There is no sharp line between embedded and
general-purpose computing.
Modern (embedded) CPUs combine generalpurpose and DSP features, mutatis mutandis.
COP5622
4
The “Center of Gravity” of
Computing
Mainframes
Minicomputers
Desktop systems
Smart products
Era:
1950s
1970s
1980s
2000s
Form factor:
Multi-cabinet
Multiple boards
Single board
Single chip
Resource type:
Corporate
Departmental
Personal
Embedded
Users per CPU:
100s-1000s
10s-100s
1 user
100s CPUs/user
Type system cost:
$1 million+
$100,000s+
$1,000-$10,000s
$10-$100
Worldwide units:
10,000s+
100,000+
100,000,000s
100,000,000,000s
Major platforms:
IBM, CDC,
Burroughs, Sperry,
GE, Honeywell,
Univac, NCR
DEC, IBM, Prime,
Wang, HP, Pyramid,
Data General, …
Apple, IBM,
Compaq, Sun, Hp,
SGI, Dell, …
?
Operating
systems:
By manufacturer
By manufacturer,
some Unix
DOS, MacOS,
Windows,
Unix/Linux
?
Source: J. Fisher, P. Faraboschi, C. Young
COP5622
5
Superscalar, VLIW, and EPIC
Name
Issue structure
Hazard
detection
Superscalar
(static)
Dynamic
Hardware
Static
In-order
execution
Sun UltraSparc
II/III
Superscalar
(dynamic)
Dynamic
Hardware
Dynamic
Some out-oforder execution
IBM Power2
Superscalar
(speculative)
Dynamic
Hardware
Dynamic with
speculation
Out-of-order
execution with
speculation
Pentium III/4,
MIPS R10K,
Alpha 21264,
HP PA 8500,
IBM RS64III
VLIW/LIW
Static
Software
Static
No hazards
between issue
packets
Trimedia, i850
EPIC
Mostly static
Mostly software
Mostly static
Explicit
dependences
marked by
compiler
Itanium,
Itanium2
Scheduling
Distinguishing
characteristic
Examples
Source: J. Hennessy & D. Patterson
COP5622
6
Superscalar versus VLIW
Source: J. Fisher, P. Faraboschi, C. Young
COP5622
7
HP PA-8000
Source: J. Fisher, P. Faraboschi, C. Young
• Instruction reorder
buffer is used to
issue operations to
the execution units
• Operations are
scheduled out-oforder
• Instruction reorder
buffer takes prime
real estate
COP5622
8
Role of the Compiler for
Superscalar and VLIW
Sequential
Architectures
Dependence
architectures
Independence
Architectures
Processor style:
Superscalar
Dataflow
VLIW
Dependence
information in the
program:
Implicit in register names
An exact description of all
dependence information
Description of operations
that are independent
How dependent
operations are typically
exposed:
By the hardware’s control
unit
By the compiler (they are
embedded in the
program)
By the compiler (they are
implicit in the program)
How independent
operations are typically
exposed:
By the hardware’s control
unit
By the hardware’s control
unit
By the compiler (they are
embedded in the
program)
Where scheduling is
typically performed:
In the hardware’s control
unit
In the hardware’s control
unit
In the compiler
Role of the compiler:
Rearranges code to make
ILP more evident and
accessible
Replaces some of the
hardware
Replaces virtually all
hardware dedicated to
ILP exposure and
scheduling
Source: J. Fisher, P. Faraboschi, C. Young
COP5622
9
So, What’s Next?
• Some evidence things are about to change:
– Intel announced radically redesigned x86 core
– Apple decided to adopt Intel cores
– Steve Jobs: Performance per Watt must increase
• PowerPC: 15 computation units per Watt
• New Intel x86 core: 70 computation units per Watt
– Out-of-order hardware is power hungry
– VLIW shown to reduce power consumption
– New compiler technology (Elbrus) bought by Intel
• Conclusion…
COP5622
10
Can VLIW Really Compete
with Superscalar?
“A fanatic is one who can’t change his mind and won’t
change the subject.”
- [attributed to] Sir Winston S. Churchill, British Prime Minister
“Transmeta and Itanium not living up to promises”
Fallacy: VLIW controls everything in software.
Fallacy: VLIWs require “Heroic Compilers” to
do what superscalars do in hardware.
COP5622
11
Embedded System
Complexity and Cost
“Any intelligent fool can make things bigger, more complex,
and more violent. It takes a touch of genius -- and a lot of
courage -- to move in the opposite direction.”
- Ernst F. Schumacher, German Economist, 1910-1977
In embedded systems, smaller is often better
• Silicon cost scales with cube of area
• Low-power constraints
• Also custom designs, e.g. ASIC (> 100,000
units), FPGA (low volume), ASIP, SoPC
COP5622
12
VLIW and ILP
• ILP has significant impact on performance,
which is important for high-end systems
– Superscalar
– VLIW/EPIC
• For embedded systems, ILP yields
performance gains and power savings (lower
clock rate), provided that ILP implementation
is low-cost and low-power
– VLIW
– DSP and custom
COP5622
13
Example (VEX)
Source: J. Fisher, P. Faraboschi, C. Young
COP5622
14
Example (Compacted VEX)
Source: J. Fisher, P. Faraboschi, C. Young
COP5622
15
ILP Compilers
“A worker may be the hammer’s master, but the hammer still
prevails. A tool knows exactly how it is meant to be handled,
while the user of the tool can only have an approximate
idea.”
- Milan Kundera, Czech writer, 1929-
An ILP compiler is possibly the largest
investment in engineering effort in a VLIW
system.
COP5622
16
Compiler in the Embedded
Toolchain Workflow
COP5622
17
Compilation with Profiling
COP5622
18
What is Important in an ILP
Compiler?
• Parallelism is key to performance,
price/performance, power, and cost
• One-design-fits-all compilers (gcc) cannot
optimize well over all platforms
• Compiler technology lags behind hardware
• Back-end optimizations are crucial for ILP
– Responsible for finding and organizing parallelism
– May need multiple intermediate representations
(IRs)
COP5622
19
Structure of an ILP Compiler
COP5622
20
Embedded-Specific Tradeoffs
for Compilers
• Space, time, and energy tradeoffs
• These are contrasting goals for compiler
• Ideally, a compiler should expose some
linear combination of the optimization
dimensions to application developers:
K1{speed} + K2{code_size} + K3{energy_efficiency}
COP5622
21
Effect of Compiler
Optimizations on Space
• Code size determines cost of ROM
• Should minimize I-cache and D-cache misses
• Code layout techniques:
–
–
–
–
–
DAG-based placement
Pettis-Hansen
Inlining
Cache line coloring
Temporal-order placement
COP5622
22
Code Placement Gains
Source: J. Fisher, P. Faraboschi, C. Young
COP5622
23
Fundamentals of Power
Dissipation: Switching
Unlike TTL or ECL, CMOS transistors drain current
while switching, and power dissipation depends linearly
on frequency and quadratically on voltage:
fs
Ci
A
switching frequency
load capacitance on net i
fraction of nets in circuit that actually switch
COP5622
24
Fundamentals of Power
Dissipation: Leakage
CMOS transistors drain minimal current when open or
closed, typically only 1% of total dissipation for 0.25 to
about 30% for 90nm process technology.
Dynamic voltage scaling (DVS) lowers operational
frequency and increases the time from t1 to t2  t1(f1/f2)
with an energy saving that is approximately quadratic:
COP5622
25
Power-aware Software
Techniques
•
•
•
•
•
Reducing switching activity
Power-aware instruction selection
Scheduling for minimal dissipation
Memory access optimizations
Data remapping
COP5622
26
Effect of Compiler
Optimizations on Power
• Most of the traditional “scalar” optimizations
benefit space, time, and (therefore) power
• Not so for:
–
–
–
–
–
Loop unrolling
Tail duplication
Inlining and cloning
Speculation and predication
Global code motion
• These and other optimizations will be
discussed in subsequent lectures
COP5622
27
Download