Intel Corporate Overview

advertisement
Exascale:
Power, Cooling,
Reliability, and
Future Arithmetic
John Gustafson
HPC User Forum
Seattle, September 2010
The First Cluster: The “Cosmic Cube”
64 Intel 8086-8087 nodes
700 Watts, total
6 cubic feet
Chuck Seitz and
Geoffrey Fox
developed an
alternative to
Cray vector
mainframes for
$50,000, in
1984. Motivated
by QCD physics
problems, but
soon found
useful for a very
wide range of
apps.
But Fox & Seitz did not invent the
Cosmic Cube.
Stan Lee did.
From Wikipedia:
“A device created by
a secret society of
scientists to further
their ultimate goal of
world conquest.”
Tales of Suspense #79, July 1966
Terascale Power Use Today (Not to Scale)
TFLOP Machine Today
2000 W
Heat removal (all levels, chip to facility)
40% of total power consumed
950 W
19% (81% efficient power supplies)
5 kW
1500 W
Control
Disk
Comm.
Power supply loss
7.5 nJ/instruction @ 0.2 instruction/flop
Memory
100 W
100 W
150 W
10 TB disk @ 10 W per TB
100 pJ comm. per flop
0.1 byte/flop @ 1.5 nJ per byte
Compute
200 W
1 Tflops/s @ 200 pJ per flop
Derived from data from S. Borkar, Intel Fellow
Let’s See that Drawn to Scale…
A SIMD accelerator approach gives up Control to
reduce wattage per Tflops/s. Which can work, for some
applications that are very regular and SIMD-like
(vectorizable with long vectors).
Energy Cost by Operation Type
Operation
64-bit multiply-add
Read 64 bits from cache
Move 64 bits across chip
Execute an instruction
Read 64 bits from DRAM
Approximate energy
consumed today
200 pJ
800 pJ
2000 pJ
7500 pJ
12000 pJ
Notice that 12000 pJ @ 3 GHz = 36 watts!
SiCortex’s solution: drop the memory speed, but the
performance dropped proportionately.
Larger caches actually reduce power consumption.
Energy Cost of a Future HPC
Cluster
Exaflop
Petaflop
Teraflop
Power
20 MW
20 KW
20 W
Size
Data Center
Cabinet
Chip/Module
No cost for interconnect? Hmm…
But while we’ve building HPC clusters, Google
and Microsoft have been very, very busy…
Cloud computing has already
eclipsed HPC for sheer scale
•
•
•
Cloud computing means using a
remote data center to manage
scalable, reliable, on-demand
access to applications
Provides applications and
infrastructure over the Internet
“Scalable” here means:
– Possibly millions of
simultaneous users of the app.
– Exploiting thousand-fold
parallelism in the app.
From Tony Hey, Microsoft
Mega-Data Center Economy of Scale
•
•
A 50,000-server facility is 6–7x more Over 18 million square feet. Each.
cost-effective than a 1,000 server
facility in key respects
Don’t expect a TOP500 score soon.
• Secrecy?
• Or… not enough interconnect?
Technology
Cost in small
data center
Cost in large
data center
Network
$95 per Mbps
per
month
$13 per Mbps
per month
7.1
Storage
$2.20 per GB
per month
$0.40 per GB
per month
5.7
Administration
~140
servers per
Administrator
>1000
Servers per
Administrator
7.1
Data courtesy of James Hamilton
Ratio
Each data
center is the
size of 11.5
football fields
Computing by the truckload
•
Build racks and cooling and
communication together in a “container”
•
•
Hookups: power, cooling, and interconnect
•
But: designed for capacity computing, not
capability computing
I estimate each center is already over 70
megawatts… and 20 petaflops total!
From Tony Hey, Microsoft
Arming for search engine warfare
It’s starting to look like…
a steel mill!
This work is licensed under a Creative Commons Attribution 3.0 U.S. License
A steel mill takes ~500 megawatts
•
•
•
•
Half the steel mills in the US
are abandoned
Maybe some should be
converted to data centers!
Self-contained power
plant
Is this where “economy
of scale” will top out for
clusters as well?
With great power comes great responsibility
—Uncle Ben
Yes, and also some really big heat sinks.
—John G.
An unpleasant math surprise lurks…
64-bit precision is looking long in the tooth. (gulp!)
Bits
x86 80 (stack only)
80
70
Cray 64
CDC 60
60
most vendors 64
50
40
Univac, IBM 36
30
Year
Zuse 22
20
1940
1950
1960
1970
1980
1990
2000
2010
At 1 exaflop/s (1018), 15 decimals don’t last long.
It’s unlikely a code uses the best precision.
• Too few bits gives unacceptable errors
• Too many bits wastes memory, bandwidth, joules
• This goes for integers as well as floating point
Optimum precision
80
IEEE 754 double precision
64
48
32
16
0
All floating-point operations in application
Ways out of the dilemma…
•
•
•
•
Better hardware support for 128-bit, if only for use as a
check
Interval arithmetic has promise, if programmers can
learn to use it properly (not just apply it to point
arithmetic methods)
Increasing precision automatically with the memory
hierarchy might even allow a return to 32-bit
Maybe it’s time to restore Numerical Analysis to the
standard curriculum of computer scientists?
The hierarchical precision idea
More capacity
Lower latency;
More precision, range
L1
Reg Cache
16 FP values
1-cycle latency
69 significant
decimals
10–2525221 to
10+2525221 range
L2 Cache
2 million FP values
30-cycle latency
15 significant decimals
10–307 to 10+307 range
4096 FP values
6-cycle latency
33 significant decimals
10–9864 to 10+9864 range
Local RAM
2 billion FP values
200-cycle latency
7 significant decimals
10–38 to 10+38 range
Another unpleasant surprise lurking…
No more automatic caches?
• Hardware cache policies are designed to
minimize miss rates, at the expense of low
cache utilization (typically around 20%).
• Memory transfers will soon be half the power
consumed by a computer, and computing is
already power-constrained.
• Software will need to manage memory
hierarchies explicitly. And source codes need to
expose memory moves, not hide them.
Summary
• Mega-data center clusters have eclipsed HPC
clusters, but HPC can learn a lot from their methods
in getting to exascale.
• Clusters may grow to the size of steel mills, dictated
by economies of scale.
• We may have to rethink the use of 64-bit flops
everywhere, for a variety of reasons.
• Speculative data motion (like, automatic caching)
reduces operations per watt… it’s on the way out.
Download