スライド 1 - Amano Lab

advertisement
Supercomputers
Special Course of Computer
Architecture
H.Amano
Contents
•
•
•
•
What are supercomputers?
Architecture of Supercomputers
Representative supercomputers
Exa-Scale supercomputer project
Defining Supercomputers
• High performance computers mainly for scientific
computation.
– Huge amount of computation for Biochemistry,
Physics, Astronomy, Meteorology and etc.
– Very expensive: developed and managed by national
fund.
– High level techniques are required to develop and
manage them.
– USA, Japan and China compete the top 1
supercomputer.
– A large amount of national fund is used, and tends to
be political news→ In Japan, the supercomputer
project became the target of budget review in Dec.
2009
「K」 achieved 10PFLOPS, and became the top 1 in the last year,
but Sequoia got back in the last month.
FLOPS
• Floating Point Operation Per Second
• Floating Point number
– (Mantissa) × 2 (index)
– Double precision 64bit, Single precision 32bit.
– IEEE Standard defines the format and
rounding
sign
Single
Double
index
8
mantissa
23
11
52
The range of performance
106
109
1012
1015
1018
100万
10億
1兆
1000兆
100京
M(Mega)
G(Giga)
T(Tera)
P(Peta)
E(Exa)
iPhone4S
140MFLOPS
Supercomputers
10TFLOPS-16PFLOPS
High-end PC
50-80GFLOPS
Powerful
GPU
Tera-FLOPS
10PFLOPS = 1京回 in Japanese
→ The name 「K」 comes from it.
growing ratio: 1.9times/year
How to select top 1?
• Top500/Green500: Performance of executing Linpack
– Linpack is a kernel for matrix computation.
– Scale free
– Performance centric.
• Godon Bell Prize
– Peak Performance, Price/Performance, Special Achievement
• HPC Challenge
– Global HPL Matrix computation: Computation
– Global Random Access: random memory access: Communication
– EP stream per system: heavy load memory access: Memory
performance
– Global FFT: Complicated problem requiring both memory and
communication performance.
• Nov. ACM/IEEE Supercomputing Conference
– Top500、Gordon Bell Prize、HPC Challenge、Green500
• Jun. International Supercomputing Conference
– Top500、Green500
10
Rmax:
Peta FLOPS
Top 5
K
Japan
Sequoia
USA
16PFLOPS
9
8
3
Tianhe(天河) China
2
Jaguar USA
Nebulae
China
1
Kraken USA
Roadrunner USA
Tsubame Japan
Jugene Germany
2010.6
2010.11
2011.6
2011.11
From SACSIS2012 Invited Speech.
Top 500 2011 11月
Name
Developmen
t
Hardware
Cores
Performanc
e TFLOPS
Power
(KW)
K (京)
(Japan)
RIKEN AICS
SPARC VIIIfx
2.0GHz Tofu
Interconnect
Fujitsu
705024
10510
(11280)
12659.9
Tianhe1A( 天河)
(China)
National
Supercompute
r Center
Tenjien
NUDT YH MPPXeon
X5670 6C
2.93GHz,NVIDIA
2050 NUDT
186368
2566
(4701)
4040
Jaguar
(USA)
DOE/SC/Oak
Ridge
National Lab.
Cray XT5-HE
Opteron 6-Core
2.6GHz, Cray Inc.
224162
1759
(2331)
6950
Nebulae(
China)
National
Supercomputi
ng Centre in
Shenzhen
Dawning TC3600
Blade, Xeon X5650
6C 2.66GHz,
Infiniband QDR,
NVIDIA 2050,Dawing
120640
1271
(2974)
2580
TSUBAM
E2.0(Jap
an)
GSIC,Tokyo
Inst. of
Technology
HP ProLiant
73238
SL390s G7 Xeon
6C X5670, NVIDIA
GPU,NEC/HP
1192
(2287)
1398.6
Green 500 2011 11月
1
Machine
Place
FLOPS/W
Total
kW
BlueGene/Q, Power BQC 16C
1.60 GHz, Custom
IBM - Rochester
2026.48
85.12
IBM Blue
Gene/Q
got 1-5
2-5
BlueGene/Q, Power BQC 16C
1.60 GHz, Custom
BlueGene/Q Prototype
IBM – Thomas J.
Watson Research
Center /Rochester
1689.86 2026.48
6
DEGIMA Cluster, Intel i5, ATI
Radeon GPU, Infiniband QDR
Nagasaki Univ.
1378.32
47.05
7
Bullx B505, Xeon E5649 6C
2.53GHz, Infiniband QDR,
NVIDIA 2090
Barcelona
Supercomputing
Center
1266.26
81.50
8
Curie Hybrid Nodes - Bullx B505, TGCC / GENCI
Nvidia M2090, Xeon E5640 2.67
GHz, Infiniband QDR
1010.11
108.80
10位はTsubame-2.0(東工大)
Why Top1?
• Top1 is just a measure of matrix computation.
• Top1 of Green500, Gordon Bell Prize, Top1 of
each HPC Challenge program
→ All machines are valuable.
TV or newspapers are too much focus on Top 500.
• However, most top 1 computer also got Gordon
Bell Prize and HPC Challenge top1.
– K and Sequoia
• Impact of Top 1 is great!
Why supercomputers so fast?
× Because they use high freq. clock
Freq.
Pentium4
3.2GHz
Clock freq. of High end PC
Nehalem
3.3GHz
K 2GHz
Sequoia 1.6GHz
1GHz
40% / year
Alpha21064
150MHz
100MHz
1992
The speed up of the clock is
saturated in 2003.
Power and heat dissipation
The clock frequency of K
and Sequoia is lower than
that of common PCs
2000
2008
年
Major 3 methods of parallel
processing in supercomputers
Supercomputer = massively parallel computers
– SIMD (Single Instruction Stream Multiple Data Streams)
• Most accelerators
– Pipelined processing
• Vector computers
– MIMD(Multiple Instruction Streams Multiple Data Streams):
• Homogeneous (vs. Accelerators), Scalar (vs. Vector machines)
– Although all supercomputers use three methods in various level,
it can be classified by its usage.
Key issues other than computational nodes
Large high bandwidth memory
Large disk
High speed Interconnection Networks.
SIMD (Single Instruction Stream
Multiple Data Streams
Instruction
Memory
Instruction
Processing Unit
Data memory
•All Processing Units executes
the same instruction
•Low degree of flexibility
•Illiac-IV/MMX
instructions/ClearSpeed/IMAP
/GP-GPU(coarse grain)
•CM-2,(fine grain)
GPGPU(General-Purpose computing
on Graphic ProcessingUnit)
– TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th )
– 天河一号(Xeon+FireStream,2009/11 5th )
※()内は開発環境
GeForce
GTX280
240 cores
Host
Input Assembler
Thread Execution Manager
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
…
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
Load/Store
Global Memory
PBSM
PBSM
PBSM
PBSM
GPU (NVIDIA’s GTX580)
128 Cores
128 Cores
L2 Cache
128 Cores
128 Cores
512 GPU cores ( 128 X 4 )
768 KB L2 cache
40nm CMOS 550 mm^2
Cell Broadband Engine
PS3
PPE L2 C SPE
L1 C SXU SXU SXU SXU
IBM Roadrunner
Common platform for
supercomputers and games
PXU LS LS LS LS IOIF1
DMA DMA DMA DMA
1.6GHz / 4 X 16B data rings
BIF/
MIC SXU SXU SXU SXU
IOIF0
LS LS LS LS
DMA DMA DMA DMA
Peta FLOPS
11
K
Japan
Peak performance vs
Linpack Performance
10
5
4
The difference is large
in machines with accelerators
Tianhe(天河) China
Homogeneous
Using GPU
3
Nebulae
China
2
Jaguar USA
Tsubame Japan
1
Accelerator type is
energy efficient.
Pipeline processing
1
2
3
4
5
6
Stage
Each stage sends the result/receives the input every clock
cycle.
N stages = N times performance
Data dependency makes RAW hazards and degrades the
performance.
If the large array is treated, a lot of stages can work
efficiently.
Vector computers
vector registers
a0a1a2…..
adder
Y=Y+X[i]
multiplier
X[i]=A[i]*B[i]
b0b1b2….
The classic style supercomputers since Cray-1.
Earth simulator may be the last vector supercomputer.
Vector computers
vector registers
a1a2…..
adder
multiplier
a0
b0
Y=Y+X[i]
X[i]=A[i]*B[i]
b1b2….
Vector computers
vector registers
a2…..
adder
multiplier
a1
a0
b0
b1
Y=Y+X[i]
X[i]=A[i]*B[i]
b2….
Vector computers
vector registers
a11…..
multiplier
a10
adder
x1
x0
a9
b9
b10
Y=Y+X[i]
X[i]=A[i]*B[i]
b11….
MIMD(Multipe-Instruction Streams/
Multiple-Data Streams)
• Multiple processors (cores) can work
independently.
– Synchronization mechanism
– Data communication: Shared memory
• All supercomputers are MIMD with multiple
cores.
• However, K and Sequoia (BlueGene Q) are
typical massively parallel MIMD machines.
– homogeneous computers
– scalar processors
MIMD(Multipe-Instruction Streams/
Multiple-Data Streams)
Node 0
0
Node 1
1
2
Interconnection
Network
Node 2
3
Node 3
Shared Memory
Processors which can
work independently.
Multi-Core (Intel’s Nehalem-EX)
CPU
CPU
L3 Cache
CPU
CPU
CPU
CPU
L3 Cache
CPU
CPU
8 CPU cores
24MB L3 cache
45nm CMOS 600 mm^2
Intel 80-Core Chip
Intel 80-core chip [Vangal,ISSCC’07]
How to program them?
• Can the common programs for PC be
accelerated on supercomputers?
– Yes, a certain degree by parallel compilers.
• However, in order to efficient use of many
cores, specialists must optimize programs.
– Multithread using MPIs
– Open MP
– Open CL/CUDA → GPU accelerator type
The fastest computer
Also simple NUMA
From IBM web site
IBM’s BlueGene Q
• Successor of Blue Gene L and Blue Gene P.
• Sequoia is consisting of BlueGene Q
• 18 Power processors (16 computational, 1
control and 1 redundant) and network interfaces
are provided in a chip.
• Inner-chip interconnection is a cross-bar switch.
• 5 dimensional Mesh/Torus
• 1.6GHz clock.
Japanese supercomputers
• K-Supercomputer
– Homogeneous scalar type massively parallel computers.
• Earth simulator
– Vector computers
– The difference between peak and Linpack performance is small.
• TIT’s Tsubame
– A lot of GPUs are used. Energy efficient supercomputer.
• Nagasaki University’s DEGIMA
– A lot of GPUs are used. Hand made supercomputer. High costperformance. Gordon Bell prize cost performance winner
• GRAPE projects
– For astronomy, dedicated supercomputers. SIMD、Various
version won the Gordon Bell prize.
SACSIS2012 Invited Speech
Supercomputer 「K」
L2 C
Memory
Core
Core
Core
Core
Core
Core
Core
Core
Tofu Interconnect
6-D Torus/Mesh
Inter
Connect
Controller
SPARC64 VIIIfx Chip
4 nodes/board
96nodes/Lack
24boards/Lack
RDMA mechanism
NUMA or UMA+NORMA
SACSIS2012 Invited speech
SACSIS2012 invited speech
water cooling system
Lacks of K
6 dimensional torus
Tofu
3-dimensional mesh
2 00
0 00
010
0 20
1 00
101
001
00 2
10
11 0
1 11
0 11
012
120
120
121
0 21
0 22
201
20 2
10 2
11
212
3-ary 1-cube
112
221
2 22
3-ary 2-cube
122
3-ary 3-cube
4 dimensional mesh
0***
1***
2***
Why K could get top 1
• The delay of BlueGeneQ/Sequoia
– Financial crisis in USA
• Withdrawal of NEC/Hitachi
– As starting, the complex system of a vector machine
and a scalar machine was planned.
– All budget can be used only for scalar machine.
• Budget reviewing made the project famous.
– Enough fund was thrown in short period.
• Engineers in Fujitsu did really good job.
SACSIS2012 invited talk
Peak performance
40TFLOPS
The earth simulator
Interconnection Network (16GB/s x 2)
Node 1
7
0
1
…
Vector Processor
1
….
Vector Processor
0
…
Shared Memory
16GB
Vector Processor
Vector Processor
1
7
Node 0
…
Vector Processor
Vector Processor
Shared Memory
16GB
Vector Processor
0
Vector Processor
Vector Processor
Shared Memory
16GB
7
Node 639
The Earth simulator
(2002) Simple NUMA
TIT’s Tsubame
Well balanced
supercomputer with GPUs
Nagasaki
Univ’s
DEGIMA
GRAPE-DR
Kei Hiraki “GRAPE-DR”
http://www.fpl.org (FPL2007)
Exa-scale computer
• Japanese national project for exa-scale computer started.
• Feasibility Study started.
– U. Tokyo, Tsukuba Univ. Tohoku Univ. and Riken.
• It is difficult to produce supercomputers with Japanese original chips.
• In Japan, a vendor suffers loss for developing supercomputers.
• The vendor may retrieve development fee later by selling smaller
systems.
• However, Japanese semiconductor companies will not be able to
support a big money for development.
• If Intel’s CPUs or NVIDIA’s GPUs are used, a huge national money
will flow to US companies.
• For exa-scale: 70,000,000 cores are needed.
– The limitation of budget is severer than technical limit.
Amdahl’s law
Serial part
1%
Parallel part 99%
Accelerated by parallel processing
0.01 + 0.99/p
50 times with 100 cores、91 times with 1000 cores
If there is a small part of serial execution part, the performance
improvement is limited.
Why Exa-scale supercomputers?
• The ratio of serial part becomes small for the
large scale problem.
– Linpack is scale free benchmark.
– Serial execution part 1 day+Parallel execution part 10 years
→ 1day+1day: A big impact.
• Are there any big programs which cannot be solved by K
but can be solved by Exa-scale supercomputers?
– The number of programs will be decreased.
– Can we find new area of application?
• It is important such a big computing power is open for
researches.
Should we develop a floating computation
centric supercomputers?
• What people wants big supercomputer to do?
– Finding new medicines: Pattern matching.
– Simulation of earthquake, Meteorology for analyzing
global warming.
– Big data
– Artificial Intelligence
• Most of them are not suitable for floating
computation centric supercomputers.
• “Supercomputers for big data” or “Super-cloud
computers” might be required.
Motivation and limitation
• Integrated computer technologies including architecture, hardware,
software, dependable techniques, semiconductors and application.
• Flagship and symbols.
• No-computer is remained in Japan other than supercomputers
• A super computing power is open for peaceful researches.
• It is a tool which makes impossible analysis possible.
• What needs infinite computing power?
• Is it a Japanese supercomputer if all cores and accelerators are
made in USA?
• Does floating centric supercomputer to solve LInpack as fast as
possible really fit the demand?
Look at Exa-scale computer project!
Excise
• A target program:
serial computation part :1
parallel computation part: N3
• K: 700,000 cores
• Exa: 70,000,000 cores
• What N makes Exa 10 times faster than
K?
Download