NEC Supercomputers for Meteorological Applications

advertisement
CAS2001
NEC Supercomputers for Meteorological Applications
Road Map and Product Strategy
Oct. 29, 2001
Tadashi Watanabe
Solutions
/
History of High Performance Computers
Earth Simulator
ASCI Q
SR8000F1
13
10
12
10
ASCI
SX-4/512
T3E
VPP500/222
CM-5
Vector
T
FLOPS
11
10
Scalar
10
10
9
10
G
8
10
7
10
10 M
6
ASCI White
SX-6
VPP5000
SX-5/512
VPP700/512
SR2201
CRAY T932
S3800/480
SX-3/44R
CRAY C90
SX-3/44
VP2600
CRAY Y-MP8
SX-2
S-820/80
CRAY Y-MP
VP-200
CYBER205
S-810/20
CRAY-1
ILLIAC IV
STAR-100
TI ASC
CDC7600
Vector
Multiprocessors
Microprocessors
CDC6600
1970
'80
'90
2000
05
Architecture of Supercomputers
Scalar
Processing
Performance
Limitation
by Scalar
Processing
Vector
Processing
(Memory to
Memory)
Vector
Processing
(Vector
Register)
Vector
Processing
Bottleneck
in Memory
Throughput
Vector Register
Vectorizing
Compiler
Performance
Limitation
by Single Processor
Vector Processor
Vector Processor
Scalar Processor
Vector Pipes
Vector Pipes
Shared Memory
Multiprocessors
Multiprocessor
Parallelizing
Compiler
Bottleneck in
Memory
Throughput
Distributed
Memory
Parallel
Processor
Distributed Memory
Difficult to Code
Distributed
Shared PP
Distributed
Shared
Memory
Processor
Vector
Processor
Main
Memory
SMP
SMP
Vector Register
Main Memory
Main Memory
Main Memory
Main Memory
Mainframe
CDC6600/7600
CYBER200
CRAY-1
SX-2
VP-200
S810/S820
CRAYXMP/YMP
CRAY-C90/T90
SX-3/SX-4/SX-5
VP2000
S3800
Network
VPP500 T3E
SP-2
CM5
nCUBE
PARAGON
Network
SX-5/SX-6
RS6000/SP
O2K
TX7
Capacity Computing and Capability Computing
…
Capacity Computing
Capability Computing
・Goals:Workload and Throughput
Wall clock time is secondary
・Many small problems - Not challenging
・Fit on μ-P Based MPP or Workstation
Clusters
・Goal:Wall clock time(TAT)
・Large critical problems
- not fit in conventional systems
・Best fit on SMP with powerful
processors
Performance for Capability Computing
Products for Capacity and Capability Computing
Performance for Capacity Computing
(Throughput)
Vector Processor vs Scalar Processor
・ Vector : Capability Computing
・ Scalar : Capacity Computing
Vector Oriented
Data Size
Large
Weather/Climate
Genome
Chemistry
Scalar Oriented
Crash
CFD
IA-64 Architecture
Itanium(800MHZ)
Small
Max 16way
AzusA
Structural
Analysis
Max 64GB Shared Memory
Small
Arithmetic
Operations
Large
*code name
AzusA*
NEC’s Strategic
Itanium product
 The world’s first large
scale Itanium server in
operation
 Leverages NEC’s
expertise on
supercomputers and
mainframes to develop
highly scalable and reliable
Itanium servers
AzusA Features
AzusA advanced features by NEC Original Chipset
AzusA Features
• Based on the expertise on super-computers and mainframes
16 Intel Itanium
TM
Cell#3
CPUCell#2
CPUCell#1
MEMCell#0
CPU
MEM
CPU
MEM
NEC MEM
PCI Box
High Performance:
16 Intel ItaniumTM Processors
Large Memory Space:
64bit addressing
64GB main memory
Large Configuration:
128 PCI slots(33MHz) or 64 slots (66MHz)
High Availability:
Replaceable parts hot-swappable
- CPU CELL, PCI card, FAN, Power supply
Data paths are ECC and/or parity protected
8 Disk Drives
Flexibility:
Partitioning(into up to 4 systems)
Higher scalability, availability, and flexibility
NEC Itanium Server Roadmap
32-512CPU
SCALABILTY
16,32-512
16-32 CPU
Future
Products
Itanium 16CPU
16
High-End
AzusA
Madison 8CPU
McKinley 8CPU
Madison 4CPU
McKinley 4CPU
Itanium 4CPU
Midrange
Low-End
Itanium
2000
2001
McKinley
2002
Madison
2003
Note: plan subject to change
SX-6:
The facts
SX-Series History
◆THE LATEST TECHNOLOGY IN THE SX-SERIES
NEC INTRODUCES SX-6: A NEW GENERATION OF SUPERCOMPUTERS
SX-SERIES
2001
NEW GENERATION
1998
1994
SX-6 Series
- SINGLE-CHIP VECTOR PROCESSOR
-GREATER SCALABILITY
1989
SX-5 Series
-HIGH SUSTAINED PERFORMANCE
-Large Capacity SHARED MEMORY
1983
SX-4 Series
WITH THE
COLLABORATION
-CMOS INNOVATIVE TECHNOLOGY
OF ISV AND USERS
-ENTIRELY AIR-COOLING
GLOBAL
SX-3 Series
ALLIANCES
-SHARED MEMORY・MULTI-FUNCTION PROCESSOR
ACCUMULATED
-UNIX OS
SX Series
-THE FIRST COMPUTER IN THE WORLD
SURPASSING 1GFLOPS
HPC
TECHNOLOGY
STATE-OF-OF-THE-ART
CRAY, BULL
To Be THE MARKET
BOARD PACKAGING
LEADER IN LARGE
TECHNOLOGY
SCALE HPC MARKET
SX-6 single node system
• High performance
supercomputer
• Ultra-high bandwidth
shared memory
subsystem
• Maximum 8 processors,
8 Gigaflops each
• Maximum 64 Gigabyte
memory
• Maximum 64 Gigaflops
per node
SX-6 multi node system
• Maximum 128 nodes
• Maximum 1024 CPUs, max 8
TFLOPS
• Internode crossbar Switch
• 8 GB/s interconnect bandwidth per
node
• 1 TB/s maximum interconnect
bandwidth per system
SX-6 system software
• Proven Operating System: Super
UX
• Development Tools: C, C++,
Fortran90, MPI, OpenMP,
Vampir/SX, TotalView
• Enhanced Multi-Node Batch System
• Enhanced System Management
Tools
• User friendly middleware
Focus Markets
Environment & Meteorology
DMI, DKRZ, CHMI, IAP, INGV, …
MSC, INPE, BOM, KMA, JAMSTECH, NIES,...
Aerospace
Automotive
NLR, DLR,
EADS Airbus,
ONERA,NAL ...
IFP,
Mecalog,Volkswagen
Porsche
,DaimlerChrysler,
Renault, Toyota,
Mazda, Nissan, ...
Research
Seismic
Veritas, IFP, ...
HLRS Stuttgart, CSCS,
MPG, …
NIFS, Tohoku University,
Osaka University, ...
SX-6:
The technology
SX Series Processor Evolution
SX - 4
8 Vector Pipe
457 x 386 mm
Performance : 2 GFLOPS at 8.0 ns
: 0.35μm CMOS
LSI
: 37 Chips
SX- 5
SX- 6
16 Vector Pipe
225 x 225 mm
Vector CPU
: 8 GLOPS at 4.0 ns
: 0.25μm CMOS
: 32 Chips
8 GFLOPS at 2.0ns
0.15µm CMOS
Single Chip Processor
SX series memory evolution
SX - 4
457 x 386 mm
Capacity :
256MB / Card
Memory Chip : 4Mb SSRAM
32Mb SDRAM
SX- 5
SX- 6
457 x 386 mm
105 x 176mm
4- 8GB / Card
64 - 128Mb SDRAM
2GB / Card
256Mb DDR-SDRAM
Size Comparison
CPU
SX - 6
: 128GFlops(64GF*2Node)
Memory : 128GB
64GF/Cab
SX - 5
CPU
: 160 GFlops
2.0m
Memory : 128GB
1.1m
2.8m
1.8m
3.2m
6.8m ~ 7.4m
SX-6:
Parallel Processing
and
Performance
Keys for Efficiencies in Parallel Processing
・Load Balancing
・Communication Overhead
・Synchronization
Load Balancing
Many/less powerful CPUs
CPUs ● ● ● ● ● ● ● ● ● ● ● ●
Job
…
●●●
……
Many number of small tasks
Few/Powerful CPUs
……
……
Small number of large tasks
Which is more efficient and easier?
Communication Overhead
Many/less powerful CPUs
●
●
Few/powerful CPUs
CPU
●
●
●
●
●
●
●
●
●
●
・Many number of small tasks
・Low bandwidth and many
paths among CPUs
・Small number of large tasks
・High bandwith and few paths
among CPUs
Which is more efficient and easier?
Synchronization
Many/less powerful CPUs
Fork
……
Join
・Many number of small tasks
Few/Powerful CPUs
Fork
……
Join
・Small number of large tasks
Which is more efficient and easier?
NEC’s Approach for Capability Computing
(SX-6 Systems Configuration)
IXS
Full Non-blocking X-bar
8GB/Sec Bisection
Bandwidth
Memory
Large Number of
Independent Memory Banks
(4096 Banks)
Memory
Memory
Full Non-blocking X-bar
(256GB/Sec)
32GB/Sec
Bandwidth
…
8GF/CPU
Vector CPU
…
…
・Few but Powerful CPUs with Vector
・Powerful SMP
・High Bandwidth with Non-Blocking X-bar
SX-6 vs SX-5
2.5
SX-5 [8GF]
SX-6 [8GF]
(SX-5 User Time/SX-6 User Time)
Improvemment Ratio of User Time
Climate codes
2.0
1.5
1.0
0.5
97.0
97.5
98.0
98.5
Vector Operation Ratio (%)
99.0
99.5
100.0
Performance on SX-6/SX-5
(Electro Magnetic Field Analysis)
24
SX-6[8GF]
8CPU
Effective GFLOPS
20
SX-5[8GF]
16
4CPU
12
8
2CPU
4
0
8
16
24
32
40
Peak GFLOPS
48
56
64
Performance on SX-6/SX-5
(Crystal Structure Analysis)
24
SX-6[8GF]
8CPUs
20
SX-5[8GF]
16
4CPUs
12
2CPUs
8
4
0
8
16
24
32
40
Peak GFLOPS
48
56
64
Vector vs Scalar(Climate App.)
200
64CPUs
180
160
48CPUs
Effective Gflops
140
SX-6(8GF/CPU)
120
32CPUs
100
SX-5(8GF/CPU)
80
Scalar Server(10%eff.)
60
40
Scalar Server(15%eff.)
20
0
0
100
200
300
400
Peak Gflops
500
600
Technological Competence
All Technologies for High Performance
Computing are available internally
within NEC:
–
–
–
–
–
–
–
Semiconductor Devices
Packaging
Hardware Design
Interconnections and Network
Operating Systems Software
Languages and Tools
Applications Tuning and Support
Memory Chip and Tr in μ-Processor
Tr
nm
250
bits
64G
Bits/Chip
16G
1G
4G
200
Tr/Chip
1G
100
100M 256
99
2002
2005
2008
2011
2014
(ITRS’99)
Simulating “Earth” on Supercomputer
Supercomputer Simulation:
- can visualize
- can virtually experiment
- can forecast the future
However, current supercomputers
are not enough for further analysis
of problems on Planet Earth
Each CPUs executes
their share of
computation
(North American 24hours Precipitation)
Power
x 1000
NEC SX-4
The Earth
Simulator
> 40TFLOPS
1Q2002
Project of
Science & Technology Agency
Earth Simulator
HPC Road Map
Earth Simulator
SX-6XX
SX Series
SX-6X
SX-6
SX-5
SX-4
95
96
97
98
99
00
01
Where NEC is
・Technology Leader in High Performance Computing
・Leading Supplier of HPC Platforms for Large Scale
Technical and Engineering Computing
・Key Contributor to Vector Supercomputer Development
・Committed to Development of Vector Supercomputing
END
Download