CAS2001 NEC Supercomputers for Meteorological Applications Road Map and Product Strategy Oct. 29, 2001 Tadashi Watanabe Solutions / History of High Performance Computers Earth Simulator ASCI Q SR8000F1 13 10 12 10 ASCI SX-4/512 T3E VPP500/222 CM-5 Vector T FLOPS 11 10 Scalar 10 10 9 10 G 8 10 7 10 10 M 6 ASCI White SX-6 VPP5000 SX-5/512 VPP700/512 SR2201 CRAY T932 S3800/480 SX-3/44R CRAY C90 SX-3/44 VP2600 CRAY Y-MP8 SX-2 S-820/80 CRAY Y-MP VP-200 CYBER205 S-810/20 CRAY-1 ILLIAC IV STAR-100 TI ASC CDC7600 Vector Multiprocessors Microprocessors CDC6600 1970 '80 '90 2000 05 Architecture of Supercomputers Scalar Processing Performance Limitation by Scalar Processing Vector Processing (Memory to Memory) Vector Processing (Vector Register) Vector Processing Bottleneck in Memory Throughput Vector Register Vectorizing Compiler Performance Limitation by Single Processor Vector Processor Vector Processor Scalar Processor Vector Pipes Vector Pipes Shared Memory Multiprocessors Multiprocessor Parallelizing Compiler Bottleneck in Memory Throughput Distributed Memory Parallel Processor Distributed Memory Difficult to Code Distributed Shared PP Distributed Shared Memory Processor Vector Processor Main Memory SMP SMP Vector Register Main Memory Main Memory Main Memory Main Memory Mainframe CDC6600/7600 CYBER200 CRAY-1 SX-2 VP-200 S810/S820 CRAYXMP/YMP CRAY-C90/T90 SX-3/SX-4/SX-5 VP2000 S3800 Network VPP500 T3E SP-2 CM5 nCUBE PARAGON Network SX-5/SX-6 RS6000/SP O2K TX7 Capacity Computing and Capability Computing … Capacity Computing Capability Computing ・Goals:Workload and Throughput Wall clock time is secondary ・Many small problems - Not challenging ・Fit on μ-P Based MPP or Workstation Clusters ・Goal:Wall clock time(TAT) ・Large critical problems - not fit in conventional systems ・Best fit on SMP with powerful processors Performance for Capability Computing Products for Capacity and Capability Computing Performance for Capacity Computing (Throughput) Vector Processor vs Scalar Processor ・ Vector : Capability Computing ・ Scalar : Capacity Computing Vector Oriented Data Size Large Weather/Climate Genome Chemistry Scalar Oriented Crash CFD IA-64 Architecture Itanium(800MHZ) Small Max 16way AzusA Structural Analysis Max 64GB Shared Memory Small Arithmetic Operations Large *code name AzusA* NEC’s Strategic Itanium product The world’s first large scale Itanium server in operation Leverages NEC’s expertise on supercomputers and mainframes to develop highly scalable and reliable Itanium servers AzusA Features AzusA advanced features by NEC Original Chipset AzusA Features • Based on the expertise on super-computers and mainframes 16 Intel Itanium TM Cell#3 CPUCell#2 CPUCell#1 MEMCell#0 CPU MEM CPU MEM NEC MEM PCI Box High Performance: 16 Intel ItaniumTM Processors Large Memory Space: 64bit addressing 64GB main memory Large Configuration: 128 PCI slots(33MHz) or 64 slots (66MHz) High Availability: Replaceable parts hot-swappable - CPU CELL, PCI card, FAN, Power supply Data paths are ECC and/or parity protected 8 Disk Drives Flexibility: Partitioning(into up to 4 systems) Higher scalability, availability, and flexibility NEC Itanium Server Roadmap 32-512CPU SCALABILTY 16,32-512 16-32 CPU Future Products Itanium 16CPU 16 High-End AzusA Madison 8CPU McKinley 8CPU Madison 4CPU McKinley 4CPU Itanium 4CPU Midrange Low-End Itanium 2000 2001 McKinley 2002 Madison 2003 Note: plan subject to change SX-6: The facts SX-Series History ◆THE LATEST TECHNOLOGY IN THE SX-SERIES NEC INTRODUCES SX-6: A NEW GENERATION OF SUPERCOMPUTERS SX-SERIES 2001 NEW GENERATION 1998 1994 SX-6 Series - SINGLE-CHIP VECTOR PROCESSOR -GREATER SCALABILITY 1989 SX-5 Series -HIGH SUSTAINED PERFORMANCE -Large Capacity SHARED MEMORY 1983 SX-4 Series WITH THE COLLABORATION -CMOS INNOVATIVE TECHNOLOGY OF ISV AND USERS -ENTIRELY AIR-COOLING GLOBAL SX-3 Series ALLIANCES -SHARED MEMORY・MULTI-FUNCTION PROCESSOR ACCUMULATED -UNIX OS SX Series -THE FIRST COMPUTER IN THE WORLD SURPASSING 1GFLOPS HPC TECHNOLOGY STATE-OF-OF-THE-ART CRAY, BULL To Be THE MARKET BOARD PACKAGING LEADER IN LARGE TECHNOLOGY SCALE HPC MARKET SX-6 single node system • High performance supercomputer • Ultra-high bandwidth shared memory subsystem • Maximum 8 processors, 8 Gigaflops each • Maximum 64 Gigabyte memory • Maximum 64 Gigaflops per node SX-6 multi node system • Maximum 128 nodes • Maximum 1024 CPUs, max 8 TFLOPS • Internode crossbar Switch • 8 GB/s interconnect bandwidth per node • 1 TB/s maximum interconnect bandwidth per system SX-6 system software • Proven Operating System: Super UX • Development Tools: C, C++, Fortran90, MPI, OpenMP, Vampir/SX, TotalView • Enhanced Multi-Node Batch System • Enhanced System Management Tools • User friendly middleware Focus Markets Environment & Meteorology DMI, DKRZ, CHMI, IAP, INGV, … MSC, INPE, BOM, KMA, JAMSTECH, NIES,... Aerospace Automotive NLR, DLR, EADS Airbus, ONERA,NAL ... IFP, Mecalog,Volkswagen Porsche ,DaimlerChrysler, Renault, Toyota, Mazda, Nissan, ... Research Seismic Veritas, IFP, ... HLRS Stuttgart, CSCS, MPG, … NIFS, Tohoku University, Osaka University, ... SX-6: The technology SX Series Processor Evolution SX - 4 8 Vector Pipe 457 x 386 mm Performance : 2 GFLOPS at 8.0 ns : 0.35μm CMOS LSI : 37 Chips SX- 5 SX- 6 16 Vector Pipe 225 x 225 mm Vector CPU : 8 GLOPS at 4.0 ns : 0.25μm CMOS : 32 Chips 8 GFLOPS at 2.0ns 0.15µm CMOS Single Chip Processor SX series memory evolution SX - 4 457 x 386 mm Capacity : 256MB / Card Memory Chip : 4Mb SSRAM 32Mb SDRAM SX- 5 SX- 6 457 x 386 mm 105 x 176mm 4- 8GB / Card 64 - 128Mb SDRAM 2GB / Card 256Mb DDR-SDRAM Size Comparison CPU SX - 6 : 128GFlops(64GF*2Node) Memory : 128GB 64GF/Cab SX - 5 CPU : 160 GFlops 2.0m Memory : 128GB 1.1m 2.8m 1.8m 3.2m 6.8m ~ 7.4m SX-6: Parallel Processing and Performance Keys for Efficiencies in Parallel Processing ・Load Balancing ・Communication Overhead ・Synchronization Load Balancing Many/less powerful CPUs CPUs ● ● ● ● ● ● ● ● ● ● ● ● Job … ●●● …… Many number of small tasks Few/Powerful CPUs …… …… Small number of large tasks Which is more efficient and easier? Communication Overhead Many/less powerful CPUs ● ● Few/powerful CPUs CPU ● ● ● ● ● ● ● ● ● ● ・Many number of small tasks ・Low bandwidth and many paths among CPUs ・Small number of large tasks ・High bandwith and few paths among CPUs Which is more efficient and easier? Synchronization Many/less powerful CPUs Fork …… Join ・Many number of small tasks Few/Powerful CPUs Fork …… Join ・Small number of large tasks Which is more efficient and easier? NEC’s Approach for Capability Computing (SX-6 Systems Configuration) IXS Full Non-blocking X-bar 8GB/Sec Bisection Bandwidth Memory Large Number of Independent Memory Banks (4096 Banks) Memory Memory Full Non-blocking X-bar (256GB/Sec) 32GB/Sec Bandwidth … 8GF/CPU Vector CPU … … ・Few but Powerful CPUs with Vector ・Powerful SMP ・High Bandwidth with Non-Blocking X-bar SX-6 vs SX-5 2.5 SX-5 [8GF] SX-6 [8GF] (SX-5 User Time/SX-6 User Time) Improvemment Ratio of User Time Climate codes 2.0 1.5 1.0 0.5 97.0 97.5 98.0 98.5 Vector Operation Ratio (%) 99.0 99.5 100.0 Performance on SX-6/SX-5 (Electro Magnetic Field Analysis) 24 SX-6[8GF] 8CPU Effective GFLOPS 20 SX-5[8GF] 16 4CPU 12 8 2CPU 4 0 8 16 24 32 40 Peak GFLOPS 48 56 64 Performance on SX-6/SX-5 (Crystal Structure Analysis) 24 SX-6[8GF] 8CPUs 20 SX-5[8GF] 16 4CPUs 12 2CPUs 8 4 0 8 16 24 32 40 Peak GFLOPS 48 56 64 Vector vs Scalar(Climate App.) 200 64CPUs 180 160 48CPUs Effective Gflops 140 SX-6(8GF/CPU) 120 32CPUs 100 SX-5(8GF/CPU) 80 Scalar Server(10%eff.) 60 40 Scalar Server(15%eff.) 20 0 0 100 200 300 400 Peak Gflops 500 600 Technological Competence All Technologies for High Performance Computing are available internally within NEC: – – – – – – – Semiconductor Devices Packaging Hardware Design Interconnections and Network Operating Systems Software Languages and Tools Applications Tuning and Support Memory Chip and Tr in μ-Processor Tr nm 250 bits 64G Bits/Chip 16G 1G 4G 200 Tr/Chip 1G 100 100M 256 99 2002 2005 2008 2011 2014 (ITRS’99) Simulating “Earth” on Supercomputer Supercomputer Simulation: - can visualize - can virtually experiment - can forecast the future However, current supercomputers are not enough for further analysis of problems on Planet Earth Each CPUs executes their share of computation (North American 24hours Precipitation) Power x 1000 NEC SX-4 The Earth Simulator > 40TFLOPS 1Q2002 Project of Science & Technology Agency Earth Simulator HPC Road Map Earth Simulator SX-6XX SX Series SX-6X SX-6 SX-5 SX-4 95 96 97 98 99 00 01 Where NEC is ・Technology Leader in High Performance Computing ・Leading Supplier of HPC Platforms for Large Scale Technical and Engineering Computing ・Key Contributor to Vector Supercomputer Development ・Committed to Development of Vector Supercomputing END