Presentation

advertisement
Commodity Computing
Clusters - next generation
supercomputers?
Paweł Pisarczyk, ATM S. A.
pawel.pisarczyk@atm.com.pl
Agenda
• Introduction
• Supercomputer classification
• Architecture and implementations
• Commodity clusters
• Processors
• Operating systems
• Summary
Supercomputer
• „A supercomputer is a device for turning computebound problems into I/O-bound problem” Seymour Cray
• A supercomputer is a computer system that leads
the world in terms of processing capacity,
particularly speed of calculations, at the time of its
introduction.
source: http://en.wikipedia.org
Supercomputer History (1)
•
•
•
•
•
•
1945-50 - Manchester Mark I
1950-55 - MIT Whirlwind
1955-60 - IBM 7090 - 210 KFLOPS
1960-65 - CDC 6600 -10.24 MFLOPS
1965-70 - CDC 7600 - 32.27 MFLOPS
1970-75 - CDC Cyber 76
Supercomputer History (2)
•
•
•
•
•
•
•
1975-80 - Cray-1 - 160 MFLOPS
1980-85 - Cray X-MP - 500 MFLOPS
1985-90 - Cray Y-MP - 1.3 GFLOPS
1990-95 - Fujitsu Numerical Wind Tunnel - 236 GFLOPS
1995-00 - Intel ASCI Red - 2.150 TFLOPS
2000-02 - IBM ASCI White, SP Power3 375 MHz - 7.226 TFLOPS
2002-03 - NEC Earth Simulator - 35 TFLOPS
Supercomputer Classes (1)
• General-purpose supercomputers:
– vector processing machines - the same operation
carried out on a large amount of data simultaneously
– tightly connected cluster computers (NUMA) communication oriented architectures engineered from
ground up, based on high speed interconnects and large
number of processors
– commodity clusters - collection of large number of
commodity PCs (COTS) interconnected by highbandwidth low-latency network
Supercomputer Classes (2)
• Special-purpose supercomputers - high
performance computing devices with a hardware
architecture dedicated to solve a single problem
(equipped with custom ASICS or FPGA chips)
Examples
– Deep Blue
– GRAPE for astrophysics
Flynn taxonomy - 1972 (1)
• SISD - Single Instruction Single Data (DEC, Sun
Microsystems, PC)
• SIMD - Single Instruction Multiple Data
– computers with large number o processing units (i.e.
ALUs) - CPP DAP Gamma II, Quadrics Apemille
– vector processing machines - NEC SX6, IA32 MMX
• MISD - Multiple Instruction Single Data
– theoretical model, no practical implementation
Flynn taxonomy - 1972 (2)
• MIMD - Multiple Instruction Multiple Data
– SM-MIMD - Shared Memory MIMD
• global address space
• SMP systems and ccNUMA systems
– DM-MIMD - Distributed Memory MIMD
• many nodes with local address spaces
• high-bandwidth, low-latency communication
• common NUMA architectures (Non Uniform Memory
Access)
• operating system have to be communication oriented
(Mach project)
SM-MIMD implementations
• S-COMA - Simple Cache-Only Memory
Architecture
– common SMP systems
• ccNUMA - Cache Coherent NUMA
– SGI Origin 3000
– SGI Altix 3000
– HP SuperDome
S-COMA (SMP)
RAM
L2 cache
L2 cache
L2 cache
CPU 0
CPU 1
CPU N
ccNUMA
RAM 0
RAM K
L3 cache
L3 cache
L2 cache
L2 cache
L2 cache
L2 cache
CPU 0
CPU 1
CPU N-1
CPU N
ccNUMA implementation
SGI Altix 3000 (ccNUMA)
• 64 Itanium 2 (IA64) processors
• C-brick modules with 2 CPUs and ASIC SHUB
• NUMAflex, NUMAlink interconnects (6.4 GB/s,
2.4 GB/s)
• Modified Linux kernel (2.6 NUMA support)
DM-MIMD implementations
• Massively parallel systems (NUMA)
– communication oriented architecture
– low-latency, high-bandwidth interconnects
– topologies: hypercube, torus, tree
– Butterfly networks, Omega networks, engineered from
ground up communication
DM-MIMD implementations
• Commodity clusters
– a cluster is a collection of connected, independent
computers working in unison to solve a problem
– COTS technology
– nodes are interconnected by Ethernet LAN, Myrinet,
QsNet ELAN etc.
– computation can be performed by using popular
programming toolkits and frameworks: OpenMP, MPI
– clusters require dedicated management software
NUMA implementations
Cray T3E-1350
• Processor: Alpha 21164 675 MHz
• Number of CPUs: 40 - 2176
• 3-D Torus topology
• Operating system: UNICOS/mk - microkernel
based
• Peak performance: 3 TFLOPS
Commodity cluster implementation (1)
Linux Networx/Quadrics
• Processor: Intel Xeon 2.4 GHz
• CPUs: 2304
• Interconnections: QsNet ELAN3
• Operating system: Linux + management tools +
Lustre Cluster File System
• Peak performance: 7.6 TFLOPS
• 3rd computer on TOP500 list
• Developed for Lawrence Livermore National
Laboratory in 2002
Commodity cluster implementation (2)
HP XC6000 Cluster (XC3000 Cluster)
• Processor: Intel Itanium 2 6M 1.5 GHz (Intel Xeon 3
GHz)
• Node: HP Integrity rx2600 (HP ProLiant DL380)
• Number of processors: 34-512
• Interconnections: QsNet ELAN3 (Myricom Myrinet
XP)
• Operating system: Linux + SSI Middleware +
management tools + Lustre Cluster File System
• Peak performance: 34 CPUs - 204 GFLOPS, 512
CPUs - 3 TFLOPS
Commodity Clusters - software
• Operating system - Linux or SSI Linux (Single
System Image)
• Platform for specialized applications for science,
engineering and business (simulation, modeling,
data mining)
• Distributed computation environments are used for
software development (OpenMP, MPI)
• Common supercomputer applications require
porting to clusters
Performance Scaling
Scale Right
Scale-Out
(Cluster)
Scale-Up
(SMP, ccNUMA)
Processors (1)
• Many types of existing processors are used in
supercomputers
• Microprocessor development directions:
– Increasing of clock frequency and speed instruction
stream processing
– Processing of large collection of data in single processor
instruction - SIMD
– Control path multiplication – multithreading
Processors (2)
• Vector processors
– NEC SX-6
– Cray (Cray X1)
• RISC processors
– MIPS
– IBM Power4
– Alpha
• CISC processors
– IA32
– AMD x86-64
• VLIW processors
– IA64
Intel Itanium 2 features
• State-of-the-art unconventional 64-bit architecture
• New programming model implementing VLIW paradigm
• EPIC technology – Explicitly Parallel Instruction Computing –
compiler determines instruction dependency informing processor
how to process an instruction stream parallel
• Many registers (128 64-bit), register stack management
• 6 GFLOPS peak performance
• Full advantages of the processor can be used by dedicated
compiler
Operating systems
• Monolithic kernel based OSs - UNIX (modification
of existing solutions)
– BSD
– Solaris
– Irix
– Linux
• Microkernel based OSs
– Mach
Microkernel architecture
Task A
Task B
Task C
Kernel
Kernel
Hardware
Hardware
Summary
• Today’s there is a lot of supercomputer
architectures
• Both vector processors and common RISC, CISC,
VLIW chips are used for supercomputers
• Commodity clusters under control of Linux OS are
an attractive method for supercomputer
implementation
TOP 500 list (1)
1. Earth Simulator, NEC - 35.86 TFLOPS
2. HP Alphaserver SC, HP - 13.88 TFLOPS
3. Linux Networx / Quadrics IA32 - 7.634 TFLOPS
Top 500 list (2)
Source: http://www.top500.org/list/2003/06/
Download