Slide

advertisement
From Beowulf to professional
turn-key solutions
Einar Rustad - VP Business Development
Slide 1 - 22.03.2016
The difference is in the software...
Outline
•
•
•
•
Scali Background
Clustering Rationale
Scali Products
Technology
Slide 2 - 22.03.2016
The difference is in the software...
History and Facts at a glance
• History:
• Based on a development project for High
Performance SAR Processing (Military),
1994 - 1996 (concurrently with the Beowulf
project at NASA)
• Spin-off from Kongsberg Gruppen ASA,
1997
• Organisation:
• 30 Employees
• Head Office in Oslo, branch in Houston,
sales offices in Germany, France, UK
• Main Owners
• Four Seasons Venture, SND Invest,
Kongsberg, Intel Corp., Employees
Slide 3 - 22.03.2016
The difference is in the software...
Paderborn, PC2
1997:
PSC1
8 x 4 Torus
64 Processors
P-3, 300MHz
19.2GFlops
Slide 4 - 22.03.2016
1998:
PSC2
12 x 8 Torus
192 Processors
P-3, 450MHz
86.4GFlops
The difference is in the software...
A Major Software Challenge
10000
Nanoseconds
1000
100
µP cycle time
10
DRAM access time
1
0,1
1978
Slide 5 - 22.03.2016
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
The difference is in the software...
Increasing Performance
• Faster Processors
• Frequency
• Instruction Level Parallelism
• Better Algorithms
13
12
11
10
9
8
7
6
5
4
3
2
1
4
• Compilers
• Brainpower
• Parallel Processing
F+C
F+Blk+C
F+Blk
• Compilers
• Tools (Profilers, Debuggers)
• More Brainpower
F
B
Slide 6 - 22.03.2016
2
1
The difference is in the software...
Clusters vs SMPs
Use of SMPs
Use of Clusters
• Common Access to Shared
Resources
• Processors
• Memory
• Storage Devices
• Running Multiple
Applications
• Running Multiple Instances
of the Same Application
• Running Parallel
Applications
• Common Access to Shared
Resources
• Processors
• Distributed Memory
• Storage Devices
• Running Multiple
Applications
• Running Multiple Instances
of the Same Application
• Running Parallel
Applications
Slide 7 - 22.03.2016
The difference is in the software...
Why Do We Need SMPs?
• Small SMPs make Great Nodes for building
Clusters!
• The most Cost-Effective Cluster Node is a Dual
Processor SMP
• High Volume Manufacturing
• High Utilization of Shared Resources
– Bus
– Memory
– I/O
Slide 9 - 22.03.2016
The difference is in the software...
Clustering makes
Mo(o)re Sense
• Microprocessor Performance Increases 50-60%
per Year
• 1 year lag:
• 2 year lag:
1.0 SHV Unit = 1.6 Proprietary Units
1.0 SHV Unit = 2.6 Proprietary Units
• Volume Disadvantage
• When Volume Doubles, Cost is reduced to 90%
• 1,000 Proprietary Units vs 1,000,000 SHV units=>
Proprietary Unit 3 X more Expensive
• 2 years lag and 1:100 Volume Disadvantage => 7
X Worse Price/Performance
Slide 10 - 22.03.2016
The difference is in the software...
UNCLASSIFIED
Hardware acquisition - massive savings !
Fan Systems (Bristol)
compute server
Price performance ratio per processor
35.0
high-cost
Normalized £/Mflops
30.0
25.0
Savings made:
- Non-EDS compute
server acquired by IE……….£75k
(Alpha/PC cluster with 24 proc.)
20.0
- EDS solution with 24 proc.
(SGI Origin 2000)………..…£300k
- Savings……………………..£225k
15.0
10.0
very cost effective !
5.0
1.0
0.0
Alpha EV6.7
21264 667Mhz
Pentium IV
1.7Ghz
Pentium III
800Mhz
SUN Ultra 10 Ultra
Sparc II 450Mhz
SGI Origin 2000
R10k 250Mhz
Platforms
Proposed HPC
platforms
February 14, 2001
- EDS solution with same
computing power (SGI Origin
2000)………………………...£1.2M
- Savings……………………£1.1M
EDS platforms
RR Defence(E), Installation Engineering, Bristol
François Moyroud
Software Focal Points
• High Performance Communication
• ScaMPI
– Record Performance (>380MB/s, <4µs Latency)
– Rich set of functions
– Debugging options (trace file generation)
– Performance tuning with high precision timers
– Easy to use
• Cluster Management
• Scali Universe
–
–
–
–
–
Slide 12 - 22.03.2016
Single System Environment
Remote Operation
Job Scheduling
Monitoring
Alarms
The difference is in the software...
What Kind of Interconnect?
• The choice of cluster interconnect
depends entirely on the Applications and
the size of the Cluster
• Ethernet - Long Latency - Low Bandwidth
– Poor Scalability
– Embarrasingly Parallel codes
• SCI - Ultra-low Latency - High Bandwidth
– Good Scalability
– All kinds of parallel applications
Slide 13 - 22.03.2016
The difference is in the software...
System Interconnects
High Performance
Interconnect:
• Torus Topology
• IEEE/ANSI std. 1596 SCI
• 667MBytes/s/segment/ring
• Shared Address Space
Maintenance and LAN
Interconnect:
• 100Mbit/s or Gigabit
Ethernet
• Channel Bonding option
Slide 14 - 22.03.2016
The difference is in the software...
3-D Torus Topology
Distributed Switching:
PCI-bus
PSB
B-Link
LC-3
X
Slide 15 - 22.03.2016
LC-3
LC-3
Y
Z
SCI Rings
The difference is in the software...
2D/3D Torus (D33X)
PCI 532MB/s
PSB66
B-Link 640MB/s
LC-3
LC-3
LC-3
6x 667MB/s SCI links
Slide 16 - 22.03.2016
The difference is in the software...
Shared Nothing Data Transfers
Application
System
User
Data
Slide 17 - 22.03.2016
Network Adapter
System
Buffer
Host Memory
The difference is in the software...
Shared Address Space
Data Transfers
Application
Application
System
System
Network
Adapter
User
Data
System
Buffer
Host Memory
Slide 18 - 22.03.2016
Network
Adapter
System
Buffer
User
Data
Host Memory
The difference is in the software...
A2A Scalability with 66MHz/64bits PCI
Ringlet
1000
2D-Torus
3D-Torus
GByte/s
100
1728
4D-Torus
PCI
10
144
1
12
0,1
1
10
100
1000
10000
Number of Nodes
Slide 19 - 22.03.2016
The difference is in the software...
Scalability
All-to-All (64 bytes message size)
100000
MBytes/s
10000
1000
100
10
1
2
4
8
16
32
64 128 200
Number of Nodes
Slide 20 - 22.03.2016
The difference is in the software...
Fluent Benchmarks
Small class 2
Small class 3
Small class 1
SCI
8000
6000
4000
2000
0
2
4
8
16
Ethernet
5000
4000
3000
2000
1000
0
Performance
Ethernet
10000
Performance
Performance
12000
SCI
32
2
4
8
16
CPUs
CPUs
32
1800
1600
1400
1200
1000
800
600
400
200
0
Ethernet
SCI
2
4
8
16
32
16
32
CPUs
Ethernet
SCI
2
4
8
16
32
CPUs
Medium class 3
2500
2000
Ethernet
1500
SCI
1000
500
0
2
4
8
CPUs
16
32
Performance
700
600
500
400
300
200
100
0
Medium class 2
Performance
Performance
Medium class 1
400
300
Ethernet
SCI
200
100
0
2
4
8
CPUs
(Performance Metric is “Jobs per Day”)
Slide 21 - 22.03.2016
The difference is in the software...
Performance with Fluent
•
•
•
•
A benchmark from the Transportation Industry
Problem is partitioned along the principal axes and consists of about 4.5
million cells
Dual Intel Xeon 2.0GHz, Intel 860 chip-set, 3GB RDRAM per node
Fluent 5.7
16 CPUs
8 CPUs
4 CPUs
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
4 CPUs
8 CPUs
16 CPUs
SCI
317.6
523.6
960.0
100baseT
290.9
411.4
587.8
1000.0
jobs per day
Slide 22 - 22.03.2016
The difference is in the software...
Performance with Fluent (cont’d)
• NEW YORK CITY, NY -- September 25, 2001 -- Sun Microsystems,
Inc. (Nasdaq: SUNW) today announced that the newly announced
900 Mhz, 72-way Sun Fire™ 15K server outperformed a 1 GHz, 128way IBM system by over 23 percent on the large FL5L1 dataset,
using FLUENT™, a leading Computational Fluid Dynamics (CFD)
software application.
• ARMONK, NY, October 26, 2001—The IBM @server p690, codenamed "Regatta," today set a world record for processing speed on
the important Fluent engineering benchmark, providing nearly 80
percent more power than the new Sun Fire 15K, which has twice as
many processors and is nearly double the price.
Slide 23 - 22.03.2016
The difference is in the software...
Performance with Fluent (cont’d)
• FL5L3 - Turbulent Flow Through a Transition Duct
Number of cells
Cell type
Models
Solver
• Scali TeraRack, 96 CPUs
• IBM “Regatta” 32 CPUs
• Sun SunFire 15K, 72 CPUs
9,792,512
hexahedral
RSM turbulence
segregated implicit
Scali TeraRack (AMD K7, 1400)
354.2
IBM PSERIES_690_TURBO
(POWER4,1300)
199.7
127.6
SUN SUNFIRE15K (USIII,900)
0
50
100
150
200
250
300
350
400
Jobs per day
Slide 24 - 22.03.2016
The difference is in the software...
Performance with Fluent (cont’d)
Relative Performance/Price ratio (running FL5L3 benchmark):
Scali TeraRack (AMD K7,
1400+)
13.4
IBM PSERIES_690_TURBO
(POWER4,1300)
2.5
SUN SUNFIRE15K
(USIII,900)
1.0
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
Relative Performance/Price ratio
Slide 25 - 22.03.2016
The difference is in the software...
ClusterEdge™
• Universe XE
• SCI Interconnect HW
• High Performance
Communication Libraries
•
•
•
•
ScaMPI
Scali IP
Shmem
Scali SAN
Slide 29 - 22.03.2016
The difference is in the software...
Platform Support
•
•
Operating systems
• RH 6.2, 7.0, 7.1, 7.2
• SuSE 7.0, 7.1
• Solaris 8
Architectures:
• ia32 (PII, PIII, P4, AMD Athlon,
Athlon MP)
• ia64 (Itanium)
• Alpha (EV6, EV67, EV68)
• SPARC (UltraSPARC III)
•
•
•
•
•
Slide 30 - 22.03.2016
x86 chipsets:
• 440LX, 440BX, 440GX, i840, i850, i860
• VIA Apollo Pro 133A
• Serverworks LE, HE, WS (HE–SL)
Itanium chipset:
• Intel 460GX, HP ZX 1
Athlon chipsets:
• VIA KX133, VIA KT133, AMD 750,
AMD760, AMD760MP, AMD760MPX
Alpha chipsets:
• Tsunami/Typhoon 21272
SCI boards:
• Dolphin D311/D312, D315, D316
• Dolphin D33X series
The difference is in the software...
Universe Architecture
Remote
Workstation
Control Node
(Frontend)
GUI
GUI
S
Server daemon
C
Cluster Nodes
TCP/IP Socket
SCI
TCP/IP Socket
Slide 31 - 22.03.2016
Node daemon
The difference is in the software...
Scali Universe
Multiple Cluster Management
scila
Common Login per Cluster
Individual or Grouped Node Management
Slide 32 - 22.03.2016
The difference is in the software...
Software Configuration
Management
Nodes are
categorized once.
From then on, new
software is installed
by one mouse Click,
or with a single
command.
Slide 33 - 22.03.2016
The difference is in the software...
System Monitoring
• Resource Monitoring
• CPU
• Memory
• Disk
• Hardware Monitoring
• Temperature
• Fan Speed
• Operator Alarms on
selected Parameters at
Specified Tresholds
Slide 34 - 22.03.2016
The difference is in the software...
Monitoring contd.
Sam 113
51 Sam 113
51
Slide 35 - 22.03.2016
The difference is in the software...
Events/Alarms
Slide 36 - 22.03.2016
The difference is in the software...
OpenPBS integration
Slide 37 - 22.03.2016
The difference is in the software...
Fault Tolerance
• 2D Torus topology
• more routing options
• XY routing algorithm
• Node 33 fails (3)
• Nodes on 33’s ringlets
becomes unavailable
• Cluster fractured with
current routing setting
Slide 38 - 22.03.2016
14
24
34
44
13
23
33
43
12
22
32
42
11
21
31
41
The difference is in the software...
Fault Tolerance
• Scali advanced routing
algorithm:
• From the Turn Model
family of routing
algorithms
• All nodes but the failed
one can be utilised as
one big partition
Slide 39 - 22.03.2016
43
13
23
33
42
12
22
32
41
11
21
31
44
14
24
34
The difference is in the software...
Some Reference Customers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Spacetec/Tromsø Satellite Station, Norway
Norwegian Defense Research Establishment
Parallab, Norway
Paderborn Parallel Computing Center, Germany
Fujitsu Siemens computers, Germany
Spacebel, Belgium
Aerospatiale, France
Fraunhofer Gesellschaft, Germany
Lockheed Martin TDS, USA
University of Geneva, Switzerland
University of Oslo, Norway
Uni-C, Denmark
Paderborn Parallel Computing Center
University of Lund, Sweden
University of Aachen, Germany
DNV, Norway
DaimlerChrysler, Germany
AEA Technology, Germany
BMW AG, Germany
Audi AG, Germany
University of New Mexico, USA
Slide 40 - 22.03.2016
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Max Planck Institute für Plasmaphysik,
Germany
University of Alberta, Canada
University of Manitoba, Canada
Etnus Software, USA
Oracle Inc., USA
University of Florida, USA
deCODE Genetics, Iceland
Uni-Heidelberg, Germany
GMD, Germany
Uni-Giessen, Germany
Uni-Hannover, Germany
Uni-Düsseldorf, Germany
Linux NetworX, USA
Magmasoft AG, Germany
University of Umeå, Sweden
University of Linkøping, Sweden
PGS Inc., USA
US Naval Air, USA
The difference is in the software...
Reference Customers cntd.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Rolls Royce Ltd., UK
Norsk Hydro, Norway
NGU, Norway
University of Santa Cruz, USA
Jodrell Bank Observatory, UK
NTT, Japan
CEA, France
Ford/Visteon, Germany
ABB AG, Germany
National Technical University of Athens, Greece
Medasys Digital Systems, France
PDG Linagora S.A., France
Workstations UK, Ltd., England
Bull S.A., France
The Norwegian Meteorological Institute, Norway
Nanco Data AB, Sweden
Aspen Systems Inc., USA
Atipa Linux Solution Inc., USA
California Institute of Technology, USA
Compaq Computer Corporation Inc., USA
Fermilab, USA
Ford Motor Company Inc., USA
General Dynamics Inc., USA
Slide 41 - 22.03.2016
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Intel Corporation Inc., USA
IOWA State University, USA
Los Alamos National Laboratory, USA
Penguin Computing Inc., USA
Times N Systems Inc., USA
University of Alberta, Canada
Manash University, Australia
University of Southern Mississippi, Australia
Jacusiel Acuna Ltda., Chile
University of Copenhagen, Denmark
Caton Sistemas Alternativos, Spain
Mapcon Geografical Inform, Sweden
Fujitsu Software Corporation, USA
City Team OY, Finland
Falcon Computers, Finland
Link Masters Ltd., Holland
MIT, USA
Paralogic Inc., USA
Sandia National Laboratory, USA
Sicorp Inc., USA
University of Delaware, USA
Western Scientific Inc., USA
Group of Parallel and Distr. Processing, Brazil
The difference is in the software...
Conclusions
• Industrial Users want
•
•
•
•
•
•
•
ISV Applications
Single Point of Contact
Ease-of-Use
Support
Lower TCO, not just low Cost
Short deployment time
Focus on their own areas of expertise, not on
being computer companies
Slide 42 - 22.03.2016
The difference is in the software...
End of Presentation
Backup Slides
Slide 43 - 22.03.2016
The difference is in the software...
SCI vs. Myrinet 2000
•
•
All benchmarks conducted by The Numerically Intensive Computing Group
at Penn State's Center for Academic Computing, beatnic@cac.psu.edu
Machines:
•
•
Myrinet setup:
•
•
GM 1.2.3 and MPI-GM 1.2.4 (with everything such as directcopy and shared
memory transfers enabled)
SCI setup:
•
•
Dual 1Ghz PIII with ServerWorks HE-SL
SSP 2.0.2
Observations:
•
Myrinet’s eager protocol was broken, and Scali had to change its copyright on
the “bandwidth” program to help Myricom debug their protocol. Hence, only
ping-pong numbers are reported.
Slide 44 - 22.03.2016
The difference is in the software...
SCI vs. M2K:
Ping-Pong comparison
200
180%
M2K
SCI
%faster
180
140%
16M
8M
4M
2M
1M
512k
256k
128k
64k
32k
0
16k
0%
8k
20
4k
20%
2k
40
1k
40%
512
60
256
60%
128
80
64
80%
32
100
16
100%
8
120
4
120%
2
140
0
Bandwidth[MByte/sec]
160
160%
-20%
Message length[Byte]
Slide 45 - 22.03.2016
The difference is in the software...
Download