Infinite possibilities. Finite Solutions.

advertisement
Dolphin Wulfkit and Scali software
The Supercomputer interconnect
Amal D’Silva
amal@summnet.com
Summation Enterprises Pvt. Ltd.
Preferred System Integrators since 1991.
Agenda
• Dolphin Wulfkit hardware
• Scali Software / some commercial
benchmarks
• Summation profile
Interconnect Technologies
Application
areas:
WAN
LAN
I/O
Memory
Cache
Processor
FibreChannel
Design
space for
different
technologie
s
Etherne
t
AT
M
Network
Application
requirements:
Distance
Bandwidt
h
Latency
SCSI
Myrinet, cLan
Proprietary Busses
Dolphin SCI Technology
Bus
Cluster Interconnect
Requirements
100 000
10 000
1 000
100
10
1
1
100
10 000
50 000
100 000
100 000
∞
100 000
1 000
20
1
1
Interconnect impact on cluster performance
Some Real-world examples from Top500 May 2004 List
• Intel, Bangalore cluster:



574 Xeon 2.4 GHz CPUs/ GigE interconnect
Rpeak: 2755 GFLOPs Rmax: 1196 GFLOPs
Efficiency: 43%
• Kabru, IMSc, Chennai:
288 Xeon 2.4 GHz CPUs/ Wulfkit 3D interconnect
 Rpeak: 1382 GFLOPs Rmax: 1002 GFLOPs
 Efficiency: 72%
Simply put, Kabru gives 84 % of the performance with HALF the number
of CPUs !

•
Commodity interconnect limitations
• Cluster performance depends primarily on two factors:
Bandwidth and Latency
• Gigabit: Speed limited to 1000mbps (approx 80 Megabytes/s in
real world). This is fixed irrespective of processor power
• With Increasing processor speeds, latency “time taken to move
data” from one node to another is playing an increasing role in
cluster performance
• Gigabit typically gives an internode latency of 120 ~ 150 microsecs.
As a result, CPUs in a node are often idling waiting to get data
from another node
• In any switch based architecture, the switch becomes the single
point of failure. If the switch goes down, so does the cluster.
Dolphin Wulfkit advantages
• Internode bandwidth: 260 Megabytes/s on Xeon/ (over
three times faster that Gigabit).
• Latency: under 5 microsecs ( over TWENTY FIVE
times quicker than Gigabit)
• Matrix type internode connections: No switch, hence no
single point of failure
• Cards can be moved across processor generations. This
leads to investment protection
Dolphin Wulfkit advantages (contd.)
• Linear scalability: e.g. adding 8 nodes to a 16 node
cluster involves known fixed costs: eight nodes and
eight Dolphin SCI cards. With any switch based
architecture, there are additional issues like “unused
ports” on the switch to be considered. E.g. For Gigabit,
one has to “throw away” the 16 port switch and buy a
32 port switch
• Realworld performance on par /better than proprietary
interconnects like Memory Channel (HP) and
NUMAlink (SGI), at cost effective price points
Wulfkit : The Supercomputer Interconect
• Wulfkit is based on the Scalable Coherent Interface (SCI), the
ANSI/IEEE 1596-1992 standard defines a point-to-point interface
and a set of packet protocols.
• Wulfkit is not a networking technology, but is a purpose-designed
cluster interconnect.
• The SCI interface has two unidirectional links that operate
concurrently.
• Bus imitating protocol with packet-based handshake protocols and
guaranteed data delivery.
• Upto 667 MegaBytes/s internode bandwidth.
PCI-SCI Adapter Card 1 slot 2 dimensions
• SCI ADAPTERS (64 bit - 66 MHz)

PCI / SCI ADAPTER (D335)

D330 card with LC3 daughter card

Supports 2 SCI ring connections

Switching over B-Link

Used for WulfKit 2D clusters

PCI 64/66

D339 2-slot version
SCI
PCI
LC LC
PSB
2D Adapter Card
SCI
System Interconnect
High Performance
Interconnect:
•Torus Topology
•IEEE/ANSI std. 1596 SCI
•667MBytes/s/segment/ring
•Shared Address Space
Maintenance and LAN
Interconnect:
•100Mbit/s Ethernet
•(out of band monitoring)
System Architecture
Remote
Workstation
Control Node
(Frontend)
GUI
3
GUI
S
Server daemon
SC
ITCP/IP Socket
C
Node daemon
4x4 2D Torus SCI cluster
3D Torus topology (for greater than 64 ~ 72 nodes)
PCI
PSB66
LC-3
LC-3
LC-3
Linköping University - NSC - SCI Clusters
Also in Sweden, Umeå University 120 Athlon
nodes
• Monolith: 200
node, 2xXeon, 2,2
GHz, 3D SCI
• INGVAR: 32 node,
2xAMD 900 MHz,
2D SCI
• Otto: 48 node,
2xP4 2.26 GHz, 2D
SCI
• Commercial under
installation: 40,
2xXeon, 2D SCI
• Total 320 SCI
nodes
MPI connect middleware and
MPIManage Cluster setup/
mgmt tools
http://www.scali.com
Slide 14 - 21.03.2016
The difference is in the software...
Scali Software Platform
• Scali MPI Manage
– Cluster Installation /Management
• Scali MPI Connect
– High Performance MPI Libraries
Slide 15 - 21.03.2016
The difference is in the software...
Scali MPI Connect
•
•
•
•
•
Fault Tolerant
High Bandwidth
Low Latency
Multi-Thread safe
Simultaneous Inter/Intra-node operation
• UNIX command line
replicated
Slide 16 - 21.03.2016
• Exact message size option
• Manual/debugger mode for
selected processes
• Explicit host specification
• Job queuing
– PBS, DQS, LSF, CCS, NQS,
Maui
• Conformance to MPI-1.2
verified through 1665 MPI
tests
The difference is in the software...
Scali MPI Manage features
• System Installation and Configuration
• System Administration
• System Monitoring Alarms and Event
Automation
• Work Load Management
• Hardware Management
• Heterogeneous Cluster Support
Slide 17 - 21.03.2016
The difference is in the software...
Fault Tolerance
2D Torus topology
more routing options
XY routing algorithm
Node 33 fails (3)
Nodes on 33’s ringlets
becomes unavailable
Cluster fractured with
current routing setting
14
24
34
44
13
23
33
43
12
22
32
42
11
21
31
41
Fault Tolerance
Scali advanced routing
algorithm:
From the Turn Model family
of routing algorithms
All nodes but the failed one can
be utilised as one big partition
43
13
23
33
42
12
22
32
41
11
21
31
44
14
24
34
Scali MPI Manage GUI
Slide 20 - 21.03.2016
The difference is in the software...
Monitoring ctd.
Sam 113
51
Slide 21 - 21.03.2016
The difference is in the software...
System Monitoring
Resource
Monitoring
•CPU
•Memory
•Disk
Hardware
Monitoring
•Temperature
•Fan Speed
Operator Alarms on
selected
Parameters at
Specified Tresholds
Slide 22 - 21.03.2016
The difference is in the software...
SCI vs. Myrinet 2000:
Ping-Pong comparison
200
180%
M2K
SCI
%faster
180
140%
16M
8M
4M
2M
1M
512k
256k
128k
64k
32k
0
16k
0%
8k
20
4k
20%
2k
40
1k
40%
512
60
256
60%
128
80
64
80%
32
100
16
100%
8
120
4
120%
2
140
0
Bandwidth[MByte/sec]
160
160%
-20%
Message length[Byte]
Slide 24 - 21.03.2016
The difference is in the software...
Itanium vs Cray T3E
Bandwidth
Slide 25 - 21.03.2016
The difference is in the software...
Itanium vs T3E Latency
Slide 26 - 21.03.2016
The difference is in the software...
Some Reference
Customers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Spacetec/Tromsø Satellite Station, Norway
Norwegian Defense Research Establishment
Parallab, Norway
Paderborn Parallel Computing Center, Germany
Fujitsu Siemens computers, Germany
Spacebel, Belgium
Aerospatiale, France
Fraunhofer Gesellschaft, Germany
Lockheed Martin TDS, USA
University of Geneva, Switzerland
University of Oslo, Norway
Uni-C, Denmark
Paderborn Parallel Computing Center
University of Lund, Sweden
University of Aachen, Germany
DNV, Norway
DaimlerChrysler, Germany
AEA Technology, Germany
BMW AG, Germany
Audi AG, Germany
University of New Mexico, USA
Slide 27 - 21.03.2016
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Max Planck Institute für Plasmaphysik,
Germany
University of Alberta, Canada
University of Manitoba, Canada
Etnus Software, USA
Oracle Inc., USA
University of Florida, USA
deCODE Genetics, Iceland
Uni-Heidelberg, Germany
GMD, Germany
Uni-Giessen, Germany
Uni-Hannover, Germany
Uni-Düsseldorf, Germany
Linux NetworX, USA
Magmasoft AG, Germany
University of Umeå, Sweden
University of Linkøping, Sweden
PGS Inc., USA
US Naval Air, USA
The difference is in the software...
Some more Reference
Customers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Rolls Royce Ltd., UK
Norsk Hydro, Norway
NGU, Norway
University of Santa Cruz, USA
Jodrell Bank Observatory, UK
NTT, Japan
CEA, France
Ford/Visteon, Germany
ABB AG, Germany
National Technical University of Athens, Greece
Medasys Digital Systems, France
PDG Linagora S.A., France
Workstations UK, Ltd., England
Bull S.A., France
The Norwegian Meteorological Institute, Norway
Nanco Data AB, Sweden
Aspen Systems Inc., USA
Atipa Linux Solution Inc., USA
California Institute of Technology, USA
Compaq Computer Corporation Inc., USA
Fermilab, USA
Ford Motor Company Inc., USA
General Dynamics Inc., USA
Slide 28 - 21.03.2016
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Intel Corporation Inc., USA
IOWA State University, USA
Los Alamos National Laboratory, USA
Penguin Computing Inc., USA
Times N Systems Inc., USA
University of Alberta, Canada
Manash University, Australia
University of Southern Mississippi, Australia
Jacusiel Acuna Ltda., Chile
University of Copenhagen, Denmark
Caton Sistemas Alternativos, Spain
Mapcon Geografical Inform, Sweden
Fujitsu Software Corporation, USA
City Team OY, Finland
Falcon Computers, Finland
Link Masters Ltd., Holland
MIT, USA
Paralogic Inc., USA
Sandia National Laboratory, USA
Sicorp Inc., USA
University of Delaware, USA
Western Scientific Inc., USA
Group of Parallel and Distr. Processing, Brazil
The difference is in the software...
Application Benchmarks
With Dolphin SCI and Scali MPI
Slide 29 - 21.03.2016
The difference is in the software...
NAS parallel benchmarks
(16cpu/8nodes)
NPB 2.3 Performance
240%
Mpich
ScaMPI/SCI
ScaMPI/tcpip
ScaMPI/DET2
Performance [rel. to MPICH]
220%
200%
180%
160%
140%
120%
100%
80%
60%
BT
Slide 30 - 21.03.2016
CG
EP
FT
IS
LU
MG
SP
The difference is in the software...
Magma (16cpus/8nodes)
Magma Performance
65
60
MPICH
ScaMPI/SCI
ScaMPI/tcp
ScaMPI/det
Performance [jobs/day]
55
50
45
40
35
30
25
20
15
10
5
0
MPI
Slide 31 - 21.03.2016
The difference is in the software...
Eclipse (16cpus/8nodes)
Eclipse300 Performance
100
Performance [jobs/day]
90
80
70
MPICH
ScaMPI/SCI
ScaMPI/tcpip
ScaMPI/DET
ScaMPI/DET2
60
50
40
30
20
10
0
MPI
Slide 32 - 21.03.2016
The difference is in the software...
FEKO: Parallel Speedup
Boden out-of-core - 2 CPUs per node
5
4,5
4
Speedup
3,5
3
2,5
2
1,5
1
0,5
0
8
12
16
20
24
28
32
CPUs
Linear
Solution of the linear set of eqns.
Slide 33 - 21.03.2016
Calcul. of matrix A
Total times
The difference is in the software...
Acusolve (16cpus/8nodes)
Acusolve Performance
72
71
Peformance [jobs/day]
70
69
MPICH
ScaMPI-SCI
ScaMPI/tcpip
ScaMPI/DET
68
67
66
65
64
63
62
61
60
MPI
Slide 34 - 21.03.2016
The difference is in the software...
Visage (16cpus/8nodes)
Visage Performance
350
340
Performance [jobs/day]
330
320
310
300
ScaMPI/SCI
290
ScaMPI/tcpip
280
ScaMPI/DET
270
260
ScaMPI/DET2
250
240
230
220
210
200
MPI
Slide 35 - 21.03.2016
The difference is in the software...
CFD scaling mm5: linear to 400 CPUs
mm5 t3a data set
Performance (Mflops/sec.)
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500
0
2
Slide 36 - 21.03.2016
4
8
16
32
64
128
256
400
# cpus The difference is in the software...
Scaling - Fluent – Linköping cluster
Fluent Performance - 64 Million cells
70
65
Performance [jobs/day]
60
55
50
45
40
35
30
25
20
15
10
5
0
32
Slide 37 - 21.03.2016
64
128
The difference is in the software...
Dolphin Software
•
•
•
•
•
•
All Dolphin SW is free open source (GPL or LGPL)
SISCI
SCI-SOCKET
 Low Latency Socket Library
 TCP and UDP Replacement
 User and Kernel level support
 Release 2.0 available
SCI-MPICH (RWTH Aachen)
 MPICH 1.2 and some MPICH 2 features
 New release is being prepared, beta available
SCI Interconnect Manager
 Automatic failover recovery.
 No single point of failure in 2D and 3D networks.
Other
 SCI Reflective Memory, Scali MPI, Linux Labs SCI Cluster Cray-compatible
shmem and Clugres PostgreSQL, MandrakeSoft Clustering HPC solution, Xprime
X1 Database Performance Cluster for Microsoft SQL Servers, ClusterFrame from
Qlusters and SunCluster 3.1 (Oracle 9i), MySQL Cluster
Summation Enterprises Pvt. Ltd.
Brief Company Profile
• Our expertise: Clustering for High Performance
Technical Computing, Clustering for High Availability,
Terabyte Storage solutions, SANs
• O.S. skills : Linux (Alpha 64bit, x86:32 and 64bit),
Solaris (SPARC and x86), Tru64unix, Windows NT/ 2K/
2003 and the QNX Realtime O.S.
Summation milestones
• Working with Linux since 1996
• First in India to deploy/ support 64bit Alpha Linux
workstations (1999)
• First in India to spec, deploy and support a 26 Processor
Alpha Linux cluster (2001)
• Only company in India to have worked with Gigabit, SCI
and Myrinet interconnects
• Involved with the design, setup, support of many of the
largest HPTC clusters in India.
Exclusive Distributors /
System Integrators in India
• Dolphin Interconnect AS, Norway
– SCI interconnect for Supercomputer performance
• Scali AS, Norway
– Cluster management tools
• Absoft, Inc., USA
– FORTRAN Development tools
• Steeleye Inc., USA
– High Availability Clustering and Disaster Recovery Solutions for Windows
& Linux
– Summation is the sole Distributor, Consulting services & Technical
support partner for Steeleye in India
Partnering with Industry leaders
• Sun Microsystems, Inc.
– Focus on Education & Research segments
– High Performance Technical Computing,
Grid Computing Initiative with Sun Grid
Engine (SGE/ SGEE)
– HPTC Competency Centre
Wulfkit / HPTC users
• Institute of Mathematical Sciences, Chennai
–
–
–
–
144 node Dual Xeon Wulfkit 3D cluster,
9 node Dual Xeon Wulfkit 2D cluster
9 node Dual Xeon Ethernet cluster
1.4 TB RAID storage
• Bhaba Atomic Research Centre, Mumbai
– 64 node Dual Xeon Wulfkit 2D cluster
– 40 node P4 Wulfkit 3D cluster
– Alpha servers / Linux OpenGL workstations / Rackmount servers
• Harish Chandra Research Institute, Allahabad
– Forty Two node Dual Xeon Wulfkit Cluster,
– 1.1 TB RAID Storage
Wulfkit / HPTC users (contd.)
• Intel Technology India Pvt. Ltd., Bangalore
– Eight node Dual Xeon Wulfkit Clusters (ten nos.)
• NCRA (TIFR), Pune
– 4 node Wulfkit 2D cluster
• Bharat Forge Ltd., Pune
– Nine node Dual Xeon Wulfkit 2D cluster
• Indian Rare Earths Ltd., Mumbai
– 26 Processor Alpha Linux cluster with RAID storage
• Tata Institute of Fundamental Research, Mumbai
– RISC/Unix servers, Four node Xeon cluster
• Centre for Advanced Technology, Indore
– Alpha/ Sun Workstations
Questions ?
Amal D’Silva
email:amal@summnet.com
GSM: 98202 83309
Download