Dolphin Wulfkit and Scali software The Supercomputer interconnect Amal D’Silva amal@summnet.com Summation Enterprises Pvt. Ltd. Preferred System Integrators since 1991. Agenda • Dolphin Wulfkit hardware • Scali Software / some commercial benchmarks • Summation profile Interconnect Technologies Application areas: WAN LAN I/O Memory Cache Processor FibreChannel Design space for different technologie s Etherne t AT M Network Application requirements: Distance Bandwidt h Latency SCSI Myrinet, cLan Proprietary Busses Dolphin SCI Technology Bus Cluster Interconnect Requirements 100 000 10 000 1 000 100 10 1 1 100 10 000 50 000 100 000 100 000 ∞ 100 000 1 000 20 1 1 Interconnect impact on cluster performance Some Real-world examples from Top500 May 2004 List • Intel, Bangalore cluster: 574 Xeon 2.4 GHz CPUs/ GigE interconnect Rpeak: 2755 GFLOPs Rmax: 1196 GFLOPs Efficiency: 43% • Kabru, IMSc, Chennai: 288 Xeon 2.4 GHz CPUs/ Wulfkit 3D interconnect Rpeak: 1382 GFLOPs Rmax: 1002 GFLOPs Efficiency: 72% Simply put, Kabru gives 84 % of the performance with HALF the number of CPUs ! • Commodity interconnect limitations • Cluster performance depends primarily on two factors: Bandwidth and Latency • Gigabit: Speed limited to 1000mbps (approx 80 Megabytes/s in real world). This is fixed irrespective of processor power • With Increasing processor speeds, latency “time taken to move data” from one node to another is playing an increasing role in cluster performance • Gigabit typically gives an internode latency of 120 ~ 150 microsecs. As a result, CPUs in a node are often idling waiting to get data from another node • In any switch based architecture, the switch becomes the single point of failure. If the switch goes down, so does the cluster. Dolphin Wulfkit advantages • Internode bandwidth: 260 Megabytes/s on Xeon/ (over three times faster that Gigabit). • Latency: under 5 microsecs ( over TWENTY FIVE times quicker than Gigabit) • Matrix type internode connections: No switch, hence no single point of failure • Cards can be moved across processor generations. This leads to investment protection Dolphin Wulfkit advantages (contd.) • Linear scalability: e.g. adding 8 nodes to a 16 node cluster involves known fixed costs: eight nodes and eight Dolphin SCI cards. With any switch based architecture, there are additional issues like “unused ports” on the switch to be considered. E.g. For Gigabit, one has to “throw away” the 16 port switch and buy a 32 port switch • Realworld performance on par /better than proprietary interconnects like Memory Channel (HP) and NUMAlink (SGI), at cost effective price points Wulfkit : The Supercomputer Interconect • Wulfkit is based on the Scalable Coherent Interface (SCI), the ANSI/IEEE 1596-1992 standard defines a point-to-point interface and a set of packet protocols. • Wulfkit is not a networking technology, but is a purpose-designed cluster interconnect. • The SCI interface has two unidirectional links that operate concurrently. • Bus imitating protocol with packet-based handshake protocols and guaranteed data delivery. • Upto 667 MegaBytes/s internode bandwidth. PCI-SCI Adapter Card 1 slot 2 dimensions • SCI ADAPTERS (64 bit - 66 MHz) PCI / SCI ADAPTER (D335) D330 card with LC3 daughter card Supports 2 SCI ring connections Switching over B-Link Used for WulfKit 2D clusters PCI 64/66 D339 2-slot version SCI PCI LC LC PSB 2D Adapter Card SCI System Interconnect High Performance Interconnect: •Torus Topology •IEEE/ANSI std. 1596 SCI •667MBytes/s/segment/ring •Shared Address Space Maintenance and LAN Interconnect: •100Mbit/s Ethernet •(out of band monitoring) System Architecture Remote Workstation Control Node (Frontend) GUI 3 GUI S Server daemon SC ITCP/IP Socket C Node daemon 4x4 2D Torus SCI cluster 3D Torus topology (for greater than 64 ~ 72 nodes) PCI PSB66 LC-3 LC-3 LC-3 Linköping University - NSC - SCI Clusters Also in Sweden, Umeå University 120 Athlon nodes • Monolith: 200 node, 2xXeon, 2,2 GHz, 3D SCI • INGVAR: 32 node, 2xAMD 900 MHz, 2D SCI • Otto: 48 node, 2xP4 2.26 GHz, 2D SCI • Commercial under installation: 40, 2xXeon, 2D SCI • Total 320 SCI nodes MPI connect middleware and MPIManage Cluster setup/ mgmt tools http://www.scali.com Slide 14 - 21.03.2016 The difference is in the software... Scali Software Platform • Scali MPI Manage – Cluster Installation /Management • Scali MPI Connect – High Performance MPI Libraries Slide 15 - 21.03.2016 The difference is in the software... Scali MPI Connect • • • • • Fault Tolerant High Bandwidth Low Latency Multi-Thread safe Simultaneous Inter/Intra-node operation • UNIX command line replicated Slide 16 - 21.03.2016 • Exact message size option • Manual/debugger mode for selected processes • Explicit host specification • Job queuing – PBS, DQS, LSF, CCS, NQS, Maui • Conformance to MPI-1.2 verified through 1665 MPI tests The difference is in the software... Scali MPI Manage features • System Installation and Configuration • System Administration • System Monitoring Alarms and Event Automation • Work Load Management • Hardware Management • Heterogeneous Cluster Support Slide 17 - 21.03.2016 The difference is in the software... Fault Tolerance 2D Torus topology more routing options XY routing algorithm Node 33 fails (3) Nodes on 33’s ringlets becomes unavailable Cluster fractured with current routing setting 14 24 34 44 13 23 33 43 12 22 32 42 11 21 31 41 Fault Tolerance Scali advanced routing algorithm: From the Turn Model family of routing algorithms All nodes but the failed one can be utilised as one big partition 43 13 23 33 42 12 22 32 41 11 21 31 44 14 24 34 Scali MPI Manage GUI Slide 20 - 21.03.2016 The difference is in the software... Monitoring ctd. Sam 113 51 Slide 21 - 21.03.2016 The difference is in the software... System Monitoring Resource Monitoring •CPU •Memory •Disk Hardware Monitoring •Temperature •Fan Speed Operator Alarms on selected Parameters at Specified Tresholds Slide 22 - 21.03.2016 The difference is in the software... SCI vs. Myrinet 2000: Ping-Pong comparison 200 180% M2K SCI %faster 180 140% 16M 8M 4M 2M 1M 512k 256k 128k 64k 32k 0 16k 0% 8k 20 4k 20% 2k 40 1k 40% 512 60 256 60% 128 80 64 80% 32 100 16 100% 8 120 4 120% 2 140 0 Bandwidth[MByte/sec] 160 160% -20% Message length[Byte] Slide 24 - 21.03.2016 The difference is in the software... Itanium vs Cray T3E Bandwidth Slide 25 - 21.03.2016 The difference is in the software... Itanium vs T3E Latency Slide 26 - 21.03.2016 The difference is in the software... Some Reference Customers • • • • • • • • • • • • • • • • • • • • • Spacetec/Tromsø Satellite Station, Norway Norwegian Defense Research Establishment Parallab, Norway Paderborn Parallel Computing Center, Germany Fujitsu Siemens computers, Germany Spacebel, Belgium Aerospatiale, France Fraunhofer Gesellschaft, Germany Lockheed Martin TDS, USA University of Geneva, Switzerland University of Oslo, Norway Uni-C, Denmark Paderborn Parallel Computing Center University of Lund, Sweden University of Aachen, Germany DNV, Norway DaimlerChrysler, Germany AEA Technology, Germany BMW AG, Germany Audi AG, Germany University of New Mexico, USA Slide 27 - 21.03.2016 • • • • • • • • • • • • • • • • • • Max Planck Institute für Plasmaphysik, Germany University of Alberta, Canada University of Manitoba, Canada Etnus Software, USA Oracle Inc., USA University of Florida, USA deCODE Genetics, Iceland Uni-Heidelberg, Germany GMD, Germany Uni-Giessen, Germany Uni-Hannover, Germany Uni-Düsseldorf, Germany Linux NetworX, USA Magmasoft AG, Germany University of Umeå, Sweden University of Linkøping, Sweden PGS Inc., USA US Naval Air, USA The difference is in the software... Some more Reference Customers • • • • • • • • • • • • • • • • • • • • • • • Rolls Royce Ltd., UK Norsk Hydro, Norway NGU, Norway University of Santa Cruz, USA Jodrell Bank Observatory, UK NTT, Japan CEA, France Ford/Visteon, Germany ABB AG, Germany National Technical University of Athens, Greece Medasys Digital Systems, France PDG Linagora S.A., France Workstations UK, Ltd., England Bull S.A., France The Norwegian Meteorological Institute, Norway Nanco Data AB, Sweden Aspen Systems Inc., USA Atipa Linux Solution Inc., USA California Institute of Technology, USA Compaq Computer Corporation Inc., USA Fermilab, USA Ford Motor Company Inc., USA General Dynamics Inc., USA Slide 28 - 21.03.2016 • • • • • • • • • • • • • • • • • • • • • • • Intel Corporation Inc., USA IOWA State University, USA Los Alamos National Laboratory, USA Penguin Computing Inc., USA Times N Systems Inc., USA University of Alberta, Canada Manash University, Australia University of Southern Mississippi, Australia Jacusiel Acuna Ltda., Chile University of Copenhagen, Denmark Caton Sistemas Alternativos, Spain Mapcon Geografical Inform, Sweden Fujitsu Software Corporation, USA City Team OY, Finland Falcon Computers, Finland Link Masters Ltd., Holland MIT, USA Paralogic Inc., USA Sandia National Laboratory, USA Sicorp Inc., USA University of Delaware, USA Western Scientific Inc., USA Group of Parallel and Distr. Processing, Brazil The difference is in the software... Application Benchmarks With Dolphin SCI and Scali MPI Slide 29 - 21.03.2016 The difference is in the software... NAS parallel benchmarks (16cpu/8nodes) NPB 2.3 Performance 240% Mpich ScaMPI/SCI ScaMPI/tcpip ScaMPI/DET2 Performance [rel. to MPICH] 220% 200% 180% 160% 140% 120% 100% 80% 60% BT Slide 30 - 21.03.2016 CG EP FT IS LU MG SP The difference is in the software... Magma (16cpus/8nodes) Magma Performance 65 60 MPICH ScaMPI/SCI ScaMPI/tcp ScaMPI/det Performance [jobs/day] 55 50 45 40 35 30 25 20 15 10 5 0 MPI Slide 31 - 21.03.2016 The difference is in the software... Eclipse (16cpus/8nodes) Eclipse300 Performance 100 Performance [jobs/day] 90 80 70 MPICH ScaMPI/SCI ScaMPI/tcpip ScaMPI/DET ScaMPI/DET2 60 50 40 30 20 10 0 MPI Slide 32 - 21.03.2016 The difference is in the software... FEKO: Parallel Speedup Boden out-of-core - 2 CPUs per node 5 4,5 4 Speedup 3,5 3 2,5 2 1,5 1 0,5 0 8 12 16 20 24 28 32 CPUs Linear Solution of the linear set of eqns. Slide 33 - 21.03.2016 Calcul. of matrix A Total times The difference is in the software... Acusolve (16cpus/8nodes) Acusolve Performance 72 71 Peformance [jobs/day] 70 69 MPICH ScaMPI-SCI ScaMPI/tcpip ScaMPI/DET 68 67 66 65 64 63 62 61 60 MPI Slide 34 - 21.03.2016 The difference is in the software... Visage (16cpus/8nodes) Visage Performance 350 340 Performance [jobs/day] 330 320 310 300 ScaMPI/SCI 290 ScaMPI/tcpip 280 ScaMPI/DET 270 260 ScaMPI/DET2 250 240 230 220 210 200 MPI Slide 35 - 21.03.2016 The difference is in the software... CFD scaling mm5: linear to 400 CPUs mm5 t3a data set Performance (Mflops/sec.) 35000 32500 30000 27500 25000 22500 20000 17500 15000 12500 10000 7500 5000 2500 0 2 Slide 36 - 21.03.2016 4 8 16 32 64 128 256 400 # cpus The difference is in the software... Scaling - Fluent – Linköping cluster Fluent Performance - 64 Million cells 70 65 Performance [jobs/day] 60 55 50 45 40 35 30 25 20 15 10 5 0 32 Slide 37 - 21.03.2016 64 128 The difference is in the software... Dolphin Software • • • • • • All Dolphin SW is free open source (GPL or LGPL) SISCI SCI-SOCKET Low Latency Socket Library TCP and UDP Replacement User and Kernel level support Release 2.0 available SCI-MPICH (RWTH Aachen) MPICH 1.2 and some MPICH 2 features New release is being prepared, beta available SCI Interconnect Manager Automatic failover recovery. No single point of failure in 2D and 3D networks. Other SCI Reflective Memory, Scali MPI, Linux Labs SCI Cluster Cray-compatible shmem and Clugres PostgreSQL, MandrakeSoft Clustering HPC solution, Xprime X1 Database Performance Cluster for Microsoft SQL Servers, ClusterFrame from Qlusters and SunCluster 3.1 (Oracle 9i), MySQL Cluster Summation Enterprises Pvt. Ltd. Brief Company Profile • Our expertise: Clustering for High Performance Technical Computing, Clustering for High Availability, Terabyte Storage solutions, SANs • O.S. skills : Linux (Alpha 64bit, x86:32 and 64bit), Solaris (SPARC and x86), Tru64unix, Windows NT/ 2K/ 2003 and the QNX Realtime O.S. Summation milestones • Working with Linux since 1996 • First in India to deploy/ support 64bit Alpha Linux workstations (1999) • First in India to spec, deploy and support a 26 Processor Alpha Linux cluster (2001) • Only company in India to have worked with Gigabit, SCI and Myrinet interconnects • Involved with the design, setup, support of many of the largest HPTC clusters in India. Exclusive Distributors / System Integrators in India • Dolphin Interconnect AS, Norway – SCI interconnect for Supercomputer performance • Scali AS, Norway – Cluster management tools • Absoft, Inc., USA – FORTRAN Development tools • Steeleye Inc., USA – High Availability Clustering and Disaster Recovery Solutions for Windows & Linux – Summation is the sole Distributor, Consulting services & Technical support partner for Steeleye in India Partnering with Industry leaders • Sun Microsystems, Inc. – Focus on Education & Research segments – High Performance Technical Computing, Grid Computing Initiative with Sun Grid Engine (SGE/ SGEE) – HPTC Competency Centre Wulfkit / HPTC users • Institute of Mathematical Sciences, Chennai – – – – 144 node Dual Xeon Wulfkit 3D cluster, 9 node Dual Xeon Wulfkit 2D cluster 9 node Dual Xeon Ethernet cluster 1.4 TB RAID storage • Bhaba Atomic Research Centre, Mumbai – 64 node Dual Xeon Wulfkit 2D cluster – 40 node P4 Wulfkit 3D cluster – Alpha servers / Linux OpenGL workstations / Rackmount servers • Harish Chandra Research Institute, Allahabad – Forty Two node Dual Xeon Wulfkit Cluster, – 1.1 TB RAID Storage Wulfkit / HPTC users (contd.) • Intel Technology India Pvt. Ltd., Bangalore – Eight node Dual Xeon Wulfkit Clusters (ten nos.) • NCRA (TIFR), Pune – 4 node Wulfkit 2D cluster • Bharat Forge Ltd., Pune – Nine node Dual Xeon Wulfkit 2D cluster • Indian Rare Earths Ltd., Mumbai – 26 Processor Alpha Linux cluster with RAID storage • Tata Institute of Fundamental Research, Mumbai – RISC/Unix servers, Four node Xeon cluster • Centre for Advanced Technology, Indore – Alpha/ Sun Workstations Questions ? Amal D’Silva email:amal@summnet.com GSM: 98202 83309