From Beowulf to professional turn-key solutions Einar Rustad - VP Business Development Slide 1 - 22.03.2016 The difference is in the software... Outline • • • • Scali Background Clustering Rationale Scali Products Technology Slide 2 - 22.03.2016 The difference is in the software... History and Facts at a glance • History: • Based on a development project for High Performance SAR Processing (Military), 1994 - 1996 (concurrently with the Beowulf project at NASA) • Spin-off from Kongsberg Gruppen ASA, 1997 • Organisation: • 30 Employees • Head Office in Oslo, branch in Houston, sales offices in Germany, France, UK • Main Owners • Four Seasons Venture, SND Invest, Kongsberg, Intel Corp., Employees Slide 3 - 22.03.2016 The difference is in the software... Paderborn, PC2 1997: PSC1 8 x 4 Torus 64 Processors P-3, 300MHz 19.2GFlops Slide 4 - 22.03.2016 1998: PSC2 12 x 8 Torus 192 Processors P-3, 450MHz 86.4GFlops The difference is in the software... A Major Software Challenge 10000 Nanoseconds 1000 100 µP cycle time 10 DRAM access time 1 0,1 1978 Slide 5 - 22.03.2016 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 The difference is in the software... Increasing Performance • Faster Processors • Frequency • Instruction Level Parallelism • Better Algorithms 13 12 11 10 9 8 7 6 5 4 3 2 1 4 • Compilers • Brainpower • Parallel Processing F+C F+Blk+C F+Blk • Compilers • Tools (Profilers, Debuggers) • More Brainpower F B Slide 6 - 22.03.2016 2 1 The difference is in the software... Clusters vs SMPs Use of SMPs Use of Clusters • Common Access to Shared Resources • Processors • Memory • Storage Devices • Running Multiple Applications • Running Multiple Instances of the Same Application • Running Parallel Applications • Common Access to Shared Resources • Processors • Distributed Memory • Storage Devices • Running Multiple Applications • Running Multiple Instances of the Same Application • Running Parallel Applications Slide 7 - 22.03.2016 The difference is in the software... Why Do We Need SMPs? • Small SMPs make Great Nodes for building Clusters! • The most Cost-Effective Cluster Node is a Dual Processor SMP • High Volume Manufacturing • High Utilization of Shared Resources – Bus – Memory – I/O Slide 9 - 22.03.2016 The difference is in the software... Clustering makes Mo(o)re Sense • Microprocessor Performance Increases 50-60% per Year • 1 year lag: • 2 year lag: 1.0 SHV Unit = 1.6 Proprietary Units 1.0 SHV Unit = 2.6 Proprietary Units • Volume Disadvantage • When Volume Doubles, Cost is reduced to 90% • 1,000 Proprietary Units vs 1,000,000 SHV units=> Proprietary Unit 3 X more Expensive • 2 years lag and 1:100 Volume Disadvantage => 7 X Worse Price/Performance Slide 10 - 22.03.2016 The difference is in the software... UNCLASSIFIED Hardware acquisition - massive savings ! Fan Systems (Bristol) compute server Price performance ratio per processor 35.0 high-cost Normalized £/Mflops 30.0 25.0 Savings made: - Non-EDS compute server acquired by IE……….£75k (Alpha/PC cluster with 24 proc.) 20.0 - EDS solution with 24 proc. (SGI Origin 2000)………..…£300k - Savings……………………..£225k 15.0 10.0 very cost effective ! 5.0 1.0 0.0 Alpha EV6.7 21264 667Mhz Pentium IV 1.7Ghz Pentium III 800Mhz SUN Ultra 10 Ultra Sparc II 450Mhz SGI Origin 2000 R10k 250Mhz Platforms Proposed HPC platforms February 14, 2001 - EDS solution with same computing power (SGI Origin 2000)………………………...£1.2M - Savings……………………£1.1M EDS platforms RR Defence(E), Installation Engineering, Bristol François Moyroud Software Focal Points • High Performance Communication • ScaMPI – Record Performance (>380MB/s, <4µs Latency) – Rich set of functions – Debugging options (trace file generation) – Performance tuning with high precision timers – Easy to use • Cluster Management • Scali Universe – – – – – Slide 12 - 22.03.2016 Single System Environment Remote Operation Job Scheduling Monitoring Alarms The difference is in the software... What Kind of Interconnect? • The choice of cluster interconnect depends entirely on the Applications and the size of the Cluster • Ethernet - Long Latency - Low Bandwidth – Poor Scalability – Embarrasingly Parallel codes • SCI - Ultra-low Latency - High Bandwidth – Good Scalability – All kinds of parallel applications Slide 13 - 22.03.2016 The difference is in the software... System Interconnects High Performance Interconnect: • Torus Topology • IEEE/ANSI std. 1596 SCI • 667MBytes/s/segment/ring • Shared Address Space Maintenance and LAN Interconnect: • 100Mbit/s or Gigabit Ethernet • Channel Bonding option Slide 14 - 22.03.2016 The difference is in the software... 3-D Torus Topology Distributed Switching: PCI-bus PSB B-Link LC-3 X Slide 15 - 22.03.2016 LC-3 LC-3 Y Z SCI Rings The difference is in the software... 2D/3D Torus (D33X) PCI 532MB/s PSB66 B-Link 640MB/s LC-3 LC-3 LC-3 6x 667MB/s SCI links Slide 16 - 22.03.2016 The difference is in the software... Shared Nothing Data Transfers Application System User Data Slide 17 - 22.03.2016 Network Adapter System Buffer Host Memory The difference is in the software... Shared Address Space Data Transfers Application Application System System Network Adapter User Data System Buffer Host Memory Slide 18 - 22.03.2016 Network Adapter System Buffer User Data Host Memory The difference is in the software... A2A Scalability with 66MHz/64bits PCI Ringlet 1000 2D-Torus 3D-Torus GByte/s 100 1728 4D-Torus PCI 10 144 1 12 0,1 1 10 100 1000 10000 Number of Nodes Slide 19 - 22.03.2016 The difference is in the software... Scalability All-to-All (64 bytes message size) 100000 MBytes/s 10000 1000 100 10 1 2 4 8 16 32 64 128 200 Number of Nodes Slide 20 - 22.03.2016 The difference is in the software... Fluent Benchmarks Small class 2 Small class 3 Small class 1 SCI 8000 6000 4000 2000 0 2 4 8 16 Ethernet 5000 4000 3000 2000 1000 0 Performance Ethernet 10000 Performance Performance 12000 SCI 32 2 4 8 16 CPUs CPUs 32 1800 1600 1400 1200 1000 800 600 400 200 0 Ethernet SCI 2 4 8 16 32 16 32 CPUs Ethernet SCI 2 4 8 16 32 CPUs Medium class 3 2500 2000 Ethernet 1500 SCI 1000 500 0 2 4 8 CPUs 16 32 Performance 700 600 500 400 300 200 100 0 Medium class 2 Performance Performance Medium class 1 400 300 Ethernet SCI 200 100 0 2 4 8 CPUs (Performance Metric is “Jobs per Day”) Slide 21 - 22.03.2016 The difference is in the software... Performance with Fluent • • • • A benchmark from the Transportation Industry Problem is partitioned along the principal axes and consists of about 4.5 million cells Dual Intel Xeon 2.0GHz, Intel 860 chip-set, 3GB RDRAM per node Fluent 5.7 16 CPUs 8 CPUs 4 CPUs 0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 900.0 4 CPUs 8 CPUs 16 CPUs SCI 317.6 523.6 960.0 100baseT 290.9 411.4 587.8 1000.0 jobs per day Slide 22 - 22.03.2016 The difference is in the software... Performance with Fluent (cont’d) • NEW YORK CITY, NY -- September 25, 2001 -- Sun Microsystems, Inc. (Nasdaq: SUNW) today announced that the newly announced 900 Mhz, 72-way Sun Fire™ 15K server outperformed a 1 GHz, 128way IBM system by over 23 percent on the large FL5L1 dataset, using FLUENT™, a leading Computational Fluid Dynamics (CFD) software application. • ARMONK, NY, October 26, 2001—The IBM @server p690, codenamed "Regatta," today set a world record for processing speed on the important Fluent engineering benchmark, providing nearly 80 percent more power than the new Sun Fire 15K, which has twice as many processors and is nearly double the price. Slide 23 - 22.03.2016 The difference is in the software... Performance with Fluent (cont’d) • FL5L3 - Turbulent Flow Through a Transition Duct Number of cells Cell type Models Solver • Scali TeraRack, 96 CPUs • IBM “Regatta” 32 CPUs • Sun SunFire 15K, 72 CPUs 9,792,512 hexahedral RSM turbulence segregated implicit Scali TeraRack (AMD K7, 1400) 354.2 IBM PSERIES_690_TURBO (POWER4,1300) 199.7 127.6 SUN SUNFIRE15K (USIII,900) 0 50 100 150 200 250 300 350 400 Jobs per day Slide 24 - 22.03.2016 The difference is in the software... Performance with Fluent (cont’d) Relative Performance/Price ratio (running FL5L3 benchmark): Scali TeraRack (AMD K7, 1400+) 13.4 IBM PSERIES_690_TURBO (POWER4,1300) 2.5 SUN SUNFIRE15K (USIII,900) 1.0 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 Relative Performance/Price ratio Slide 25 - 22.03.2016 The difference is in the software... ClusterEdge™ • Universe XE • SCI Interconnect HW • High Performance Communication Libraries • • • • ScaMPI Scali IP Shmem Scali SAN Slide 29 - 22.03.2016 The difference is in the software... Platform Support • • Operating systems • RH 6.2, 7.0, 7.1, 7.2 • SuSE 7.0, 7.1 • Solaris 8 Architectures: • ia32 (PII, PIII, P4, AMD Athlon, Athlon MP) • ia64 (Itanium) • Alpha (EV6, EV67, EV68) • SPARC (UltraSPARC III) • • • • • Slide 30 - 22.03.2016 x86 chipsets: • 440LX, 440BX, 440GX, i840, i850, i860 • VIA Apollo Pro 133A • Serverworks LE, HE, WS (HE–SL) Itanium chipset: • Intel 460GX, HP ZX 1 Athlon chipsets: • VIA KX133, VIA KT133, AMD 750, AMD760, AMD760MP, AMD760MPX Alpha chipsets: • Tsunami/Typhoon 21272 SCI boards: • Dolphin D311/D312, D315, D316 • Dolphin D33X series The difference is in the software... Universe Architecture Remote Workstation Control Node (Frontend) GUI GUI S Server daemon C Cluster Nodes TCP/IP Socket SCI TCP/IP Socket Slide 31 - 22.03.2016 Node daemon The difference is in the software... Scali Universe Multiple Cluster Management scila Common Login per Cluster Individual or Grouped Node Management Slide 32 - 22.03.2016 The difference is in the software... Software Configuration Management Nodes are categorized once. From then on, new software is installed by one mouse Click, or with a single command. Slide 33 - 22.03.2016 The difference is in the software... System Monitoring • Resource Monitoring • CPU • Memory • Disk • Hardware Monitoring • Temperature • Fan Speed • Operator Alarms on selected Parameters at Specified Tresholds Slide 34 - 22.03.2016 The difference is in the software... Monitoring contd. Sam 113 51 Sam 113 51 Slide 35 - 22.03.2016 The difference is in the software... Events/Alarms Slide 36 - 22.03.2016 The difference is in the software... OpenPBS integration Slide 37 - 22.03.2016 The difference is in the software... Fault Tolerance • 2D Torus topology • more routing options • XY routing algorithm • Node 33 fails (3) • Nodes on 33’s ringlets becomes unavailable • Cluster fractured with current routing setting Slide 38 - 22.03.2016 14 24 34 44 13 23 33 43 12 22 32 42 11 21 31 41 The difference is in the software... Fault Tolerance • Scali advanced routing algorithm: • From the Turn Model family of routing algorithms • All nodes but the failed one can be utilised as one big partition Slide 39 - 22.03.2016 43 13 23 33 42 12 22 32 41 11 21 31 44 14 24 34 The difference is in the software... Some Reference Customers • • • • • • • • • • • • • • • • • • • • • Spacetec/Tromsø Satellite Station, Norway Norwegian Defense Research Establishment Parallab, Norway Paderborn Parallel Computing Center, Germany Fujitsu Siemens computers, Germany Spacebel, Belgium Aerospatiale, France Fraunhofer Gesellschaft, Germany Lockheed Martin TDS, USA University of Geneva, Switzerland University of Oslo, Norway Uni-C, Denmark Paderborn Parallel Computing Center University of Lund, Sweden University of Aachen, Germany DNV, Norway DaimlerChrysler, Germany AEA Technology, Germany BMW AG, Germany Audi AG, Germany University of New Mexico, USA Slide 40 - 22.03.2016 • • • • • • • • • • • • • • • • • • Max Planck Institute für Plasmaphysik, Germany University of Alberta, Canada University of Manitoba, Canada Etnus Software, USA Oracle Inc., USA University of Florida, USA deCODE Genetics, Iceland Uni-Heidelberg, Germany GMD, Germany Uni-Giessen, Germany Uni-Hannover, Germany Uni-Düsseldorf, Germany Linux NetworX, USA Magmasoft AG, Germany University of Umeå, Sweden University of Linkøping, Sweden PGS Inc., USA US Naval Air, USA The difference is in the software... Reference Customers cntd. • • • • • • • • • • • • • • • • • • • • • • • Rolls Royce Ltd., UK Norsk Hydro, Norway NGU, Norway University of Santa Cruz, USA Jodrell Bank Observatory, UK NTT, Japan CEA, France Ford/Visteon, Germany ABB AG, Germany National Technical University of Athens, Greece Medasys Digital Systems, France PDG Linagora S.A., France Workstations UK, Ltd., England Bull S.A., France The Norwegian Meteorological Institute, Norway Nanco Data AB, Sweden Aspen Systems Inc., USA Atipa Linux Solution Inc., USA California Institute of Technology, USA Compaq Computer Corporation Inc., USA Fermilab, USA Ford Motor Company Inc., USA General Dynamics Inc., USA Slide 41 - 22.03.2016 • • • • • • • • • • • • • • • • • • • • • • • Intel Corporation Inc., USA IOWA State University, USA Los Alamos National Laboratory, USA Penguin Computing Inc., USA Times N Systems Inc., USA University of Alberta, Canada Manash University, Australia University of Southern Mississippi, Australia Jacusiel Acuna Ltda., Chile University of Copenhagen, Denmark Caton Sistemas Alternativos, Spain Mapcon Geografical Inform, Sweden Fujitsu Software Corporation, USA City Team OY, Finland Falcon Computers, Finland Link Masters Ltd., Holland MIT, USA Paralogic Inc., USA Sandia National Laboratory, USA Sicorp Inc., USA University of Delaware, USA Western Scientific Inc., USA Group of Parallel and Distr. Processing, Brazil The difference is in the software... Conclusions • Industrial Users want • • • • • • • ISV Applications Single Point of Contact Ease-of-Use Support Lower TCO, not just low Cost Short deployment time Focus on their own areas of expertise, not on being computer companies Slide 42 - 22.03.2016 The difference is in the software... End of Presentation Backup Slides Slide 43 - 22.03.2016 The difference is in the software... SCI vs. Myrinet 2000 • • All benchmarks conducted by The Numerically Intensive Computing Group at Penn State's Center for Academic Computing, beatnic@cac.psu.edu Machines: • • Myrinet setup: • • GM 1.2.3 and MPI-GM 1.2.4 (with everything such as directcopy and shared memory transfers enabled) SCI setup: • • Dual 1Ghz PIII with ServerWorks HE-SL SSP 2.0.2 Observations: • Myrinet’s eager protocol was broken, and Scali had to change its copyright on the “bandwidth” program to help Myricom debug their protocol. Hence, only ping-pong numbers are reported. Slide 44 - 22.03.2016 The difference is in the software... SCI vs. M2K: Ping-Pong comparison 200 180% M2K SCI %faster 180 140% 16M 8M 4M 2M 1M 512k 256k 128k 64k 32k 0 16k 0% 8k 20 4k 20% 2k 40 1k 40% 512 60 256 60% 128 80 64 80% 32 100 16 100% 8 120 4 120% 2 140 0 Bandwidth[MByte/sec] 160 160% -20% Message length[Byte] Slide 45 - 22.03.2016 The difference is in the software...