Connection Machine Architecture Greg Faust, Mike Gibson, Sal Valente CS-6354 Computer Architecture Fall 2009 1 Historic Timeline • • • • • • • • • 1981: MIT AI-Lab Technical Memo on CM 1982: Thinking Machines Inc. Founded 1985: Danny Hillis wins ACM “Best PhD” Award 1986: CM-1 Ships 1987: CM-2 Ships 1991: CM-5 Announced 1991: CM-5 Ships 1994: TMI Chapter 11 – Sun/Oracle pick bones Heavily DARPA funded/backed $16M+ Direct Contracts plus subsidized CM sales 2 Involved Notables • • • • • • • • • Danny Hillis – CM inventor and TMI Founder Charles Leiserson – Fat tree inventor Richard Feynman – Noble Prize winning Physicist Marvin Minsky – MIT AI Lab “Visionary” Guy Steele – Common Lisp, Grace Hopper Award Stephen Wolfram – Mathematica inventor Doug Lenat – Mind/Body problem philosopher Greg Papadopoulos – MIT Media lab, Sun CTO various others 3 CM-1 and CM-2 Architecture • • • • • Original design goal to support neuron like simulations Up to 64K single bit processors (actually 3 bits in and 2 out) 16 Processors/chip, 32chips/PCB, 16 PCBs/cube, 8cubes/hypercube Hypercube architecture – Each 16-Proc chip a hyper-node Each proc has 4K bits of bit addressable RAM – Distributed Physical Memory – Global Memory Addresses • • • • • Up to 4 front-end computers talk to sequencers via 4x4 crossbar “Sequencers” issue SIMD instructions over a Broadcast Network Bit procs communicate via 2D local HW grid connections (“NEWS”) Bit procs communicate via hypercube network using MSG passing Lots of Twinkling Lights!! 4 CM-1 CM-2 Architecture 5 CM-1 and CM-2 Programming • ISA supports: – Bit-oriented operations – Arbitrary precision multi-bit scalar Ops using bit-serial implementation on bit procs – Full Multi-Dimensional Vector Ops • “Virtual Processor” idea similar to CUDA threads but they are statically allocated • OS and Programming Tools run on front-ends • *Lisp as the initial programming language • Later C* and CM-Fortran 6 CM-2 Improvements • • • • 1 Weitek IEEE FP coprocessor per 32 1-bit procs Up to 256K bits of memory per processor Added ECC to Memory Implemented the IO subsystem – Up to 80 GByte RAID array called “Data Vault” uses 39 Striped Disks and ECC, plus spare disks on standby – High Speed Graphics Output • En-route MSG combining in H-Cube router • New implementation of Multi-Dimensional NEWS on top of H-Cube (special addressing mode) 7 CM-1 Photo 8 CM-5 vs CM-1 and CM-2 • • • • Significant departure from CM-1 and CM-2 Targeted at more scientific and business applications More Commercial Off-The-Shelf components (“COTS”) Large Array of SPARC Processing Nodes – 1-bit processors are abandoned • Abandoned “NEWS” Grid and Hyper-Cube Networks • Delivered 1024 node machine, with claims 16K nodes possible • Even More Twinkling Lights! 9 CM-5 Photo – Watch it Blink 10 CM-5 Overall Architecture • "Coordinated Homogeneous Array of RISC Processors“ or “CHARM” • Asymmetric CoProcessors Model – Large Array of Processor Nodes – Small Collection of Control Nodes • 2 Separate scalable networks – One for data – One for control and synchronization • Still uses striped RAID for high disk BandWidth 11 Division of Labor • Processor Nodes can be assigned to a “Partition” • One Control Node per Partition • Control Node runs scalar code, then broadcasts parallel work to Processor Nodes • Processor Nodes receive a program, not an instruction stream, have own Program Counter • Processor nodes can access other node's memory by reading or writing a global memory address • Processor Nodes also communicate via MSG passing • Processor Nodes cannot issue system calls 12 Control Nodes • • • • • • Full Sun Workstations Running UNIX Connected to the “Outside World” Handles Partition Time Sharing Connected to both data and control networks Performs System Diagnostics 13 Processor Nodes • Nodes are a 5-chip microprocessor – Off the Shelf SPARC processor @ 40 MHz – 32MBytes local node memory – Multi-port memory controller for added BW – “Caching techniques do not perform as well on large parallel machines” – Proprietary 4-FPU Vector coprocessor – Proprietary network controller 14 CM-5 Processor Node Diagram 15 Data Network Architecture • Point to Point Inter-node communication and I/O • Implemented as a Fat Tree – Fat Trees invented by TMI employee Charles Leiserson • • • • Claim: Onsite BandWidth Expandable Delivering 5GB/sec Bisection BW on 1024 node machine Data router chip is a 8x8 crossbar switch Faulty nodes are mapped out of network – Programs can not assume a network topology • Network can be flushed when Time Share swaps occur • Network, not processors, guarantee end to end delivery 16 Fat Tree Structure 17 Separate Control Network • • • • • • Synchronization & control network Complete Binary Tree organization Provides broadcast capability Implements barrier operations Implements interrupts for timesharing Performs reduction operators (Sum, Max, AND, OR, Count, etc) 18 CM-5 Programming • Supports multiple Parallel High Level Languages and Programming Styles – Including Data Parallel Model from CM-1 and CM-2 • Goal: Hide many decisions from programmers – CM-1, CM-2 vs CM-5 ISA changes – Use of Processor Node CPU vs Vector CoProcessors – Partition Wide Synchronizations generate by Compiler • Is it MIMD, SPMD, SIMD? – “Globally Synchronized MIMD” 19 Sample CM Apps • Machine Learning – Neural Nets, concept clustering, genetic algorithms • • • • • • • • VLSI Design Geophysics (Oil Exploration), Plate Tectonics Particle Simulation Fluid Flow Simulation Computer Vision Computer Graphics , Animation Protein Sequence Matching Global Climate Model Simulation 20 References • • • • • • Danny Hillis PhD: The Connection Machine Inc: The Rise and Fall of Thinking Machines Wiki: Connection Machine ACM: The CM-5 Connection Machine ACM: The Network Architecture of the CM-5 IEEE: Architecture and Applications of the Connection Machine • IEEE: Fat-trees: universal networks for hardware-efficient supercomputing • Encyclopedia of Computer Science and Technology 21