Computing beyond a Million Processors - bio-inspired massively-parallel architectures Andrew Brown Steve Furber The University of Manchester The University of Southampton steve.furber@manchester.ac.uk adb@ecs.soton.ac.uk SBF is supported by a Royal Society-Wolfson Research Merit Award ACACES 12 July 2009 1 Outline • • • • • • • Computer Architecture Perspective Building Brains Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 2 Multi-core CPUs • High-end uniprocessors – diminishing returns from complexity – wire vs transistor delays • Multi-core processors – cut-and-paste – simple way to deliver more MIPS • Moore’s Law – more transistors – more cores … but what about the software? ACACES 12 July 2009 3 Multi-core CPUS • General-purpose parallelization – an unsolved problem – the ‘Holy Grail’ of computer science for half a century? – but imperative in the many-core world • Once solved – few complex cores, or many simple cores? – simple cores win hands-down on power-efficiency! ACACES 12 July 2009 4 Back to the future • Imagine… – a limitless supply of (free) processors – load-balancing is irrelevant – all that matters is: • the energy used to perform a computation • formulating the problem to avoid synchronisation • abandoning determinism • How might such systems work? ACACES 12 July 2009 5 Bio-inspiration • How can massively parallel computing resources accelerate our understanding of brain function? • How can our growing understanding of brain function point the way to more efficient parallel, fault-tolerant computation? ACACES 12 July 2009 6 Outline • • • • • • • Computer Architecture Perspective Building Brains Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 7 Building brains • Brains demonstrate – massive parallelism (1011 neurons) – massive connectivity (1015 synapses) – excellent power-efficiency • much better than today’s microchips – – – – ACACES 12 July 2009 low-performance components (~ 100 Hz) low-speed communication (~ metres/sec) adaptivity – tolerant of component failure autonomous learning 8 Neurons • Multiple inputs (dendrites) • Single output (axon) – digital “spike” – fires at 10s to 100s of Hz – output connects to many targets • Synapse at input/output connection ACACES 12 July 2009 40 20 0 -20 -40 -60 -80 0 20 40 60 80 100 120 140 160 180 200 (www.ship.edu/ ~cgboeree/theneuron.html) 9 Neurons • A flexible biological control component – very simple animals have a handful – bees: 850,000 – humans: 1011 (photo courtesy of the Brain Mind Institute, EPFL) ACACES 12 July 2009 10 Neurons • Regular high-level structure • low-level vision, to • language, etc. • Random low-level structure – adapts over time ACACES 12 July 2009 (faculty.washington.edu/ rhevner/Miscellany.html) – e.g. 6-level cortical microachitecture 11 Neural Computation • To compute we need: – Processing – Communication – Storage • Processing: abstract model – linear sum of weighted inputs x1 x2 x3 x4 w1 w2 w3 w4 f y • ignores non-linear processes in dendrites – non-linear output function – learn by adjusting synaptic weights ACACES 12 July 2009 12 Processing • Leaky integrate-and-fire model – inputs are a series of spikes – total input is a weighted sum of the spikes – neuron activation is the input with a “leaky” decay – when activation exceeds threshold, output fires – habituation, refractory period, …? ACACES 12 July 2009 xi (t tik ) k I wi xi i A A / A I if A A fire & set A 0 13 Processing ( www.izhikevich.com ) • Izhikevich model – two variables, one fast, one slow: v 0.04v 5v 140 u I u a (bv u ) v 40 20 2 0 -20 -40 – neuron fires when -60 v > 30; then: vc u ud -80 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 2 u 0 -2 -4 -6 -8 -10 -12 -14 – a, b, c & d select behaviour ACACES 12 July 2009 14 Communication • Spikes – biological neurons communicate principally via ‘spike’ events – asynchronous – information is only: 40 20 0 • which neuron fires, and • when it fires -20 -40 -60 -80 ACACES 12 July 2009 0 20 40 60 80 100 120 140 160 180 200 15 Storage • Synaptic weights – stable over long periods of time • with diverse decay properties? – adaptive, with diverse rules • Hebbian, anti-Hebbian, LTP, LTD, ... • Axon ‘delay lines’ • Neuron dynamics – multiple time constants • Dynamic network states ACACES 12 July 2009 16 Outline • • • • • • • Building Brains Computer Architecture Perspective Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 17 The Good News... Transistors per Intel chip Millions of transistors per chip 100 Pentium 4 Pentium III 10 Pentium Pentium II 486 1 386 286 0.1 8086 0.01 4004 0.001 1970 ACACES 12 July 2009 8080 8008 1975 1980 1985 Year 1990 1995 2000 18 ...and the Bad News • Device variability & 1.0 • Component failure ACACES 12 July 2009 Vout2(V) 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 Vout1(V) 0.8 1.0 19 Atomic Scale devices The simulation Paradigm now A 22 nm MOSFET In production 2008 A 4.2 nm MOSFET In production 2023 ACACES 12 July 2009 20 A view from Intel • The Good News: – we will have 100 billion transistor ICs • The Bad News: – billions will fail in manufacture • unusable due to parameter variations – billions more will fail over the first year of operation • intermittent and permanent faults (Shekhar Borkar, Intel Fellow) ACACES 12 July 2009 21 A view from Intel • Conclusions: – one-time production test will be out – burn-in to catch infant mortality will be impractical – test hardware will be an integral part of the design – dynamically self-test, detect errors, reconfigure, adapt, ... (Shekhar Borkar, Intel Fellow) ACACES 12 July 2009 22 Outline • • • • • • • Building Brains Computer Architecture Perspective Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 23 Design principles • Virtualised topology – physical and logical connectivity are decoupled • Bounded asynchrony – time models itself • Energy frugality – processors are free – the real cost of computation is energy ACACES 12 July 2009 24 Outline • • • • • • • Building Brains Computer Architecture Perspective Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 25 SpiNNaker project • Multi-core CPU node – 20 ARM968 processors – to model large-scale systems of spiking neurons • Scalable up to systems with 10,000s of nodes – over a million processors – >108 MIPS total • Power ~ 25mw/neuron ACACES 12 July 2009 26 SpiNNaker project ACACES 12 July 2009 27 SpiNNaker project • Fault-tolerant architecture for largescale neural modelling • A billion neurons in real time • A step-function increase in the scale of neural computation • Cost- and energyefficient ACACES 12 July 2009 28 SpiNNaker system ACACES 12 July 2009 29 CMP node ACACES 12 July 2009 30 ARM968 subsystem ACACES 12 July 2009 31 GALS organization • clocked IP blocks • self-timed interconnect • self-timed interchip links ACACES 12 July 2009 32 Outline • • • • • • • Building Brains Computer Architecture Perspective Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 33 Circuit-level concurrency • Delay-insensitive comms – 3-of-6 RTZ on chip – 2-of-7 NRZ off chip • Deadlock resistance – Tx & Rx circuits have high deadlock immunity – Tx & Rx can be reset independently • each injects a token at reset • true transition detector filters surplus token ACACES 12 July 2009 data Rx Tx ack din dout (2 phase) ¬reset (4 phase) ¬ack 34 System-level concurrency • Breaking symmetry – any processor can be Monitor Processor • local ‘election’ on each chip, after self-test – all nodes are identical at start-up • addresses are computed relative to node with host connection (0,0) – system initialised using flood-fill • nearest-neighbour packet type • boot time (almost) independent of system scale ACACES 12 July 2009 35 Application-level concurrency • Event-driven realtime software – spike packet arrived • initiate DMA – DMA of synaptic data completed • process inputs • insert axonal delay – 1ms Timer interrupt ACACES 12 July 2009 sleeping event Packet Received Timer Millisecond Interrupt Interrupt fetch_ Synaptic_ Data(); Priority 1 update_ Neurons(); Priority 3 DMA Completion Interrupt update_ Stimulus(); Priority 2 goto_Sleep(); 36 Application-level concurrency • Cross-system delay << 1ms – hardware routing – ‘emergency’ routing • failed links • Congestion – if all else fails • drop packet ACACES 12 July 2009 37 Biological concurrency • Firing rate population codes 450 400 350 300 firing rate – N neurons – diverse tuning – collective coding of a physical parameter – accuracy N – robust to neuron failure 500 250 200 150 100 50 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 parameter 0.4 0.6 0.8 1 (Neural Engineering, Eliasmith & Anderson 2003) ACACES 12 July 2009 38 Biological concurrency • Single spike/neuron codes – choose N to fire from a population of M – order of firing may or may not matter Number of codes Unordered Ordered N-of-M N-of-M CNM M!/(M-N)! M-bit binary 2M e.g. M=100, N=20 1021 1039 1030 e.g. M=1000, N=200 10216 10591 10301 ACACES 12 July 2009 39 Outline • • • • • • • Building Brains Computer Architecture Perspective Living with Failure Design Principles The SpiNNaker system Concurrency Conclusions ACACES 12 July 2009 40 Software progress • ARM SoC Designer SystemC model • 4 chip x 2 CPU toplevel Verilog model • Running: • boot code • Izhikevich model • PDP2 codes ACACES 12 July 2009 • Basic configuration flow • …it all works! 41 Where might this lead? • Robots – iCub EU project – open humanoid robot platform – mechanics, but no brain! ACACES 12 July 2009 42 Conclusions • Many-core processing is coming – soon we will have far more processors than we can program • When (if?) we crack parallelism… – more small processors are better than fewer large processors – synchronization, coherent global memory, determinism, are all impediments • Biology suggests a way forward! – but we need new theories of biological concurrency ACACES 12 July 2009 43 UoM SpiNNaker team ACACES 12 July 2009 44