Computer Architecture – The history repeats! D N Ranasinghe History and the Future • From early hardwired (application specific) analogue ‘processors’ to stored program controlled generic flexible processors • Von Neumann architecture extended to vector and dataflow • Instructions operating on stored data • Other models of computing: data flow, evolution driven, pattern driven • RISC and CISC • Hennessey and Patterson invent RISC as an alternative to Intel/CISC family • Pipelinable, based on program behavior (infrequent memory references) • Future? See https://cacm.acm.org/magazines/2019/2/234352-a-newgolden-age-for-computer-architecture/fulltext Workloads and limitations • Early ASICS: digital signal processor (DSP) chips for digital filters and Fourier transforms etc., with precision • Moore’s law • CPU power doubles every 18 months or so • Has been a rule of thumb for processor designers, but reaching its limit now • nm VLSI fabrication technology causes the circuit dimensions to be order of few wavelengths and the power dissipation issue • Emergence of multi core (symmetric multi processing/SMP) and many core processors (GPU) • Unrelenting push for higher throughput from MIPS to petaflops • IPS = cycles per second / CPI • Either increase the clock rate and/or reduce CPI Flynn’s classification • SISD • SIMD – vector processing (good for some applications) • Each instruction operates on a data vector • MIMD – many core and multicore (mostly shared memory) and, distributed memory • Shared memory – cache coherence problems; concurrent programming paradigms: locks, semaphores and monitors • Distributed memory – message passing clusters; what is a good programming paradigm for task parallelism? Still being asked (e.g., scatter-gather?) CPI classification • >1 • Non pipelined • =1 • • • • Instruction pipeline Data hazards, Control hazards, Structural hazards Data hazards: RAW, WAR, WAW\ Control hazards: branch prediction (BTB) • <1 • Multiple issue Exploiting parallelism (in code) • ILP – Instruction level parallelism • Compiler support • Loop unrolling • Hardware based dynamic scheduling – in order issue and out of order execution on single issue pipelines/Tomasulo (dataflow model) • Keeps a score board on execution progress • Register renaming solves WAR and WAW hazards • Common data bus solves RAW hazards through the ingenious way of dataflow model, which gives rise to out of order execution • Hardware based speculation – dynamic branch prediction and speculation on the outcome of branch and undo the effects if necessary • Multiple issue pipelines TLP • Thread level parallelism • Motivation to reduce context switching overhead • Hyper threading/SMT • In multiple issue pipelines, there can be many threads • Multiple issue + RISC pipelined + multithreaded programs Exploiting Data parallelism • • • • • Data parallelism can be captured by vectors GPU’s as graphics processors optimizing on vector processing capability Throughput exceeding 1 TFLOPs Cuda as programming language for handling threads operating on vectors Basic Structure • A cuda thread operates on a SIMD vector, say 32 data elements each having its own small general purpose register set and ALU and LD/ST unit • Say, 16 of such cuda threads constitute a thread block and has a per block local memory • Say, 16 of such thread block constitute a logical grid, and one or more such logical grids share the global GPU memory Emergence of domain specific architectures • NN structure ->Google Tensor PU • SDN controller • Automata processor – FPGA implementation of NFA (string matching: bioinformatics?) • Graph processor • Nvidia GPU vector processor • Theoretically, any type of specific computation can be designed and implemented on hardware using FPGA (e.g., evolutionary model/DNA chips?) – making it generic to its class of problems is hard. Google’s TPU