Uploaded by Osanda Lelum

Computer Architecture

Computer Architecture – The
history repeats!
D N Ranasinghe
History and the Future
• From early hardwired (application specific) analogue ‘processors’ to
stored program controlled generic flexible processors
• Von Neumann architecture extended to vector and dataflow
• Instructions operating on stored data
• Other models of computing: data flow, evolution driven, pattern driven
• Hennessey and Patterson invent RISC as an alternative to Intel/CISC family
• Pipelinable, based on program behavior (infrequent memory references)
• Future? See https://cacm.acm.org/magazines/2019/2/234352-a-newgolden-age-for-computer-architecture/fulltext
Workloads and limitations
• Early ASICS: digital signal processor (DSP) chips for digital filters and Fourier
transforms etc., with precision
• Moore’s law
• CPU power doubles every 18 months or so
• Has been a rule of thumb for processor designers, but reaching its limit now
• nm VLSI fabrication technology causes the circuit dimensions to be order of few
wavelengths and the power dissipation issue
• Emergence of multi core (symmetric multi processing/SMP) and many core
processors (GPU)
• Unrelenting push for higher throughput from MIPS to petaflops
• IPS = cycles per second / CPI
• Either increase the clock rate and/or reduce CPI
Flynn’s classification
• SIMD – vector processing (good for some applications)
• Each instruction operates on a data vector
• MIMD – many core and multicore (mostly shared memory) and,
distributed memory
• Shared memory – cache coherence problems; concurrent programming
paradigms: locks, semaphores and monitors
• Distributed memory – message passing clusters; what is a good programming
paradigm for task parallelism? Still being asked (e.g., scatter-gather?)
CPI classification
• >1
• Non pipelined
• =1
Instruction pipeline
Data hazards, Control hazards, Structural hazards
Data hazards: RAW, WAR, WAW\
Control hazards: branch prediction (BTB)
• <1
• Multiple issue
Exploiting parallelism (in code)
• ILP – Instruction level parallelism
• Compiler support
• Loop unrolling
• Hardware based dynamic scheduling – in order issue and out of order
execution on single issue pipelines/Tomasulo (dataflow model)
• Keeps a score board on execution progress
• Register renaming solves WAR and WAW hazards
• Common data bus solves RAW hazards through the ingenious way of dataflow model,
which gives rise to out of order execution
• Hardware based speculation – dynamic branch prediction and speculation on
the outcome of branch and undo the effects if necessary
• Multiple issue pipelines
• Thread level parallelism
• Motivation to reduce context switching overhead
• Hyper threading/SMT
• In multiple issue pipelines, there can be many threads
• Multiple issue + RISC pipelined + multithreaded programs
Exploiting Data parallelism
Data parallelism can be captured by vectors
GPU’s as graphics processors optimizing on vector processing capability
Throughput exceeding 1 TFLOPs
Cuda as programming language for handling threads operating on vectors
Basic Structure
• A cuda thread operates on a SIMD vector, say 32 data elements each having its own
small general purpose register set and ALU and LD/ST unit
• Say, 16 of such cuda threads constitute a thread block and has a per block local
• Say, 16 of such thread block constitute a logical grid, and one or more such logical
grids share the global GPU memory
Emergence of domain specific architectures
• NN structure ->Google Tensor PU
• SDN controller
• Automata processor – FPGA implementation of NFA (string matching:
• Graph processor
• Nvidia GPU vector processor
• Theoretically, any type of specific computation can be designed and
implemented on hardware using FPGA (e.g., evolutionary
model/DNA chips?) – making it generic to its class of problems is
Google’s TPU