Architecture Classifications Prof. (Dr.) Parul Goyal , Professor High Performance Computing BCSE-526 B.Tech Eighth Semester Computer Science & Engineering, M. M. Engineering College, Maharishi Markandeshwar (Deemed to Be University), Mullana, Ambala - 133207 •Flynn’s taxonomy •A way of describing the information flow in computers: architectural definition •Information is divided into instructions (I) and data (D) •There can be single (S) or multiple instances of both (M) •Four combinations: SISD,SIMD,MISD,MIMD SISD • Single Instruction, Single Data • An absolutely serial execution model • Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Single data element Single processor P M SIMD • Single Instruction, Multiple Data • In this case one instruction is applied to multiple data streams at the same time K P Ma P Mb P Mc Single instruction processor K, broadcasts instruction to processing elements (PEs) Each processor typically has its own data memory Array of processors MISD • Multiple Instruction, Single Data • Largely useless definition (not important) • Closest relevant example would be a cpu than can `pipeline’ instructions Ma Each processor has its own instruction stream but operates on the same data stream Mi P Mi P Mi P Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours. MIMD • Multiple Instruction, Multiple Data • Covers a host of modern architectures M M M P P P P Processors have independent data and instruction streams. Processors may communicate directly or via shared memory. M Instruction Set Architecture • ISA – interface between hardware and software • ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) • Assembly language is a realization of the ISA in a form easy to remember (and program) Key Concept in ISA evolution and CPU design • Efficiency gains to be had by executing as many operations per clock cycle as possible • Instruction level parallelism (ILP) • Exploit parallelism within the instruction stream • Programmer does not see this parallelism explicitly • Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI) ILP versus thread level parallelism • Many modern programs have more than one (parallel) “thread” of execution One “thread” Instructions • Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level 3 3 2 1 2 1 These instructions are executed in parallel even though there is one thread ILP techniques • The two main ILP techniques are • Pipelining – including additional techniques such as out-of-order execution • Superscalar execution Pipelining • Multiple instructions overlapped in execution • Throughput optimization: doesn’t reduce time for individual instructions Instr 12 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 3 Instr 2 Instr 1 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Design sweetspot • Pipeline stepping time is determined by slowest operation in pipeline • Best speed-up: if all operations take same amount of time • Net time per instruction=stepping time/pipeline stages • Perfect speed up factor = # pipeline stages • Never achieved: start up overheads to consider Pipeline compromises Time to issue instruction 10ns 10ns 5ns 10ns 5ns 10ns 5ns =55ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Instruction 10ns 10ns 10ns 10ns 10ns 10ns 10ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 These stages take longer than necessary =70ns Superscalar execution • Careful about definitions: superscalar execution is not simply about having multiple instructions in flight • Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store) Benefits of superscalar design • Having more than one functional unit of a given type can help schedule more instructions within the pipeline • The Pentium IV pipeline was 20 stages deep! • Enormous throughput potential but big pipeline stall penalty • Incorporation of multiple units into the pipeline is sometimes called superpipelining Other ways of increasing ILP • Branch prediction • Predict which path will be taken by assigning certain probabilities • Out of order execution • Independent operations can be rescheduled in the instruction stream • Pipelined functional units • Floating point units can be pipelined to increase throughput Limits of ILP • See D. Wall “Limits of ILP” 1991 • Probability of hitting hazards (instructions that cannot be pipelined) increases with added length • Instruction fetch and decode rate • Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… • Branch prediction – • Multiple condition statements increase branches severely • Cache locality and memory limitations • Finite limits to effectiveness of prefetch Scalar Processor Architectures ‘Scalar’ Pipelined Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Superscalar Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD Complex Instruction Set Computing • CISC – older design idea (x86 instruction set is CISC) • Many (powerful) instructions supported within the ISA • Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) • Upside: Reduced instruction memory usage • Downside: designing CPU is much harder Reduced Instruction Set Computing • RISC – newer concept than CISC (but still old) • ARM, Intel, AMD, RISC-V(!), all RISC designs • Small instruction set, CISC type operation becomes a chain of RISC operations • Upside: Easier to design CPU • Upside: Smaller instruction set => higher clock speed • Downside: assembly language typically longer (compiler design issue though) • Most modern x86 processors are implemented using RISC techniques Birth of RISC • Roots can be traced to three research projects • IBM 801 (late 1970s, J. Cocke) • Berkeley RISC processor (~1980, D. Patterson) • Stanford MIPS processor (~1981, J. Hennessy) • Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment • Commercialization benefitted from 3 independent projects • Berkeley Project -> begat Sun Microsystems • Stanford Project -> begat MIPS (used by SGI) RISC processors • Complexity has nonetheless increased significantly • Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations • A digression: What if we could remove the scheduling complexity by using a smart compiler…? RISC behemoth: ARM • Most common chips in the world are now based on designs from Advanced Risc Machines (ARM) • Started out 36 years ago building microcomputers in UK • Licences ISA out to other companies • Apple, Nvidia, Samsung, AMD, Broadcom, Fujitsu, Amazon, Huawei and Qualcomm all use ARM technology VLIW & EPIC • VLIW – very long instruction word • Idea: pack a number of noninterdependent operations into one long instruction • Strong emphasis on compilers to schedule instructions • When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word VLIW & EPIC II • Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs • VLIW processors should be faster and less expensive than RISC • EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW • ISA is called IA-64 VLIW & EPIC III • Hey – it’s 2021, why aren’t we all using Intel Itanium processors? • AMD figured out an easy extension to make x86 support 64 bits & introduced multicore • Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64 RISC vs CISC recap RISC (popular by mid 80s) Operations on registers CISC (pre 1970s) Operations directly on memory Pro: Small instruction set makes design easy Pro: decreased CPI, but also get faster CPU through easier design (tc reduced) Pro: Many powerful instructions, easy to write assembly language* Con: complicated instructions must be built from simpler ones Con: Efficient compiler technology absolutely essential Pro: Reduced memory requirement for instructions, reduced number of total instructions (Ni)* Con: ISA often large and wasteful (20-25% usage) Con: ISA hard to debug during development Who “won”? – Not VLIW! • Modern x86 are RISC-CISC hybrids • ISA is translated at hardware level to shorter instructions • Very complicated designs though, lots of scheduling hardware • MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal • Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory? Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based (B5000 1963) Concept of a Family (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets (Vax, Intel 432 1977-80) Load/Store Architecture (CDC 6600, Cray 1 1963-76) RISC (Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987) LIW/”EPIC”? (IA-64. . .1999) Simultaneous multithreading • Completely different technology to ILP • NOT multi-core • Designed to overcome lack of fine grained parallelism in code • Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales • Requires programmer to have created a parallel program for this to work though • One physical processor looks like two logical processors Motivation for SMT • Strong motivation for SMT: memory latency making load operations take longer and longer • Need some way to hide this bottleneck (memory wall again!) • SMT: switch over execution to threads that have their data and execute those • TERA MTA (bought by Cray) attempt to design computer entirely around this concept SMT Example: IBM Power 9 • 12x to 24x cores, each core can support upto 8 threads • SMT gives ~40-50% improvement in performance for 1-2 threads • Not bad • Intel Hyperthreading ~ 20-30% improvement • 8 threads gets to 100% performance increase Multiple cores • Simply add more CPUs • Easiest way to increase throughput now • Why do this? • Response to problem of increasing power output on modern CPUs • We’ve essentially reached the limit on improving individual core speeds • Design involves compromise: n CPUs must now share memory bus – less bandwidth to each Intel & AMD multi-core processors • Intel 56-core processors • “Xeon Platinum” • Design envelope 400W, but divide by number of processors => each core is v. power efficient • $20k each(!) • AMD has 64 core processors • “Ryzen threadripper” • 280 W design envelope • Individual cores not as good as Intel though (20% less speed) • $4k RISC-V (2010) • A new approach to CPU design • “Linux of processor design” • ISA design available via open source licenses • No fees to use it • Design tools readily available • Dozens of CPU designs now created based on RISC-V • Further opens up the possibility of domain specific hardware Summary • Flynn’s taxonomy categorizes instruction and data flow in computers • Modern processors are MIMD • Pipelining and superscalar design improve CPU performance by increasing the instructions per clock • CISC/RISC design approaches appear to be reaching the limits of their applicability • In the absence of improved single core performance, designers are simply integrating more cores