Architecture Classifications

Architecture Classifications Prof. (Dr.) Parul Goyal , Professor High Performance Computing BCSE-526 B.Tech Eighth Semester Computer Science & Engineering, M. M. Engineering College, Maharishi Markandeshwar (Deemed to Be University), Mullana, Ambala - 133207 •Flynn’s taxonomy •A way of describing the information flow in computers: architectural definition •Information is divided into instructions (I) and data (D) •There can be single (S) or multiple instances of both (M) •Four combinations: SISD,SIMD,MISD,MIMD SISD • Single Instruction, Single Data • An absolutely serial execution model • Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Single data element Single processor P M SIMD • Single Instruction, Multiple Data • In this case one instruction is applied to multiple data streams at the same time K P Ma P Mb P Mc Single instruction processor K, broadcasts instruction to processing elements (PEs) Each processor typically has its own data memory Array of processors MISD • Multiple Instruction, Single Data • Largely useless definition (not important) • Closest relevant example would be a cpu than can `pipeline’ instructions Ma Each processor has its own instruction stream but operates on the same data stream Mi P Mi P Mi P Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours. MIMD • Multiple Instruction, Multiple Data • Covers a host of modern architectures M M M P P P P Processors have independent data and instruction streams. Processors may communicate directly or via shared memory. M Instruction Set Architecture • ISA – interface between hardware and software • ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) • Assembly language is a realization of the ISA in a form easy to remember (and program) Key Concept in ISA evolution and CPU design • Efficiency gains to be had by executing as many operations per clock cycle as possible • Instruction level parallelism (ILP) • Exploit parallelism within the instruction stream • Programmer does not see this parallelism explicitly • Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI) ILP versus thread level parallelism • Many modern programs have more than one (parallel) “thread” of execution One “thread” Instructions • Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level 3 3 2 1 2 1 These instructions are executed in parallel even though there is one thread ILP techniques • The two main ILP techniques are • Pipelining – including additional techniques such as out-of-order execution • Superscalar execution Pipelining • Multiple instructions overlapped in execution • Throughput optimization: doesn’t reduce time for individual instructions Instr 12 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 3 Instr 2 Instr 1 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Design sweetspot • Pipeline stepping time is determined by slowest operation in pipeline • Best speed-up: if all operations take same amount of time • Net time per instruction=stepping time/pipeline stages • Perfect speed up factor = # pipeline stages • Never achieved: start up overheads to consider Pipeline compromises Time to issue instruction 10ns 10ns 5ns 10ns 5ns 10ns 5ns =55ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Instruction 10ns 10ns 10ns 10ns 10ns 10ns 10ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 These stages take longer than necessary =70ns Superscalar execution • Careful about definitions: superscalar execution is not simply about having multiple instructions in flight • Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store) Benefits of superscalar design • Having more than one functional unit of a given type can help schedule more instructions within the pipeline • The Pentium IV pipeline was 20 stages deep! • Enormous throughput potential but big pipeline stall penalty • Incorporation of multiple units into the pipeline is sometimes called superpipelining Other ways of increasing ILP • Branch prediction • Predict which path will be taken by assigning certain probabilities • Out of order execution • Independent operations can be rescheduled in the instruction stream • Pipelined functional units • Floating point units can be pipelined to increase throughput Limits of ILP • See D. Wall “Limits of ILP” 1991 • Probability of hitting hazards (instructions that cannot be pipelined) increases with added length • Instruction fetch and decode rate • Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… • Branch prediction – • Multiple condition statements increase branches severely • Cache locality and memory limitations • Finite limits to effectiveness of prefetch Scalar Processor Architectures ‘Scalar’ Pipelined Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Superscalar Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD Complex Instruction Set Computing • CISC – older design idea (x86 instruction set is CISC) • Many (powerful) instructions supported within the ISA • Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) • Upside: Reduced instruction memory usage • Downside: designing CPU is much harder Reduced Instruction Set Computing • RISC – newer concept than CISC (but still old) • ARM, Intel, AMD, RISC-V(!), all RISC designs • Small instruction set, CISC type operation becomes a chain of RISC operations • Upside: Easier to design CPU • Upside: Smaller instruction set => higher clock speed • Downside: assembly language typically longer (compiler design issue though) • Most modern x86 processors are implemented using RISC techniques Birth of RISC • Roots can be traced to three research projects • IBM 801 (late 1970s, J. Cocke) • Berkeley RISC processor (~1980, D. Patterson) • Stanford MIPS processor (~1981, J. Hennessy) • Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment • Commercialization benefitted from 3 independent projects • Berkeley Project -> begat Sun Microsystems • Stanford Project -> begat MIPS (used by SGI) RISC processors • Complexity has nonetheless increased significantly • Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations • A digression: What if we could remove the scheduling complexity by using a smart compiler…? RISC behemoth: ARM • Most common chips in the world are now based on designs from Advanced Risc Machines (ARM) • Started out 36 years ago building microcomputers in UK • Licences ISA out to other companies • Apple, Nvidia, Samsung, AMD, Broadcom, Fujitsu, Amazon, Huawei and Qualcomm all use ARM technology VLIW & EPIC • VLIW – very long instruction word • Idea: pack a number of noninterdependent operations into one long instruction • Strong emphasis on compilers to schedule instructions • When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word VLIW & EPIC II • Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs • VLIW processors should be faster and less expensive than RISC • EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW • ISA is called IA-64 VLIW & EPIC III • Hey – it’s 2021, why aren’t we all using Intel Itanium processors? • AMD figured out an easy extension to make x86 support 64 bits & introduced multicore • Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64 RISC vs CISC recap RISC (popular by mid 80s) Operations on registers CISC (pre 1970s) Operations directly on memory Pro: Small instruction set makes design easy Pro: decreased CPI, but also get faster CPU through easier design (tc reduced) Pro: Many powerful instructions, easy to write assembly language* Con: complicated instructions must be built from simpler ones Con: Efficient compiler technology absolutely essential Pro: Reduced memory requirement for instructions, reduced number of total instructions (Ni)* Con: ISA often large and wasteful (20-25% usage) Con: ISA hard to debug during development Who “won”? – Not VLIW! • Modern x86 are RISC-CISC hybrids • ISA is translated at hardware level to shorter instructions • Very complicated designs though, lots of scheduling hardware • MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal • Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory? Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based (B5000 1963) Concept of a Family (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets (Vax, Intel 432 1977-80) Load/Store Architecture (CDC 6600, Cray 1 1963-76) RISC (Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987) LIW/”EPIC”? (IA-64. . .1999) Simultaneous multithreading • Completely different technology to ILP • NOT multi-core • Designed to overcome lack of fine grained parallelism in code • Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales • Requires programmer to have created a parallel program for this to work though • One physical processor looks like two logical processors Motivation for SMT • Strong motivation for SMT: memory latency making load operations take longer and longer • Need some way to hide this bottleneck (memory wall again!) • SMT: switch over execution to threads that have their data and execute those • TERA MTA (bought by Cray) attempt to design computer entirely around this concept SMT Example: IBM Power 9 • 12x to 24x cores, each core can support upto 8 threads • SMT gives ~40-50% improvement in performance for 1-2 threads • Not bad • Intel Hyperthreading ~ 20-30% improvement • 8 threads gets to 100% performance increase Multiple cores • Simply add more CPUs • Easiest way to increase throughput now • Why do this? • Response to problem of increasing power output on modern CPUs • We’ve essentially reached the limit on improving individual core speeds • Design involves compromise: n CPUs must now share memory bus – less bandwidth to each Intel & AMD multi-core processors • Intel 56-core processors • “Xeon Platinum” • Design envelope 400W, but divide by number of processors => each core is v. power efficient • $20k each(!) • AMD has 64 core processors • “Ryzen threadripper” • 280 W design envelope • Individual cores not as good as Intel though (20% less speed) • $4k RISC-V (2010) • A new approach to CPU design • “Linux of processor design” • ISA design available via open source licenses • No fees to use it • Design tools readily available • Dozens of CPU designs now created based on RISC-V • Further opens up the possibility of domain specific hardware Summary • Flynn’s taxonomy categorizes instruction and data flow in computers • Modern processors are MIMD • Pipelining and superscalar design improve CPU performance by increasing the instructions per clock • CISC/RISC design approaches appear to be reaching the limits of their applicability • In the absence of improved single core performance, designers are simply integrating more cores

Architecture Classifications

Related documents

Products

Support

Architecture Classifications

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib