15-740/18-740 Computer Architecture Lecture 3: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011 Review of Last Lecture n Vector processors q q q n n Advantages and disadvantages Amortization of control overhead over multiple data elements Vector vs. scalar code execution time SIMD vs. MIMD Concept of on-chip networks 2 Today n n Wrap up intro to on-chip networks ISA level tradeoffs 3 Review: SIMD vs. MIMD n SIMD: q n Concurrency arises from performing the same operations on different pieces of data MIMD: q Concurrency arises from performing different operations on different pieces of data n q Control/thread parallelism: execute different threads of control in parallel à multithreading, multiprocessing Idea: Use multiple processors to solve a problem 4 Review: Flynn’s Taxonomy of Computers n n n Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements q q n MISD: Multiple instructions operate on single data element q n Array processor Vector processor Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q q Multiprocessor Multithreaded processor 5 Review: On-Chip Network Based Multi-Core Systems n A scalable multi-core is a distributed system on a chip PE PE R R PE PE PE PE PE R VC Identifier R PE R PE R R Router PE Processing Element From East PE R R Input Port with Buffers PE PE PE PE R R R R PE R R R R PE (Cores, L2 Banks, Memory Controllers, Accelerators, etc) From West VC 0 VC 1 VC 2 Control Logic Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA) To East From North To West To North To South To PE From South From PE Crossbar 6 Review: Idea of On-Chip Networks n Problem: Connecting many cores with a single bus is not scalable q Single point of connection limits communication bandwidth n q n Electrical loading on the single bus limits bus frequency Idea: Use a network to connect cores q q n What if multiple core pairs want to communicate with each other at the same time? Connect neighboring cores via short links Communicate between cores by routing packets over the network Dally and Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” DAC 2001. 7 Review: Bus + Simple + Cost effective for a small number of nodes + Easy to implement coherence (snooping) - Not scalable to large number of nodes (limited bandwidth, electrical loading à reduced frequency) - High contention Memory Memory Memory Memory cache cache cache cache Proc Proc Proc Proc 8 Review: Crossbar Every node connected to every other n Good for small number of nodes + Least contention in the network: high bandwidth - Expensive - Not scalable due to quadratic cost 7 n 6 Used in core-to-cache-bank networks in - IBM POWER5 - Sun Niagara I/II 5 4 3 2 1 0 0 1 2 3 4 5 6 7 9 Review: Mesh n n n n n n O(N) cost Average latency: O(sqrt(N)) Easy to layout on-chip: regular and equal-length links Path diversity: many ways to get from one node to another Used in Tilera 100-core And many on-chip network prototypes 10 Review: Torus Mesh is not symmetric on edges: performance very sensitive to placement of task on edge vs. middle n Torus avoids this problem + Higher path diversity than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths n 11 Review: Torus, continued n Weave nodes to make inter-node latencies ~constant S S MP MP S MP S MP S MP S MP S MP S MP 12 Review: Example NoC: 100-core Tilera Processor q Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. 13 The Need for QoS in the On-Chip Network n One can create malicious applications that continuously access the same resource à deny service to less aggressive applications 14 The Need for QoS in the On-Chip Network n n Need to provide packet scheduling mechanisms that ensure applications’ service requirements (bandwidth/latency) are satisfied Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Costeffective QOS Scheme for Networks-on-Chip,” MICRO 2009. 15 On Chip Networks: Some Questions n n n n n n n Is mesh/torus the best topology? How do you design the router? q Energy efficient, low latency q What is the routing algorithm? Is it adaptive or deterministic? How does the router prioritize between different threads’/applications’ packets? q How does the OS/application communicate the importance of applications to the routers? q How does the router provide bandwidth/latency guarantees to applications that need them? Where do you place different resources? (e.g., memory controllers) How do you maintain cache coherence? How does the OS scheduler place tasks? How is data placed in distributed caches? 16 What is Computer Architecture? n n The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals. We will soon distinguish between the terms architecture, microarchitecture, and implementation. 17 Why Study Computer Architecture? 18 Moore’s Law Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, 1965. 19 Why Study Computer Architecture? n Make computers faster, cheaper, smaller, more reliable q n Enable new applications q q q n Life-like 3D visualization 20 years ago? Virtual reality? Personal genomics? Adapt the computing stack to technology trends q n By exploiting advances and changes in underlying technology/circuits Innovation in software is built into trends and changes in computer architecture n > 50% performance improvement per year Understand why computers work the way they do 20 An Example: Multi-Core Systems Multi-Core Chip DRAM MEMORY CONTROLLER L2 CACHE 3 L2 CACHE 2 CORE 2 CORE 3 DRAM BANKS CORE 1 DRAM INTERFACE L2 CACHE 1 L2 CACHE 0 SHARED L3 CACHE CORE 0 *Die photo credit: AMD Barcelona 21 Unexpected Slowdowns in Multi-Core High priority Memory Performance Hog Low priority (Core 0) (Core 1) Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007. 22 Why the Disparity in Slowdowns? CORE matlab1 gcc 2 CORE L2 CACHE L2 CACHE Multi-Core Chip unfairness INTERCONNECT DRAM MEMORY CONTROLLER Shared DRAM Memory System DRAM DRAM DRAM DRAM Bank 0 Bank 1 Bank 2 Bank 3 23 DRAM Bank Operation Rows Row address 0 1 Columns Row decoder Access Address: (Row 0, Column 0) (Row 0, Column 1) (Row 0, Column 85) (Row 1, Column 0) Row 01 Row Empty Column address 0 1 85 Row Buffer CONFLICT HIT ! Column mux Data 24 DRAM Controllers n A row-conflict memory access takes significantly longer than a row-hit access n Current controllers take advantage of the row buffer n Commonly used scheduling policy (FR-FCFS) [Rixner 2000]* (1) Row-hit first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first n This scheduling policy aims to maximize DRAM throughput *Rixner et al., “Memory Access Scheduling,” ISCA 2000. *Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997. 25 The Problem n Multiple threads share the DRAM controller DRAM controllers designed to maximize DRAM throughput n DRAM scheduling policies are thread-unfair n q Row-hit first: unfairly prioritizes threads with high row buffer locality n q n Threads that keep on accessing the same row Oldest-first: unfairly prioritizes memory-intensive threads DRAM controller vulnerable to denial of service attacks q Can write programs to exploit unfairness 26 Fundamental Concepts 27 What is Computer Architecture? n n The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals. Traditional definition: “The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the dataflow and controls, the logic design, and the physical implementation.” Gene Amdahl, IBM Journal of R&D, April 1964 28 Levels of Transformation Problem Problem Algorithm Algorithm Programs User Program ISA Microarchitecture Runtime System (VM, OS, MM) Circuits ISA Electrons Microarchitecture Circuits/Technology Electrons 29 Levels of Transformation n ISA q Agreed upon interface between software and hardware n q n What the software writer needs to know to write system/user programs Microarchitecture q q n SW/compiler assumes, HW promises Specific implementation of an ISA Not visible to the software Problem Algorithm Program ISA Microarchitecture Circuits Electrons Microprocessor q q ISA, uarch, circuits “Architecture” = ISA + microarchitecture 30 ISA vs. Microarchitecture n What is part of ISA vs. Uarch? q q q n Implementation (uarch) can be various as long as it satisfies the specification (ISA) q q n Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA q q Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why? 31 ISA n Instructions q q q n Memory q q n n n n n n Opcodes, Addressing Modes, Data Types Instruction Types and Formats Registers, Condition Codes Address space, Addressability, Alignment Virtual memory management Call, Interrupt/Exception Handling Access Control, Priority/Privilege I/O Task Management Power and Thermal Management Multi-threading support, Multiprocessor support 32 Microarchitecture n n Implementation of the ISA under specific design constraints and goals Anything done in hardware without exposure to software q q q q q q q q q q Pipelining In-order versus out-of-order instruction execution Memory access scheduling policy Speculative execution Superscalar processing (multiple instruction issue?) Clock gating Caching? Levels, size, associativity, replacement policy Prefetching? Voltage/frequency scaling? Error correction? 33 Design Point n A set of design considerations and their importance q n Considerations q q q q q q q n leads to tradeoffs in both ISA and uarch Cost Performance Maximum power consumption Energy consumption (battery life) Availability Reliability and Correctness (or is it?) Time to Market Problem Algorithm Program ISA Microarchitecture Circuits Electrons Design point determined by the “Problem” space (application space) 34 Tradeoffs: Soul of Computer Architecture n ISA-level tradeoffs n Uarch-level tradeoffs n System and Task-level tradeoffs q How to divide the labor between hardware and software 35 ISA-level Tradeoffs: Semantic Gap n Where to place the ISA? Semantic gap q q Closer to high-level language (HLL) or closer to hardware control signals? à Complex vs. simple instructions RISC vs. CISC vs. HLL machines n n q FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) Tradeoffs: n Simple compiler, complex hardware vs. complex compiler, simple hardware q n n Caveat: Translation (indirection) can change the tradeoff! Burden of backward compatibility Performance? q q Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? Instruction size, code size 36 X86: Small Semantic Gap: String Operations REP MOVS DEST SRC How many instructions does this take in Alpha? 37 Small Semantic Gap Examples in VAX n n n n n n n n FIND FIRST q Find the first set bit in a bit field q Helps OS resource allocation operations SAVE CONTEXT, LOAD CONTEXT q Special context switching instructions INSQUEUE, REMQUEUE q Operations on doubly linked list INDEX q Array access with bounds checking STRING Operations q Compare strings, find substrings, … Cyclic Redundancy Check Instruction EDITPC q Implements editing functions to display fixed format output Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78. 38 Small versus Large Semantic Gap n CISC vs. RISC q Complex instruction set computer à complex instructions n q Initially motivated by “not good enough” code generation Reduced instruction set computer à simple instructions n John Cocke, mid 1970s, IBM 801 q n Goal: enable better compiler control and optimization RISC motivated by q Memory stalls (no work done in a complex instruction when there is a memory stall?) n q q When is this correct? Simplifying the hardware à lower cost, higher frequency Enabling the compiler to optimize the code better n Find fine-grained parallelism to reduce stalls 39 Small versus Large Semantic Gap n n n n John Cocke’s RISC (large semantic gap) concept: q Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding à smaller code size à saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work à compiler has less opportunity to optimize - More complex hardware à translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985. 40 ISA-level Tradeoffs: Instruction Length n Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) n Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently n Tradeoffs q q q Code size (memory space, bandwidth, latency) vs. hardware complexity ISA extensibility and expressiveness Performance? Smaller code vs. imperfect decode 41 ISA-level Tradeoffs: Uniform Decode n Uniform decode: Same bits in each instruction correspond to the same meaning Opcode is always in the same location q Ditto operand specifiers, immediate values, … q Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space q n Non-uniform decode E.g., opcode can be the 1st-7th byte in x86 + More compact and powerful instruction format -- More complex decode logic (e.g., more logic to speculatively generate branch target) q 42 x86 vs. Alpha Instruction Formats n x86: n Alpha: 43 ISA-level Tradeoffs: Number of Registers n Affects: q q q n Number of bits used for encoding register address Number of values kept in fast storage (register file) (uarch) Size, access time, power consumption of register file Large number of registers: + Enables better register allocation (and optimizations) by compiler à fewer saves/restores -- Larger instruction size -- Larger register file size -- (Superscalar processors) More complex dependency check logic 44 ISA-level Tradeoffs: Addressing Modes n Addressing mode specifies how to obtain an operand of an instruction q q q n Register Immediate Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) More modes: + help better support programming constructs (arrays, pointerbased accesses) -- make it harder for the architect to design -- too many choices for the compiler? n n Many ways to do the same thing complicates compiler design Read Wulf, “Compilers and Computer Architecture” 45 x86 vs. Alpha Instruction Formats n x86: n Alpha: 46 x86 register indirect absolute register + displacement register 47 x86 indexed (base + index) scaled (base + index*4) 48 Other ISA-level Tradeoffs n n n n n n n n n n n n n Load/store vs. Memory/Memory Condition codes vs. condition registers vs. compare&test Hardware interlocks vs. software-guaranteed interlocking VLIW vs. single instruction vs. SIMD 0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands) Precise vs. imprecise exceptions Virtual memory vs. not Aligned vs. unaligned access Supported data types Software vs. hardware managed page fault handling Granularity of atomicity Cache coherence (hardware vs. software) … 49 Programmer vs. (Micro)architect n Many ISA features designed to aid programmers But, complicate the hardware designer’s job n Virtual memory n q q n Unaligned memory access q n n vs. overlay programming Should the programmer be concerned about the size of code blocks? Compile/programmer needs to align data Transactional memory? VLIW vs. SIMD? Superscalar execution vs. SIMD? 50 Transactional Memory THREAD 1 THREAD 2 enqueue (Q, v) { enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; release(lock); Q->tail = node; Q->tail release(lock); = node; Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; Q->tail release(lock); = node; release(lock); Q->tail = node; } } begin-transaction begin-transaction … … enqueue (Q, v); //no locks enqueue (Q, v); //no locks … … end-transaction end-transaction 51 Transactional Memory n n A transaction is executed atomically: ALL or NONE If there is a data conflict between two transactions, only one of them completes; the other is rolled back q q Both write to the same location One reads from the location another writes 52 ISA-level Tradeoff: Supporting TM n n Still under research Pros: q q n Cons: q q q n Could make programming with threads easier Could improve parallel program performance vs. locks. Why? What if it does not pan out? All future microarchitectures might have to support the new instructions (for backward compatibility reasons) Complexity? How does the architect decide whether or not to support TM in the ISA? (How to evaluate the whole stack) 53 ISA-level Tradeoffs: Instruction Pointer n Do we need an instruction pointer in the ISA? q Yes: Control-driven, sequential execution n n q No: Data-driven, parallel execution n n An instruction is executed when the IP points to it IP automatically changes sequentially (except control flow instructions) An instruction is executed when all its operand values are available (data flow) Tradeoffs: MANY high-level ones q q q q Ease of programming (for average programmers)? Ease of compilation? Performance: Extraction of parallelism? Hardware complexity? 54 The Von-Neumann Model MEMORY Mem Addr Reg Mem Data Reg PROCESSING UNIT INPUT OUTPUT ALU TEMP CONTROL UNIT IP Inst Register 55 The Von-Neumann Model n n n n Stored program computer (instructions in memory) One instruction at a time Sequential execution Unified memory q n n The interpretation of a stored value depends on the control signals All major ISAs today use this model Underneath (at uarch level), the execution model is very different q q q Multiple instructions at a time Out-of-order execution Separate instruction and data caches 56 Fundamentals of Uarch Performance Tradeoffs Instruction Supply - Zero-cycle latency (no cache miss) - No branch mispredicts Data Path (Functional Units) - Perfect data flow (reg/memory dependencies) - Zero-cycle interconnect (operand communication) - No fetch breaks Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Enough functional units - Zero latency compute? We will examine all these throughout the course (especially data supply) 57 How to Evaluate Performance Tradeoffs Execution time = # instructions program Algorithm Program ISA Compiler X = time program # cycles instruction ISA Microarchitecture X time cycle Microarchitecture Logic design Circuit implementation Technology 58 Improving Performance n Reducing instructions/program n Reducing cycles/instruction (CPI) n Reducing time/cycle (clock period) 59 Improving Performance (Reducing Exec Time) n Reducing instructions/program q q n More efficient algorithms and programs Better ISA? Reducing cycles/instruction (CPI) q Better microarchitecture design n n n Execute multiple instructions at the same time Reduce latency of instructions (1-cycle vs. 100-cycle memory access) Reducing time/cycle (clock period) q q Technology scaling Pipelining 60 Improving Performance: Semantic Gap n Reducing instructions/program q q n Complex instructions: small code size (+) Simple instructions: large code size (--) Reducing cycles/instruction (CPI) q Complex instructions: (can) take more cycles to execute (--) n n q n REP MOVS How about ADD with condition code setting? Simple instructions: (can) take fewer cycles to execute (+) Reducing time/cycle (clock period) q Does instruction complexity affect this? n It depends 61