Computer Organization (EENG 3710) Instructor: Partha Guturu EE Department Quick Recap on our respective roles • Who is responsible for your learning of Computer Organization? Some aphorisms on Teaching philosophy: “I do not teach my pupils. I provide conditions in which they can learn”- Albert Einstein “I hear and I forget. I see and I remember. I do and I understand” Chinese proverb "Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime." -- Chinese proverb What does the data say? – Even if you are fascinating….. – People only remember the first 15 minutes of what you say 100 Percent of Students 50 Paying Attention 0 0 10 20 30 40 50 Time from Start of Lecture (minutes) 60 What’s so good about our approach? • Learner-Centric Approach • Life-long learning • Proactive versus reactive Course Objectives: What you need to learn? • High level view of a computer • Different types – Desk/lap tops – Servers – Embedded systems • • • • Anatomy of a computer and our focus here Computer Organization versus Architecture Instruction sets Different components of a computer and their interworking • Computer Performance Issues Different Applications & Requirements • • • Desktop Applications – Emphasis on performance of integer and Floating Point (FP) data types – Little regard for program (code) size and power consumption Server Applications – Database, file system, web applications, time-sharing – FP (Floating Point) performance is much less important than integer and character strings – Little regard for program (code) size and power consumption Embedded Applications – Digital Signal Processors (DSPs), media processors, control – High value placed on program size and power consumption • Less memory, is cheaper and lower power • Reduce chip costs: FP instructions may be optional Embedded Computers in Your Car Relative levels of demand for different computer types Anatomy of Computer & Our Focus Application (ex: browser) Compiler Software Hardware Operating System Assembler Processor Memory I/O system Instruction Set Architecture Datapath Control Digital Design Circuit Design transistors * Coordination of many levels (layers) of abstraction Why a Compiler? In Paris they simply stared when I spoke to them in French; I never did succeed in making those idiots understand their own language. Mark Twain, The Innocents Abroad, 1869 Why High Level Language • Ease of thinking and coding in an English/Math like language • Enhanced productivity because of the ease to debug and validate • Maintainability • Target independent development • Availability of optimizing compilers A Dissection to Reveal Finer Details High Level Language Program (e.g., C) Compiler Assembly Language Program (e.g.,MIPS) Assembler Machine Language Program (MIPS) Machine Interpretation Hardware Architecture Description (Logisim, VHDL, Verilog, etc.) Architecture Implementation Logic Circuit Description (Logisim, etc.) temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 0110 1000 1111 1001 lw lw sw sw 1010 0000 0101 1100 $t0, 0($2) $t1, 4($2) $t1, 0($2) $t0, 4($2) 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 What is in a Computer? • Components: – – – – processor (datapath, control) input (mouse, keyboard) output (display, printer) memory (cache (SRAM), main memory (DRAM)) • Our primary focus: the processor (datapath and control) – Implemented using millions of transistors – Impossible to understand by looking at each transistor – We need abstraction! 5 Major Components of a Computer Personal Computer Computer Processor Control (“brain”) Datapath (“brawn”) Memory (where programs, data live when running) Devices Input Output Keyboard, Mouse Disk (where programs, data live when not running) Display, Printer 5 Major Components of a Computer Processor Chip (CPU) Components Motherboard Lay-Out Dramatic Changes in Technology • • • • Processor – Logic capacity: about 30% ~ 35% per year – Clock rate : about 30% per year Memory – DRAM: Dynamic Random Access Memory – Capacity: about 60% per year (4x every 3 years) – Memory speed: about 10% per year – Cost per bit: improves about 25% per year Disk – Capacity: about 60% ~ 100% per year – Speed: about 10% per year Network Bandwidth – 10 Mb ------(10 years)-- 100Mb ------(5 years)-- 1 Gb Growth Capacity of DRAM Chips K = 1024 (210) In recent years growth rate has slowed to 2x every 2 year # of transistors on an IC Dramatic Changes in Technology Gordon Moore Intel Cofounder 2X Transistors / Chip Every 1.5 years Called “Moore’s Law” Year The Underlying Technologies Year Technology Relative Performance/Unit Cost 1951 Vacuum Tube 1 1965 Transistor 35 1975 Integrated Circuit (IC) 900 1995 Very Large Scale IC (VLSI) 2,400,000 2005 Ultra VLSI 6,200,000,000 What if technology in the automobile industry advanced at the same rate? What if the automobile … “If the automobile had followed the same development cycle as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year, killing everyone inside.” – Robert X. Cringely, InfoWorld magazine Complex Chip Manufacturing Process Enabled by Technological Breakthroughs Computer Architecture versus Computer Organization Computer architecture is the abstract image of a computing system that is seen by a machine language (or assembly language) programmer, including the instruction set, memory address modes, processor registers, and address and data formats; whereas the computer organization is a lower level, more concrete, description of the system that involves how the constituent parts of the system are interconnected and how they interoperate in order to implement the architectural specification --Phillip A. Laplante (2001), Dictionary of Computer Science, Engineering, and Technology -> Can change organization without changing architecture (e.g. 64 bit architecture with 16 bit machine using 4 clock cycles) Course Outline • • • • • • • • • Topic # weeks Introduction to Computer Organization (1) Computer Instructions (2) Arithmetic and Logic Unit (1) Performance Analysis (1) Data Path and Control (2) Performance Enhancement with Pipelining (2) Memory Hierarchy and Virtual Memory Concepts (2) Storage, Networks, and other Peripherals (1) Engineering Design with Microcomputers (2) Course Objectives Know about the different software and hardware components of a digital computer . Comprehend how different components of the digital computer collaborate to produce the end result in an application development process Apply principles of logic design to digital computer design. Analyze digital computer and decompose it into modules and lower level logical blocks involving both combinational and sequential circuit elements. Synthesize various components of computer's Arithmetic Logic Unit, Control Units, and Data Paths Understand and Assess (evaluate) computer CPU performance, and learn methods to enhance computer performance. Language of the Computer • We will have a quick look at MIPS language • MIPS- Not to be confused with million instructions per second • MIPS- Microprocessor without Interlocked Pipelined Stages- a RISC (Reduced Instruction Set Computer) processor developed by MIPS Technologies. • By 1990 1 out of 3 RISC processors was using MIPS; Architecture also called MIPS • CISCO routers, Nintendo 64, Sony Play Station, Play Station 2, etc. use MIPS designs Why bother to learn assembly language? • “The difference between mediocre and star programmers is that star programmers understand assembly language, whether or not they use it on a daily basis.” “Assembly language is the language of the computer itself. To be a programmer without ever learning assembly language is like being a professional race car driver without understanding how your carburetor works. To be a truly successful programmer, you have to understand exactly what the computer sees when it is running a program. Nothing short of learning assembly language will do that for you. Assembly language is often seen as a black art among today's programmers - with those knowing this art being more productive, more knowledgeable, and better paid, even if they primarily work in other languages.” Basic Instruction Format Three Instruction Formats: R Opcode 31 I rs 26 25 Opcode 31 21 20 rs 26 25 J Opcode 31 rt rd 16 15 rt 21 20 shamt 11 10 funct 6 5 0 Immediate 16 15 0 Memory Address 26 25 0 Now Guess MIPS Architecture • How many registers? • How big a memory could be supported? • What is memory word size? • How to handle data in RAM? Non-architectural design/implementation issues that vary from design to design: • Roles of registers Instruction Set Architecture (ISA) • Instructions – The words of a computer’s language are called instructions • Instructions set – The vocabulary of a computer’s language is called instruction set • Instruction Set Architecture (ISA) – The set of instructions a particular CPU implements is an Instruction Set Architecture. The Instruction Set Architecture (ISA) software instruction set architecture hardware The interface description separating the software and hardware ISA Sales ISA: CISC vs. RISC • Early trend was to add more and more instructions to new CPUs to do elaborate operations – CISC (Complex Instruction Set Computer) – The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible. – VAX architecture had an instruction to multiply polynomials! • RISC philosophy (Cocke IBM, Patterson, Hennessy, 1980s) – Reduced Instruction Set Computer – Keep the instruction set small and simple, makes it easier to build fast hardware. – Let software do complicated operations by composing simpler ones. The MIPS ISA • Instruction Categories – Load/Store – Computational – Jump and Branch – Floating Point Registers R0 - R31 • coprocessor PC HI – Memory Management – Special 3 Instruction Formats: all 32 bits wide OP rs rt OP rs rt OP rd LO sa immediate jump target funct MIPS Registers and their Roles Name Number Use Preserved across a Call? $zero $at 0 1 $v0 -$v1 2-3 Values for function results No 4-7 Expression Evaluation Arguments No $ao -$a3 The constant value 0 Assembler Temporary N.A. No $t0 -$t7 8-15 Temporaries No $s0 -$s7 16-23 Saved Temporaries YES $t8 -$t9 24-25 Temporaries No $k0 -$k1 26-27 Reserved for OS kernel No $gp (28) global pointer, $sp (29) stack pointer, $fp (30) frame pointer, $ra (31)return address are all preserved across a call Simple operations • Compute f = (a+b)-(c-d) assuming these variables are in some $s registers • Memory operation- base register concept • Why a multiplication factor of 4 is required for ‘n’ the array element- Answer:-Memory addresses are in MIPS are byte addresses. list of Quick Recap- CompilersSorted 10 numbers C++ program to Sort 10 numbers Input C++ Compiler (Machine-X code to translate any C++ Program into Assembly Program for Machine X) Machine X Two steps in this dotted area can be Output : merged together Machine X into a single step Assembly Program to sort 10 numbers Input Assembler (Machine-X code to translate any Machine X Assembly Program into Machine X code) Machine X Output 10 numbers Input Machine X code To Sort 10 numbers Output Machine X Quick Recap- Shortcut Compilers C++ program to Sort 10 numbers 10 numbers Input C++ Compiler (Machine-X code to translate any C++ Program directly into Machine X code) Input Output Machine X code To Sort 10 numbers Machine X Machine X Output: Sorted list of 10 numbers Quick Recap- Bootstrapping C++ program to translate any C++ program into Machine Y code Input Input C++ Compiler (Machine-X code to translate any C++ Program directly into Machine X code) Machine X Output: Machine X code to translate Output any C++ program into Machine Y code Machine Y Code for C++ compiler (i.e. to Machine X translate any C++ program into Machine Y code) This can be installed and run on Machine Y; thus you have a compiler for Machine Y Chapter 2- MIPS Programming Quick Recap- MIPS • MIPS language- expansion of the acronym • No of registers and architecture in general • The 3 Instruction formats and the various fields (e.g. rs, rt, rd, shamt, etc.) • Now, we proceed along with – MIPS Assembly Instruction formats – Coding simple problems and translating into MIPS machine code Simple Statements • C Code: d = (a + b) – (c + d) • Machine code assuming a, b, c, and d are in MIPS registers • Machine code assuming that a, b, c, and d are in consecutive memory locations from a given starting address (use lw, sw) Loops and Branches • Develop assembly code for a typical C-code to add 100 numbers as follows: // Read 100 numbers into an array A sum = 0; for (i = 0; i < 100; i++) { sum = sum + A[i]; } // Print sum Procedure Calls • Caller and Callee- who should preserve which registers? • Leaf and recursive procedure examples for explaining the conventions, and jla and jr instructions. SPIM Courtesy: Prof. Jerry Breecher Clark University Appendix A MIPS Simulation • SPIM is a simulator. – Reads a MIPS assembly language program. – Simulates each instruction. – Displays values of registers and memory. – Supports breakpoints and single stepping. – Provides simple I/O for interacting with user. SPIM Versions • SPIM is the command line version. • XSPIM is x-windows version (Unix workstations). • There is also a windows version. You can use this at home and it can be downloaded from: http://www.cs.wisc.edu/~larus/spim.html. Resources On the Web • There’s a very good SPIM tutorial at http://chortle.ccsu.edu/AssemblyTutorial/Chapter-09/ass09_1.html • In fact, there’s a tutorial for a good chunk of the ISA portion of this course at: http://chortle.ccsu.edu/AssemblyTutorial/tutorialContents.html • Here are a couple of other good references you can look at: Patterson_Hennessy_AppendixA.pdf And http://babbage.clarku.edu/~jbreecher/comp_org/labs/Introduction_To_SPIM.pdf SPIM Program • • • MIPS assembly language. Must include a label “main” – this will be called by the SPIM startup code (allows you to have command line arguments). Can include named memory locations, constants and string literals in a “data segment”. General Layout • Data definitions start with .Data directive. • Code definition starts with .Text directive. – “Text” is the traditional name for the memory that holds a program. Usually have a bunch of subroutine definitions and a “main”. • Simple Example .data # data memory foo: .word 0 # 32 bit variable .text .align 2 .globl main main: lw $a0,foo # program memory # word alignment # main is global Data Definitions • You can define variables/constants with: – .word : defines 32 bit quantities. – .byte: defines 8 bit quantities. – .asciiz: zero-delimited ascii strings. – .space: allocate some bytes. Data Examples .data prompt: .asciiz “Hello World\n” msg: .asciiz “The answer is ” x: .space 4 y: .word 4 str: .space 100 MIPS: Software Conventions For Registers Simple I/O SPIM provides some simple I/O using the “syscall” instruction. The specific I/O done depends on some registers. – You set $v0 to indicate the operation. – Parameters in $a0, $a1. I/O Functions System call is used to communicate with the system and do simple I/O. $v0 Load arguments (if any) into registers $a0, $a1 or $f12 (for floating point). do: syscall Results returned in registers $v0 or $f0. Example: Reading an int li $v0,5 syscall # Indicate we want function 5 # Upon return from the syscall, $v0 has the integer typed by # a human in the SPIM console # Now print that same integer move $a0,$v0 # Get the number to be printed into register li $v0,1 # Indicate we’re doing a write-integer syscall Printing A String .data msg: .asciiz .text .globl main: li $v0,4 la $a0,msg syscall jr $ra “SPIM IS FUN” A Typical MIPS READ and WRITE Program .data 0x10000000 A: .word 0, 0 .text main: la $t0, A li $v0, 5 #setting up return reg for read syscall sw $v0, ($t0) li $v0, 5 #setting up return reg for read syscall sw $v0, 4($t0) lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 li $v0, 1 #setting up return reg for print move $a0,$t3 syscall A C-Program with Read and Sum Loops int main (int argc, char **argv) // Older versions of C accept: void main() { int A[5], i; for (i = 0; i <=4; i++) { scanf(“%d”, A[i]); } sum = 0; for (i = 0; i <=4; i++) { sum = sum + A[i]; } printf(“The sum of 5 numbers is: %d\n”, sum); } The MIPS equivalent of the C-Program with Read and Sum Loops .data A: .word 0 #Create space for the first word A[0] and initialize it to 0 .space 16 #Create space for 4 more words A[1] .. A[4] msg: .asciiz "The sum of 5 numbers is: " .text main: la $t0, A #Store in $t0 the address of A[0], the first of five words li $t1, 0 #Store in $t1, the initial value of loop variable li $t2, 4 #Store in $t2, the final value of loop variable li $t3, 0 #Initialize $t3 that increments by 4 with each word read loop: add $t4, $t0, $t3 #Put in $t4 the address of next word li $v0, 5 # Initialize $v0 for Read syscall sw $v0,($t4) # put the new integer read into the word location pointed by $t4 addi $t3, $t3, 4 #increment $t3 by 4 for calculation of next word address addi $t1, 1 ble $t1, $t2, loop (continued to next slide …) The MIPS equivalent of the C-Program with Read and Sum Loops … Continued from previous slide. li $t1, 0 #Do the same initialization for identical loop at addLoop li $t2, 4 li $t3, 0 li $s0, 0 addloop: add $t4, $t0, $t3 lw $t5, ($t4) #Read the integer at address in $t4 into $t5 add $s0, $s0, $t5 #Update the partial sum in $s0 by adding the new integer addi $t3, $t3, 4 addi $t1, 1 ble $t1, $t2, addloop li $v0, 4 la $a0, msg syscall #Make System Ready to print String #Load starting address (msg) of the string into $a0- argument register li $v0, 1 #Make System Ready to print the integer (sum) move $a0, $s0 syscall SPIM Subroutines • • • • The stack is set up for you – just use $sp. You can view the stack in the data window. main is called as a subroutine (have it return using jr $ra). For now, don’t worry about details. But the next few pages do some excellent example of how stacks all work. Why Are Stacks So Great? • • Some machines provide a memory stack as part of the architecture (e.g., VAX) Sometimes stacks are implemented via software convention (e.g., MIPS) Why Are Stacks So Great? MIPS Function Calling Conventions SP fact: addiu $sp, $sp, -32 sw $ra, 20($sp) ... sw $s0, 4($sp) ... lw $ra, 20($sp) addiu $sp, $sp, 32 jr $ra C-Program for a leaf-procedure void main() { int e, f, g, h; scanf(%d”, &e); scanf(“%d”, &f); scanf(“%d”, &g); scanf(“%d”, &h); result = leaf_procedure(e, f, g, h) printf (“Result = %d\n”, result); } Int leaf_procedure(int e, int f, int g, int h) { int res; int temp1, temp2; //Not required // (only for making it // close to MIPS code) temp1 = e + f; temp2 = g + h; res = temp1 – temp2; return (res) } Page 1: MIPS code for the main (calling program ) of leaf_procedure .data e: .word 0 f: .word 0 g: .word 0 h: .word 0 .text main: la $t0, e #Load address of e into $t0 li $t1, 0 #set the loop iteration variable to 0 readLoop: sll $t2,$t1, 2 #Since each word is 4 bytes long, multiply loop variable by 4 add $t3, $t0, $t2 #First time in the loop, $t3 will have address of e li $v0, 5 #Prepare for read syscall sw $v0, ($t3) #Newly read value will go to e, f, g , or h depending upon # whether the loop variable $t1 contains 0, 1, 2, or 3, # that is, whether $t2 is 0, 4, 8, or 12. addi $t1, $t1, 1 xori $t2, $t1, 4 #You can destroy the original $t2 value because you are recomputing # it from $t1 at the beginning of the loop! bne $t2, $zero, readLoop #You haven't read all the 4 integers; go back to readloop. Page 2: MIPS Code Continuation for the main of leaf_procedure #reading complete. Make preparations for the leaf_procedure that computes (e+f)-(g+h) # by saving arguments in argument registers. lw $a0, 0($t0) #load e into $a0 lw $a1, 4($t0) #load f into $a1 lw $a2, 8($t0) #load g into $a2 lw $a3, 12($t0) #load h into $a3 jal leaf_procedure #this instruction stores the address of next instruction (the return #address, that is, the adress of the instruction at the “print “ label) #in $ra and jumps onto the label leaf_procedure print: move $t0, $v0 li $v0, 1 #Prepare for print move $a0, $t0 syscall j last Page 3: MIPS Code for the leaf_procedure itself leaf_procedure: addi $sp, $sp, -12 #Make space on the stack for 3 integers lw $s0, 0($sp) #save the contents of the registers you plan to temporarily use # in this procedure on stack so that original values can be restored # before returning to the calling program lw $s1, 4($sp) lw $s2, 8 ($sp) add $s1, $a0, $a1 #Add e and f in $a0 and $a1, respectively, and put in $s1 add $s2, $a2, $a3 #Add g and h in $a2 and $a3, respectively, and put in $s2 sub $s0, $s1, $s2 #subract g+h in $s2 from e+f in $s1, and put it in $s0 #Make preparations for returning back to the calling procedure (main in this case) move $v0, $s0 #Put the computed value into return value register sw $s0, 0($sp) #Restore values on stack to the original resisters sw $s1, 4($sp) sw $s2, 8 ($sp) addi $sp, $sp, 12 #Update stack jr $ra #Jump to location pointed to by $ra (print, in our case) last: # the main program wil stop here as there is no valid instruction here. MIPS Function Calling Conventions main() { printf("The factorial of 10 is %d\n", fact(10)); } int fact (int n) { if (n <= 1) return(1); return (n * fact (n-1)); } MIPS Function Calling Conventions .text .global main main: subu $sp, $sp, 32 #stack frame size is 32 bytes sw $ra,20($sp) #save return address li $a0,10 # load argument (10) in $a0 jal fact #call fact la $a0 LC #load string address in $a0 move $a1,$v0 #load fact result in $a1 jal printf # call printf lw $ra,20($sp) # restore $sp addu $sp, $sp,32 # pop the stack jr $ra # exit() .data LC: .asciiz "The factorial of 10 is %d\n" MIPS Function Calling Conventions .text fact: sw sw subu bgtz li j L2: jal lw mul L1: addu jr subu $sp,$sp,8 # stack frame is 8 bytes $ra,8($sp) #save return address $a0,4($sp) # save argument(n) $a0,$a0,1 # compute n-1 $a0, L2 # if n-1>0 (ie n>1) go to L2 $v0, 1 # L1 # return(1) # new argument (n-1) is already in $a0 fact # call fact $a0,4($sp) # load n $v0,$v0,$a0 # fact(n-1)*n lw $ra,8($sp) # restore $ra $sp,$sp,8 # pop the stack $ra # return, result in $v0 MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions MIPS Function Calling Conventions Sample SPIM Programs (on the web) multiply.s: multiplication subroutine based on repeated addition and a test program that calls it. http://babbage.clarku.edu/~jbreecher/comp_org/labs/multiply.s fact.s: computes factorials using the multiply subroutine. http://babbage.clarku.edu/~jbreecher/comp_org/labs/fact.s sort.s: the sorting program from the text. http://babbage.clarku.edu/~jbreecher/comp_org/labs/sort.s strcpy.s: the strcpy subroutine and test code. http://babbage.clarku.edu/~jbreecher/comp_org/labs/strcpy.s EENG 3710 Computer Organization Arithmetic 3 ALU Design – Integer Addition, Multiplication & Division Adapted from David H. Albonesi Copyright David H. Albonesi and the University of Rochester. E. J. Kim Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) 1000 Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) 1000 00000 Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) 1000 00000 00000 0 Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) 1000 00000 00000 1000000 0 Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) 1000 00000 00000 +100000 0 0 1001000 (product) (partial products) Integer multiplication 1000 (multiplicand) • Pencil and paper binary multiplication x 1001 (multiplier) 1000 00000 00000 +100000 0 0 1001000 (product) (partial products) • Key elements – Examine multiplier bits from right to left – Shift multiplicand left one position each step – Simplification: each step, add multiplicand to Integer multiplication • Initialize product register to 0 1000 (multiplicand) 1001 (multiplier) 00000000 (running product) Integer multiplication • Multiplier bit = 1: add multiplicand to product 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 (new running product) Integer multiplication • Shift multiplicand left 10000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 Integer multiplication • Multiplier bit = 0: do nothing 10000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 Integer multiplication • Shift multiplicand left 100000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 Integer multiplication • Multiplier bit = 0: do nothing 100000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 Integer multiplication • Shift multiplicand left 1000000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 Integer multiplication • Multiplier bit = 1: add multiplicand to product 1000000 (multiplicand) 1001 (multiplier) 00000000 +1000 00001000 +1000000 01001000 (product) Integer multiplication • 64-bit hardware implementation LSB – Multiplicand loaded into right half of multiplicand register – Product register initialized to all 0’s – Repeat the following 32 times • If multiplier register LSB=1, add multiplicand to product • Shift multiplicand one bit left • Shift multiplier one bit right • Integer multiplication Algorithm Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 1001 (multiplicand) (multiplier) 00000000 (running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 10000000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 01000000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 01000000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 00100000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 00100000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 00010000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 0001000 0 10010000 (new running product) Integer multiplication • Drawback: half of 64-bit multiplicand register are zeros – Half of 64 bit adder is adding zeros • Solution: shift product right instead of multiplicand left – Only left half of product register added to multiplicand 1000 (multiplicand) 1001 (multiplier) 00000000 +1000 +1000 0001000 0 01001000 (product) Integer multiplication • Hardware implementation Integer multiplication • Final improvement: use right half of product register for the multiplier • Integer multiplication Final algorithm Multiplication of signed numbers • Naïve approach – Convert to positive numbers – Multiply – Negate product if multiplier and multiplicand signs differ – Slow and extra hardware Multiplication of signed numbers • Booth’s algorithm – Invented for speed • Shifting was faster than addition at the time • Objective: reduce the number of additions required – Fortunately, it works for signed numbers as well – Basic idea: the additions from a string of 1’s in the multiplier can be converted to a single addition and a single subtraction operation – Example: 00111110 is equivalent to 01000000 – requires an addition for this bit 00000010 position requires additions for each of these bit positions and a subtraction for this bit position Booth’s algorithm • Starting from right to left, look at two adjacent bits of the multiplier – Place a zero at the right of the LSB to start • If bits = 00, do nothing • If bits = 10, subtract the multiplicand from the product – Beginning of a string of 1’s • If bits = 11, do nothing – Middle of a string of 1’s • Example x Booth recoding 0010 1101 (multiplicand) (multiplier) • Example Booth recoding 0010 (multiplicand) 00001101 0 extra bit position (product+multiplier) • Example Booth recoding 0010 00001101 0 +1110 11101101 0 (multiplicand) • Example Booth recoding 0010 00001101 0 +1110 11110110 1 (multiplicand) • Example Booth recoding 0010 00001101 0 +1110 11110110 1 +0010 00010110 1 (multiplicand) • Example Booth recoding 0010 00001101 0 +1110 11110110 1 +0010 00001011 0 (multiplicand) • Example Booth recoding 0010 00001101 0 +1110 11110110 1 +0010 00001011 0 +1110 11101011 0 (multiplicand) • Example Booth recoding 0010 00001101 0 +1110 11110110 1 +0010 00001011 0 +1110 11110101 1 (multiplicand) • Example Booth recoding 0010 00001101 0 +1110 11110110 1 +0010 00001011 0 +1110 11110101 1 (multiplicand) • Example Booth recoding 0010 (multiplicand) 00001101 0 +1110 11110110 1 +0010 00001011 0 +1110 111110101 (product) • Integer division Pencil and paper binary division (divisor) 1000 01001000 (dividend) Integer division • Pencil and paper binary division 1 (divisor) 1000 01001000 - 1000 0001 (dividend) (partial remainder) • Integer division Pencil and paper binary division 1 (divisor) 1000 01001000 - 1000 00010 (dividend) • Integer division Pencil and paper binary division 10 (divisor) 1000 01001000 - 1000 00010 (dividend) • Integer division Pencil and paper binary division 10 (divisor) 1000 01001000 - 1000 000100 (dividend) • Integer division Pencil and paper binary division 100 (divisor) 1000 01001000 - 1000 000100 (dividend) • Integer division Pencil and paper binary division 100 (divisor) 1000 01001000 - 1000 0001000 (dividend) • Integer division Pencil and paper binary division 1001 (divisor) 1000 01001000 - 1000 0001000 - 0001000 0000000 (quotient) (dividend) (remainder) • Integer division Pencil and paper binary division 1001 (divisor) 1000 01001000 - 1000 0001000 - 0001000 0000000 (quotient) (dividend) (remainder) • Steps in hardware – Shift the dividend left one position – Subtract the divisor from the left half of the dividend – If result positive, shift left a 1 into the quotient – Else, shift left a 0 into the quotient, and repeat from • Initial state (divisor) 1000 Integer division 01001000 (dividend) 0000 (quotient) • Integer division Shift dividend left one position (divisor) 1000 10010000 (dividend) 0000 (quotient) Integer division • Subtract divisor from left half of dividend (divisor) 1000 10010000 (dividend) - 1000 (keep these 00010000bits) 0000 (quotient) • Integer division Result positive, left shift a 1 into the quotient (divisor) 1000 10010000 - 1000 00010000 (dividend) 0001 (quotient) Integer division • Shift partial remainder left one position (divisor) 1000 10010000 - 1000 00100000 (dividend) 0001 (quotient) Integer division • Subtract divisor from left half of partial remainder (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 (dividend) 0001 (quotient) Integer division • Result negative, left shift 0 into quotient (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 (dividend) 0010 (quotient) Integer division • Restore original partial remainder (how?) (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 00100000 (dividend) 0010 (quotient) Integer division • Shift partial remainder left one position (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 (dividend) 0010 (quotient) Integer division • Subtract divisor from left half of partial remainder (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 - 1000 11000000 (dividend) 0010 (quotient) Integer division • Result negative, left shift 0 into quotient (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 - 1000 11000000 (dividend) 0100 (quotient) Integer division • Restore original partial remainder (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 - 1000 11000000 01000000 (dividend) 0100 (quotient) Integer division • Shift partial remainder left one position (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 - 1000 11000000 10000000 (dividend) 0100 (quotient) Integer division • Subtract divisor from left half of partial remainder (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 - 1000 11000000 10000000 - 1000 00000000 (dividend) 0100 (quotient) Integer division • Result positive, left shift 1 into quotient (divisor) 1000 10010000 - 1000 00100000 - 1000 10100000 01000000 - 1000 11000000 10000000 - 1000 00000000 (remainder) (dividend) 1001 (quotient) Integer division • Hardware implementation What operations do we do here? Load dividend here initially Integer and floating point revisited integer ALU P C instruction memory integer register file HI LO flt pt register file data memory integer multiplier flt pt adder flt pt multiplier • Integer ALU handles add, subtract, logical, set less than, equality test, and effective address calculations • Integer multiplier handles multiply and divide – HI and LO registers hold result of integer multiply and divide Floating point representation • Floating point (fp) numbers represent reals – Example reals: 5.6745, 1.23 x 10-19, 345.67 x 106 – Floats and doubles in C • Fp numbers are in signed magnitude representation of the form (-1)S x M x BE where – – – – – S is the sign bit (0=positive, 1=negative) M is the mantissa (also called the significand) B is the base (implied) E is the exponent Example: 22.34 x 10-4 • S=0 • M=22.34 • B=10 • E=-4 Floating point representation • Fp numbers are normalized in that M has only one digit to the left of the “decimal point” – Between 1.0 and 9.9999… in decimal – Between 1.0 and 1.1111… in binary – Simplifies fp arithmetic and comparisons – Normalized: 5.6745 x 102, 1.23 x 10-19 – Not normalized: 345.67 x 106 , 22.34 x 10-4 , 0.123 x 10-45 – In binary format, normalized numbers are of the form (-1)S x 1.M x BE • Leading 1 in 1.M is implied Floating point representation tradeoffs • Representing a wide enough range of fp values with enough precision (“decimal” places) given limited bits (-1)S x 1.M x BE 32 bits S M?? E?? – More E bits increases the range – More M bits increases the precision – A larger B increases the range but decreases the precision – The distance between consecutive fp numbers is not constant! … … BE BE+1 BE+2 Floating point representation tradeoffs • Allowing for fast arithmetic implementations – Different exponents requires lining up the significands; larger base increases the probability of equal exponents • Handling very small and very large numbers representable negative numbers (S=1) exponent overflow 0 exponent underflow representable positive numbers (S=0) exponent overflow • Sorting/comparing fp numbers fp numbers can be treated as integers for sorting and comparing purposes if E is placed to the left (-1)S x 1.M x BE S E bigger E is bigger number M If E’s are same, bigger M is bigger number • Example – 3.67 x 106 > 6.34 x 10-4 > 1.23 x 10-4 Biased exponent notation • 111…111 represents the most positive E and 000…000 represents the most negative E for sorting/comparing purposes • To get correct signed value for E, need to subtract a bias of 011…111 • Biased fp numbers are of the form (-1)S x 1.M x BE-bias • Example: assume 8 bits for E – Bias is 01111111 = 127 – Largest E represented by 11111111 which is 255 – 127 = 128 – Smallest E represented by 00000000 which is 0 – 127 = -127 IEEE 754 floating point standard • Created in 1985 in response to the wide range of fp formats used by different companies – Has greatly improved portability of scientific applications • B=2 S 1 bit E M 8 bits 23 bits • Single precision (sp) format (“float” in C) S 1 bit E M 11 bits 52 bits • Double precision (dp) format (“double” in C) IEEE 754 floating point standard • Exponent bias is 127 for sp and 1023 for dp • Fp numbers are of the form (-1)S x 1.M x 2E-bias – 1 in mantissa and base of 2 are implied – Sp form is (-1)S x 1.M22 M21 …M0 x 2E-127 and value is (-1)S x (1+(M22x2-1) +(M21x2-2)+…+(M0x2-23)) x 2E-127 • Sp example 1 00000001 1000…00 0M S E – Number is –1.1000…000 x 21-127=-1.1 x 2-126=1.763 x 10-38 IEEE 754 floating point standard • Denormalized numbers – Allow for representation of very small numbers representable negative numbers exponent overflow 0 exponent underflow representable positive numbers – Identified by E=0 and a non-zero M exponent overflow – Format is (-1)S x 0.M x 2-(bias-1) – Smallest positive dp denormalized number is 0.00…01 x 2-1022 = 2-1074 smallest positive dp normalized number is 1.0 x 21023 – Hardware support is complex, and so often handled by software Floating point addition • Make both exponents the same – Find the number with the smaller one – Shift its mantissa to the right until the exponents match • Must include the implicit 1 (1.M) • Add the mantissas • Choose the largest exponent • Put the result in normalized form – Shift mantissa left or right until in form 1.M – Adjust exponent accordingly • Handle overflow or underflow if necessary • Round • Renormalize if necessary if rounding produced an unnormalized result Floating point addition • Algorithm • Floating point addition example Initial values 1 S E 0 S 00000001 00000011 E 0000…0110 0M 0100…0011 1M • Floating point addition example Identify smaller E and calculate E difference 1 S 00000001 E 0000…0110 0M difference = 2 0 S 00000011 E 0100…0011 1M • Floating point addition example Shift smaller M right by E difference 1 S E 0 S 00000001 00000011 E 0100…0001 1M 0100…0011 1M • Floating point addition example Add mantissas 1 S 00000001 E 0 S 00000011 E 0100…0001 1M 0100…0011 1M -0.0100…00011 + 1.0100…00111 = 1.0000…00100 0 S E 0000…0010 0M • Floating point addition example Choose larger exponent for result 1 S E 0 S 00000011 E 0 S 00000001 00000011 E 0100…0001 1M 0100…0011 1M 0000…0010 0M • Floating point addition example Final answer (already normalized) 1 S E 0 S 00000011 E 0 S 00000001 00000011 E 0100…0001 1M 0100…0011 1M 0000…0010 0M • Floating point addition Hardware design determine smaller exponent • Floating point addition Hardware design shift mantissa of smaller number right by exponent difference • Floating point addition Hardware design add mantissas • Floating point addition Hardware design normalize result by shifting mantissa of result and adjusting larger exponent • Floating point addition Hardware design round result • Floating point addition Hardware design renormalize if necessary Floating point multiply • Add the exponents and subtract the bias from the sum – Example: (5+127) + (2+127) – 127 = 7+127 • Multiply the mantissas • Put the result in normalized form – Shift mantissa left or right until in form 1.M – Adjust exponent accordingly • Handle overflow or underflow if necessary • Round • Renormalize if necessary if rounding produced an unnormalized result • Set S=0 if signs of both operands the same, S=1 otherwise Floating point multiply • Algorithm • Floating point multiply example Initial values 1 S E 0 S 00000111 11100000 E 1000…0000 0M 1000…0000 0M -1.5 x 27-127 1.5 x 2224-127 • Floating point multiply example Add exponents 1 S E 0 S 00000111 11100000 E 1000…0000 0M 1000…0000 0M 00000111 + 11100000 = 11100111 (231) -1.5 x 27-127 1.5 x 2224-127 • Floating point multiply example Subtract bias 1 S E 0 S 00000111 11100000 E 1000…0000 0M 1000…0000 0M -1.5 x 27-127 1.5 x 2224-127 11100111 – 01111111 = 11100111 + 10000001 = 01101000 (104) 01101000 S E M • Floating point multiply example Multiply the mantissas 1 S E 0 S 00000111 11100000 E 1000…0000 0M 1000…0000 0M 1.1000… x 1.1000… = 10.01000… 01101000 S E M -1.5 x 27-127 1.5 x 2224-127 • Floating point multiply example Normalize by shifting 1.M right one position and adding one to E 1 S E 0 S 00000111 11100000 E 1000…0000 0M 1000…0000 0M 10.01000… => 1.001000… 01101001 S E 001000… M -1.5 x 27-127 1.5 x 2224-127 • Floating point multiply example Set S=1 since signs are different 1 S E 0 S 00000111 11100000 E 1 01101001 S E 1000…0000 0M 1000…0000 0M 001000… M -1.5 x 27-127 1.5 x 2224-127 -1.125 x 2105-127 Rounding • Fp arithmetic operations may produce a result with more digits than can be represented in 1.M • The result must be rounded to fit into the available number of M positions • Tradeoff of hardware cost (keeping extra bits) and speed versus accumulated rounding error Rounding • Examples from decimal multiplication • Renormalization is required after rounding in c) Rounding • Examples from binary multiplication (assuming two bits for M) 1.01 x 1.01 = 1.1001 (1.25 x 1.25 = 1.5625) 1.11 x 1.01 = 10.0011 (1.75 x 1.25 = 2.1875) Result has twice as many bits 1.10 x 1.01 = 1.111 May require renormalization after rounding (1.5 x 1.25 = 1.875) Rounding • In binary, an extra bit of 1 is halfway in between the two possible representations 1.001 (1.125) is halfway between 1.00 (1) and 1.01 (1.25) 1.101 (1.625) is halfway between 1.10 (1.5) and 1.11 (1.75) • IEEE 754 rounding modes Truncate – Remove all digits beyond those supported – 1.00100 -> 1.00 • Round up to the next value – 1.00100 -> 1.01 • Round down to the previous value – 1.00100 -> 1.00 – Differs from Truncate for negative numbers • Round-to-nearest-even – Rounds to the even value (the one with an LSB of 0) – 1.00100 -> 1.00 Implementing rounding • A product may have twice as many digits as the multiplier and multiplicand – 1.11 x 1.01 = 10.0011 • For round-to-nearest-even, we need to know LSB of final rounded result – The value to the right of the LSB (round bit) 1.00101 rounds to 1.01 – Whether any other digits to the right of the round Roundare bit 1’s Sticky bit = 0 OR 1 = 1 digit • The sticky bit is the OR of these digits 1.00100 rounds to 1.00 Implementing rounding • The product before normalization may have 2 digits to the left of the binary point bb.bbbb… • Product register format needs to be 1b.bbbb… • Two possible cases 01.bbbb… r sssss… r sssss… Need this as a result bit! Implementing rounding • The guard bit (g) becomes part of the unrounded result when the MSB = 0 • g, r, and s suffice for rounding addition as well MIPS floating point registers floating point registers 31 f0 f1 0 . . . control/status register 31 FCR31 0 implementation/revision 31register FCR0 0 f30 f31 • 32 32-bit FPRs – 16 64-bit registers (32-bit register pairs) for dp floating point – Software conventions for their usage (as with GPRs) • Control/status register – Status of compare operations, sets rounding mode, MIPS floating point instruction overview • Operate on single and double precision operands • Computation – Add, sub, multiply, divide, sqrt, absolute value, negate – Multiply-add, multiply-subtract • Added as part of MIPS-IV revision of ISA specification • Load and store – Integer register read for EA calculation – Data to be loaded or stored in fp register file MIPS R10000 arithmetic units EA calc P C instruction memory integer register file integer ALU integer ALU + multiplier flt pt adder flt pt register file flt pt multiplier flt pt divider flt pt sq root data memory • MIPS R10000 arithmetic units Integer ALU + shifter – All instructions take one cycle • Integer ALU + multiplier – Booth’s algorithm for multiplication (5-10 cycles) – Non-restoring division (34-67 cycles) • Floating point adder – Carry propagate (2 cycles) • Floating point multiplier (3 cycles) – Booth’s algorithm • Floating point divider (12-19 cycles) • Floating point square root unit Processor Design - 1 Adopted from notes by David A. Patterson, John Kubiatowicz, and others. Copyright © 2001 University of California at Berkeley 203 Outline of Slides • • • • • • Overview Design a processor: step-by-step Requirements of the instruction set Components and clocking Assembling an adequate Data path Controlling the data path 204 Chapter 5.1 - Processor Design 1 The Big Picture: Where Are We Now? • The five classic components of a computer Processor Input Control Memory Datapath • Today’s topic: design a single cycle processor Output machine design Arithmetic Chapter 5.1 - Processor Design 1 205 inst. set design technology The CPU °Processor (CPU): the active part of the computer, which does all the work (data manipulation and decision-making) °Datapath: portion of the processor which contains hardware necessary to perform operations required by the processor (the brawn) °Control: portion of the processor (also in hardware) which tells the datapath what needs to be done (the brain) 206 Chapter 5.1 - Processor Design 1 Big Picture: The Performance Perspective CPI • Performance of a machine is determined by: – Instruction count – Clock cycle time – Clock cycles per instruction • Processor design (datapath and control) will determine: – Clock cycle time – Clock cycles per instruction Inst. Count Cycle Time • What we will do Today: – Single cycle processor: • Advantage: One clock cycle per instruction • Disadvantage: long cycle time 207 Chapter 5.1 - Processor Design 1 How to Design a Processor: Step-by-step 1. Analyze instruction set datapath requirements – the meaning of each instruction is given by the register transfers – datapath must include storage element for ISA registers • possibly more – datapath must support each register transfer 2. Select set of datapath components and establish clocking methodology 3. Assemble datapath meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic 208 Chapter 5.1 - Processor Design 1 The MIPS Instruction Formats • All MIPS instructions are 32 bits long. The three instruction formats: 31 26 21 16 11 6 – R-type op rs rt rd shamt – I-type 31 6 bits 26 op – J-type 31 6 bits 26 op 209 5 bits 21 rs 5 bits 5 bits 16 5 bits 5 bits 0 funct 6 bits 0 immediate rt 5 bits 16 bits target address 6 bits 26 bits • The different fields are: – op: operation of the instruction – rs, rt, rd: the source and destination register specifiers – shamt: shift amount – funct: selects the variant of the operation in the “op” field – address / immediate: address offset or immediate value – target address: target address of the jump instruction Chapter 5.1 - Processor Design 1 0 Step 1a: The MIPS-lite Subset for Today • ADD and SUB 31 26 op - addU rd, rs, rt - subU rd, rs, rt 21 rs 6 bits 16 rt 5 bits 5 bits 11 6 0 rd shamt funct 5 bits 5 bits 6 bits • OR Immediate: - ori rt, rs, imm16 31 op • LOAD / STORE Word - lw rt, rs, imm16 - sw rt, rs, imm16 26 - beq rs, rt, imm16 31 rs 5 bits 6 bits 5 bits 16 bits 16 rt 5 bits 21 rs 0 immediate 5 bits 21 26 op 16 rt 5 bits 26 op 6 bits • BRANCH: 210 rs 6 bits 31 21 0 immediate 16 bits 16 rt 5 bits 0 Chapter 5.1 - Processor Design 1 immediate 16 bits Logical Register Transfers • Register Transfer Logic gives the meaning of the instructions • All start by fetching the instruction op | rs | rt | rd | shamt | funct = MEM[ PC ] op | rs | rt | Imm16 211 = MEM[ PC ] inst Register Transfers ADDU R[rd] R[rs] + R[rt]; PC PC + 4 SUBU R[rd] R[rs] – R[rt]; PC PC + 4 ORi R[rt] R[rs] | zero_ext(Imm16); LOAD R[rt] MEM[ R[rs] + sign_ext(Imm16)]; PC PC + 4 STORE MEM[ R[rs] + sign_ext(Imm16) ] R[rt]; PC PC + 4 BEQ if ( R[rs] == R[rt] ) then PC PC + 4 + sign_ext(Imm16)] || 00 else PC PC + 4 PC PC + 4 Chapter 5.1 - Processor Design 1 Step 1: Requirements of the Instruction Set • Memory – instruction & data • Registers (32 x 32) – read RS – read RT – Write RT or RD • PC • Extender • Add and Sub register or extended immediate • Add 4 or extended immediate to PC 212 Chapter 5.1 - Processor Design 1 Step 2: Components of the Datapath • Combinational Elements • Storage Elements –Clocking methodology 213 Chapter 5.1 - Processor Design 1 Combinational Logic Elements (Basic Building Blocks) OP CarryIn Adder A Y B 32 Sum A 32 Carry 32 B MUX ALU Adder 32 32 Result 32 Select A ALU B 214 32 32 MUX • 32 Chapter 5.1 - Processor Design 1 Storage Element: Register File • Register File consists of 32 registers: – Two 32-bit output busses: busA and busB – One 32-bit input bus: busW • Register is selected by: – RA (number) selects the register to put on busA (data) – RB (number) selects the register to put on busB (data) – RW (number) selects the register to be written via busW (data) when Write Enable is 1 RW RA RB Write Enable 5 5 5 busW 32 Clk • Clock input (CLK) – The CLK input is a factor ONLY during write operation – During read operation, behaves as a combinational logic block: • RA or RB valid busA or busB valid after “access time.” 215 busA 32 32 32-bit Registers busB 32 Chapter 5.1 - Processor Design 1 Storage Element: Idealized Memory • Memory (idealized) – One input bus: Data In – One output bus: Data Out • Memory word is selected by: – Address selects the word to put on Data Out – Write Enable = 1: address selects the memory word to be written via the Data In bus Write Enable Address Data In 32 Clk • Clock input (CLK) – The CLK input is a factor ONLY during write operation – During read operation, behaves as a combinational logic block: • Address valid Data Out valid after “access time.” 216 Chapter 5.1 - Processor Design 1 DataOut 32 Memory Hierarchy (Ch. 7) • Want a single main memory, both large and fast • Problem 1: large memories are slow while fast memories are small • Example: MIPS registers (fast, but few) • Solution: mix of memories provides illusion of single large, fast memory • Cache: a small, fast memory; Holds a copy of part of a larger, slower memory • Imem, Dmem are really separate caches memories 217 Chapter 5.1 - Processor Design 1 Digression: Sequential Logic, Clocking • Combinational circuits: no memory • Output depends only on the inputs • Sequential circuits: have memory • How to ensure memory element is updated neither too soon, nor too late? • Recall hardware multiplier • Product/multiplier register is the writable memory element • Gate propagation delay means ALU result takes time to stabilize; Delay varies with inputs • Must wait until result stable before write to product/multiplier register else get garbage • How to be certain ALU output is stable? 218 Chapter 5.1 - Processor Design 1 Adding a Clock to a Circuit • Clock: free running signal with fixed cycle time (clock period) high (1) low (0) period rising edge falling edge ° Clock determines when to write memory element • level-triggered - store clock high (low) • edge-triggered - store only on clock edge ° We will use negative (falling) edge-triggered methodology 219 Chapter 5.1 - Processor Design 1 Role of Clock in MIPS Processors • single-cycle machine: does everything in one clock cycle • instruction execution = up to 5 steps • must complete 5th step before cycle ends falling clock edge rising clock edge clock signal instruction execution step 1/step 2/step 3/step 4/step 5 220 datapath stable register(s) written Chapter 5.1 - Processor Design 1 SR-Latches • SR-latch with NOR Gates • S = 1 and R = 1 not allowed 221 ° Symbol for SR-Latch with NOR gates Chapter 5.1 - Processor Design 1 SR-Latches • SR-latch with NAND Gates, also known as S´R´ -latch • S = 0 and R = 0 not allowed Chapter 5.1 - Processor Design 1 222 ° Symbol for SR-Latch with NAND gates SR-Latches with Control Input • SR-latch with NAND Gates and control input C ° C = 0, no change of state; 223 Chapter 5.1 - Processor Design 1 ° C = 1, change is allowed; • If S = 1 and R = 1, Q and Q´ are Indetermined D-Latches • D-latch based on SR-Latch with NAND Gates and control input C ° C = 0, no change of state; • Q (t + t ) = Q (t ) ° C = 1, change is allowed; • Q (t + t ) = D (t ) • No Indeterminate Output 224 Chapter 5.1 - Processor Design 1 Negative Edge-Triggered MasterSlave D-Flip-Flop ° Symbol for D-Flip Flop. Chapter 5.1 - Processor Design 1 ° Arrowhead (>) indicates an edgetriggered sequential circuit. 225 ° Bubble means that triggering is effective during the HighLow C transition Clocking Methodology for the Entire Datapath Clk Setup Hold Setup Hold Don’t Care . . . . . . . . . . . . • Design/synthesis based on pulsed-sequential circuits – All combinational inputs remain at constant levels and only clock signal appears as a pulse with a fixed period Tcc • All storage elements are clocked by the same clock edge • Cycle time Tcc = CLK-to-q + longest delay path + Setup time + clock skew • (CLK-to-q + shortest delay path - clock skew) > hold time 226 Chapter 5.1 - Processor Design 1 Step 3: Assemble Data Path Meeting Requirements • Register Transfer Requirements Datapath “Assembly” • Instruction Fetch • Read Operands and Execute Operation 227 Chapter 5.1 - Processor Design 1 Stages of the Datapath (1/6) Problem: a single, atomic block which “executes an instruction” (performs all necessary operations beginning with fetching the instruction) would be too bulky and inefficient Solution: break up the process of “executing an instruction” into stages, and then connect the stages to create the whole datapath Smaller stages are easier to design Easy to optimize (change) one stage without touching the others 228 Chapter 5.1 - Processor Design 1 Stages of the Datapath (2/6) There is a wide variety of MIPS instructions: so what general steps do they have in common? Stage 1: instruction fetch No matter what the instruction, the 32-bit instruction word must first be fetched from memory (the cache-memory hierarchy) Also, this is where we increment PC (that is, PC = PC + 4, to point to the next instruction: byte addressing so + 4) 229 Chapter 5.1 - Processor Design 1 Stages of the Datapath (3/6) Stage 2: Instruction Decode upon fetching the instruction, we next gather data from the fields (decode all necessary instruction data) first, read the Opcode to determine instruction type and field lengths second, read in data from all necessary registers -for add, read two registers -for addi, read one register -for jal, no reads necessary 230 Chapter 5.1 - Processor Design 1 Stages of the Datapath (4/6) °Stage 3: ALU (Arithmetic-Logic Unit) the real work of most instructions is done here: arithmetic (+, -, *, /), shifting, logic (&, |), comparisons (slt) what about loads and stores? -lw $t0, 40($t1) -the address we are accessing in memory = the value in $t1 + the value 40 -so we do this addition in this stage 231 Chapter 5.1 - Processor Design 1 Stages of the Datapath (5/6) °Stage 4: Memory Access actually only the load and store instructions do anything during this stage; the others remain idle since these instructions have a unique step, we need this extra stage to account for them as a result of the cache system, this stage is expected to be just as fast (on average) as the others 232 Chapter 5.1 - Processor Design 1 Stages of the Datapath (6/6) °Stage 5: Register Write most instructions write the result of some computation into a register examples: arithmetic, logical, shifts, loads, slt what about stores, branches, jumps? -don’t write anything into a register at the end -these remain idle during this fifth stage 233 Chapter 5.1 - Processor Design 1 1. Instruction Fetch 234 imm 2. Decode/ Register Read ALU Data memory rd rs rt registers +4 instruction memory PC Generic Steps: Datapath 3. Execute 4. Memory 5. Reg. Write Chapter 5.1 - Processor Design 1 Datapath Walkthroughs for Different Instructions http://engineering.unt.edu/electrical/public/guturu/datapath.pdf Datapath Walkthroughs (1/3) add $r3, $r1, $r2 # r3 = r1+r2 Stage 1: fetch this instruction, incr. PC ; Stage 2: decode to find it’s an add, then read registers $r1 and $r2 ; Stage 3: add the two values retrieved in Stage 2 ; Stage 4: idle (nothing to write to memory) ; Stage 5: write result of Stage 3 into register $r3 ; 236 Chapter 5.1 - Processor Design 1 +4 237 reg[1] reg[1]+reg[2] reg[2] ALU Data memory 2 registers 3 1 imm add r3, r1, r2 PC instruction memory Example: add Instruction Chapter 5.1 - Processor Design 1 Datapath Walkthroughs (2/3) slti $r3, $r1, 17 Stage 1: fetch this instruction, inc. PC Stage 2: decode to find it’s an slti, then read register $r1 Stage 3: compare value retrieved in Stage 2 with the integer 17 Stage 4: go idle Stage 5: write the result of Stage 3 in register $r3 238 Chapter 5.1 - Processor Design 1 imm reg[1]-17 ALU Data memory 3 reg[1] 17 slti r3, r1, 17 +4 x 1 registers PC instruction memory Example: slti Instruction 239 Chapter 5.1 - Processor Design 1 Datapath Walkthroughs (3/3) sw $r3, 17($r1) Stage 1: fetch this instruction, inc. PC Stage 2: decode to find it’s a sw, then read registers $r1 and $r3 Stage 3: add 17 to value in register $41 (retrieved in Stage 2) Stage 4: write value in register $r3 (retrieved in Stage 2) into memory address computed in Stage 3 Stage 5: go idle (nothing to write into a register) 240 Chapter 5.1 - Processor Design 1 imm 241 17 reg[1] reg[1]+17 reg[3] ALU Data MEM[r1+17]<=r3 memory 3 SW r3, 17(r1) +4 x 1 registers PC instruction memory Example: sw Instruction Chapter 5.1 - Processor Design 1 Why Five Stages? (1/2) Could we have a different number of stages? Yes, and other architectures do So why does MIPS have five if instructions tend to go idle for at least one stage? There is one instruction that uses all five stages: the load 242 Chapter 5.1 - Processor Design 1 Why Five Stages? (2/2) lw $r3, 17($r1) Stage 1: fetch this instruction, inc. PC Stage 2: decode to find it’s a lw, then read register $r1 Stage 3: add 17 to value in register $r1 (retrieved in Stage 2) Stage 4: read value from memory address compute in Stage 3 Stage 5: write value found in Stage 4 into register $r3 243 Chapter 5.1 - Processor Design 1 244 registers 17 reg[1]+17 ALU Data memory imm reg[1] MEM[r1+17] +4 x 1 3 LW r3, 17(r1) PC instruction memory Example: lw Instruction Chapter 5.1 - Processor Design 1 Datapath Summary °The datapath based on data transfers required to perform instructions registers rd rs rt imm ALU Data memory +4 instruction memory PC °A controller causes the right transfers to happen opcode, funct Controller 245 Chapter 5.1 - Processor Design 1 Overview of the Instruction Fetch Unit • The common operations – Fetch the Instruction: mem[PC] – Update the program counter: • Sequential Code: PC PC + 4 – Branch and Jump: PC “something else” Clk PC Next Address Logic Address Instruction Memory Instruction Word 32 246 Chapter 5.1 - Processor Design 1 Add & Subtract addu rd, rs, rt R[rd] R[rs] op R[rt]; Example: – Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields – ALUctr and RegWr: control logic after decoding the instruction 31 26 21 op rs 6 bits 16 11 rt 5 bits 5 bits Rd Rs Rt RegWr 5 5 5 Rw Ra Rb 247 32 32-bit Registers 0 rd shamt funct 5 bits 5 bits 6 bits ALU ctr busA 32 busB 32 ALU busW 32 Clk 6 Result 3 2 Chapter 5.1 - Processor Design 1 Register-Register Timing: One complete cycle Clk PC Old Value Rs, Rt, Rd, Op, Func ALUctr RegWr busA, B busW Clk-to-Q New Value Old Value Old Value Old Value Old Value Old Value 248 New Value Register File Access Time New Value ALU Delay New Value busA 32 busB 32 ALUct r ALU Rd Rs Rt RegWr 5 5 5 Rw Ra Rb busW 32 32-bit 32 Clk Registers Instruction Memory Access Time New Value Delay through Control Logic New Value Register Write Occurs Here 3 2 Result Chapter 5.1 - Processor Design 1 Logical Operations With Immediate 31 26 21 op 31 rs 6 bits 11 16 rt 5 bits immediate 5 bits 16 15 Rd Rt Mux Rs Rt? RegWr 5 5 5 immediate ALUct r ALU ZeroExt 249 16 bits Result 32 Mux busA Rw Ra Rb 32 32 32-bit Registers busB 32 16 0 • R[rt] R[rs] op ZeroExt[ imm16 ] RegDst imm16 16 bits rd? 0000000000000000 16 bits busW 32 Clk 0 32 ALUSrc Chapter 5.1 - Processor Design 1 Load Operations R[rt] Mem[R[rs] + SignExt[imm16]]; Example: lw rt, rs, imm16 31 26 op 6 bits 21 rs 5 bits 16 rt 5 bits ALU ctr ALU W_Src 32 MemWr M ux Extender Mux busA Rw Ra Rb 32 32 32-bit Registers busB 32 imm16 16 0 immediate 16 bits rd Rd Rt RegDst Mux Rs Rt? RegWr5 5 5 busW 32 Clk 11 WrEnAdr Data In 32 ?? Data 32 32 Clk Memory ALUSrc ExtOp 250 Chapter 5.1 - Processor Design 1 Store Operations Mem[ R[rs] + SignExt[imm16] R[rt] ]; Example: sw rt, rs, imm16 31 26 21 op 16 rs 6 bits 5 bits 0 rt immediate 5 bits 16 bits Rd Rt RegDst Mux Rs Rt RegWr5 5 5 32 ExtOp 251 MemWr W_Src 32 M ux 16 Extender imm16 ALU busA Rw Ra Rb 32 32 32-bit Registers busB 32 Mux busW 32 Clk ALU ctr Data In32 Clk WrEn Adr 32 Data Memory ALUSrc Chapter 5.1 - Processor Design 1 The Branch Instruction 31 26 op 6 bits 21 16 rs 5 bits rt 5 bits 0 immediate 16 bits • beq rs, rt, imm16 – mem[PC] – Equal R[rs] == R[rt] Fetch the instruction from memory Calculate the branch condition – if (Equal) Calculate the next instruction’s address • PC PC + 4 + ( SignExt(imm16) 4 ) – else • PC PC + 4 252 Chapter 5.1 - Processor Design 1 Datapath for Branch Operations 26 21 op rs 6 bits beq 16 rt 5 bits 0 immediate 5 bits rs, rt, imm16 16 bits Datapath generates condition (equal) Inst Address nPC_sel 4 32 PC Mux Adder Rs Rt 5 5 busA Rw Ra Rb 32 32 32-bit Registers busB 32 RegWr 5 00 Adder 253 PC Ext imm16 Cond busW Clk Equal? 31 Clk Chapter 5.1 - Processor Design 1 Summary: A Single Cycle Datapath 32 0 32 32 WrEn Adr Data In Data Clk Memory Mux imm16 16 1 00 = ALU imm16 ALUc MemWr MemtoReg tr Rs Rt 5 5 Rw Ra Rb busA 32 32 32-bit 0 Registers busB 32 Mux Clk Imm16 Equal Rt 0 Extender Clk Rt Rd Instruction<31:0> <0:15> busW 32 <11:15> PC PC Ext Adder Mux Adder 4 Rs RegDst Rd 1 RegWr 5 <16:20> nPC_sel <21:25> Inst Memory Adr 1 ExtOp ALUSrc 254 Chapter 5.1 - Processor Design 1 An Abstract View of the Critical Path • Register file and ideal memory: – The CLK input is a factor ONLY during write operation – During read operation, behave as combinational logic: • Address valid Output valid after “access time.” PC Clk Next Address 255 Clk Imm 1 6 A 32 32 ALU Ideal Instruction Instruction Memory Rd Rs Rt 5 5 5 Instruction Address 32 Rw Ra Rb 32 32-bit Registers Critical Path (Load Operation) = PC’s Clk-to-Q + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew B 32 Data Address Data In Ideal Data Memory Clk Chapter 5.1 - Processor Design 1 An Abstract View of the Implementation Ideal Instruction Memory PC Clk 32 Instruction Rd Rs Rt 5 5 5 Rw Ra Rb 32 32-bit Registers Clk Control Signals Conditions A 32 32 ALU Next Address Instruction Address Control B 32 Data Address Data In Ideal Data Memory Data Out Clk Datapath 256 Chapter 5.1 - Processor Design 1 Steps 4 & 5: Implement the control In The Next Section 257 Chapter 5.1 - Processor Design 1 Summary: MIPS-lite Implementations • single-cycle: uses single l-o-n-g clock cycle for each instruction executed • Easy to understand, but not practical • slower than implementation that allows instructions to take different numbers of clock cycles • fast instructions: (beq) fewer clock cycles • slow instructions (mult?): more cycles • multicycle, pipelined implementations later • Next time, finish the single-cycle implementation 258 Chapter 5.1 - Processor Design 1 Summary • 5 steps to design a processor – 1. Analyze instruction set => datapath requirements – 2. Select set of datapath components & establish clock methodology – 3. Assemble datapath meeting the requirements – 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. – 5. Assemble the control logic • MIPS makes it easier – Instructions same size – Source registers always in same place – Immediates same size, location – Operations always on registers/immediates • Single cycle datapath: CPI = 1, TCC long • Next time: implementing control 259 Chapter 5.1 - Processor Design 1 Processor Design - 2 Adopted from notes by David A. Patterson, John Kubiatowicz, and others. Copyright © 2001 University of California at Berkeley 260 Summary: A Single Cycle Datapath RegDst nPC_sel 5 busA Rw Ra Rb 32 32-bit Registers busB 32 imm16 16 = 32 0 1 32 Data In 32 ExtOp Clk ALUSrc 32 0 Mux 00 5 Mux 261 MemtoReg Rt Extender Adder PC Ext imm16 Rs Clk Clk ALUctr MemWr Equal ALU Mux PC Adder 32 Imm16 0 RegWr 5 busW Rd Rd Rt 1 4 Rt Instruction<31:0> <0:15> Rs <11:15> Adr <16:20> <21:25> Inst Memory WrEn Adr 1 Data Memory Chapter 5.2 - Processor Design 2 An Abstract View of the Critical Path • Register file and ideal memory: – The CLK input is a factor ONLY during write operation – During read operation, behave as combinational logic: • Address valid => Output valid after “access time.” Ideal Instruction Memory Instruction Rd Rs 5 5 Instruction Address Rt 5 Imm 16 A 32 32 32-bit Registers PC 32 Rw Ra Rb 32 ALU B Clk 32 Clk Next Address Critical Path (Load Operation) = PC’s Clk-to-Q + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew 262 Data Address Ideal Data Memory Data In Clk Chapter 5.2 - Processor Design 2 The Big Picture: Where are We Now? • The Five Classic Components of a Computer Processor Input Control Memory Datapath Output • Next Topic: Designing the Control for the Single Cycle Datapath 263 Chapter 5.2 - Processor Design 2 An Abstract View of the Implementation Control Ideal Instruction Memory Instruction Rd Rs 5 5 A 32 Rw Ra Rb 32 32 32-bit Registers PC Clk Conditions Rt 5 Clk 32 ALU Next Address Instruction Address Control Signals B 32 Data Address Ideal Data Memory Data In Data Out Clk Datapath 264 Chapter 5.2 - Processor Design 2 Recap: A Single Cycle Datapath • Rs, Rt, Rd and Imed16 hardwired into datapath from Fetch Unit • We have everything except control signals (underline) – Today’s lecture will show you how to generate the control signals Instruction<31:0> RegWr Rs 5 5 Rt Rt Zero ALUctr 5 Imm16 MemtoReg 0 32 32 WrEn Adr Data In 32 Clk Mux 0 1 32 ALU 16 Extender imm16 32 Mux 32 Clk Rw Ra Rb 32 32-bit Registers busB 32 Rd MemWr busA busW Rs <0:15> Clk 1 Mux 0 <11:15> RegDst Rt <21:25> Rd Instruction Fetch Unit <16:20> nPC_sel 1 Data Memory ALUSrc 265 ExtOp Chapter 5.2 - Processor Design 2 Recap: Meaning of the Control Signals 0 PC PC + 4 1 PC PC + 4 + SignExt(Im16) || 00 • Later in lecture: higher-level connection between mux and branch cond • nPC_sel: nPC_sel Adr 4 00 Adder imm16 PC Mux Adder PC Ext 266 Inst Memory Clk Chapter 5.2 - Processor Design 2 Recap: Meaning of the Control Signals • ExtOp: • ALUsrc: • ALUctr: RegDst “zero”, “sign” 0 regB; 1 immed “add”, “sub”, “or” 0 1 32 ExtOp 32 Data In Clk ALUSrc 32 WrEn Adr Data Memory MemWr: 1 write memory ° MemtoReg: 0 ALU; 1 Mem MemtoReg 0 ° RegDst: 0 “rt”; 1 “rd” ° RegWr: 1 write register Mux 267 32 ALU 16 MemWr = Mux Extender RegWr 5 5 Rs 5 Rt Rw Ra Rb busA busW 32 32-bit 32 busB Registers 32 Clk imm16 ALUctr Equal Rd Rt 0 1 ° 1 Chapter 5.2 - Processor Design 2 The add Instruction 31 26 op 6 bits 21 16 rs 5 bits rt 5 bits 11 6 0 rd shamt funct 5 bits 5 bits 6 bits • add rd, rs, rt mem[PC] Fetch the instruction from memory R[rd] R[rs] + R[rt] The actual operation PC PC + 4 Calculate the next instruction’s address 268 Chapter 5.2 - Processor Design 2 Fetch Unit at the Beginning of add Inst Memory Adr nPC_sel Adder 0 imm16 PC Mux Adder 269 • Fetch the instruction from Instruction memory: Instruction mem[PC] (This is the same for all instructions) 00 4 Instruction<31:0> 1 Clk Chapter 5.2 - Processor Design 2 The Single Cycle Datapath during add 31 26 21 op rs 16 rt 11 6 rd 0 shamt funct • R[rd] R[rs] + R[rt] Instruction<31:0> 1 Mux 0 Rs RegWr = 1 Imm16 Zero 1 MemWr = 0 0 32 32 WrEn Adr Data In 32 32 Clk Mux Extender 16 Rd MemtoReg = 0 busA Rw Ra Rb 32 32 32-bit Registers busB 0 32 imm16 Rs 5 Mux 32 Clk 5 Rt ALU busW 5 ALUctr = Add Rt <0:15> Clk <11:15> RegDst = 1 Rt <16:20> Rd Instruction Fetch Unit <21:25> nPC_sel= +4 1 Data Memory ALUSrc = 0 ExtOp = x 270 Chapter 5.2 - Processor Design 2 Instruction Fetch Unit at the End of add • PC PC + 4 – This is the same for all instructions except: Branch and Jump Inst Memory Instruction<31:0> Adr nPC_sel 4 Adder 00 0 imm16 PC Mux Adder 271 1 Clk Chapter 5.2 - Processor Design 2 The Single Cycle Datapath during Or Immediate 31 26 21 op rs 16 0 rt immediate • R[rt] R[rs] or ZeroExt(Imm16) Instruction<31:0> 1 Mux 0 RegWr = 5 Rt ALUctr = 5 Extender 16 Rd Imm16 Zero 1 32 MemWr = 0 32 32 WrEn Adr Data In 32 Clk Mux busA Rw Ra Rb 32 32 32-bit Registers busB 0 32 imm16 Rs MemtoReg = Mux 32 Clk 5 Rt ALU busW Rs <0:15> Clk <11:15> RegDst = Rt <21:25> Rd Instruction Fetch Unit <16:20> nPC_sel = 1 Data Memory ALUSrc = 272 ExtOp = Chapter 5.2 - Processor Design 2 The Single Cycle Datapath during Or Immediate 31 26 21 op rs • R[rt] R[rs] or 16 0 rt immediate ZeroExt(Imm16) Instruction<31:0> 5 Rt Rt ALUctr = Or 5 Zero ALU 16 Extender imm16 1 32 Rs Rd MemWr = 0 0 32 32 WrEn Adr Data In 32 Clk Imm16 MemtoReg = 0 Mux busA Rw Ra Rb 32 32 32-bit Registers busB 0 32 Mux 32 Clk Rs <0:15> 1 Mux 0 RegWr = 1 5 busW Clk <11:15> RegDst = 0 Rt <16:20> Rd Instruction Fetch Unit <21:25> nPC_sel= +4 1 Data Memory ALUSrc = 1 273ExtOp = 0 Chapter 5.2 - Processor Design 2 The Single Cycle Datapath during Load 31 26 21 op rs 16 0 rt immediate • R[rt] Data Memory {R[rs] + SignExt[imm16]} Instruction<31:0> 5 ALUctr = Add Rt 5 busA Rw Ra Rb 32 32 32-bit Registers busB 0 32 Rt Zero Rd Imm16 MemtoReg = 1 MemWr = 0 0 Mux ALU 16 Extender imm16 1 Rs 32 Mux 32 Clk Rs <0:15> 1 Mux 0 RegWr = 1 5 busW Clk <11:15> RegDst = 0 Rt <16:20> Rd Instruction Fetch Unit <21:25> nPC_sel= +4 1 WrEn Adr Data In 32 32 Clk Data Memory 32 ALUSrc = 1 ExtOp = 1 274 Chapter 5.2 - Processor Design 2 he Single Cycle Datapath during Store 31 26 op 21 16 rs 0 rt immediate • Data Memory {R[rs] + SignExt[imm16]} R[rt] Instruction<31:0> 1 Mux 0 RegWr = Rs 5 5 Rt Rt ALUctr = 5 busA 0 1 32 Rd MemWr = 32 WrEn Adr Clk MemtoReg = 0 32 Data In 32 Imm16 Mux 16 Extender imm16 32 Mux 32 Clk Rw Ra Rb 32 32-bit Registers busB 32 ALU busW Zero Rs <0:15> Clk <11:15> RegDst = Rt <21:25> Rd Instruction Fetch Unit <16:20> nPC_sel = 1 Data Memory ALUSrc = 275 ExtOp = Chapter 5.2 - Processor Design 2 The Single Cycle Datapath during Store 31 26 21 op rs 16 0 rt immediate • Data Memory {R[rs] + SignExt[imm16]} R[rt] 5 5 ALUctr = Add Rt 5 Rd Imm16 MemtoReg = x Zero ALU 16 Extender imm16 1 Rs MemWr = 1 0 32 32 WrEn Adr Data In 32 32 Clk Mux busA Rw Ra Rb 32 32 32-bit Registers busB 0 32 Rt Mux 32 Clk Rs <0:15> 1 Mux 0 RegWr = 0 busW Clk <11:15> RegDst = x Rt Instruction Fetch Unit <16:20> Rd Instruction<31:0> <21:25> nPC_sel= +4 1 Data Memory ALUSrc = 1 276 ExtOp = 1 Chapter 5.2 - Processor Design 2 Single Cycle Datapath during Branch 31 26 21 op rs 16 rt 0 immediate • if (R[rs] – R[rt] == 0) then Zero 1; else Zero 0 Instruction<31:0> 5 5 Rt Rt ALUctr =Sub Zero ALU Extender 16 Rd 1 32 MemWr = 0 0 32 32 WrEn Adr Data In 32 Clk Imm16 MemtoReg = x Mux busA Rw Ra Rb 32 32 32-bit Registers busB 0 32 imm16 Rs 5 Mux 32 Clk Rs <0:15> 1 Mux 0 RegWr = 0 busW Clk <11:15> RegDst = x Rt <21:25> Rd Instruction Fetch Unit <16:20> nPC_sel= “Br” 1 Data Memory ALUSrc = 0 277ExtOp = x Chapter 5.2 - Processor Design 2 Instruction Fetch Unit at the End of Branch 31 26 op 21 rs 16 0 rt immediate • if (Zero == 1) then PC = PC + 4 + SignExt(imm16)4 ; else PC = PC + 4 Inst Memory nPC_sel Instruction<31:0> Adr ° What is encoding of nPC_sel? Zero • Direct MUX select? • Branch / not branch ° Let’s choose second option 4 Adder imm16 PC Mux Adder 278 00 0 1 nPC_sel 0 1 1 zero? x 0 1 MUX 0 0 1 Clk Chapter 5.2 - Processor Design 2 Step 4: Given Datapath: RTL Control Instruction<31:0> Rt Rs Rd <0:15> Fun <11:15> Adr Op <16:20> <21:25> <21:25> Inst Memory Imm16 Control nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Zero DATA PATH 279 Chapter 5.2 - Processor Design 2 A Summary of Control Signals inst Register Transfer ADD R[rd] R[rs] + R[rt]; PC PC + 4 ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, nPC_sel = “+4” SUB R[rd] R[rs] – R[rt]; PC PC + 4 ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, nPC_sel = “+4” ORi R[rt] R[rs] + zero_ext(Imm16); PC PC + 4 ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, nPC_sel = “+4” LOAD R[rt] MEM[ R[rs] + sign_ext(Imm16)]; PC PC + 4 ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemtoReg, nPC_sel = “+4” RegDst = rt, RegWr, STORE MEM[ R[rs] + sign_ext(Imm16) ] R[rs]; PC PC + 4 ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, nPC_sel = “+4” BEQ if ( R[rs] == R[rt] ) then PC PC + sign_ext(Imm16)] || 00 else PC PC + 4 nPC_sel = “Br”, ALUctr = “sub” 280 Chapter 5.2 - Processor Design 2 A Summary of Control Signals See Appendix A We Don’t Care :-) func 10 0000 10 0010 op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 RegDst add 1 sub 1 ori 0 lw 0 sw x beq x jump x ALUSrc 0 0 1 1 1 0 x MemtoReg RegWrite 0 1 0 1 0 1 1 1 x 0 x 0 x 0 MemWrite nPCsel 0 0 0 0 0 0 0 0 1 0 0 1 0 0 Jump 0 0 0 0 0 0 1 ExtOp x x 0 1 1 x x Add Subtract Or Add Add Subtract xxx ALUctr<2:0> 31 26 21 16 R-type op rs rt I-type op rs rt J-type op 11 rd shamt immediate target address 281 6 0 funct add, sub ori, lw, sw, beq jump Chapter 5.2 - Processor Design 2 The Concept of Local Decoding op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 R-type ori lw sw beq jump RegDst 1 0 0 x x x ALUSrc 0 1 1 1 0 x MemtoReg 0 0 1 x x x RegWrite 1 1 1 0 0 0 MemWrite 0 0 0 1 0 0 Branch 0 0 0 0 1 0 Jump 0 0 0 0 0 1 ExtOp x 0 1 1 x x “R-type” Or Add Add Subtract xxx ALUop<N:0> func op Main Control 6 6 ALUop N ALU Control (Local) ALUctr 3 ALU 282 Chapter 5.2 - Processor Design 2 The Encoding of ALUop func op Main Control 6 ALU Control (Local) 6 ALUop ALUctr 3 N • In this exercise, ALUop has to be 2 bits wide to represent: – “R-type” instructions (1) – “I-type” instructions that require the ALU to perform: • (2) Or, (3) Add, and (4) Subtract • To implement the full MIPS ISA, ALUop has to be 3 bits to represent: – “R-type” instructions (1) – “I-type” instructions that require the ALU to perform: • (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi) ALUop (Symbolic) ALUop<2:0> 283 R-type ori lw sw “R-type” Or Add 0 10 0 00 1 00 beq jump Add Subtract xxx 0 00 0 01 xxx Chapter 5.2 - Processor Design 2 The Decoding of the “func” Field func op Main Control 6 ALU Control (Local) 6 ALUop N ALUop (Symbolic) ALUop<2:0> 31 op 3 R-type ori lw sw “R-type” 1 00 Or 0 10 Add 0 00 Add 0 00 26 R-type ALUctr 21 rs 16 11 rt rd beq jump Subtract 0 01 xxx xxx 6 0 shamt funct P. 286 text: funct<5:0> Instruction Operation ALUctr ALUctr<2:0> ALU Operation add 000 And 10 0010 subtract 001 Or 10 0100 and 010 Add 10 0101 or 110 Subtract 10 1010 set-on-less-than 284 ALU 10 0000 111 Set-on-less-than Chapter 5.2 - Processor Design 2 The Truth Table for ALUctr funct<3:0> R-type ALUop (Symbolic) “R-type” ALUop<2:0> 1 00 Instruction Op. 0000 add ori lw sw beq 0010 subtract Or Add Add Subtract 0100 and 0 10 0 00 0 00 0 01 0101 or 1010 set-on-less-than ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALU Operation ALUctr bit<2> bit<1> bit<0> 0 0 0 x x x x Add 0 1 0 0 x 1 x x x x Subtract 1 1 0 0 1 x x x x x Or 0 0 1 1 x x 0 0 0 0 Add 0 1 0 1 x x 0 0 1 0 Subtract 1 1 0 1 x x 0 1 0 0 And 0 0 0 1 x x 0 1 0 1 Or 0 0 1 1 x x 1 0 1 0 Set on < 1 1 1 285 Chapter 5.2 - Processor Design 2 The Logic Equation for ALUctr<2> ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<2> 0 x 1 x x x x 1 1 x x 0 0 1 0 1 1 x x 1 0 1 0 1 This makes func<3> a don’t care • ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0> 286 Chapter 5.2 - Processor Design 2 The Logic Equation for ALUctr<1> ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<1> 0 0 0 x x x x 1 0 x 1 x x x x 1 1 x x 0 0 0 0 1 1 x x 0 0 1 0 1 1 x x 1 0 1 0 1 • ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0> 287 Chapter 5.2 - Processor Design 2 The Logic Equation for ALUctr<0> ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<0> 0 1 x x x x x 1 1 x x 0 1 0 1 1 1 x x 1 0 1 0 1 • ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> & func<0> + ALUop<2> & func<3> & !func<2> & func<1> & !func<0> 288 Chapter 5.2 - Processor Design 2 The ALU Control Block func ALU Control (Local) 6 ALUop ALUctr 3 3 • ALUctr<2> • ALUctr<1> • ALUctr<0> = !ALUop<2> ALUop<2> = !ALUop<2> ALUop<2> = !ALUop<2> + ALUop<2> + ALUop<2> 289 & & & & & & & & & ALUop<0> + !func<2> & func<1> & !ALUop<0> + !func<2> & !func<0> ALUop<0> !func<3> & func<2> !func<1> & func<0> func<3> & !func<2> func<1> & !func<0> !func<0> Chapter 5.2 - Processor Design 2 Step 5: Logic for Each Control Signal • nPC_sel • ALUsrc • ALUctr <= if (OP == BEQ) then “Br” else “+4” <= if (OP == “Rtype”) then “regB” else “immed” <= if (OP == “Rtype”) then funct elseif (OP == ORi) then “OR” elseif (OP == BEQ) then “sub” else “add” • ExtOp <= _____________ • MemWr <= _____________ • MemtoReg <= _____________ • RegWr: <=_____________ • RegDst: <= _____________ 290 Chapter 5.2 - Processor Design 2 Step 5: Logic for each control signal • nPC_sel • ALUsrc • ALUctr <= if (OP == BEQ) then “Br” else “+4” <= if (OP == “Rtype”) then “regB” else “immed” <= if (OP == “Rtype”) then funct elseif (OP == ORi) then “OR” elseif (OP == BEQ) then “sub” else “add” • ExtOp <= if (OP == ORi) then “zero” else “sign” • MemWr <= (OP == Store) • MemtoReg <= (OP == Load) • RegWr: <= if ((OP == Store) || (OP == BEQ)) then 0 else 1 • RegDst: <= if ((OP == Load) || (OP == ORi)) then 0 else 1 291 Chapter 5.2 - Processor Design 2 The “Truth Table” for the Main Control RegDst ALUSrc op Main Control 6 : ALUop func 6 ALU Control (Local) ALUctr 3 3 op RegDst ALUSrc MemtoReg RegWrite MemWrite nPC_sel Jump ExtOp ALUop (Symbolic) ALUop <2> ALUop <1> ALUop <0> 00 0000 R-type 1 0 0 1 0 0 0 x “R-type” 1 0 0 292 00 1101 10 0011 10 1011 00 0100 00 0010 ori lw sw beq jump 0 0 x x x 1 1 1 0 x 0 1 x x x 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 x x Or Add Add Subtract xxx 0 0 0 x 0 1 0 0 x 0 0 0 0 xChapter 5.2 - Processor 1 Design 2 A Real MIPS Datapath (CNS T0) 293 Chapter 5.2 - Processor Design 2 Summary: A Single Cycle Processor ALUop 3 RegDst op 6 Instr<31:26> Main Control func ALUSrc : 5 5 Rt Rt 5 Imm16 0 1 MemtoReg MemWr 0 32 32 WrEn Adr Data In 32 32 Clk Mux 16 Extender imm16 Instr<15:0> 32 ALU Rw Ra Rb 32 32-bit Registers busB 32 Zero Mux 32 Clk Rd ALUctr busA busW Rs <0:15> Rs <11:15> Clk <16:20> Instruction Fetch Unit 1 Mux 0 RegWr 3 Instruction<31:0> <21:25> RegDst Rt ALUctr Instr<5:0> 6 nPC_sel Rd ALU Control 1 Data Memory ALUSrc 294 ExtOp Chapter 5.2 - Processor Design 2 Recap: An Abstract View of the Critical Path (Load) • Register file and ideal memory: –The CLK input is a factor ONLY during write operation –During read operation, behave as combinational logic: • Address valid Output valid after “access time.” Ideal Instruction Memory Instruction Rd 5 Instruction Address Rs 5 Rt 5 Imm 16 A 32 Rw Ra 32 32-bit Registers PC 32 Rb 32 ALU B Clk 32 Clk Next Address Critical Path (Load Operation) = PC’s Clk-to-Q + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew 295 Data Address Data In Ideal Data Memory Clk Chapter 5.2 - Processor Design 2 Worst Case Timing (Load) Clk Old Value Rs, Rt, Rd, Op, Func PC ALUct r ExtOp ALUSrc MemtoReg RegWr busA busB Clk-to-Q New Value Old Value Old Value Old Value Old Value Old Value Old Value Instruction Memoey Access Time New Value Delay through Control Logic New Value New Value New Value New Value New Value Old DelayValue through Extender & Mux Old Value Addres s Old Value busW Old 296 Value Register Write Occurs Register File Access Time New Value New Value ALU Delay New Value Data Memory Access Time New Chapter 5.2 - Processor Design 2 Drawback of this Single Cycle Processor • Long cycle time: –Cycle time must be long enough for the load instruction: PC’s Clock -to-Q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew • Cycle time for load is much longer than needed for all other instructions 297 Chapter 5.2 - Processor Design 2 Summary ° Single cycle datapath: CPI = 1, CCT long ° 5 steps to design a processor • 1. Analyze instruction set => datapath requirements • 2. Select set of datapath components & establish clock methodology • 3. Assemble datapath meeting the requirements • 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. • 5. Assemble the control logic Processor ° Control is the hard part Input Control Memory ° MIPS makes control easier • Instructions same size • Source registers always in same place • Immediates same size, location Operations always on registers/immediates • 298 Datapath Output Chapter 5.2 - Processor Design 2 Designing a Multi-cycle Processor Adapted from the lecture notes of John Kubiatowicz (UCB) Recap: A Single Cycle Datapath Instruction<31:0> 1 Mux 0 RegWr 5 Rs 5 Rt Rs ALUctr 5 busA 0 1 32 MemtoReg 0 32 32 WrEn Adr Data In 32 Clk ALUSrc ExtOp Imm16 Data Memory Mux 16 Extender imm16 32 Mux 32 Clk Rw Ra Rb 32 32-bit Registers busB 32 Rd Equal MemWr ALU busW Rt <0:15> Clk <11:15> RegDst Rt <16:20> Rd Instruction Fetch Unit <21:25> nPC_sel 1 Recap: The “Truth Table” for the Main Control RegDst ALUSrc op 6 Main Control : ALUop func 6 ALU Control (Local) ALUctr 3 3 op RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump ExtOp ALUop (Symbolic) ALUop <2> ALUop <1> ALUop <0> 00 0000 R-type 1 0 0 1 0 0 0 x “R-type” 1 0 0 00 1101 10 0011 10 1011 00 0100 00 0010 ori lw sw beq jump 0 0 x x x 1 1 1 0 x 0 1 x x x 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 x x Or Add Add Subtract xxx 0 0 0 x 0 1 0 0 x 0 0 0 0 x 1 The Big Picture: Where are We Now? • The Five Classic Components of a Computer Processor Input Control Memory Datapath Output • Today’s Topic: Designing the Datapath for the Multiple Clock Cycle Datapath • looks like a FSM with PC as state ALU Reg. Wrt Result Store Data Mem MemWr RegDst RegWr MemRd MemWr fun Mem Access ExtOp ALUSrc ALUctr Equal op Ext Register Fetch Instruction Fetch PC Next PC nPC_sel Abstract View of our single cycle processor Main Control ALU control What’s wrong with our CPI=1 processor? Arithmetic & Logical PC Inst Memory Reg File mux ALU mux setup Load PC Inst Memory mux Reg File Critical Path ALU Data Mem Store PC Inst Memory Reg File ALU Data Mem Branch PC Inst Memory Reg File mux cmp mux • Long Cycle Time • All instructions take as much time as the slowest • Real memory is not as nice as our idealized memory – cannot always get the job done in one (short) cycle mux setup Basic Limits on Cycle Time • Next address logic – PC <= branch ? PC + offset : PC + 4 • Instruction Fetch – InstructionReg <= Mem[PC] • Register Access – A <= R[rs] • ALU operation – R <= A + B Result Store MemWr RegDst RegWr Reg. File Data Mem MemRd MemWr Mem Access Exec ALUctr ALUSrc ExtOp Operand Fetch Instruction Fetch PC Next PC nPC_sel Control Operand Fetch Instruction Fetch PC Next PC • Place enables on all registers Exec Reg. File Result Store Data Mem Mem Access MemWr RegDst RegWr MemRd MemWr ALUctr ALUSrc ExtOp Equal nPC_sel Partitioning the CPI=1 Datapath • Add registers between smallest steps • Critical Path ? B ExtOp Equal S Reg. File RegDst RegWr MemToReg MemRd MemWr ALUctr Ext ALUSrc ALU A Result Store Reg File Mem Acces s IR nPC_sel E Data Mem Operand Fetch Instruction Fetch PC Next PC Example Multicycle Datapath M Recall: Step-by-step Processor Design Step 1: ISA => Logical Register Transfers Step 2: Components of the Datapath Step 3: RTL + Components => Datapath Step 4: Datapath + Logical RTs => Physical RTs Step 5: Physical RTs => Control Step 4: R-type (add, sub, . . .) • Logical Register Transfer inst Logical Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 • Physical Register Transfers inst Physical Register Transfers Time IR <– MEM[pc] ADDU A<– R[rs]; B <– R[rt] S <– A + B R[rd] <– S; PC <– PC + 4 Reg. File M Data Mem B S Mem Acces s A Exec Reg File IR Inst. Mem PC Next PC E Step 4: Logical immed • Logical Register Transfer inst Logical Register Transfers ORI R[rt] <– R[rs] OR ZExt(Im16); PC <– PC + 4 • Physical Register Transfers inst Physical Register Transfers Time IR <– MEM[pc] ORI A<– R[rs]; B <– R[rt] S <– A or ZExt(Im16) R[rt] <– S; PC <– PC + 4 Reg. File M Data Mem B S Mem Acces s A Exec Reg File IR Inst. Mem PC Next PC E Step 4 : Load • Logical Register Transfer inst Logical Register Transfers LW R[rt] <– MEM[R[rs] + SExt(Im16)]; • Physical Register Transfers PC <– PC + 4 inst Physical Register Transfers Time IR <– MEM[pc] LW A<– R[rs]; B <– R[rt] S <– A + SExt(Im16) M <– MEM[S] R[rd] <– M; PC <– PC + 4 Reg. File M Data Mem B S Mem Acces s A Exec Reg File IR Inst. Mem PC Next PC E Step 4 : Store • Logical Register Transfer inst Logical Register Transfers SW MEM[R[rs] + SExt(Im16)] <– R[rt]; • Physical Register Transfers PC <– PC + 4 Time inst Physical Register Transfers IR <– MEM[pc] SW A<– R[rs]; B <– R[rt] S <– A + SExt(Im16); MEM[S] <– B PC <– PC + 4 Reg. File M Data Mem B S Mem Acces s A Exec Reg File IR Inst. Mem PC Next PC E Step 4 : Branch • Logical Register Transfer inst Logical Register Transfers BEQ if R[rs] == R[rt] then PC <= PC + 4+SExt(Im16) || 00 • Physical Register Transfers else PC <= PC + 4 Time inst Physical Register Transfers IR <– MEM[pc] BEQ E<– (R[rs] = R[rt]) if !E then PC <– PC + 4 else PC <–PC+4+SExt(Im16)||00 Reg. File M Data Mem B S Mem Acces s A Exec Reg File IR Inst. Mem PC Next PC E Multi-Cycle Data Path http://www.ee.unt.edu/public/guturu/MultiCycleDesign.docx www.ee.unt.edu/public/guturu/multicyclestatemachine.pdf Alternative data-path (book): Multiple Cycle Datapath • Minimizes Hardware: 1 memory, 1 adder PCWr PCWrCond Zero MemWr ALUSelA RegWr 1 WrAdr 32 Din Dout 32 32 Rt 0 5 Rd Ra Rb busA Reg File Imm 16 32 busW busB 32 1 Mux 0 Extend ExtOp 32 1 Rw 1 Zero << 2 4 0 1 32 32 2 3 32 MemtoReg ALU Control ALUOp ALUSelB ALU Out 1 32 Mux Ideal Memory Rt 5 Target ALU Mux RAdr Rs 32 0 0 Mux 0 Instruction Reg 32 32 RegDst 32 PC 32 IRWr BrWr Mux IorD PCSrc Our Control Model • State specifies control points for Register Transfer • Transfer occurs upon exiting state (same falling edge) inputs (conditions) Next State Logic Control State State X Register Transfer Control Points Depends on Input Output Logic outputs (control points) Step 4 Control Specification for multicycle proc IR <= MEM[PC] ORi LW SW BEQ S <= A fun B S <= A or ZX S <= A + SX M <= MEM[S] R[rd] <= S R[rt] <= S R[rt] <= M PC <= PC + 4 PC <= PC + 4 PC <= PC + 4 S <= A + SX MEM[S] <= B PC <= PC + 4 PC <= Next(PC,Equal) Write-back Memory Execute “decode / operand fetch” A <= R[rs] B <= R[rt] R-type “instruction fetch” Traditional FSM Controller next state op cond state control points Truth Table 11 next State Equal 6 4 op datapath State State control points Step 5 (datapath + state diagram control) • Translate RTs into control points • Assign states • Then go build the controller Mapping RTs to Control Points IR <= MEM[PC] imem_rd, IRen A <= R[rs] B <= R[rt] “instruction fetch” “decode” R-type S <= A fun B ORi ALUfun, Sen S <= A or ZX LW S <= A + SX M <= MEM[S] R[rd] <= S PC <= PC + 4 RegDst, RegWr, PCen R[rt] <= S R[rt] <= M PC <= PC + 4 PC <= PC + 4 SW BEQ S <= A + SX MEM[S] <= B PC <= PC + 4 PC <= Next(PC,Equal) Write-back Memory Execute Aen, Ben, Een Assigning States IR <= MEM[PC] 0000 “instruction fetch” “decode” A <= R[rs] B <= R[rt] R-type ORi S <= A fun B S <= A or ZX 0100 0110 LW S <= A + SX 1000 M <= MEM[S] 1001 SW BEQ S <= A + SX 1011 MEM[S] <= B PC <= PC + 4 1100 R[rd] <= S R[rt] <= S R[rt] <= M PC <= PC + 4 PC <= PC + 4 PC <= PC + 4 0101 0111 1010 PC <= Next(PC) 0011 Write-back Memory Execute 0001 (Mostly) Detailed Control Specification (missing0) State Op field Eq Next IR PC Ops Exec Mem Write-Back en sel A B E Ex Sr ALU S R W M M-R Wr Dst 0000 ?????? ? 0001 1 0001 BEQ x 0011 111 0001 R-type x 0100 111 0001 ORI x 0110 1 1 1 -all same in Moore machine 0001 LW x 1000 111 0001 SW x 1011 111 0011 BEQ: 0011 0100 R: 0101 0110 ORi: 0111 1000 1001 LW: 1010 1011 1100 SW: xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx 0 1 x x x x x x x x x 0000 0000 0101 0000 0111 0000 1001 1010 0000 1100 0000 1 0 1 1 x 0 x x 0 x 0 1 fun 1 1 0 0 1 1 0 0 or 1 1 0 0 1 0 1 0 add 1 1 0 1 1 0 1 1 0 1 0 add 1 1 0 0 1 0 Performance Evaluation • What is the average CPI? – state diagram gives CPI for each instruction type – workload gives frequency of each type Type CPIi for type Frequency CPIi x freqIi Arith/Logic 4 40% 1.6 Load 5 30% 1.5 Store 4 10% 0.4 branch 3 20% 0.6 Average CPI:4.1 Introduction to Pipelining Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) Pipelining is Natural! • Laundry Example • Ann, Brian, Cathy, Dave A each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes B C D Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k A B O r d e r C D • Sequential laundry takes 6 hours for 4 loads Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 T a s k 40 40 40 20 A B O r d e r C D • Pipelined laundry takes 3.5 hours for 4 loads Pipelining Lessons • Latency vs. Throughput • Question – What is the latency in both cases ? – What is the throughput in both cases ? 30 40 40 40 40 20 A B C D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipelining Lessons [contd…] • Question – What is the fastest operation in the example ? – What is the slowest operation in the example 30 40 40 40 40 20 A B C D Pipeline rate limited by slowest pipeline stage Pipelining Lessons [contd…] 30 40 40 40 40 20 A B C D Multiple tasks operating simultaneously using different resources Pipelining Lessons [contd…] • Question – Would the speedup increase if we had more steps ? 30 40 40 40 40 20 A B C D Potential Speedup = Number of pipe stages Pipelining Lessons [contd…] • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes • Question – Will it affect if “Folder” also took 40 minutes Unbalanced lengths of pipe stages reduces speedup Pipelining Lessons [contd…] 30 40 40 40 40 20 A B C D Time to “fill” pipeline and time to “drain” it reduces speedup Five Stages of an Instruction Cycle 1 Cycle 2 Load Ifetch Reg/Dec Cycle 3 Cycle 4 Cycle 5 Exec Mem Wr • Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • Wr: Write the data back to the register file Conventional Pipelined Execution Representation Time IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem IFetch Dcd IFetch Dcd IFetch Dcd Program Flow IFetch Dcd WB Example Program Execution Order (in instructions) Time Inst. Fetch lw $1,100($0) R e g ALU Mem R e g Inst. Fetch lw $2, 200($0) R e g ALU Mem Inst. Fetch lw $3, 300($0) Program Execution Order (in instructions) lw $1,100($0) lw $2, 200($0) lw $3, 300($0) R e g Time Inst. Fetch R e g Inst. Fetch ALU R e g Mem R e g Inst. Fetch ALU Mem R e g ALU R e g Mem R e g R e g ALU Mem R e g Example [contd…] • Timepipeline = Timenon-pipeline / Pipe stages – Assumptions • Stages are perfectly balanced • Ideal conditions Definitions • Performance is in units of things per sec – bigger is better • If we are primarily concerned with response time – performance(x) = 1 execution_time(x) " X is n times faster than Y" means Execution_time(Y) Performance(X) n = = Performance(Y) Execution_time(X) Example [contd…] • Speedup in this case = 24/14 = 1.7 • Lets add 1000 more instructions – Time (non-pipelined) = 1000 x 8 + 24 ns = 8000 ns – Time (pipelined) = 1000 x 2 + 14 ns = 2014 ns – Speedup = 8000 / 2014 = 3.98 = 4 (approx) = 8/2 Instruction throughput is important metric (as opposed to individual instruction) as real programs execute billions of instructions in practical case !!! Pipeline Hazards • Structural Hazard IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem IFetch Dcd IFetch Dcd IFetch Dcd Program Flow IFetch Dcd WB Pipeline Hazard [contd…] • Control Hazard • Example – add – beq – lw $4, $5, $6 $1, $2, 40 $3, 300($0) Pipleline Hazard [contd…] • Data Hazards • Example – add – sub $s0, $t0, $t1 $t2, $s0, $t3 Summary Pipelining Lessons 6 PM 7 8 9 Time 30 40 T a s k A B O r d e r C D 40 40 40 20 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences Summary of Pipeline Hazards • Structural Hazards – Hardware design • Control Hazard – Decision based on results • Data Hazard – Data Dependency Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002 Revisiting Pipelining Lessons 6 PM 7 8 9 Time 30 40 T a s k A B O r d e r C D 40 40 40 20 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences Revisiting Pipelining Hazards • Structural Hazards – Hardware design • Control Hazard – Decision based on results • Data Hazard – Data Dependency Control Signals for existing Datapath IF: Instruction Fetch ID: Instruction Decode/ register file read EX: Execute/address calculation MEM: Memory Access WB: Write back ADD ADD 4 Shift left 2 Read Reg1 M U X P C Address Read Reg2 Zero ADD Instruction Instruction Memory Read Data1 Registers Read Data2 Write Reg M U X Address Read Data Data Memory Write Data Write Data 16 Sign Extend 32 The Right to Left Control can lead to hazards M U X Place registers between each step IF/ID ID/EX EX/MEM MEM/WB ADD ADD 4 Shift left 2 Read Reg1 M U X P C Address Read Reg2 Zero ADD Instruction Instruction Memory Read Data1 Registers Read Data2 Write Reg M U X Address Read Data Data Memory Write Data Write Data 16 Sign Extend 32 M U X Example 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Start: Fetch 10 n WB Ctrl A Exec im Reg File Mem Ctrl rs rt S M = PC 10 D Mem Acces s Data Mem B Next PC IR n Reg. File n Decode Inst. Mem n IF 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n WB Ctrl Mem Ctrl A S M Reg. File im Exec rt Reg File 2 = PC 14 D Mem Acces s Data Mem B Next PC IR n Decode lw r1, r2(35) Inst. Mem Fetch 14, Decode 10 ID 10 lw r1, r2(35) IF 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n WB Ctrl S M Reg. File r2 n Mem Ctrl Exec 35 Reg File lw r1 rt 2 = PC 20 D Mem Acces s Data Mem B Next PC IR Decode addI r2, r2, 3 Inst. Mem Fetch 20, Decode 14, Exec 10 EX 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n WB Ctrl D Reg. File M Mem Acces s Data Mem Mem Ctrl r2+35 Exec 3 Reg File r2 lw r1 addI r2, r2, 3 5 4 B PC 24 = Next PC IR Decode sub r3, r4, r5 Inst. Mem Fetch 24, Decode 20, Exec 14, Mem 10 M 10 lw EX 14 addI r2, r2, 3 ID 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 IF 100 and r1, r2(35) r13, r14, 15 Reg. File M[r2+35] D Mem Acces s Data Mem WB Ctrl r2+3 Exec r4 Mem Ctrl lw r1 addI r2 sub r3 7 Reg File 6 r5 PC 30 = Next PC IR Decode beq r6, r7 100 Inst. Mem Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 WB 10 M 14 lw r1, r2(35) addI r2, r2, 3 EX 20 ID 24 sub r3, r4, r5 beq r6, r7, 100 IF 30 ori r8, r9, 17 add r10, r11, r12 34 100 and r13, r14, 15 r1=M[r2+35] WB Ctrl Reg. File addI r2 Mem Ctrl r2+3 sub r3 r4-r5 r6 Exec 100 Reg File beq xx 9 = PC 100 D Mem Acces s Data Mem r7 Next PC IR Decode ori r8, r9 17 Inst. Mem Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 10 WB 14 M 20 lw r1, r2(35) addI r2, r2, 3 sub r3, r4, r5 EX 24 ID 30 beq r6, r7, 100 ori r8, r9, 17 34 add r10, r11, r12 IF 100 and r13, r14, 15 Pipelining Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw Ifetch Reg/Dec Exec 2nd lw Ifetch Mem Reg/Dec Exec 3rd lw Ifetch Wr Mem Reg/Dec Exec Wr Mem Wr The five independent functional units in the pipeline datapath are: – Instruction Memory for the Ifetch stage – Register File’s Read ports (bus A and busB) for the Reg/Dec stage – ALU for the Exec stage – Data Memory for the Mem stage – Register File’s Write port (bus W) for the Wr stage Pipelining the R Instruction Cycle 1 Cycle 2 R-type Ifetch Cycle 3 Cycle 4 Reg/Dec Exec Wr Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: – ALU operates on the two register operands – Update PC Wr: Write the ALU output back to the register file Pipelining Both L and R type Cycle 1 Cycle 2 R-type Ifetch R-type Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Reg/Dec Exec Ifetch Load Reg/Dec Exec Ifetch Ops! We have a problem! Wr Wr Reg/Dec Exec R-type Ifetch Mem Wr Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr We have pipeline conflict or structural hazard: – Two instructions try to write to the register file at the same time! – Only one write port Important Observations Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: – Load uses Register File’s Write Port during its 5th stage – R-type uses Register File’s Write Port during its 4th stage 1 Load 2 Ifetch 1 R-type Ifetch 3 4 Reg/Dec Exec 2 3 Reg/Dec Exec 5 Mem 4 Wr Wr Solution Delay R-type’s register write by one cycle: – Now R-type instructions also use Reg File’s write port at Stage 5 – Mem stage is a NOOP stage: nothing is being done. 1 2 R-type Ifetch Cycle 1 Cycle 2 Ifetch R-type 3 Reg/Dec Exec Load 5 Mem Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Reg/Dec Exec Ifetch 4 Mem Reg/Dec Exec Ifetch Wr Mem Reg/Dec Exec R-type Ifetch Wr Mem Reg/Dec Exec R-type Ifetch Wr Mem Reg/Dec Exec Wr Mem Wr Datapath (Without Pipeline) IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + SX; M <– Mem[S] Mem[S] <- B If Cond PC < PC+SX; Reg. File S D M Data Mem B Mem Access A Exec R[rd] <– M; IR Inst. Mem R[rt] <– S; PC Next PC R[rd] <– S; S <– A + SX; Equal S <– A or ZX; Reg File S <– A + B; Datapath (With Pipeline) IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] Mem[S] <- B A S M B D Reg. File R[rd] <– M; IR Inst. Mem R[rt] <– M; PC Next PC R[rd] <– M; M <– Mem[S] if Cond PC < PC+SX; Mem Acces s Data Mem M <– S S <– A + SX; Exec M <– S S <– A + SX; Equal S <– A or ZX; Reg File S <– A + B; Structural Hazard and Solution Time (clock cycles) Instr 3 Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem ALU Instr 4 Mem ALU Instr 2 Reg ALU Instr 1 Mem ALU O r d e r Load ALU I n s t r. Reg Mem Reg Control Hazard - #1 Stall Add Beq Reg Mem Reg Mem Lost potential Reg Mem Mem Reg Reg ALU Load Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem • Stall: wait until decision is clear • Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Reg Control Hazard – #2 Predict Add Beq Reg Mem Reg Mem Reg Mem Reg Mem ALU Load Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Reg Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) More dynamic scheme: history of 1 branch Control Hazard - #3 Delayed Branch Beq Load Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Misc Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) Data Hazards (RAW) Dependencies backwards in time are hazards and r6,r1,r7 xor r10,r1,r11 W B Reg Reg Im Reg Im Reg Dm Reg Dm ALU or r8,r1,r9 Im ME M Dm ALU sub r4,r1,r3 E X ALU O r d e r add r1,r2,r3 ID/R FReg ALU I n s t r. Time (clock cycles) I F Im Reg Dm Reg Data Hazards [contd…] “Forward” result from one stage to Time (clock cycles) another or r8,r1,r9 Reg Im Reg Im Reg Im Reg Dm Reg Dm Reg Dm ALU xor r10,r1,r11 W B Reg ALU and r6,r1,r7 Im ME M Dm ALU sub r4,r1,r3 E X ALU O r d e r add r1,r2,r3 ID/R FReg ALU I n s t r. I F Im Reg Dm Reg Data Hazards [contd…] Dependencies backwards in time are hazards sub r4,r1,r3 Stall ME M Dm W B Reg Im Reg ALU lw r1,0(r2) ID/R FReg ALU Time (clock cycles) I F Im E X Dm Reg Can’t solve with forwarding: Must delay/stall instruction dependent on loads Hazard Detection I-Fetch DCD MemOpFetch OpFetch IFetch DCD Exec Store °°° Structural Hazard I-Fetch DCD OpFetch Jump IFetch IF DCD EX IF Mem WB DCD EX IF DCD °°° RAW (read after write) Data Hazard Mem WB DCD EX Mem WB IF DCD IF Control Hazard DCD OF WAW Data Hazard (write after write) OF Ex RS Ex Mem WAR Data Hazard (write after read) Forwarding Unit For Resolving Data Hazards www.ee.unt.edu/public/guturu/MIPS-Pipeline-With-Forwarding-Unit.pdf Hazard Detection and Forwarding Units www.ee.unt.edu/public/guturu/MIPS_Pipeline_Hazard_Detection.jpg Three Generic Data Hazards • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called “output dependence” by compiler writers This also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes Hazard Detection Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. New Inst Instruction Movement: Inst I Inst J Window on execution: Only pending instructions can cause hazards A RAW hazard exists on register if Rregs( i ) Wregs( j ) A WAW hazard exists on register if Wregs( i ) Wregs( j ) A WAR hazard exists on register if Wregs( i ) Rregs( j ) Computing CPI • Start with Base CPI • Add stalls CPI CPIbase CPI stall CPI stall STALLtype1 freq type1 STALLtype 2 freq type 2 Suppose: –CPIbase=1 –Freqbranch=20%, freqload=30% –Suppose branches always cause 1 cycle stall –Loads cause a 2 cycle stall Then: CPI = 1 + (10.20)+(2 0.30)= 1.8 Summary • Control Signals need to be propagated • Insert Registers between every stage to “remember” and “propagate” values • Solutions to Control Hazard are Stall, Predict and Delayed Branch • Solutions to Data Hazard is “Forwarding” • Effective CPI = CPIideal + CPIstall Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley The Big Picture Processor Input Control Memory Datapath Output Technology Trends Capacity Logic: 2x in 3 years DRAM: 4x in 3 years Disk: 4x in 3 years Speed (latency) 2x in 3 years 2x in 10 years 2x in 10 years DRAM Year Size 1000:1! 2:1! 1980 64 Kb 1983 256 Kb Cycle Time 250 ns 220 ns 1986 1989 1992 1995 190 ns 165 ns 145 ns 120 ns 1 Mb 4 Mb 16 Mb 64 Mb Technology Trends [contd…] Processor-DRAM Memory Gap (latency) 1000 CPU “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) 100 10 µProc 60%/yr. (2X/1.5yr) “Less’ Law?” DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1 Time DRAM 9%/yr. (2X/10 yrs) The Goal: Large, Fast, Cheap Memory !!! • Fact – Large memories are slow – Fast memories are small • How do we create a memory that is large, cheap and fast (most of the time) ? – Hierarchy – Parallelism • By taking advantage of the principle of locality: – Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology. Processor Control Speed (ns):1s Size (bytes): 100s On-Chip Cache Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) 10ns 100ns Ks Ms Secondary Storage (Disk) Tertiary Storage (Tape) 10,000,00 10,000,000,000ns 0ns (10 sec) (10 Gsms) Ts Today’s Situation • Rely on caches to bridge gap • Microprocessor-DRAM performance gap – time of a full cache miss in instructions executed 1st Alpha (7000): 2nd Alpha (8400): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 266 ns/3.3 ns = 80 clks x 4 or 320 instructions Memory Hierarchy (1/4) • Processor – executes programs – runs on order of nanoseconds to picoseconds – needs to access code and data for programs: where are these? • Disk – HUGE capacity (virtually limitless) – VERY slow: runs on order of milliseconds – so how do we account for this gap? Memory Hierarchy (2/4) • Memory (DRAM) – smaller than disk (not limitless capacity) – contains subset of data on disk: basically portions of programs that are currently being run – much faster than disk: memory accesses don’t slow down processor quite as much – Problem: memory is still too slow (hundreds of nanoseconds) – Solution: add more layers (caches) Memory Hierarchy (3/4) Processor Higher Levels in memory hierarchy Lower Level 1 Level 2 Level 3 Increasing Distance from Proc., Decreasing cost / MB ... Level n Size of memory at each level Memory Hierarchy (4/4) • If level is closer to Processor, it must be: – smaller – faster – subset of all higher levels (contains most recently used data) – contain at least all the data in all lower levels • Lowest Level (usually disk) contains all available data Analogy: Library • You’re writing a term paper (Processor) at a table in Evans • Evans Library is equivalent to disk – essentially limitless capacity – very slow to retrieve a book • Table is memory – smaller capacity: means you must return book when table fills up – easier and faster to find a book there once you’ve already retrieved it Analogy : Library [contd…] • Open books on table are cache – smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book – much, much faster to retrieve data • Illusion created: whole library open on the tabletop – Keep as many recently used books open on table as possible since likely to use again – Also keep as many books on table as possible, since faster than going to library Memory Hierarchy Basics • Disk contains everything. • When Processor needs something, bring it into to all lower levels of memory. • Cache contains copies of data in memory that are being used. • Memory contains copies of data on disk that are being used. • Entire idea is based on Temporal Locality: if we use it now, we’ll want to use it again soon (a Big Idea) Caches : Why does it Work ? Probability of reference 0 Address Space 2^n - 1 • Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor • Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y Cache Design Issues • How do we organize cache? • Where does each memory address map to? (Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.) • How do we know which elements are in cache? • How do we quickly locate them? Direct Mapped Cache • In a direct-mapped cache, each memory address is associated with one possible block within the cache – Therefore, we only need to look in a single location in the cache for the data if it exists in the cache – Block is the unit of transfer between cache and memory Direct Mapped Cache [contd…] Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Cache Index 0 1 2 3 4 Byte Direct Mapped Cache • Cache Location 0 can be occupied by data from: –Memory location 0, 4, 8, ... –In general: any memory location that is multiple of 4 Issues with Direct Mapped Cache • Since multiple memory addresses map to same cache index, how do we tell which one is in there? • What if we have a block size > 1 byte? • Result: divide memory address into three fields ttttttttttttttttt iiiiiiiiii oooo tag to check if you have correct block index to byte select block offset Example of a direct mapped cache • For a 2^N byte cache: – The uppermost (32 - N) bits are always the Cache Tag – The lowest M bits are the Byte Select (Block Size = 2^M) Block address 31 9 Cache Tag Example: 0x50 4 0 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Cache Data Byte 31 0x50 Byte 63 : Cache Tag Byte 1 Byte 0 0 : Valid Bit Byte 33 Byte 32 1 2 3 : : Byte 1023 : : Byte 992 31 Terminology • All fields are read as unsigned integers. • Index: specifies the cache index (which “row” of the cache we should look in) • Offset: once we’ve found correct block, specifies which byte within the block we want • Tag: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location Terminology [contd…] • Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level – Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Lower Level • Hit Time <<ToMiss Penalty Upper Level Memory Processor Memory Blk X From Processor Blk Y How is the hierarchy managed ? • Registers <-> Memory – by compiler (programmer?) • cache <-> memory – by the hardware • memory <-> disks – by the hardware and operating system (virtual memory) – by the programmer (files) Example • Suppose we have a 16KB of data in a direct-mapped cache with 4 word blocks • Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture • Offset – need to specify correct byte within a block – block contains 4 words 16 bytes 24 bytes – need 4 bits to specify correct byte Example [contd…] • Index: (~index into an “array of blocks”) – need to specify correct row in cache – cache contains 16 KB = 214 bytes – block contains 24 bytes (4 words) – # rows/cache = # blocks/cache (since there’s one block/row) = bytes/cache bytes/row = 214 bytes/cache 24 bytes/row = 210 rows/cache – need 10 bits to specify this many rows Example [contd…] • Tag: use remaining bits as tag – tag length = mem addr length - offset - index = 32 - 4 - 10 bits = 18 bits – so tag is leftmost 18 bits of memory address Accessing data in cache Memory Address (hex) ... • Ex.: 16KB of data, direct- 00000010 mapped, 4 word blocks • Read 4 addresses –0x00000014, 0x0000001C, 0x00000034, 0x00008014 • Memory values on right: –only cache/memory level of hierarchy 00000014 00000018 0000001C ... Value of Word ... a b c d ... 00000030 00000034 00000038 0000003C ... e f g h ... 00008010 00008014 00008018 0000801C ... i j k l ... Accessing data in cache [contd…] • 4 Addresses: –0x00000014, 0x0000001C, 0x00000034, 0x00008014 • 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields 000000000000000000 000000000000000000 000000000000000000 000000000000000010 Tag 0000000001 0100 0000000001 1100 0000000011 0100 0000000001 0100 Index Offset 16 KB Direct Mapped Cache, 16B blocks • Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries are invalid) Valid Index Tag 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 0x4-7 0x0-3 ... Example Block 0x8-b 0xc-f Read 0x00000014 = 0…00 0..001 0100 • 000000000000000000 0000000001 0100 Index field Offset Tag field Valid 0x4-7 0x8-b 0xc-f 0x0-3 Tag Index 0 1 2 3 4 5 6 7 ... 0 0 0 0 0 0 0 0 1022 1023 0 0 ... So we read block 1 (0000000001) • 000000000000000000 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... No valid data • 000000000000000000 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... So load that data into cache, setting tag, valid • 000000000000000000 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Read from cache at offset, return word b • 000000000000000000 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Read 0x0000001C = 0…00 0..001 1100 • 000000000000000000 0000000001 1100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Data valid, tag OK, so read offset return word d • 000000000000000000 0000000001 1100 Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Read 0x00000034 = 0…00 0..011 0100 • 000000000000000000 0000000011 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... So read block 3 • 000000000000000000 0000000011 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... No valid data • 000000000000000000 0000000011 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Load that cache block, return word f • 000000000000000000 0000000011 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 e f g h 3 1 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Read 0x00008014 = 0…10 0..001 0100 • 000000000000000010 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 e f g h 3 1 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... So read Cache Block 1, Data is Valid • 000000000000000010 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 e f g h 3 1 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Cache Block 1 Tag does not match (0 != 2) • 000000000000000010 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 a b c d 1 1 0 2 0 e f g h 3 1 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Miss, so replace block 1 with new data & tag • 000000000000000010 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 i j k l 1 1 2 2 0 e f g h 3 1 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... And return word j • 000000000000000010 0000000001 0100 Tag field Index field Offset Valid 0x4-7 0x8-b 0xc-f 0x0-3 Index Tag 0 0 i j k l 1 1 2 2 0 e f g h 3 1 0 4 0 5 0 6 0 7 0 ... 1022 0 1023 0 ... Things to Remember • We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. • So we create a memory hierarchy: – each successively lower level contains “most used” data from next higher level • Exploit temporal and spatial locality Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley View of Memory Hierarchies Thus far { { Next: Virtual Memory Regs Instr. Operands Cache Blocks Upper Level Faster L2 Cache Blocks Memory Pages Disk Files Tape Larger Lower Level Memory Hierarchy: Some Facts Upper Level Capacity Access Time Cost Staging Xfer Unit CPU Registers 100s Bytes <10s ns Registers Cache K Bytes 10-100 ns $.01-.001/bit Cache Instr. Operands Blocks Main Memory M Bytes 100ns-1us $.01-.001 Disk G Bytes ms -4 -3 10 - 10 cents Tape infinite sec-min 10 -6 faster prog./compiler 1-8 bytes cache cntl 8-128 bytes Memory Pages OS 512-4K bytes Files user/operator Mbytes Disk Tape Larger Lower Level Virtual Memory: Motivation • If Principle of Locality allows caches to offer (usually) speed of cache memory with size of DRAM memory, then recursively why not use at next level to give speed of DRAM memory, size of Disk memory? • Treat Memory as “cache” for Disk !!! • Share memory between multiple processes but still provide protection – don’t let one program read/write memory of another • Address space – give each program the illusion that it has its own private memory – Suppose code starts at addr 0x40000000. But different processes have different code, both at the same address! So each program has a different view of memory Advantages of Virtual Memory • Translation: – Program can be given consistent view of memory, even though physical memory is scrambled – Makes multithreading reasonable (now used a lot!) – Only the most important part of program (“Working Set”) must be in physical memory. – Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. • Protection: – Different threads (or processes) protected from each other. – Different pages can be given special behavior • (Read Only, Invisible to user programs, etc). – Kernel data protected from User programs – Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows • Sharing: – Can map same physical page to multiple users (“Shared memory”) Virtual to Physical Address Translation Program operates in its virtual address space virtual address (inst. fetch load, store) HW mapping physical address (inst. fetch load, store) Physical memory (incl. caches) • Each program operates in its own virtual address space; ~only program running • Each is protected from the other • OS can decide where each goes in memory • Hardware (HW) provides virtual -> physical mapping Mapping Virtual Memory to Physical Memory • Divide into equal sized chunks (about 4KB) • Any chunk of Virtual Memory assigned to any chuck of Physical Memory (“page”) Stack Physical Memory 64 MB Heap Static 0 Code 0 Paging Organization (eg: 1KB Page) Page is unit Virtual Physical of mapping Address Address page 0 0 1K page 0 1K 0 page 1 1K 1024 1K Addr 1024 page 1 2048 page 2 1K ... ... ... Trans MAP ... ... ... 7168 page 7 1K Physical 31744 page 31 1K Memory Page also unit of Virtual transfer from disk to physical memory Memory Virtual Memory Mapping Virtual Address: page no. offset Page Table Base Reg index into page table (actually, concatenation) Page Table ... V A.R. P. P. A. + Val Access Physical -id Rights Page Address Physical Memory Address . ... Page Table located in physical memory Issues in VM Design What is the size of information blocks that are transferred from secondary to main storage (M)? page size (Contrast with physical block size on disk, I.e. sector size) Which region of M is to hold the new block placement policy How do we find a page when we look for it? block identification Block of information brought into M, and M is full, then some region of M must be released to make room for the new block replacement policy What do we do on a write? write policy Missing item fetched from secondary memory only on the occurrence of a fault demand load policy cache mem disk reg pages frame Virtual Memory Problem # 1 • Map every address 1 extra memory accesses for every memory access • Observation: since locality in pages of data, must be locality in virtual addresses of those pages • Why not use a cache of virtual to physical address translations to make translation fast? (small is fast) • For historical reasons, cache is called a Translation Lookaside Buffer, or TLB Memory Organization with TLB •TLBs usually small, typically 128 - 256 entries • Like any other cache, the TLB can be fully associative, set associative, or direct mapped VA Processor hit PA TLB Lookup miss Translation miss Cache hit data Main Memory Typical TLB Format Virtual Physical Dirty Ref Valid Access Address Address Rights • TLB just a cache on the page table mappings • TLB access time comparable to cache (much less than main memory access time) • Ref: Used to help calculate LRU on replacement • Dirty: since use write back, need to know whether or not to write page to disk when replaced What if not in TLB • Option 1: Hardware checks page table and loads new Page Table Entry into TLB • Option 2: Hardware traps to OS, up to OS to decide what to do • MIPS follows Option 2: Hardware knows nothing about page table format TLB Miss • If the address is not in the TLB, MIPS traps to the operating system • The operating system knows which program caused the TLB fault, page fault, and knows what the virtual address desired was requested valid virtual physical 1 2 9 TLB Miss: If data is in Memory • We simply add the entry to the TLB, evicting an old entry from the TLB valid virtual physical 1 1 7 2 32 9 What if data is on disk ? • We load the page off the disk into a free block of memory, using a DMA transfer – Meantime we switch to some other process waiting to be run • When the DMA is complete, we get an interrupt and update the process's page table – So when we switch back to the task, the desired data will be in memory What if the memory is full ? • We load the page off the disk into a least recently used block of memory, using a DMA transfer – Meantime we switch to some other process waiting to be run • When the DMA is complete, we get an interrupt and update the process's page table – So when we switch back to the task, the desired data will be in memory Virtual Memory Problem # 2 • Page Table too big! – 4GB Virtual Memory ÷ 4 KB page ~ 1 million Page Table Entries 4 MB just for Page Table for 1 process, 25 processes 100 MB for Page Tables! • Variety of solutions to tradeoff memory size of mapping function for slower when miss TLB – Make TLB large enough, highly associative so rarely miss on address translation Two Level Page Tables 2nd Level Page Tables 64 MB Super Page Table Virtual Memory Physical Memory Heap ... 0 Stack Static Code 0 Summary • Apply Principle of Locality Recursively • Reduce Miss Penalty? add a (L2) cache • Manage memory to disk? Treat as cache – Included protection as bonus, now critical – Use Page Table of mappings vs. tag/data in cache • Virtual memory to Physical Memory Translation too slow? – Add a cache of Virtual to Physical Address Translations, called a TLB Summary • Virtual Memory allows protected sharing of memory between processes with less swapping to disk, less fragmentation than always swap or base/bound • Spatial Locality means Working Set of Pages is all that must be in memory for process to run fairly well • TLB to reduce performance cost of VM • Need more compact representation to reduce memory size cost of simple 1-level page table (especially 32- 64-bit address)