Computer Science 3724 Fall Semester, 2014 1 What will we do in this course? • We will look at the design of an instruction set for a simple processor. The processor is based on a “real” processor, the MIPS R2000. • We will see how logic relates to switching (and transistors) and how logic forms a calculus for designing digital circuits. • We will construct the basic logic blocks required to build a simple computer. • We will look at the internal structure of that simple processor, having a 32 bit instruction length and a 32 bit data word. • We will design the processor, and add enhancements to improve the speed of execution of its instructions. • Then we will design a memory system for the processor, and see how we can match its speed to the processor. 2 Why bother with all this? • Both software and hardware affect performance. Understanding how they interact is essential. • Understanding how computers work helps us be better programmers. • We may have to provide advice on which computer to purchase for some application. • Computing performance has improved exponentially for 40 years. – Why is the growth rate so fast? – How long can this continue? – How does this growth affect the programs I design? – How does it affect the value of hardware and software? • How does increased computation speed affect computer peripherals? (e.g., input/output devices.) 3 About questions??? Who questions much, shall learn much, and retain much.” — Francis Bacon “Asking a question is embarrassing for a moment, but not asking is embarrassing for a lifetime.” — Haruki Murakami, Kafka on the Shore, 2006, p. 255. If there is something you don’t understand, or need clarified, ask. If you think of a question after class, come to my office and ask. If you don’t understand the answer, ask again! 4 A possible users view of a computer system USER DESKTOP BROWSER OFFICE ENVIRONMENT EDITOR MAIL SPREADSHEET DATABASE COMPILERS OPERATING SYSTEM LIBRARIES COMPUTER INPUT AND OUTPUT DEVICES MEMORY 5 A typical “desktop system”: In this course we will be concerned mainly with the processor. (The part typically not on the desktop.) 6 Inside the processor box: Where is the processor? 7 A look at the “motherboard”: 8 The basic functional blocks of a simple computer CPU MEMORY INPUT/ OUTPUT We sometimes refer to five classic components of as computer. We often consider the CPU, or processor, as two components — a datapath and a control unit. The datapath performs arithmetic and logical operations on data stored temporarily in internal registers. The control unit determines exactly what operations are performed. It also controls access to memory and I/O devices. 9 What are some characteristics of those components? Characteristics of input: • wide range of speed — keyboard, touch screen, network, video • different modes — touch, video, voice, . . . • different sampling rates — temperature, speed, . . . Characteristics of output: • again a wide range of speed — text, speech, video, . . . • range of technologies — almost any controllable device Characteristics of the processor: • Does relatively simple operations at high speed • Does exactly as it is instructed • Very efficient for repetitive operations • Technology developing at a consistent (rapid) rate — roughly doubling every two years Characteristics of memory: • Processors require fast memory, to match processor speeds. • Very fast memory is relatively expensive, slow memory is relatively cheap. 10 Here some of the inputs and outputs are obvious, but note the lack of wire connections. Where is the processor? 11 Inputs? Outputs? Processor? 12 A typical instruction for a computer w := a + b + c What does this mean to the computer? What are a, b, c and w? How is the expression evaluated? Is it the same as w := a + c + b? How many computer instructions will this expression require? How long will it take to execute? Does the execution time depend on the values of a, b, and c? Is the result exact? Why or why not? Does the speed or accuracy depend on the particular processor? Could using more than one processor speed up the calculation? How about if the calculation was more complex? 13 Historical performance increase: The following shows the increase in number of transistors for Intel processors, memory, and other devices (from cmg.org): These processors span the range of 4 to 64 bit processors. Note the exponential growth in number of transistors, roughly doubling every two years. This growth was first observed by Gordon Moore, the co-founder of Intel, and is called Moore’s law. 14 Projections for the future: The following graphs use data from the International Technology Roadmap for Semiconductors (ITRS) 2004 update documentation. We can see that the predictions were actually pessimistic! ITRS produces a new roadmap every two years, and the latest is the 2013 roadmap. (See http://www.itrs.net) Transistor size: 100 90 channel width (nm) 80 70 60 50 40 30 20 10 2002 2004 2006 2008 15 2010 Year 2012 2014 2016 2018 Memory size (GB) - single chip Memory size (GB/chip): 10 1 2002 2004 2006 2008 2010 2012 Year 2014 2016 Note the log scale on the y-axis. This plot shows a stepwise exponential growth with time. Why does memory have this behavior? What happens between the beginning and end of each step? 16 2018 Clock Frequency: Clock frequency (GHz) - on chip 100 10 1 2002 2004 2006 2008 2010 2012 2014 2016 2018 Year 17 Number of transistors on chip - processors: Millions of transistors on chip 100000 low cost high performance 10000 1000 100 2002 2004 2006 2008 2010 2012 2014 2016 2018 Year 18 Can this continue? What will stop this kind of growth? Presently, a transistor in a high performance processor has a “size” of about 25 nm. A silicon atom has a “size” of about 0.54 nm (actually, the distance between atoms in a silicon crystal.) When transistors change state, or “switch”, they use energy. For a fixed power supply voltage, the heat energy produced depends on the number of transistors. Presently, processors require high speed fans to keep them cool enough. Cooling a processor is a serious problem, even now. What limits the size of a switching device? What is the minimum energy required to remember a bit of information? Is there a limit to the speed at which a computation can be performed? 19 Other technologies following a type of Moore’s Law: The following data was taken from the Technium website (http://www.kk.org/thetechnium) Doubling time of various technologies, in months: Technology measure Time Optical network dollars/bit 9 Wireless network bits/second 10 Data communication bits/dollar 12 Digital cameras pixels/dollar 12 Magnetic storage GB/in2 12 RAM (dynamic) bits/dollar 18 Processor power consumption watts/cm2 18 DNA sequencing dollars/base pair 18 Disk storage GB/dollar 20 Why does this happen for some technologies and not others? What limits the growth in these cases? 20 Instruction set architectures What is the minimum instruction set required for a processor? Consider a flowchart for a program. i := i − 1 no i<0? yes Only two symbols are really necessary; data operations (boxes) and control operations (arrows, or links). Does this mean that we really only need two instructions? Can input and output be handled, as well? 21 Actually, it is possible to combine both types of operation in one instruction, and this is all that is required to have a fully functioning computer. Can you figure out what this instruction could be? A machine with only one instruction would have interesting properties. It is an interesting exercise to determine what they are. Although a single instruction processor is interesting, it is not very efficient, since many instructions are required to do even simple operations. The course home page has a link to a simulator for a single instruction processor. A more useful exercise is to determine a small but efficient instruction set for a particular processor. 22 What must an instruction contain? • An encoding for the operation to be performed (op code) • The addresses of the operands, and a destination address for the result The instruction encoding (op code) depends on the number of different instructions to be encoded. An instruction may require 0, 1, 2, more operands. An example of a type of instruction which requires no operand is an operation on a stack. Here, the operation (e.g., addition) uses the top value and the next value on the stack, and the result replaces the top of the stack. Typical stack operations are push (place a value on the stack) and pop (removes a value from the stack). Some operations are inherently unary operations; e.g., negation. More complex operations (e.g., addition) can add an operand to the value in a fixed register (often called an accumulator) and store the result in this accumulator. It would have the form Acc ← Mem[addr] op Acc where op is an arbitrary binary operator. 23 Operations using two addresses can have a number of forms. For example: Mem[addr1] ← Mem[addr1] op Mem[addr2] or Acc ← Mem[addr1] op Mem[addr2] Operations using three addresses can implement a full binary operation (like c = a + b) directly: Mem[addr3] ← Mem[addr1] op Mem[addr2] Encoding several memory addresses in an instruction requires a large instruction size. Most processors have at least 32 address bits (4GB memory), so an instruction using three memory operands would require more than 3 × 32 = 96 bits. Some processors have variable length instructions (e.g., INTEL processors, used in the PC); others have fixed length instructions (e.g., the MIPS style processors, used in many game processors). Generally, the decoding of fixed length instructions is simpler than the decoding of variable length instructions. It is also common for certain instructions to encode data within the instruction. Typically, the data would be restricted to a constant with a small range of values. (Incrementing by a small number is a common operation, and encoding the data directly in an instruction is efficient.) This is usually called immediate data. 24 How complex should the instructions be? It is possible to have instructions that are quite complex. For example, a well-known processor from the past had a single instruction which could evaluate a polynomial of arbitrary order. There are (at least) two schools of thought on the design of instruction sets. One is that the instruction set should attempt to be as close as possible to modern computer languages. Such ISA’s are called Complex Instruction Set architectures (CISC architectures). The idea is that compilers for such architectures are simpler. These architectures typically have variable length instructions, with many addressing modes. Each instruction may take several (or many) machine cycles (clock periods). The PC (Intel, AMD) architectures are of this type. Another is that the instructions should be as simple and fast as possible. These instruction set architectures usually have fixed size instructions, and each instruction completes in a single clock cycle. Such ISA’s are called Reduced Instruction Set architectures (RISC architectures). The MIPS architecture which we will be discussing later is of this type. 25 Register files Many processors have sets of registers, often called register files. Instructions can address individual registers in the register file, using far fewer bits than a full memory address. For example, the MIPS processor has 32 32 bit registers; each register therefore requires only a 5 bit address, and a three operand instruction operating on registers only would require 3 × 5 = 15 bits for the operand addresses. The PC has 8 32 bit general registers, and a number of special purpose registers. Some processors (those in the PC, for example) allow instructions which mix memory and register operations. Other processors permit arithmetic and logic operations only on registers. Of course, both types have instructions to copy values between the register file and memory. 26 Addressing modes Many processors have several ways of constructing the memory address for a data operand. The register file may be used to provide part of the address. This is particularly useful for list or tabular data structures. The simplest addressing mode is where the address is part of the instruction itself. It may be used for accessing data, or for determining the target address for a branch or jump. Another form of addressing is relative addressing. Here the instruction contains a displacement from the current address. This is most commonly used with branch instructions. For such a branch, the target address is calculated as address = PC + displacement where PC is the program counter, which contains the address of the current instruction. Addressing modes which involve a register often add the value in a register to a displacement value from the instruction. These would be calculated as address = Ri + displacement where Ri is the register designated for the address by the instruction. This type of addressing is called indexed or based addressing. This type of addressing is useful for manipulating list data structures; the list can be traversed simply by incrementing or decrementing the register value. 27 This idea can be extended to the use of two registers. This would be useful for addressing data in a 2-dimensional structure like a table. The address would be calculated as address = Ri + Rj + displacement and is usually called based indexed addressing. Here, each register can be manipulated independently, allowing for row and column operations. More complex addressing modes are also possible. The target of an address can be a data value (corresponding to a variable in a program — the address is the variable name). It can also be an instruction, such as the target of a jump or branch instruction. It can also be another address (this corresponds to a pointer in languages like C, or a reference in other languages.) This capability is called indirect addressing, and may be supported by the instruction set architecture of the processor. Indirect addressing is used in the construction of more complex data structures like linked lists, trees, etc. Many processors support several different addressing modes. In fact, the PC supports all the addressing modes mentioned, and several others. 28 Relating instruction sets to logic It is also useful to consider what the internal structure of a computer would be, independent of any particular instruction set. For example, the requirement that instructions and data be fetched from memory (and that memory is independent from the processor) requires that the processor be able to generate and maintain a memory address, and that it be able to provide data to or receive data from memory. This implies that there are two entities which can hold information (an address and a data word) stable long enough for a memory read or write. Typically, in logic, we would implement these with registers. The address is held in the memory address register (MAR). The data is held in the memory data register (MDR). Circuitry is also required to generate and maintain instruction addresses. Most often, the next instruction to be executed is the next instruction in memory. This circuitry is usually called the program counter (PC). The instructions themselves contain addresses for data, and there must be a control unit to decode instructions and manage flow of control (e.g., branches). Data, and computational results, are stored in internal registers. There must be circuitry to perform the required arithmetic and/or logical operations (the datapath). 29 Combining all these observations, we require a structure similar to the following: General Registers and/or Accumulator M D R Instruction decode and Control Unit ALU PC PCU 30 Address Generator M A R The internal structure of a modern style processor — the MIPS R5000: More such photomicrographs are available at url http://micro.magnet.fsu.edu/chipshots 31 The MIPS instruction set architecture The MIPS has a 32 bit architecture, with 32 bit instructions, a 32 bit data word, and 32 bit addresses. It has 32 addressable internal registers requiring a 5 bit register address. Register 0 always has the the constant value 0. Addresses are for individual bytes (8 bits) but instructions must have addresses which are a multiple of 4. This is usually stated as “instructions must be word aligned in memory.” There are three basic instruction types with the following formats: R−type (register) 31 26 25 op 21 20 rs 6 bits 5 bits 16 15 rt 11 10 rd 6 5 0 shamt funct 5 bits 5 bits 5 bits 6 bits I−type (immediate) 31 26 25 op 21 20 rs 16 15 rt 0 immediate 6 bits 5 bits 5 bits 16 bits J−type (jump) 31 26 25 0 op target 6 bits 26 bits All op codes are 6 bits. All register addresses are 5 bits. 32 R−type (register) 31 26 25 op 21 20 rs 16 15 rt rd 11 10 6 5 shamt 0 funct The R-type instructions are 3 operand arithmetic and logic instructions, where the operands are contained in the registers indicated by rs, rt, and rd. For all R-type instructions, the op field is 000000. The funct field selects the particular type of operation for R-type operations. The shamt field determines the number of bits to be shifted (0 to 31). These instructions perform the following: R[rd] ← R[rs] op R[rt] Following are examples of R-type instructions: Instruction add add unsigned subtract Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 addu $s1, $s2, $s3 $s1 = $s2 + $s3 sub $s1, $s2, $s3 $s1 = $s2 - $s3 subtract unsigned subu $s1, $s2, $s3 $s1 = $s2 - $s3 and or and $s1, $s2, $s3 $s1 = $s2 & $s3 or $s1, $s2, $s3 $s1 = $s2 | $s3 33 I−type (immediate) 31 26 25 op 21 20 rs 16 15 rt 0 immediate The 16 bit immediate field contains a data constant for an arithmetic or logical operation, or an address offset for a branch instruction. This type of branch is called a relative branch. Following are examples of I-type instructions of type: R[rt] ← R[rs] op imm Instruction add Example Meaning addi $s1, $s2, imm $s1 = $s2 + imm add unsigned addiu $s1, $s2, imm $s1 = $s2 + imm subtract subi $s1, $s2, imm $s1 = $s2 - imm and andi $s1, $s2, imm $s1 = $s2 & imm Another I-type instruction is the branch instruction. Examples of this are: Instruction branch on equal Example Meaning beq $s1, $s2, imm if $s1 == $s2 go to PC + 4 + (4 × imm) branch on not equal bne $s1, $s2, imm if $s1 != $s2 go to PC + 4 + (4 × imm) Why is the imm field multiplied by 4 here? 34 J−type (jump) 31 26 25 0 op target The J-type instructions are all jump instructions. The two we will discuss are the following: Instruction jump Example Meaning j target go to address 4 × target : PC[28:31] jump and link jal target $31 = PC + 4; go to address 4 × target : PC[28:31] Why is the PC incremented by 4? Why is the target field multiplied by 4? Recall that the MIPS processor addresses data at the byte level, but instructions are addressed at the word level. Moreover, all instructions must be aligned on a word boundary (an integer multiple of 4 bytes). Therefore, the next instruction is 4 byte addresses from the current instruction. Since jumps must have an instruction as target, shifting the target address by 2 bits (which is the same as multiplying by 4) allows the instruction to specify larger jumps. Note that the jump instruction cannot span (jump across) all of memory. 35 There are a few more interesting instructions, for comparison, and memory access: R-type instructions: Instruction Example Meaning set less than slt $s1, $s2, $s3 if ($s2 < $s3), $s1=1; else $s1=0 jump register jr $ra go to $ra set less than also has an unsigned form. jump register is typically used to return from a subprogram. I-type instructions: Instruction Example Meaning set less than slti $s1, $s2, imm if ($s2 < imm), $s1=1; immediate else $s1=0 load word lw $s1, imm($s2) $s1 = Memory[$s2 + imm] store word sw $s1, imm($s2) Memory[$s2 + imm] = $s1 load word and store word are the only instructions that access memory directly. Because data must be explicitly loaded before it is operated on, and explicitly stored afterwards, the MIPS is said to be a load/store architecture. This is often considered to be an essential feature of a reduced instruction set architecture (RISC). 36 The MIPS assembly language The previous diagrams showed examples of code in a general form which is commonly used as a simple kind of language for a processor — a language in which each line in the code corresponds to a single instruction in the language understood by the machine. For example, add $1, $2, $3 means take add together the contents of registers $2 and $3 and store the result in register $1. We call this type of language an assembly language. The language of the machine itself, called the machine language, consists only of 0’s and 1’s — a binary code. The machine language instruction corresponding to the previous instruction (with the different fields identified) is: 31 26 25 21 20 16 15 11 10 6 5 0 000000 00010 00011 00001 00000 100000 op rs rt rd shamt funct There are usually programs, called assemblers, to translate the more human readable assembly code to machine language. 37 Compilers and assemblers A compiler translates a “high level” language like C or Java into the “machine language” for a particular environment (operating system and target machine type.) It it generally possible to compile a high-level language program to run on almost any commercial computer system. A single high level language statement corresponds to several, and often many, machine instructions. Some modern language compilers (e.g., Java) produce output that does not correspond to any “real” computer, but rather to a “virtual” or model computer. This output (called bytecode, or p-code) can then be executed by a software model of the virtual machine (interpreted) or further translated into the machine language of the underlying processor. 38 An assembler translates an “assembly language” into the “machine language” for a particular target machine. Assembly languages for different target machines are different. Assembly language instructions normally translate one-for-one to machine instructions. (Some particular combinations of a few instructions may correspond to only one assembly instruction.) Assembly code has a simple format. It normally includes labels, instructions, and directives. labels correspond directly to addresses (much like variable names in high-level languages), but are also used to label instructions — for example, a jump target. Labels are character strings followed by “:” For example, in the code following, loop: is a label. instructions define the particular operations to be executed directives provide information for the assembler itself. Directives are preceded by a “.” For example, the directive .align 2 forces the next item to align itself on a word boundary. Typically, there are at least two separate sections, indicated by directives, dividing the program into instructions and data. 39 A simple assembly language program The following shows a short assembly code segment, for an infinite loop: .text .align2 loop: addi $1, $0, 0 # set register 1 to 0 sw $0, 128($1) # store 0 at 128 + the location # pointed to by register 1 addi $1, $1, 4 # increment register 1 by 4 jmp loop # go back to loop Here, loop is a label, .text and .align are directives. The text following # are comments. This corresponds to the following machine language program (assuming it starts at memory location 0): location instruction 0 001000 00000 00001 00000 00000 000000 4 101011 00001 00000 00000 00010 000000 8 001000 00001 00001 00000 00000 000100 12 000010 00000 00000 00000 00000 000001 40 How does an assembler work? It is a fairly simple process to write a program to translate these instructions into machine code; it is a simple one-for-one translation. The main problem is with labels — forward references, in particular. Most simple assemblers make two “passes” over the assembler code; in the first pass all the labels and their corresponding addresses are placed in a symbol table. In the second pass, the instructions are generated, using the addresses from the symbol table. The output of the assembler is an object file. This object file still may contain unresolved references (say, to library functions) which are resolved by the linker. We will look in more detail at how functions work in assembly language later, but it is usual to provide functions for common operations in a library. For example, there is a function printf which accepts a format string and one or more values to print as arguments. (This is actually a standard C function.) 41 In UNIX systems, object files have six components: • An object file header describing the sizes of the other sections • The text segment containing the actual machine code • The data segment containing the data in the source file • relocation information identifying data and instructions that rely on absolute addresses, which must be changed if the program is moved from one part of memory to another. • The symbol table associating labels with addresses, and holding places for unresolved references. • debugging information, containing concise information about how the program was compiled, so a debugger can associate memory addresses with lines in the source file. The following diagram shows the steps involved in assembling and running a program: 42 Programmer ❄ Assembly language program ❄ Assembler ❄ Machine language program Libraries Other functions ❙ ❙ ❙ ✇ ❄ ✠ Loader ❄ Memory ✛ Processor Input ❄ Output 43 MIPS memory usage MIPS systems typically divided memory into three parts, called segments. These segments are the text segment which contains the program’s instructions, the data segment, which contains the program’s data, and the stack segment which contains the return addresses for function calls, and also contains register values which are to be saved and restored. It may also contain local variables. 7fffffffhex 10000000hex stack Dynamic data Static data 00000000 11111111 400000hex 00000000 11111111 Reserved 00000000 11111111 Stack segment Data segment Text segment 00000000 11111111 The data segment is divided into 2 parts, the lower part for static data (with size known at compile time) and the upper part, which can grow, upward, for dynamic data structures. The stack segment varies in size during the execution of a program, as functions are called and returned from. It starts at the top of memory and grows down. 44 More about assemblers Sometimes, an assembler will accept a statement that does not correspond exactly to a machine instruction. For example, it may correspond to a small set of machine instructions. These are called pseudoinstructions. This is done when a particular set of statements are frequently used, and have a simple translation to a set of machine instructions. The original MIPS assembly language had a number of these. For example, the pseudoinstruction load double ld $4, 0($1) would generate the following two instructions: lw $4, 0($1) lw $5, 4($1) The pseudoinstruction load address la $4, label generates the instructions lui $4, imm u and ori $4, $4, imm l which load the upper and lower 16 bits of the address, respectively. The pseudoinstruction mov $5, $1 moves the contents of register $1 to register $5 What single MIPS instruction corresponds to this pseudoinstruction? 45 Macros Assemblers also provide set of instructions similar to functions, which can accept a formal argument. These are called macros. A macro is expanded as text, so code is generated each time the macro is used, and the formal argument is replaced as text in the macro. Consequently, there is no function call — the macro is expanded directly in the code. Following is a macro which uses the function printf to print an integer: .data int_str:.asciiz "%d" .text .macro la print_int($arg) $a0, int_str # load format string address # into first argument register mov $a1, $arg # load macro’s parameter # (arg) into second argument # register jal printf .end_macro This macro would be “called” with a formal argument like print int($7) and would have the effect of inserting the above code, with register $7 replacing the string $arg. 46 Translating programs to assembly language Given the program statement y=a+b−c+d what is an equivalent assembly code? Assuming that a, b, c, d are in registers $5 to $8, respectively, and that y is in $9, then we could have: add $9, $5, $6 # y = a + b sub $10, $8, $7 # tmp = d - c add $9, $9, $10 # y = y + tmp Note that we have introduced a temporary register, $10 (tmp) here. This is not really necessary. To place the values of a, b, c, and d in the registers, from memory, assuming register $20 contains the address for variable a, and variables b,c, d, and y are the next consecutive words in memory, we could write lw $5, 0($20) # load a in reg $5 lw $6, 4($20) # load b in reg $6 lw $7, 8($20) # load c in reg $7 lw $8, 12($20) # load d in reg $8 To store the value of y in memory, we could write sw $9, 16($20) # store reg $9 in y 47 Simple data structures It is common to use some kind of data structure in a high-level programming language. How would the following be translated into MIPS assembly language? A[i] = A[i] + B; Assuming there is a label Astart at the beginning of the data array A[], and that register $19 has the value 4 × i and that the value of B is in register $18: lw $8, Astart($19) # load A[i] in reg $8 add $8, $18, $8 # add sw # store reg $8 in variable A[i] $8, Astart($19) 48 B to A[i] Program structures — loops Extending the previous example to a simple loop; how would the following be translated to MIPS assembly language? for i=0; i<10,i++ { A[i] = A[i] + B; } Here, we need to set up a counter, say, in register $6, and compare it to 10. loop: addi $6, $0, 0 # initialize counter i to 0 addi $19,$0, 0 # initialize array address addi $5, $0, 10 # set test value for loop lw $8, Astart($19) # load A[i] in reg $8 add $8, $18, $8 sw $8, Astart($19) # store reg $8 in variable A[i] addi $6, $6, # add B to A[i] (B is in $18) 1 # increment counter addi $19,$19, 4 # increment array address bne # jump back until counter $5, $6, loop # equals 10 Note that this is not the most efficient code; the array index itself could be used to terminate the loop, using one less register, and one less instruction in the loop. 49 Conditional expressions Consider the following C code: if (i==j) x = x + h; else x = x - h; Assume i, j, x and h are already in registers $4, $5, $6, and $7, respectively. In MIPS assembly language, this could be written as: bne $4, $5, else # jump to the "else" clause add $6, $6, $7 # execute the "then" clause j # jump past the "else" clause endif else: sub $6, $6, $7 # execute the "else" clause endif: . . . A similar, but extended, structure could be written for case structures. 50 Subprograms We have already seen the instruction to jump to a subprogram, jal which places the value of PC + 4 (the address of the next instruction in memory) into register $31. We have also seen how the subprogram returns back to the main program using the instruction jr $31 There are still some questions about subprograms, however. First, what happens when a subprogram calls another subprogram? There must be some way to save the “old” return address before overwriting the value in register $31. Next, how are arguments passed to the subprogram? To answer the first question, a stack data structure is used to save the return address in register $31 before a subprogram is called. The operation of placing a value on the stack is called pushing a value onto the stack. Returning a value from the stack to the register is called popping a value from the stack. By convention, register $29 is used as a stack pointer. It is initially set to a high value, (7fffffffhex) and decremented every time a value is pushed, and incremented whenever a value is popped. 51 The following diagram shows the state of the stack after three nested subprogram calls: main program stack call subprogram 1 sp return address to main return address to subprogram 1 return address to subprogram 2 call subprogram 2 call subprogram 3 return from subprogram 3 Note that the stack pointer always points to the last element placed on the stack. It is incremented before pushing, and decremented after popping. The return address is not the only thing which must be saved during the execution of a subprogram. Arguments may also be passed to a subprogram on the stack. If a subprogram can call itself (recursion) then its entire state must be saved. This includes the contents of registers used by the subprogram, and values of local variables, etc. These are also saved on the stack. The whole block of memory used by the stack in handling a procedure call is referred to as a procedure call frame. 52 The procedure call frame is usually completely contained in the stack, and is often called simply a stack frame. In order to facilitate accessing data in the stack frame, there is usually a frame pointer which points to the start of a frame. The stack pointer points to the end of the fame. Argument 6 Argument 5 $fp Saved registers frame size Local variables $sp Argument build In the MIPS convention, register $30 is the frame pointer. In order to properly preserve the contents of registers in a procedure call, both the caller and callee must agree on who is responsible for saving each register. The following convention was used with most MIPS compilers: 53 MIPS register names and conventions about their use Register Name zero at v0 v1 a0 a1 a2 a3 t0 t1 t2 t3 t4 t5 t6 t7 s0 s1 s2 s3 s4 s5 s6 s7 t8 t9 k0 k1 gp sp fp ra Number 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Usage Constant 0 Reserved for assembler Expression evaluation and results of a function Argument 1 Argument 2 Argument 3 Argument 4 Temporary (not preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Saved temporary (preserved across call) Temporary (not preserved across call) Temporary (not preserved across call) Reserved for OS kernel Reserved for OS kernel Pointer to global area Stack pointer Frame pointer Return address (used by function call) 54 What happens when a procedure is called Before calling a procedure, the caller must: 1. Pass the arguments to the callee procedure; The first 4 arguments are passed in registers $a0 - $a3 ($4 $7). The remaining arguments are placed on the stack. 2. Save any caller-saved registers that the caller expects to use after the call. This includes the argument registers and the temporary registers $t0 - $t9. (The callee may use these registers, altering the contents.) 3. Execute a jal to the called procedure (callee). This saves the return address in $ra. At this point, the callee must set up its stack frame: 1. Allocate memory on the stack by subtracting the frame size from the $sp. 2. Save any registers the caller expects to have left unchanged. These include $ra, $fp, and the registers $s0 - $s7. 3. Set the value of the frame pointer by adding the stack frame size to $fp and subtracting 4. The procedure can then execute its function. Note that the argument list on the stack belongs to the stack frame of the caller. 55 Returning from a procedure When the callee returns to the caller, the following steps are required: 1. If the procedure is a function returning a value, the value is placed in register $v0 and, if two words are required, $v1 (registers $2 and $3). 2. All callee-saved registers are restored by popping the values from the stack, in the reverse order from which they were pushed. 3. The stack frame is popped by adding the frame size to $sp. 4. The callee returns control to the caller by executing jr $ra Note that some of the operations may not be required for every procedure call, and modern compilers would only generate the steps required for each particular procedure. For example, the lowest level subprograms to be called (“leaf nodes”) would not have to save $ra. If a programming language does not allow a subprogram to call itself (recursion) then implementing a stack frame may not be required, but a stack is still required for nested procedure calls. 56 Who does what operation (caller or callee) is to some extent arbitrary, and different systems may use quite different conventions. For example, in some systems the subprogram arguments are part of the callee stack frame, unlike the MIPS in which they belong to the frame of the caller. The designation of certain registers as caller save, and others as callee saved is also arbitrary, and to some extent depends on how many registers are available. Having registers which a procedure can use without the overhead of saving and restoring tends to lower the overhead of a procedure call. Processors with few general registers (e.g., the INTEL processors) would likely construct a stack frame quite differently. It is imperative, however, that all the programs which will be linked together strictly follow the same conventions. Typically, procedures from several languages (e.g., assembly, C, Java) can be intermixed at run time, so the compilers and linkers must follow the same conventions if they are to interact correctly. 57 An example of a recursive function (factorial) The following is a simple factorial function written in C: C program for factorial (recursive) main () { printf ("the factorial of 10 is %d\n", fact(10)) } int fact (int n) { if (n < 1) return 1; else return (n * fact (n-1)); } Following is the same code, in MIPS assembly language. First, the main program is shown, followed by the factorial function itself. Note that the MIPS specifies a minimum size of 32 bytes for a stack frame. 58 # Mips assembly code showing recursive function calls: .text # Text section .align 2 # Align following on word boundary. .globl main # Global symbol main is the entry .ent main # point of the program. main: subiu $sp,$sp,32 # Allocate stack space for return # address and local variables # (32 bytes). (Stack "grows" downwar sw $ra, 20($sp) # Save return address sw $fp, 16($sp) # Save old frame pointer addiu $fp $sp,28 # Set up frame pointer li $a0, 10 # put argument (10) in $a0 jal fact # jump to factorial function # the factorial function returns a value in register $v0 # la $a0, $LC # Put format string pointer in $a0 # move $a1, $v0 # put result in $a1 # jal printf # print the result 59 # Instead of using printf, we can use a syscall move $s0, $v0 # put result in $s0 # Print label for output. li $v0, 4 # Syscall code for print string # goes in register $v0 la $a0, $LC # Put format string pointer in $a0 syscall # print string # Print integer result li $v0, 1 move $a0, $s0 # Syscall code for print integer # Put integer to be printed in $a0 syscall # print integer move $v0, $0 # Clear register v0. # end of print output # restore saved registers lw $ra, 20($sp) # restore return address lw $fp, 16($sp) # Save old frame pointer addiu $sp $sp,32 # Pop stack frame jr # return to caller (shell) $ra .rdata $LC: .ascii "The factorial of 10 is " 60 # factorial function .text # Text section fact: subiu $sp,$sp,32 # Allocate stack frame (32 bytes) sw $ra, 20($sp) # Save return address sw $fp, 16($sp) # Save old frame pointer addiu $fp $sp,28 # Set up frame pointer sw # Save argument (n) $a0, 0($fp) # here we do the required calculation # first check for terminal condition bgtz $a0, $L2 # Branch if n > 0 li $v0, 1 # Return 1 jr $L1 # Jump to code to return # do recursion $L2: subiu $a0, $a0, 1 # subtract 1 from n jal # jump to factorial function fact # returning fact(n-1) in $v0 lw $v1, 0($fp) # Load n (saved earlier) into $v1 mul $v0, $v0, $v1 # compute (fact(n-1) * n) # and return result in $v0 61 # restore saved registers and return $L1: # result is in $2 lw $ra, 20($sp) # restore return address lw $fp, 16($sp) # Restore old frame pointer addiu $sp, $sp,32 # pop stack jr # return to calling program $ra 62 When is assembly language used? Modern compilers optimize code so well that assembly language is rarely used to increase performance. Consequently, in large computer systems, assembly language is rarely used. Today its main application is in small systems (typically single chip microcontrollers) where some special function is being implemented, or there is a need to meet some particular timing constraint. Typically, such systems have limited memory for programs and data, and are dedicated to performing a small number of very specific functions. These kinds of constraints are often typical of I/O functions, and it is for this type of application that assembly language is still occasionally useful. Generally, a programmer will solve a problem using a higher level language like C first (this makes the resulting code more portable). Only if the timing or size constraints are not met will the programmer resort to recoding part or all of the function in assembler. 63 Switching Functions - logic: Many things can be described by two distinct states; for example, a light can be “on” or “off;” a switch can be “open” or “closed;” a statement can be “true” or “false.” Devices which have exactly two distinct states, say “on” and “off,” are often particularly simple to construct; in particular, electronic devices with two distinct states are much simpler to construct than devices with, say 10 states. A typical electronic device with 2 states is a switch, which can be “on” (switch closed) or “off” (switch open). A very effective switch can be made with a single transistor. Transistor switches can be very small; current commercial integrated circuit technology routinely manufactures devices containing many millions of these switches in a single integrated circuit, with each switch capable of switching, or changing state, in a time of less than 0.1 nanosecond (abbreviated ns, 1 ns is the time required for light to travel approximately 30 cm, or 1 foot.) Since such binary (i.e., 2-state) devices are so simple, it is useful to examine the kinds of operations which can be performed involving only 2 states. An “algebra” of entities having exactly two states (“true” and “false” or “1” and “0” was developed by the mathematician George Boole, and later called Boolean Algebra. This algebra was applied to electronic switching circuits by Shannon, as a “switching algebra.” 64 Exam logic — not what Boole indended? 65 We can define a switching algebra as an algebraic system consisting of the set {0,1}, two binary operations called OR (+) and AND (·) and one unary operation (denoted by an overbar, ¯) called NOT, or complementation, or inversion. These operations are defined as follows: OR (+) AND (·) NOT (¯) 0+0=0 0·0=0 0=1 0+1=1 0·1=0 1=0 1+0=1 1·0=0 1+1=1 1·1=1 These relations are often expressed in “truth tables” as follows: OR AND A B A+B A B A·B NOT 0 0 0 0 0 0 A A 0 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 Following are the “circuit symbols” for these functions: A A+B B A A .B A A B OR AND Note that the symbol for NOT is actually the ◦. 66 NOT This switching algebra has a number of properties which can be readily shown: Idempotency A+A=A A·A=A Commutativity A + B = B + A A·B =B·A Associativity (A + B) + C = A + (B + C) (A · B) · C = A · (B · C) Distributivity A · (B + C) = A · B + A · C A + (B · C) = (A + B) · (A + C) Absorption A + (A · B) = A A · (A + B) = A Concensus A·B+A·C +B·C =A·B+A·C (A + B) · (A + C) · (B + C) = (A + B) · (A + C) 67 Another useful property is de Morgans theorem: A+B =A·B A·B =A+B de Morgans theorem can be generalized to F (A1, A2, . . . , An, 0, 1, ·, +) = F (A1, A2, . . . , An, 1, 0, +, ·) That is, the complement of any expression can be obtained by replacing each variable and element with its complement, and at the same time interchanging the OR and AND operators. Duality Note that each of the preceding properties occur in pairs, and that one of the pairs can be obtained from the other simply by replacing the AND operator by the OR operator, and vice versa. This property is called duality, and the operators AND and OR are said to be dual. This property comes about because if the 1’s and 0’s are interchanged in the definition of the AND function, the function becomes the OR function; similarly, the OR function becomes the AND function. This property is general, and if one theorem is shown to be true, then its dual will also be true. 68 The circuit symbol notation can be extended to other logic gates; for example, the following represents the functions NAND (not AND) and NOR (not OR): A A A+B B A .B B NOR NAND Note that the NOT function is represented by the ◦. N-input OR and N-input AND gates are represented by the symbols: A1 A2 A1 A2 A1 + A2 + ... An An A1 A2 ... An An n−input OR n−input AND There is another commonly used circuit symbol, the exclusive-OR function, denoted by the symbol ⊕. It is defined as follows: A B A⊕B 0 0 0 0 1 1 1 0 1 1 1 0 A B 69 A⊕B Switch implementation of switching functions: The functions NOT, AND and OR can be implemented with simple switches. In fact, in digital electronic circuits, transistors are used as simple switches in circuits similar to those which follow. Note that the power supplied to the circuit is shown, (a battery), as is a device to detect the output (a lamp). They are not part of the logic, but are required to make the switching logic useful. In the AND function, the two switches are in series with each other; in the OR function, the two switches are connected in parallel. For the NOT function, the switch is connected in parallel with the output (the lamp). NAND and NOR gates can be constructed similarly. A ✟q ✟ A B ✟q ✟q ✟ ✟ q ✁✁ A (a) NOT gate ❝ ♠ ❝ ♠ (b) AND gate ✟q ✟ B (c) OR gate These circuits can be combined to form more complex switching functions. (If you have not seen it before, try to construct a simple switching circuit for the XOR function). Note that the inputs for these simple switches are mechanical; e.g. the press of a finger. For electronic switches such as transistors, the inputs can be the outputs of other logic functions, so very complex logic circuits can be designed which operate “automatically”. 70 ❝ ♠ Canonical forms of switching functions: It is possible to construct a truth table for any switching function (i.e., any function of switching variables.) The truth table provides a complete, unique description of the switching function, but it is cumbersome. We can derive from the truth table certain unique expressions which defines the function exactly; in fact, the expression is exactly equivalent to the truth table. One such expression is called the minterm form of the expression, or, alternately, the sum of products (SOP) form. e.g., for the function Y = A ⊕ B, the truth table is: A B Y =A⊕B 0 0 0 0 1 1 1 0 1 1 1 0 This is equivalent to Y =A·B+A·B This is the minterm form of the function. It is obtained by ORing together all the minterms. Minterms are the AND terms corresponding to each 1 in the function column. 71 Minterms are obtained by ANDing together the variables, or their complements, which have a 1 in the function column. If the variable has value 1, the variable is taken; if not, its complement is taken. The minterms are then ORed together to give the function specified in the truth table. Note: 1. Each minterm contains all the variables or their complements, exactly once. 2. Each minterm is unique, except for permutation of the variables. Therefore, the minterm form of the function is unique. 3. Any expression which contains only variables in the minterm (sum of products) form, where each product term contains all the variables, or their complement, exactly once, is a minterm expression. This means that, no matter how a function is derived, if it contains only minterms then it must be a minterm form of the function. 72 A dual form of the preceding, called a maxterm form, or product of sums (POS) form can also be written. The maxterm form of a function can be obtained from the truth table by applying the principle of duality to the way described previously for deriving the minterm form of a function. Equivalently, we can write down the minterm expression for the complement of the function, Y , and apply de Morgans theorem; e.g., A B Y =A⊕B Y =A⊕B 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1 In minterm form, Y =A·B+A·B Complementing both sides, Y =Y =A·B+A·B Applying de Morgans theorem, Y = (A · B) · (A · B) = (A + B) · (A + B) This is the maxterm form of the switching function Y = A ⊕ B 73 The maxterm form can more easily be obtained from the truth table by ORing together all the variables or their complements which give a zero for the function; if the variable has value 0 then it is ORed directly, if it has a value 1, it is complemented. Each term is called a maxterm, or a sum term. The function is equal to the AND of all the maxterms. Example: Find the minterm and maxterm expressions corresponding to the following truth table: A B C Y Minterms Maxterms 0 0 0 0 1 A·B·C 1 0 0 1 0 2 0 1 0 1 A·B·C 3 0 1 1 1 A·B·C 4 1 0 0 0 A+B+C 5 1 0 1 0 A+B+C 6 1 1 0 1 A·B·C 7 1 1 1 1 A·B·C A+B+C Minterm form: Y = A·B·C +A·B·C +A·B·C +A·B·C +A·B·C Maxterm form: Y = (A + B + C) · (A + B + C) · (A + B + C) 74 Sometimes the minterm and maxterm expression are written in a kind of “shorthand,” where the values (0 or 1) of the set of variables is used to form a binary number, the decimal equivalent of which designates the appropriate minterm or maxterm. e.g., the minterm form of the previous function is written as: Y = X (0, 2, 3, 6, 7) The maxterm form is written as: Y = Y (1, 4, 5) Note that the order in which the variable are written down in the truth table is important, in this case. The numbers which appear in the minterm form do not appear in the maxterm form, and vice versa. The minterm or maxterm form of the function is not usually the simplest or most concise; e.g., the preceding function could be simplified to the following: Y = A·B·C +A·B·C +A·B·C +A·B·C +A·B·C = A·C +B Systematic ways exist to reduce the complexity of minterm or maxterm forms of switching functions but we will not discuss them here. The problem is computationally complex (NP-hard). 75 Practical examples (1) Design of a half adder A binary half adder is a switching circuit which will add together two binary digits (called binary bits), producing two output bits, a sum bit, S, and a carry bit, C. It has the following truth table: A B S C 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 It is immediately obvious from the truth table that the two functions S and C can be implemented as S = A ⊕ B, and C = A · B as shown in the following, which also shows a logic symbol for a half adder. (Logic symbols for devices more complex than the basic logic gates are usually just boxes with appropriately labeled inputs and outputs). A B s ✩ s C ✪ A C B S S (b) (a) 76 (2) Design of a full adder A binary full adder is a switching circuit which will add together two binary digits (called binary bits), and a third bit called a carry bit which may have come from a previous full adder. It produces both a sum bit and a carry bit as output. The full adder therefore has 3 inputs A, B and C where C is the carry bit, and 2 outputs; the sum, S, and the carry C+. It is possible to connect N such full adders together to add two N bit numbers. It should be immediately obvious that the sum bit for a full adder can be obtained using two half adders, one which adds digits A and B together producing, say, Z as the sum and the other adding Z and C together to form the sum of A, B and C. Clearly, a carry output should be produced when either of the half adders produces a carry, so the carry output for the full adder can be obtained by ORing together the Carry outputs of the full adders. Such an implementation of a full adder is shown in the following: A ❍ A C ❍ B C B S Z ❍ ❍ ❍ A C B S 77 C+ S The preceding implementation relied on our knowledge of the half adder. We will now consider the design of a full adder starting from its description as a truth table: A B C S C+ 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 We can write the outputs in minterm form as: S = A·B·C +A·B·C +A·B·C +A·B·C C+ = A · B · C + A · B · C + A · B · C + A · B · C These functions can be implemented directly as shown in the next slide. 78 A B C A B C A B C A B C ✩ ❈ ❈ ✪ ❈ ❈ ✩ ❈ ❈ ❈ ❡ ❈❈ ❡ ✪ ❡ ✩ ✪ ✪✄ ✪ ✄ ✄ ✄ ✪ ✄ ✄ ✩ ✄ ✄ ✄ A B C A B C S A B C A B C ✪ ✩ ❈ ❈ ✪ ❈ ❈ ✩ ❈ ❈ ❈ ❡ ❈❈ ❡ ✪ ❡ ✩ ✪ ✪✄ ✪ ✄ ✄ ✄ ✪ ✄ ✄ ✩ ✄ ✄ ✄ ✪ Note that this implementation of the full adder requires more logic gates than the implementation shown earlier, but the circuit is implemented using only three “levels” of logic; a NOT level (not shown), an AND level and an OR level. The previous implementation would require more logic levels if it were implemented using only ANDOR-NOT (AON) logic. This implementation would consequently be a “slower” implementation, due to the inherent delay in each logic gate. The function S cannot be simplified any further using only AON logic, but C+ can be rewritten as: C+ = B · C + A · C + A · B 79 C+ Both the half adder and the full adder are useful functional blocks. In particular, the full adder is often used as the basic building block to construct larger adders which add many bits simultaneously. The following figure shows the implementation of a four bit adder, which adds together two four bit numbers (A3A2 A1A0, and B3B2B1B0) and produces a 5 bit result (S4S3S2S1 S0), using four full adders. In general, n full adders are required to implement an adder for two n-bit words. The general expression for the sum and carry bits generated in the ith addition stage is Si = (Aii ⊕ Bi) ⊕ Ci Ci+1 = Ai · Bi + ((Ai + Bi) · Ci) 0 A0 B0 A1 B1 C S B A C+ S0 C S B A C+ S1 C S B A C+ A2 B2 A3 B3 80 S2 C S B A C+ S3 S4 Note that, for this type of adder, before the result of the add operation is correct, the carry result must be allowed to propagate through all of the full adders. Because of this, this implementation of a wide word adder is called a ripple carry adder. It is possible, of course, to design an adder which has no such ripple at all, simply by creating a truth table for each bit of the n-bit adder and implementing each bit from, say, its minterm form. This becomes quite tedious, however, for large n (specification of an 8-bit adder would require 9 truth tables, each with 256 lines). It is also possible to devise logic functions which generate the only the n carry bits for an add operation on two n bit numbers. These functions are called carry look-ahead functions, and are commonly used in the construction of fast wide word adders. Such carry lookahead adders are commonly used to implement the add operations on the fastest computers. As we will see, it is possible to connect the carry look-ahead units in a tree-like fashion to give reasonably fast carry generation in a much smaller time than required in the ripple carry adder. 81 Carry look-ahead adders Logic expressions for this “carry look-ahead” function can be derived from the logic functions for the full adder. Recall that, for 2 n-bit words A = (An−1 , An−2 , . . . , A1 , A0) and B = (Bn−1 , Bn−2, . . . , B1, B0) we saw earlier that the bit Si for the ith digit of the sum S was Si = (Ai ⊕ Bi) ⊕ Ci and the carry Ci+1 was Ci+1 = Ai · Bi + (Ai + Bi) · Ci The expression for the carry, Ci+1, can be rewritten as Ci+1 = Gi+1 + Pi+1 · Ci where and Gi+1 = Ai · Bi is the carry generation term Pi+1 = Ai + Bi is the carry propagation term Note that Gi+1 and Pi+1 depend only on Ai and Bi. From this recurrence relation, we see that, if we have an initial carry in C0 = G0 then C 1 = G1 + P 1 · G0 C2 = G2 + P2 · (G1 + P1 · G0) = G2 + P 2 · G1 + P 2 · P 1 · G0 C 3 = G3 + P 3 · G2 + P 3 · P 2 · G1 + P 3 · P 2 · P 1 · G0 ... 82 Note that the product terms for the ith carry bit correspond to AND gates with inputs numbering from 2 to i; consequently for large i the AND gates will require a large number of inputs. (They cannot be connected in series because this would reintroduce a ripple effect.) Fortunately, carry look-ahead units can be cascaded in a kind of tree fashion, as shown below. The fact that this cascading is possible is apparent from the original relation, Ci+1 = Gi+1 + Pi+1 · Ci. All that is necessary at any level, say, l, is to have the term Cl−1 available. This can come from a previous level of carry look-ahead units, and replaces the initial carry input, C0. (Note that the adders used in with a carry look-ahead unit should have outputs for generate, G, and propagate, P rather than for the carry, Cout). The following figure shows an implementation of part (half) of a 16 bit adder, using 4-bit carry look-ahead functions: ✻ ✻ Pout Gout ✲ ✥✥✥ Cin ✥✥✥ P0 G0 C1 P1 G1 C2 P2 G2 C3 P3 G3 ✥ ✥ ✥✥ ✥✥✥ ✥ ✻ ✻ ✻✻ ✻ ✻✻ ✻ ✻✻ ✥ ✥ ✭✭✻ ✭✭ ✭ ✥✥✥ ✭ ✭ ✥ ✭ ✭ ✂ ❊ ✭ ✥ ✭ ✭ ✭ ✭✭ ✥✥ ✭✭ ✭✭ ✂ ❊ ✭✭✭✭✭✭✭✭✭✭ ✭✭ ✥✥✥ ✭ ✭ ✥ ✭ ✥ ✭ ✭ ✭✭✭✭ ❊ ✂ ✭✭ ✥✥ ✭✭ ✭✭ ✭✭✭✭ ❊ ✥✥✥ ✭✭✭ ✂ ✻ ✻ ✻ ✻ Pout Gout Pout Gout r✲ Cin r✲ Cin ... P0 G0 C1 P1 G1 C2 P2 G2 C3 P3 G3 P0 G0 C1 P1 G1 C2 P2 G2 C3 P3 G3 ✻✻ ✻✻ ✻✻ ✻✻ ✻✻ ✻✻ ✻✻ ✻✻ P G P G P G P G ✲C ✲C ✲C ✲C S AB S AB S AB S AB ❄ ✻✻ ❄ ✻✻ ❄ ✻✻ ❄ ✻✻ S3 A3 B3 S2 A2 B2 S1 A1 B1 S0 A0 B0 P G P G P G P G ✲C ✲C ✲C ✲C S AB S AB S AB S AB ❄ ✻✻ ❄ ✻✻ ❄ ✻✻ ❄ ✻✻ . . . S7 A7 B7 S6 A6 B6 S5 A5 B5 S4 A4 B4 83 Combinational Logic — Using MSI circuits: When designing logic circuits, the “discrete logic gates”; i.e., individual AND, OR, OR etc. gates, are often neither the simplest nor the most effective devices we could use. There are available many standard MSI (medium scale integrated) functions which can do many of the things commonly required in logic circuits. These devices, or similar devices, are often used as components of “programmable logic devices.” The digital multiplexer One MSI function which has been available for a long time is the digital selector, or multiplexer. It is the digital equivalent of the rotary switch or selector switch (e.g., the channel selector on a TV set). Its function is to accept a binary number as a “selector input,” and present the logic level connected to that input line as output from the data selector. A circuit diagram for a possible 4-line to 1-line data selector/multiplexer (abbreviated as MUX for multiplexer) is shown in the following slide. Here, the output Y is equal to the input I0, I1, I2, I3 depending on whether the select lines S1 and S0 have values 00, 01, 10, 11 for S1 and S0 respectively. That is, the output Y is selected to be equal to the input of the line given by the binary value of the select lines (or address) S1S0. 84 S1 S0 r r I0 ❍ ❍ ❍❢ ✟ ✟ ✟ ❍❍ ❍❢ r ✟ ✟✟ I1 r r I2 r I3 ✣✢ ✣✢ ✣✢ ✣✢ PP ✏✏ PP ❅ ✏✏ PP ❅ ✏ P❅ ✏✏ The logic equation for this 4-line to 1-line MUX is: Y = I0 · S1 · S0 + I1 · S1 · S0 + I2 · S1 · S0 + I3 · S1 · S0 This device can be used simply as a data selector/multiplexer, or it can be used to perform logic functions. Its simplest application is to implement a truth table directly; e.g., with a 4 line to 1 line MUX, it is possible to implement any 2-variable function directly, simply by connecting I0, I1, I2, I3 to logic 1 in logic 0, as dictated by a truth table. In this way, a MUX can be used as a simple look-up table for switching functions. This facility makes the MUX a very general purpose logic device. Connecting the inputs to a 4-bit memory makes the device a programmable logic device. 85 Example: Use a 4 line to 1 line MUX to implement the function shown in the following truth table (Y = A · B + A · B) A B Y 1 0 0 1 0 0 1 = I0 0 1 0 = I1 1 0 0 = I2 1 1 1 = I3 I0 I1 I2 I3 S1 S0 A B Simply connecting I0 = 1, I1 = 0, I2 = 0, I3 = 1, and the inputs A and B to the S1 and S0 selector inputs of the 4-line to 1-line MUX implement this truth table, as shown above. The 4-line to 1-line MUX can also be used to implement any function of three logical variables, as well. To see this, we need note only that the only possible functions of one variable C, are C, C, and the constants 0 or 1. (i.e., C, C, C + C = 1, and 0) We need only connect the appropriate value, C, C, 0 or 1, to I0, I1, I2, I3 to obtain a function of 3 variables. The MUX still behaves as a table lookup device; it is now simply looking up values of another variable. 86 Y Example: Implement the function Y (A, B, C) = A · B · C + A · B · C + A · B · C + A · B · C Using a 4-line to 1-line MUX. Here, again, we use the A and B variables as data select inputs. We can use the above equation to construct the table shown below. The residues are what is “left over” in each minterm when the “address” variables are taken away. To implement this circuit, we connect I0 and I3 to C, and I1 and I2 to C, as shown: Input “Address” Other variables (residues) I0 A·B C I1 A·B C I2 A·B C I3 A·B C C C C C I0 I1 I2 I3 S1 S0 A B In general a 4 input MUX can give any function of 3 inputs, an 8 input MUX can give any functional of 4 variables, and a 16 input MUX, any function of 5 variables. 87 Y Example: Use an 8 input MUX to implement the following equation: Y = A·B·C ·D+A·B·C ·D+A·B·C ·D+A·B·C ·D+ A·B·C ·D+A·B·C ·D+A·B·C ·D+A·B·C ·D Again, we will use A, B, C as data select inputs, or address inputs, connected to S2, S1 and S0, respectively. Input Address Residues I0 A·B·C D I1 A·B·C D I2 A·B·C D+D =1 I3 A·B·C I4 A·B·C D I5 A·B·C D I6 A·B·C D+D =1 I7 A·B·C D D 1 0 D D 1 0 I0 I1 I2 I3 I4 I5 I6 I7 S2 S1 S0 Y A B C Values of the address set A, B, C with no residues corresponding to the address in the above table must have logic value 0 connected to the corresponding data input. The select variables A, B, C must be connected to S2, S1 ,and S0, respectively. 88 MUX “trees” In practice, about 16 line to 1 line MUX’s are the largest which can be reasonably constructed as a single circuit. It is possible to use a “tree” of smaller MUX’s to make arbitrarily large MUX’s. The following shows an implementation of a 16 line to 1 line MUX using five 4 line to 1 line MUX’s. I0 I1 I2 I3 I0 I1 I2 I3 S1 S0 S1 S0 I4 I5 I6 I7 I0 I1 I2 I3 S1 S0 ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❅ ❇ ❅ ❅ ❇ ❅ ❇ ❅ ❅ S1 S0 I8 I9 I10 I11 I0 I1 I2 I3 S1 S0 S1 S0 I12 I13 I14 I15 I0 I1 I2 I3 S1 S0 ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ S1 S0 89 I0 I1 I2 I3 S1 S0 S3 S2 Decoders (demultiplexers): Another commonly used MSI device is the decoder. Decoders, in general, transform a set of inputs into a different set of outputs, which are coded in a particular manner; e.g., certain decoders are designed to decode binary or BCD coded numbers and produce the correct output to display a digit on a 7 segment (calculator type) display. Normally, however, the term “decoder” implies a device which performs, in a sense, the inverse operation of a multiplexer. A decoder accepts an n digit number as its n “select” inputs and produces an output (usually a logic 0) at one of its 2n possible outputs. Decoders are usually referred to as n line to 2n line decoders; e.g. a 3 line to 8 line decoder. This type of decoder is really a kind of binary to unary decoder. Most decoders have inverted outputs, so the selected output is set to logic 0, while all the other outputs remain at logic 1. As well, most decoders have an “enable” input E, which “enables” the operation of the decoder — when the E input is set to 0, the device behaves as a decoder and selects the output determined by the select inputs; when the E input is set to 1, the outputs of the decoder are all set to 1. (The bar over the E indicates that it is an “active low” input; that is, a logic 0 enables the function). The enable input allows decoders to be connected together in a treelike fashion, much as we saw for MUX’s. 90 A typical 3 line to 8 line decoder with an enable input behaves according to the following truth table, and has the circuit symbol as shown. E S2 S1 S0 O0 O1 O2 O3 O4 O5 O6 O7 1 0 0 x 0 0 x 0 0 x 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 ❢ S0 S1 S2 E O 0 O1 O2 O3 O4 O5 O6 O7 ❢ ❢ ❢ ❢ ❢ ❢ ❢ ❢ Note that, when the E input is enabled, an output of 0 is produced corresponding to each minterm of S2, S1, S0 . These minterm can be combined together using other logic gates to form any required logic function of the input variables. In fact, the minterms can be used to produce several functions at the same time. Using de Morgans theorem, we can see that when the outputs are inverted, as is normally the case, then the minterm form of the function can be obtained by NANDing the required terms together. 91 Example: An implementation the functions defined by the following truth table using a decoder and NAND gates is shown below: A B C Y1 Y2 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 0 0 C B A S0 S1 S2 O0 O1 O2 O3 O4 O5 O6 O7 ❢ ❢ s ❅ ❢ ❅ ✱ ❅ ❅ ✱ ❢ ❅ ❅✱ ✱✱ ❅✱❅ ❢ ✱ ✱✱❅ ◗✱ ❅ ✱❅ ❅ ❢ ✱◗◗ ✱ ❅ ✱ ◗◗ ❅ ❢ ✱ ◗ ◗ ◗ ❢ ✩ ✐ ✪ ✩ ✐ 92 Y1 ✪ Note that additional functions of the same variables would only require another NAND gate for each function. Y2 Read only memory (ROM): Often, devices requiring 8 or more input variables are implemented using a ROM. A simple type of ROM can be constructed from a decoder, a MUX, and a number of wires. We will look at a small (16 bit) ROM constructed in this way. Normally, memory is arranged in a square array, as shown in the following slide. This general organization is used for other types of memory, as well. To use the ROM to implement a logic function, the address lines are used as the variable inputs, and the contents of the memory are the function values. Usually the memory has a word length of more than 1 bit; typically 4 or 8 bits, so several functions can be implemented simultaneously. In the following figure, it is assumed that the decoder produces a logic 1 as output when the input code selects that output; otherwise it produces logic 0, and that a logic 1 output “outvotes” a logic 0 output, in the sense that if both are present on the same wire, the logic 1 will dominate. (“Real” circuits usually have the opposite behavior). A bit is “programmed” when a link is present. 93 O3 A A3 S1 B A2 S0 O1 s ✡s O2 s ✡s s ✡s s ✡s O0 C A1 S1I0 D A0 S0 I1 I2 I3 Y In this example, the function is Y =A·B·C ·D+A·B·C ·D+A·B·C ·D+A·B·C ·D corresponding to memory locations 0010, 0110, 1001, and 1010. A type of programmable read-only memory uses small “fuse links” to connect the horizontal and vertical wires at each intersection. This device is “programmed” by passing sufficient current through a link to “blow” the fuse. The link could also be a transistor which could be turned on or off, allowing a read-write type of memory to be implemented. We will see logic devices which could be used for this purpose shortly. 94 Programmable logic arrays (PLA’s) The ROM implementation of a function may become quite expensive for functions with a large number of variables, because all potential minterms of the function are implemented, whether or not they are needed. A programmable logic array (PLA) requires that only the minterms required for a function be implemented, and allows the implementation of several functions simultaneously. Moreover, the functions can be implemented directly from their minterm forms (although it is often possible to eliminate some of the minterms, further decreasing the cost of the PLA). The PLA can be considered as a direct POS (or SOP) implementation of a set of switching functions, with a set of AND functions followed by a set of OR functions. A PLA is often said to have an “AND” plane followed by an “OR” plane. In practice, either NAND or NOR gates are used, with the resulting PLA said to be a NAND/NAND or a NOR/NOR device. The next slide shows a full adder implemented using a NAND/NAND PLA. Note that, since the full adder does not require the minterm A·B ·C, this minterm is not included in the “AND” plane of the PLA. Note also that the PLA can implement a function in POS form directly, without reducing the function to minterm form. This often leads to opportunities for minimizing the area of a PLA. Also, a PLA can implement additional functions of the same set of variables simply by adding another logic gate to the “OR” plane. 95 The PLA is an efficient device for the implementation of several functions of the same set of variables. AND plane t t t t t t t t t t t t t t ❣ ✁❆ ✁ ❆ ✁ ❆ ❣ ✁❆ ✁ ❆ ✁ ❆ A B AB̄C AB̄ C̄ t t t ABC AB C̄ t t OR plane t ĀBC ĀB C̄ ❣ ✁❆ ✁ ❆ ✁ ❆ t ĀB̄C ✥ ❣ t ✦ ✥ t ❣ t ✦ ✥ ❣ ✦ ✥ ❣ ✦ ✥ ❣ ✦ ✥ ❣ t t t t ✦ ✥ ❣ t ✦ ✧ ✧ ❣ ✦ ❣ ✦ S C 96 C+ Sequential Logic Sequential logic differs from combinational logic in that the output of the logic device is dependent not only on the present inputs to the device, but also on past inputs; i.e., the output of a sequential logic device depends on its present internal state and the present inputs. This implies that a sequential logic device has some kind of memory of at least part of its “history” (i.e., its previous inputs). A simple memory device can be constructed from combinational devices with which we are already familiar. By a memory device, we mean a device which can remember if a signal of logic level 0 or 1 has been connected to one of its inputs, and can make this fact available at an output. A very simple, but still useful, memory device can be constructed from a simple OR gate, as shown: s A Q In this memory device, if A and Q are initially at logic 0, then Q remains at logic 0. However if the single input A ever becomes a logic 1, then the output Q will be logic 1 ever after, regardless of any further changes in the input at A. In this simple memory, the output is a function of the state of the memory element only; after the memory is “written” then it cannot be changed back. However, it can be “read.” Such a device could be used as a simple read only memory, which could be “programmed” only once. 97 Often a state table or timing diagram is used to describe the behaviour of a sequential device. Following is both a state table and a timing diagram for this simple memory shown previously. The state table shows the state which the device enters after an input (the “next state”), for all possible states and inputs. For this device, the output is the value stored in the memory. State table Present State Input Next State Output Qn A Qn+1 0 0 0 0 A 0 1 1 1 0 1 1 1 1 1 1 1 Q time → Note that the output of the memory is used as one of the inputs; this is called feedback and is characteristic of programmable memory devices. (Without feedback, a “permanent” electronic memory device would not be possible.) The use of feedback can introduce problems which are not found in strictly combinational circuits. In particular, it is possible to inadvertently construct devices for which the output is not determined by the inputs, and for which it is not possible to predict the output. A simple example is an inverter with its input connected to its output. Such a device is logically inconsistent; in a physical implementation the device would probably either oscillate from 1 to 0 to 1 · · · or remain at an intermediate value between logic 0 and logic 1, producing an invalid and erroneous output. 98 The R-S latch More complicated, stable, memory elements could be constructed using simple logic gates. In particular, simple, alterable memory cells can be readily constructed. One basic (but not often used in this form) memory device is the following, called an RS (reset-set) latch, or flip flop. It is the most basic of all the class of circuits which are called latches or flip flops. A logic diagram for this device is shown in the following, together with its circuit symbol and state table.. S R Qn+1 Qn+1 0 0 0 1 1 0 Qn 0 1 Qn 1 0 1 1 0 0 ❢r R Q ❍❍ ✟✟ ❍❍ ✟ ✟✟❍❍ ✟ ❍ ✟ ❢r S R Q S Q Q We can analyze this circuit to determine all possible outputs Q and Q for all inputs to R and S; e.g., suppose we raise S to logic 1, with Q at logic 0. Then Q must be 0, (the output of a NOR gate is 0 if any input is 1), and Q must be 1. If S is returned to 0, then Q remains 0 and Q remains 1; i.e., the RS latch “remember” if S was set to 1, while R is 0. If R is raised to logic 1 while S is at logic 0, then Q is set to logic 0, and Q is set to logic 1; i.e., the latch is reset. If both R and S are raised to logic 1, then both Q and Q will be at logic 0. This output is inconsistent with the identification of Q and Q as the two outputs, and therefore should be avoided. 99 Race conditions Clearly, setting both R and S to 1 should be avoided to prevent logical inconsistency. However, a far more serious problem occurs if R and S change from logic 1 to logic 0 simultaneously. This situation is called a race condition. If both R and S are at logic 1, then Q and Q are at logic 0. When R and S are both set to 0, then both Q and Q should switch to logic 1. However, when they switch to logic 1, they should cause a switch back to logic 0 again, because of the logic 1 input to each NOR gate. If both NOR gates were identical, this would occur over and over again, indefinitely — an oscillation of the outputs Q and Q from state 0 to 1 and back, with a period depending on the time delay for the NOR gate. In practice, one gate is a little faster than the other, and the final outcome depends on the relative speeds of the two gates. other. However, the final outcome cannot be predicted. 100 Clocked latches There are three control signals often associated with flip flops; they are the clock or enable signal; the preset, and the clear signal. The clock signal is ANDed with the inputs R and S, so that these signals can reach the flip-flop only when the clock pulse is 1; all other times the inputs to the inputs to the flip-flop are 0, and it retains its previous value. The clock input is used for several purposes; it is used to “capture” data which is available for only a short time; and it is used to synchronize several flip-flops so they can all operate simultaneously, or synchronously. The following figure shows a circuit diagram for a clocked RS flip-flop, together with its circuit symbol. Note the special symbol, similar to an arrowhead, which denotes the clock input. ✜ R clock S r ❢r ✢ ❍❍ ✟ ❍❍ ✟✟ ✟ ✟ ❍❍ ❍ ✟✟ ✜ ❢r ✢ 101 Q R Q > S Q Q Asynchronous preset and clear The preset and clear signals are used to set the state of the flipflop regardless of the state of the clock input. Because they are not synchronized by the clock pulse, they are said to be asynchronous. The following figure shows how a simple clocked RS flip flop with preset and clear inputs could be constructed from simple logic gates. Clear ✜ Clear R clock S r ❢r ✢ ❍❍ ✟✟ ✟ ❍❍ ✟✟❍❍ ✟ ✟ ❍ ✜ ❢r ✢ Q R Q > S Q Q P reset P reset The preset and clear act as an unclocked RS flip-flop, and consequently a logic 1 should not be applied to both at the same time. A dual form of the RS flip flop, the RS flip flop can be implemented with NAND gates, as follows: R ✜ ❢r Q ❢r Q ✢ ❍❍ ✟ ✟ ❍❍ ✟ ✟✟❍❍ ❍ ✟✟ ✜ ✢ S Clock inputs, together with preset and clear inputs could similarly be provided for this device. Since the inputs to this device are inverted, the preset and clear inputs would also be inverted. 102 The D Latch and the D flip-flop It is possible to create a latch which has no race condition, simply by providing only one input to a RS latch, and generating an inverted signal to present to the other terminal of the latch. In this case, the S and R inputs are always inverted with respect to each other, and no race condition can occur. A circuit for a D latch follows: D clock r ✜ ❍ ❍ ❍❢ ✟ ✟✟ ✝✆ r ❢r ✢ ❍❍ ✟✟ ❍❍ ✟ ✟✟❍❍ ✟ ❍ ✟ ✜ ❢r ✢ Q D Q > Q Q The D latch is used to capture, or “latch” the logic level which is present on the Data line when the clock input is high. If the data on the D line changes state while the clock pulse is high, then the output, Q, follows the input, D. This effect can be seen in the timing diagram in the next slide. The D flip-flop, while a slightly more complicated circuit, performs a function very similar to the D latch. In the case of the D flip-flop, however, the rising edge of the clock pulse is used to “capture” the input to the flip flop. This device is very useful when it is necessary to “capture” a logic level on a line which is very rapidly varying. 103 The following figureshows a timing diagram for a D-type flip-flop. This type of device is said to be “edge triggered” — either rising edge triggered (i.e. a 0–1 transition) or falling edge triggered (i.e., a 1–0 transition) devices are available. CLOCK D Q time → (a) The D latch (b) The D flip flop Both the D latch and D flip-flop have the following truth table: P reset Clear Clock D Q Q 0 1 x x 1 0 1 0 x x 0 1 0 0 x x 1 1 1 1 ↑ or 1 0 0 1 1 1 ↑ or 1 1 1 0 1 1 0 X Q0 Q0 The symbol ↑ means a leading edge, or 0 − 1 transition as the clock input to the flip flop. For a D latch, it would be the level 1. 104 The JK flip-flop The JK flip flop is the most versatile flip-flop, and the most commonly used flip flop when discrete devices are used to implement arbitrary state machines. Like the RS flip-flop, it has two data inputs, J and K, and a clock input. It has no undefined states or race condition, however. It is always behaves like it is edge triggered; normally on the falling edge. The JK flip-flop has the following characteristics: 1. If one input (J or K) is at logic 0, and the other is at logic 1, then the output is set or reset (by J and K respectively), just like the RS flip-flop, but on the (falling) clock edge. 2. If both inputs are 0, then it remains in the same state as it was before the clock pulse occurred; again like the RS flip flop. 3. If both inputs are high, however the flip-flop changes state whenever the (falling) edge of a clock pulse occurs; i.e., the clock pulse toggles the flip-flop. 105 There are two basic types of JK flip-flops. The first type is basically an RS flip-flop with its outputs Q and Q ANDed together with J and K respectively. This type of JK flip-flop has no special name. Note that the connection between the outputs and the inputs to the AND gates determines the input conditions to R and S when J = K = 1. This connection is what causes the toggling, and eliminates the invalid condition which occurs in the RS flip flop. A simplified form of this flip-flop is shown in (a) below. The second type of JK flip-flop is called a master-slave flip flop. This consists of two RS flip flops arranged so that when the clock pulse enables the first, or master, latch, it disables the second, or slave, latch. When the clock changes state again (i.e., on its falling edge) the output of the master latch is transferred to the slave latch. Again, toggling is accomplished by the connection of the output with the input AND gates. An example of this type of flip-flop is shown in (b). The circuit symbol for a JK flip flop is shown in (c). Master J Clock K ✏ ✑ > ✏ S Q ✑ R Q q✄✂ ✄✂ q J K .. . ✏ ✏ ❝ ❝q ✑ ✑ ❍❍ ✟✟ q q ❍ ❍ ✟❝ ✟ ✟ ❍ ✟✟ ❍❍ ✏ ✏ ❝q ❝ ✑ ✑ .. . (b) (a) 106 Slave ✏ ✏ ❝ ❝q Q ✑ ✑ ❍❍ ✟✟ ✟ ❍ ✟✟ ❍❍ ✏ ✏ ❝q Q ❝ ✑ ✑ J Q > K Q (c) The T flip flop This type of flip-flop is a simplified version of the JK flip-flop. It is not usually found as an IC chip by itself, but is used in many kinds of circuits, especially counter and dividers. Its only function is that it toggles itself with every clock pulse (on either the leading edge, on the trailing edge) it can be constructed from the RS flip-flop as shown below. R Q r✄✂ Q ✄ r ✂ >T S T Q (b) (a) This flip flop is normally set, or “loaded” with the preset and clear inputs. It can be used to obtain an output pulse train with a frequency of half that of the clock pulse train, as seen from the timing diagram. In this example, the T flip flop is triggered on the falling edge of the clock pulse. Several T flip-flops are often connected together to form a “divide by N” counter, where N is usually a power of 2. 107 Data registers: The simplest type of register is a data register, which is used for the temporary storage of a “word” of data. In its simplest form, it consists of a set of N D flip flops, all sharing a common clock. All of the digits in the N bit data word are connected to the data register by an N line “data bus”. Following is a four bit data register, implemented with four D flip flops. I0 O0 ❄ D Q ✻ I1 ❄ D > r Q ✻ I2 O2 ❄ D > Q Clock O1 Q ✻ I3 ❄ D > Q r Q > Q r O3 Q r The data register is said to be a synchronous device, because all the flip flops change state at the same time (they share a common clock input). 108 ✻ Shift registers Another common form of register used in computers and in many other types of logic circuits is a shift register. It is simply a set of flip flops (usually D latches or RS flip-flops) connected together so that the output of one becomes the input of the next, and so on in series. It is called a shift register because the data is shifted through the register by one bit position on each clock pulse. Following is a four bit shift register, implemented with D flip flops. in D Q D > r D > Q Clock Q Q D > Q > Q r r Q Q r On the leading edge of the first clock pulse, the signal in on the D input is latched in the first flip flop. On the leading edge of the next clock pulse, the contents of the first flip-flop is stored in the second flip-flop, and the signal which is present at the DATA input is stored is the first flip-flop, etc. Because the data is entered one bit at a time, this called a serial-in shift register. Since there is only one output, and data leaves the shift register one bit at a time, then it is also a serial out shift register. (Shift registers are named by their method of input and output; either serial or parallel.) 109 out Parallel input can be provided through the use of the preset and clear inputs to the flip-flop. The parallel loading of the flip-flop can be synchronous (i.e., occurs with the clock pulse) or asynchronous (independent of the clock pulse) depending on the design of the shift register. Parallel output can be obtained from the outputs of each flip-flop as shown. O0 ✻ in D Q > r D > Q Clock r Q O1 O2 O3 ✻ ✻ ✻ r D Q > Q r D > Q r r Q Q r Communication between a computer and a peripheral device is often done serially, while computation in the computer itself is usually performed with parallel logic circuitry. A shift register can be used to convert information from serial form to parallel form, and vice versa. Many different kinds of shift registers can be constructed, depending upon the particular function required. 110 Counters — weighted coding of binary numbers A simple binary counter can be made using T flip flops. The flipflops are attached to each other in a way so that the output of one acts as the clock for the next, and so on. In this case, the position of the flip-flop in the chain determines its weight; i.e., for a binary counter, the “power of two” it corresponds to. A 3-bit (modulo 8) binary counter could be configured with T flip-flops as shown: O0 >T O1 ✻ r ❅ ❅ ❅ >T O2 ✻ r ❅ ❅ ❅ ✻ >T Following is a timing diagram for this circuit: .. . 0 .. . 1 .. . 2 .. . 3 .. . 4 .. . 5 .. . 6 .. . 7 .. . 8 .. . 9 .. . . . 10 .. 11 .. CLOCK O0 O1 O2 Note that is this counter, each flip-flops changes state on the falling edge of the pulse from the previous flip-flop. Therefore there will be a slight time delay, due to the propagation delay of the flip-flops between the time one flip-flop changes state and the time the next one changes state. i.e., the change of state ripples through the counter, and these counters are therefore called ripple counters. 111 It is possible to design counters which will count up, count down, and which can be preset to any desired number. Counters can also be made which count in BCD, base 12 or any other number base. A count down counter can be made by connecting the Q output to the clock input in the previous counter. Using the preset and clear inputs, and by gating the output of each T flip flop with another logic level, using AND gates (say logic 0 for counting down, logic 1 for counting up) then a presetable up-down binary counter can be constructed. The following figure shows an up-down counter, without preset or clear: count enable count up/down 1 = up 0 = down Clock q q ✄✂ O0 q ❍❍ ❞ ✟✟ J Q > K Q q q q O1 q q ✄✂ ✒✑ ✒✑ J Q > K Q q q q q q ✄✂ ✒✑ ✒✑ 112 O2 J Q > K Q q q q O3 q q ✄✂ ✒✑ ✒✑ J Q > K Q q Synchronous counters The counters shown previously have been “asynchronous counters”; so called because the flip flops do not all change state at the same time, but change as a result of a previous output. The output of one flip flop is the input to the next; the state changes consequently “ripple through” the flip flops, requiring a time proportional to the length of the counter. It is possible to design synchronous counters, using JK flip flops, where all flip flops change state at the same time; i.e., the clock pulse is presented to each JK flip flop at the same time. This can be easily done by noting that, for a binary counter, any given digit changes its value (from 1 to 0 or from 0 to 1) whenever all the previous digits have a value of 1. Following is an example of a 4-bit binary synchronous counter. O0 O1 r r ✁ Clock r J Q > K r r r ✁ Q r J Q K r ✁ r ✣✢ r ✁ > Q r 113 O2 J r Q > K O3 Q ✣✢ r ✁ J Q > K Q State machines A “state machine” is a device in which the output depends in some systematic way on variables other than the immediate inputs to the device. These “other variables” are called the state variables for the machine, and depend on the history of the machine. For example, in a counter, the state variables are the values stored in the flip flops. For a binary machine, with n possible state variables, there may be as many as 2n possible states, with each state corresponding to a unique assignment of values to the state variables. The behavior of a state machine can be completely described by a “state table,” or equivalently, a “state diagram.” The next slide shows a state table which describes the operation of a modulo 8 counter; the counter has 8 states, denoted S0 to S7, a single input, the clock input, and 3 output digits, O2, O1 and O0 . In this state table, the entries where the clock input is 0 have been expressed on a single line; in a full state table, this line would actually correspond to 8 lines. The essence of a state table can be captured in a state diagram. A state diagram is a graph with labelled nodes and arcs; the nodes are the states (denoted by circles, labelled with the state), and the arcs are the possible transitions between states. The arcs are labelled with the input which causes the transition, and the output which results from the input. The next slide also shows a state diagram for a modulo 8 counter. 114 input present next outputs state state O2 O1 O0 0 Sx no change no change 1 S0 S1 0 0 1 1 S1 S2 0 1 0 1 S2 S3 0 1 1 1 S3 S4 1 0 0 1 S4 S5 1 0 1 1 S5 S6 1 1 0 1 S6 S7 1 1 1 1 S7 S0 0 0 0 0/000 ✤✜ ✬✩ ✠ 1/000 1/001 ✬✩ ✬✩ 0 ❛ ✜ ✤ ❘ ✠ ✦ ✯ ❥ ✦ ✫✪ ❛ S 0/111 0/110 S7 ✣ ✫✪ ✕☞ 1/111 ☞ ✬✩ ✤ ❘ S6 S1 0/001 ✢ ✫✪ ▲ 1/010 ❯▲ ✬✩ ✜ ✠ S2 0/010 ✣ ✢ ✫✪ ✫✪ 0/100 ❑ ☞1/011 1/110 ▲ ✤✜ ✬✩ ✬✩ ☛☞ ▲ ✜ ✤ ❘ ✠ 0/101 0/011 3 5 ❛ ✬✩ ✠ ✦ ❨❛ ✦ ✙ ✣ ✢ ✫✪ ✫✪ S S 1/101 S4 1/100 ✫✪ 115 Designing a state machine Typically, when we design a state machine, we first identify the required states (i.e., identify what information must be remembered), and then consider how to go from state to state, and, finally, what output to produce (i.e., identify state transitions and outputs). The following examples show how a state machine can be obtained from a written description of the device. Example — the serial adder The serial adder accepts as input two serial strings of digits of arbitrary length, starting with the low order bits, and produces the sum of the two bit streams as its output. (The input bit streams could come from, say, two shift registers clocked simultaneously.) This device can be easily described as a state machine. We first decide what must be “remembered” — in this case, it is easy; all that must be remembered is whether or not there is a carry to be added into the next highest order bits. Therefore, the device will have two states, carry = 0 (C0), and carry = 1 (C1), as shown below. We next identify the transitions between the states, and the necessary outputs, also shown in the state diagram. 00/0 01/1 10/1 ✬✩ 11/0 ✤ ❘ ✲ 10/0 ✢ 11/1 ✫✪ C0 ✛ ✣ ✫✪ ✬✩ ✜ 01/0 ✠ C1 00/1 116 input/output The corresponding state table, containing exactly the same information as the state diagram is as follows: Present state Inputs Next state Output C0 0 0 C0 0 C0 0 1 C0 1 C0 1 0 C0 1 C0 1 1 C1 0 C1 0 0 C0 1 C1 0 1 C1 0 C1 1 0 C1 0 C1 1 1 C1 1 117 Example — a sequence detector A state machine is required which outputs a logic 1 whenever the input sequence 0101 is detected, and which outputs a otherwise. The input is supplied serially, one bit at a time. The following is an example input sequence and output sequence: input 0 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 output 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 This state machine can be designed in a straightforward way. Assume that the machine is initially in some state, say, state A. If a 0 is input, then this is may be the start of the required sequence, so the machine should output a 0 and go to the next state, state B. If a 1 is input, then this is certainly not the start of the required sequence, so the machine should output a 0 and stay in state A. When the machine is in state A, therefore, it has detected no digits of the required input sequence. When the machine is in state B, it has detected exactly one digit (the first 0) of the required input sequence. 1/0 ✤✜ ✬✩ ✠ A 0/0 ✫✪ ✲ ✬✩ B ✫✪ 118 If the machine is in state B and a 0 is input, then two consecutive 0’s must have been input; this input is clearly not the second digit of the required sequence, but it may be the first digit of the sequence. Therefore, the machine should stay in state B and output a 0. If a 1 is input while the machine is in state B, then the first two digits of the required sequence have been detected, and the machine should go to the next state, state C, and output a 0. When the machine is in state C, it has detected exactly two digits (0 1) from the required sequence. 1/0 0/0 ✤✜ ✤✜ ✬✩ ✠ 0/0 A ✫✪ ✲ ✬✩ ✠ 1/0 B ✫✪ ✲ ✬✩ C ✫✪ If the machine is in state C and a 0 is input, then three digits of the required sequence have been input, so the machine should go to its next state, state D, and output a 0. If a 1 is input when the machine is in state C, then this input is clearly no part of the required sequence, so the machine should start over in state A and output a 0. 1/0 0/0 ✤✜ ✤✜ ✬✩ ✠ A 0/0 ✫✪ ■ ❅ ❅ ❅ ❅ ✲ ✬✩ ✠ 1/0 B ✫✪ 1/0 119 ✲ 0/0 ✬✩ C ✫✪ ✲ ✬✩ D ✫✪ If the machine is in state D and a 0 is input, then this is not the required input (the input has been 0 1 0 0), but this may be the first digit of another sequence, so the machine should go to state B and output a 0. If a 1 is input while the machine is in state D, then the required sequence (0 1 0 1) has been detected, so a 1 should be output. Moreover, the last two digits input may be the first two digits of another sequence, so the machine should go to state C. This completes the state diagram, as shown in the following figure: 1/0 0/0 ✤✜ ✤✜ ✬✩ ✠ 0/0 A ✫✪ ■ ❅ ❅ ❅ ❅ 0/0 ✲ ✬✩ ✠✠ 1/0 B ✫✪ 1/0 ✲ ❅ ❅ 0/0 ❅❅✬✩ ✬✩ ✲ C ✛ ✫✪ 1/1 D ✫✪ The following is a state table corresponding to the state diagram: Present Input Next Output State State A 0 B 0 A 1 A 0 B 0 B 0 B 1 C 0 C 0 D 0 C 1 A 0 D 0 B 0 D 1 C 1 120 Algorithmic State Machines: An interesting way of specifying a state machine, equivalent to the use of a state table or a state diagram, is by the use of a “flowchart”; actually, a particular type of flowchart called an ASM (algorithmic state machine) diagram. In an ASM diagram, or flowchart, the “algorithm” which is implemented by the state machine is presented in a clear fashion. The following figure shows a flowchart for the controller for the traffic light at an intersection where there is both East/West and North/South traffic. ❄ NS green EW red (50 seconds) S↓ ❄ N↑ NS yellow EW red (10 seconds) E→ ←W ❄ NS red EW green (50 seconds) N↑ ❄ NS red EW yellow (10 seconds) ❄ 121 We call the rectangular blocks “action blocks,” because they specify some action. Note that, although not specifically shown, “time” is implicitly an input in this flowchart; moreover, each of the individual blocks does not necessarily correspond to an individual “state” of the system. Since the blocks specify different time periods, they imply some method to measure time, for example, by counting clock pulses. A more explicit flowchart is shown in the following slide, which assumes that a clock pulse occurs every 10 seconds. The state diagram shown in the figure is equivalent to the flowchart; note, however, that the flowchart looks simpler, in that the clock input is implicit, and only transitions out of the state are explicitly shown. The square blocks correspond to the states of the system. Transitions are specified by the arrows in the flowchart. The output is specified in the blocks, and, in this example, the clock input is explicit. Recall that, in the first flowchart, each block did not necessarily correspond to an individual state of the traffic light controller. Either flow chart could be considered an ASM diagram, but we will prefer the style in which each block corresponds to a unique state. 122 CLOCK/COLOR NS green EW red (10 seconds) ✬✩ 0/GR ✤ ❘ 1/GR NS red EW yellow (10 seconds) ✛ 1/GR ✬✩ 0/GR ✤ ❘ ❄ ❄ NS red EW green (10 seconds) NS green EW red (10 seconds) B 1/GR ✬✩ 0/GR ✤ ❘ ❄ ❄ NS red EW green (10 seconds) C 1/GR ✬✩ 0/GR ✤ ❘ ❄ ❄ NS red EW green (10 seconds) D 1/GR ✬✩ 0/GR ✤ ❘ ❄ ❄ NS red EW green (10 seconds) 1/RG ✬✩ ✜ ✠ 0/RG J 1/RG ✬✩ ✜ ✠ 0/RG I E 1/RG ✬✩ ✜ ✠ 0/RG H ✣ ✢ ✫✪ ✫✪ ✻ ✻ 1/YR ✬✩ 0/YR ✤ ❘ ❄ ✲ K ✣ ✢ ✫✪ ✫✪ ✻ ✻ NS green EW red (10 seconds) ✬✩ ✜ ✠ 0/RG ✣ ✢ ✫✪ ✫✪ ✻ ✻ NS green EW red (10 seconds) 1/RY ✣ ✢ ✫✪ ✫✪ ✻ ✻ NS green EW red (10 seconds) L ✣ ✢ ✫✪ ✫✪ ✻ ✻ NS yellow EW red (10 seconds) ✛ A ✬✩ ✜ ✠ 0/RY ❄ NS red EW green (10 seconds) F ✲ 1/RG ✬✩ ✜ ✠ 0/RG G 1/RG ✣ ✢ ✫✪ ✫✪ 123 Consider another traffic light example, this time with external inputs, shown in the next slide. In this example, the East/West traffic, is less frequent than North/South traffic, and if there is no East/West traffic, the North/South light should remain green. Eastbound or westbound traffic is sensed by traffic sensors, labeled ET and WT respectively, as shown in the diagram. Again, this flowchart has implicit time, or clock, inputs, and introduces decision blocks (the diamond shaped blocks) for the traffic sensor inputs. The decision blocks cause a transition from one block to one of several others, depending on whether or not some condition is met. As before, this flowchart does not correspond directly to a state diagram, but can readily be expanded to one which does, as shown in the slide following, where each rectangular block corresponds to a 10 second interval. This structure, consisting of arrows, action blocks, and decision blocks is sufficiently general to specify any algorithm. These flowcharts are often called ASM diagrams, when they refer to control devices in particular. They are (if carefully drawn) equivalent to state diagrams, where the rectangular blocks correspond to the states of the state machine (or, as in some of the examples, they correspond to blocks of states which can readily be expanded). The decision boxes and arrows correspond to transitions between the states, and the clock input is implicit in the ordering of the blocks. 124 ✛ ❄ NS green EW red (20 seconds) ✲❄ ❝ ❄ NS green EW red (10 seconds) S↓ N↑ no ✛ no ✛ WT ET ❄ ❅ ❅ ET ❅ ❅ ❄ ❅ ❅ ❅ ❅ yes ❝ yes WT ❅ ✲ ❄ ❅ ❅ ❅ ❄ NS yellow EW red (10 seconds) N↑ ❄ NS red EW green (20 seconds) ❄ NS red EW yellow (10 seconds) ❄ 125 ✲ ET WT/color ✛ ✛ ❄ ✬✩ ❄ NS green EW red (10 seconds) A ✫✪ X/GR ❄ ✬✩ ❄ NS green EW red (10 seconds) B ✫✪ ✲❄ ❜ X/GR ❄ NS green EW red (10 seconds) no ✛ no ✛ 00/GR ❄ ❅ ❅ ET ❅ ✣ ✫✪ X/GR ❄ ✬✩ ❄ NS yellow EW red (10 seconds) D ✫✪ X/RG ❄ ✬✩ ❄ NS red EW green (10 seconds) E ✫✪ X/RG ❄ ✬✩ ❄ NS red EW green (10 seconds) F ✫✪ X/RY ❄ ✬✩ ❄ NS red EW yellow (10 seconds) ❄ C 01/YR 10/YR 11/YR ❅ ❄ ❅ ❅ ❅ ❅ yes yes ❜ WT ❅ ✲ ❄ ❅ ❅ ❅ ❄ ✬✩ ✤ ❘ G ✲ 126 ✫✪ ❄ ✲ Implementation of state machines: There are a number of ways in which state machines described by a state diagram or state table or ASM diagram can be implemented. The actual “details” of the implementation depend on a number of things such as the type of flip-flop used to hold the state information (e.g. D ff’s or JK ff’s), the way in which the next-state logic is to be implemented (e.g., using simple logic gates, MUX’s, PLA’s, etc.), and the way in which the state information is stored in the flip flops (this is usually referred to as the “state coding”). Although these details determine the actual physical implementation of the device, the method used to arrive at this implementation is quite general, and can be summarized as follows: 1. Construct the state table (or state diagram, or a complete ASM diagram) for the device, and ensure that it correctly describes the required device. 2. Assign binary values to the states, to be encoded using flipflops. 3. Design the logic necessary to produce the appropriate values for the flop flops to enter the next state, using the present state and input values as inputs to this logic. Also, design the logic to produce the appropriate outputs. 127 For the second step, two commonly used codings are: 1. a binary weighted coding, in which each state is specified by a binary number. For N states this coding requires [log2(N )] flipflops. [log2(N )] means “the integer equal to or the next integer greater than log2(N )”. 2. a unary coding, often called a “one hot” coding, in which each state is assigned to a flip flop. A value 1 stored in the flip flop means that the device is in the state corresponding to that particular flip flop. Since a device can be in only one state at any time, there will always be exactly one flip flop with a value 1 stored in it. Although this coding requires more flip flops to store the state information (N , rather than log2(N )), the nextstate logic is usually much simpler to design, because only two flip flops need to have their values changed; the one corresponding to the present state, and the one corresponding to the next state. 128 Step (3) is not difficult, since the logic required is only simple combinational logic, but it can be quite tedious, especially for a binary weighted coding since several (possibly all) flip flops may have to change their values to produce the appropriate next-state. (For example, in a 3 bit counter, the change from state 011 to state 100 requires that all flip flops change state at the same time). The type of flip flop used also affects the design effort. D flip flops have only one input, which is equal to the value to be stored in the flip flop. JK flip flops have two inputs to be controlled, each by a separate block of logic. This means that the design effort is easier for D flip flops (because the next-state logic must only produce one input for each flip flop), but the number of logic gates may be fewer for a JK flip flop (because there are more ways to change the state). Since we are more interested in reducing the design effort, we will use D flip flops for our designs. The next slide shows an ASM diagram for a device with four states, A,B,C,D, one input, Y, and one output, Z, together with the corresponding state diagram. The state table is also shown. The outputs from the device are specified on the arrows in the ASM diagram, and in the state diagram. 129 0/0 ✲ ❝✛ ✻ ✤✜ ✻ ❄ ✬✩ ✠ A A ✛ ✫✪ ❄ ✛ ✚❩ Y =0 ✚ ❩ ❩ Z = 0 ✚✚ ❩ ❩ ✚ ❩ ✚ ❩ ✚ ❩✚ 1/0 ✲ ❝❄Y =1 Z=0 ✻ ❄ B ✲ Z=0 C 1/0 Z=0 ❄ ✬✩ C ✫✪ ❄ ✬✩ D ✛ ✫✪ X/0 ❄ ❄ ✚❩ ✚ ❩ ✚ ❩ ✚ ❩ ❩ ✚ ❩ ✚ ❩ ✚ ❩✚ B X/0 ❄ Y =1 Z=0 ❄ ✬✩ D ✫✪ Y =0 Z = 1✲ 130 0/1 State Table Present State Input Next State Output A 0 A 0 A 1 B 0 B 0 C 0 B 1 C 0 C 0 D 0 C 1 D 0 D 0 A 1 D 1 B 0 We will first design a state machine corresponding to this state table using a binary weighted coding for the states. (Later we will design the same state machine using the unary, or “one hot” state coding.) The design requires log2(4) = 2 flip flops; we will use D flip flops as memory elements. We choose the following coding for the states: State F F1 F F0 A 0 0 B 0 1 C 1 0 D 1 1 (This choice of state coding is arbitrary; the problem of finding a state coding which requires a minimum number of logic gates for its implementation is NP-hard). 131 We next reconstruct the state table, including the values for each flip flop, as follows: Present State Input D FF inputs required to Output produce next state for QF F1 QF F0 A B C D Y DF F1 DF F0 Z 0 0 0 A 0 0 0 0 0 1 B 0 1 0 0 1 0 C 1 0 0 0 1 1 C 1 0 0 1 0 0 D 1 1 0 1 0 1 D 1 1 0 1 1 0 A 0 0 1 1 1 1 B 0 1 0 This table is, effectively, three truth tables, one for each of the D inputs to F F1 and F F0, and one for the output, Z. Note that there are three inputs to each truth table; namely, the outputs of F F1 and F F0, and Y. The required logic could be implemented in several ways; using simple logic gates, using three 4 line to 1 line MUX’s (or three 8 line to 1 line MUX’s), or using one 3 line to 8 line decoder, and several NAND gates. The implementation shown in the following slide uses the 3 line to 8 line decoder, and three NAND gates. (A PLA implementation is particularly attractive for state machines, because all the logic functions to be implemented are functions of the same set of input variables.) 132 O7 O6 O5 O4 O3 O2 O1 O0 S0 S1 S2 ❡ ❡ ❡ ❡ ❡ ❡ ❡ ❡ ✑ ✑ ✑ ✑ ✑ ✑ ✑ ✑ ✑ ✑ ✑ ✑ ✘ ✑ ✘✘✘ ✘ ✘ ✏ ✘✘ ✏✏ t ✘✘ ✘ ✏✏ ❜ ✏ ★ ❜✏✏ ✏ ★ ❜ ✏ t ❜ ❍❍ ❜ ★★ ❜ ❍❍ ★ ❍❍ ❍ ❜❜ ★ ❍★ ❍❍ ❜ ❛❛★❍❍❍ ❍❍❜ ★❛❛ ❍ ❍ ★ ❛❛ ❍❍ ❍ ❛❛ ❛❛ Y clock ❍❍ ❍❍ ❥ ✟✟ ✟✟ ✩ ❣ ✧ ✧ ✧ ✧ ✧ ✪ ✞ ✧✝ ✧ ✧ ✧ ✧ t ✩ ❣ ✪ D > Q F F0 D > Z Q F F1 ✞ ✝ ✞ ✝ The preceding technique would be quite tedious for the one hot state coding, since there would be four flip flops, and the state table would consequently require 32 lines. For the one hot state coding, we can consider each flip flop individually, and design the logic required to set it to 1 or 0, without concern for the other flip flops (except, of course, that one or more of them may potentially provide an input to the logic.) In this case, we can break up the state table into a separate truth table for the inputs required to produce each state; that is, we group together in separate tables the lines corresponding to each separate “next state.” For the previous example, we have the following four tables: 133 For State A Present State Input, Y Next State Output A 0 A 0 D 0 A 1 For State B Present State Input, Y Next State Output A 1 B 0 D 1 B 0 For State C Present State Input, Y Next State Output B 0 C 0 B 1 C 0 For State D Present State Input, Y Next State Output C 0 D 0 C 1 D 0 The “Next State” column is, of course, not required in the tables. Each table can be used to design the logic required to set the corresponding flip flop. The following are directly from these tables: For F FA, DA = A · Y + D · Y = (A + D) · Y For F FB , DB = A · Y + D · Y = (A + D) · Y For F FC , DC = B · Y + B · Y = B For F FD , DD = C · Y + C · Y = C The output, Z, would be evaluated as Z = D · Y 134 With a little practice, these design equations can be obtained directly from a state diagram or ASM diagram. A corresponding circuit diagram would be as shown below. Note that, for the one hot coding, although more “next state” circuits must be designed, they are normally much simpler than for the binary coded state assignment. In fact, for a small device, the implementation effort may be much less for a “one hot” implementation than for an implementation using binary coded states. Yr ❍❍ ❞ r ✟✟ ✁ clock ✁ ✏ Z ✏ D Q ✑ > r ✁ r ✏ A D Q ✑ > ✄ ✁ r B D Q > C r Repeating the design equations from the previous page: DA = A · Y + D · Y = (A + D) · Y DB = A · Y + D · Y = (A + D) · Y DC = B · Y + B · Y = B DD = C · Y + C · Y = C Z =D·Y 135 D Q > Dr ✑ Of course, sometimes a simple solution to a design problem is apparent, without requiring much design effort. For example, if we wanted to design a device to detect the sequence 0101, say, and output a 1 when this sequence was detected, we could construct a state table for the device, and complete the design as in the previous examples. (Recall that we constructed a state diagram and state table for this device earlier). There is another, simpler solution, however, using a 4 bit serial in parallel out shift register (i.e., 4 D FF’s,) four comparators, and an AND gate, as shown below. (The comparator is the complement of the XOR function, and is often represented in a circuit as an XAND function). This simple design can be used to detect any four bit sequence, by simply changing the inputs to the comparators. ✩ Z ✓✏ 0 input, X clock D Q > r r r ✓✏ 1 D Q > r r 136 ✓✏ 0 D Q > r ✓✏ 1 D Q > ✪ Structured implementation of state machines State machines are typically implemented in three ways; using individual logic gates, typically called a “random logic” implementation, using a PLA, and using memory as a “look-up” for the combinational logic. The PLA and memory (typically read-only memory) implementations are quite effective, because a state machine has a fixed (usually relatively small) number of inputs and outputs, and both those approaches can be readily automated. An implementation based on memory is often called a microcoded implementation. (This term is often reserved for the memory-based implementation of the control unit of a computer.) It is common to have a small “microcode engine” including simple functions like a counter and registers, and for the microcode itself to be “sequenced” by the counter. (In a sense, the counter is used to fetch microcode words from memory, and these microcode words control the external state changes and outputs.) 137 State machine models Mealy state machines Up to this point, we have implicitly considered only one model for a state machine; a model in which the outputs are a function of both the present state and the input. This model, shown pictorially below, is called the Mealy model for sequential devices. It is a general model for state machines, and assumes that there are two types of inputs; clock inputs and data inputs. The clock inputs cause the state transitions and “gate” the outputs, (so the outputs are really “pulse” outputs; i.e., they are valid only when the clock is asserted). The data transitions determine the values of next-states and outputs. Essentially, the clock inputs control the timing of the state transitions and outputs, while the data inputs determine their values. Primary inputs ❏❏ ❏❏ State values ❏❏ Combinational logic Primary outputs ❏❏ Memory next-state values Clock So far, the state diagrams we have drawn correspond to this model; we have labeled the transitions with the inputs which cause the transition, and the output corresponding to the transition. 138 Moore state machines Another, model for state machines is the Moore model, in which the outputs are associated with the states of the device. In the Moore machine, the outputs are stable for the full time the device is in a given state. (The outputs are said to be “level” outputs, and are valid even when the clock inputs are not asserted.) Again, there are two types of inputs, clock inputs and data inputs. In this case, however, the clock inputs only directly enable the state transitions. In this model, the transitions are functions of the present states and inputs, but the outputs are functions of the states only. Below is a pictorial representation of the Moore model of a state machine. (The Moore model describes state machines like the traffic light controllers we have seen as ASM diagrams in a very natural manner.) Primary inputs ❏❏ ✡✡ ❏❏ ✡✡ State values Next-state Combinational logic ❏❏ ✡✡ next-state values Memory Clock ❏❏ ✡✡ Output Combinational logic 139 ❏❏ ✡✡ Primary outputs The following figure shows a state diagram for the Mealy machine derived earlier which produces a 1 as output whenever the sequence 0101 is input. This machine has four states, and the outputs are associated with the inputs to the state machine. 1/0 ✤✜ ✬✩ ✠ A 0/0 0/0 0/0 ✫✪ ■ ❅ ❅ ❅ ❅ ✤✜ ✲ ✬✩ ✠✠ 1/0 B ✫✪ 1/0 ✲ ❅ ❅ ✬✩ 0/0 ❅❅✬✩ ✲ C ✛ ✫✪ 1/1 D ✫✪ The next figure shows a state diagram for a Moore machine which performs the same function. ✬✩ ✠ A/0 0 ✫✪ ■ ❅ ❅ ■ ❅ ❅ ❅ ❅ ❅ ❅ ❅ 0 0 ✜ ✤ 1 ✜ ✤ ✲ ✬✩ ✠✠ B/0 1 ✫✪ 1 1 ❅ ❅ ❅ ✬✩ ✬✩ ❅ 0 ✲ C/0 ✲ D/0 ✧✫✪ ✧ ✧ ✧ 1 ✧ ✧✧✒ ✧ ✧ ✧✧ ✧ ✧ ✬✩ ✧ ✧ ✧ ✠ ✧ ✧ E/1 ✧ 0 ✫✪ ✫✪ Note that a state diagram for a Moore machine is labeled differently from a Mealy machine state diagram; the transitions are labeled only with the inputs which cause the transition, while the states are labeled with the corresponding outputs. (The output is a function of the state only, and does not depend directly on the input.) 140 Comparing the Mealy and Moore state diagrams, it is clear that they are very similar for states A, B, C and D. State E is a “new” state, because the output of 1 must be associated with some state. In fact, state E is equivalent to state C in the Mealy diagram — Mealy state C has been split into two Moore states, one (C) with output 0, and one (E) with output 1. They are equivalent in the sense that the outputs from both Mealy state C and Moore state E (and Moore state C, too) have the same next-states for the same inputs. Note that state C was the only Mealy state in which the incoming arcs (or arrows) correspond to different outputs. Moreover, the state which was “split” retained all transitions to the corresponding nextstates in the Mealy machine. (In this example, all of the other states are associated with only one output.) In general, from this observation, it is possible to convert any Mealy type machine into an equivalent Moore type machine, and vice-versa. First, we must define what we mean for two state machines to be equivalent. Two state machines are said to be equivalent if they produce exactly the same output for all inputs. Consequently, to derive an equivalent Moore machine from a Mealy machine, it must be possible to guarantee that the two machines produce the same output after any arbitrary input string has been input. This can be done by splitting all the Mealy states corresponding to different outputs, and ensuring that these states are connected to next-states which correspond to equivalent states in the original Mealy machine. 141 As a slightly more complex example, the Mealy machine specified by the following state table, where x is the single external input, and y is the output, and having a state diagram as shown below can be converted into a Moore machine as follows: Present State Next State Output, y, for x=0 x=1 x=0 x=1 A C B 0 0 B A D 1 0 C B A 1 1 D D C 1 0 1/0 ✬✩ 1/0 A ✛ ✫✪ ❅ 0/1 ■ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ✲ ❅ ❅ ❅ ✬✩ ✬✩ ✜ ❘ ❅ ✠ 1/0 0/1 D C ✛ ✬✩ B ✛ 0/1 ✫✪ ✒✫✪ 0/0 ✢ ✫✪ 1/1 Each state with different output values associated with transitions into the state is split into states corresponding to each different output; e.g., state B has a transition from state A with an output of 0, and from state C with an output of 1. Therefore, State B is split into two states, B0, with an output of 0, and B1, with an output of 1. Every transition to B with output 0 goes to B0; every transition to B with an output 1 goes to B1. The next-states of B0 and B1 are exactly the same as for B. State D is split into two states D0 and D1, similarly. 142 The state table becomes the following, corresponding to the state diagram shown below. Present State Next State State x=0 x=1 output A C B0 1 B0 A D0 0 B1 A D0 1 C B1 A 0 D0 D1 C 0 D1 D1 C 1 Here we have added a column called state output, which is the output the device has while in a given state. The output no longer depends on the input, x. 1 ❅ ❅ ❅ ✬✩ ✬✩ ✬✩ ✬✩1 ❘ ❅ ✲ 1 ✛ C/0 D0 /0 B0 /0 A/1 ✛ ✫✪ ✫✪ 0 ✟✟ ❅ ✒✫✪ ✒✫✪ ■ ❅ ✟ ■ ❅ ❅ ■ ❅ ✟ ❅ ✟✟ ❅ ❅❅❅❅ 0 ✟ ✟❅ ❅ ❅ 0 ❅1 ✟ ❅ ❅ ✟ 1 ✟ ❅ ❅ ❅ ✟ ❅ ✟✟ ❅ ❅ ✟ 0✟✟ 1 ❅ 0 ❄ ❅ ✟ ❅ ✬✩ ✬✩ ✜ ❅ ✟ ✠ ✟ ❅ ✟ ✠ ❅ B /1 ✟ D1 /1 0 1 ✢ ✫✪ ✫✪ 143 We can see that the Moore machine accepts the same input sequences as the Mealy machine we started with, and produces the same output sequence. In addition, it produces the output 1 when started in state A, without having any input sequence. i.e., a Moore machine accepts a zero length sequence, called a null sequence, and produces an output (while in its initial state.) If we wish, we can add a new state A0 as the initial state, which produces a different output, say, 0, indicating that the machine is in its initial state. (There will be no transitions back into this initial state.) Note that, in general, any Mealy machine with N internal states and P outputs can be converted to a Moore machine with at most P × N + 1 states. (The Mealy machine and its corresponding Moore machine will be equivalent, in the sense that both will give exactly the same output for all possible input sequences.) 144 Present State Output Next State x=0 x=1 A0 0 C B0 A1 1 C B0 B0 0 A1 D0 B1 1 A1 D0 C 0 B1 A1 D0 0 D1 C D1 1 D1 C ✬✩ A0 /0 ❍❍ ❍❍ ✫✪ ❍ ❅ ❍❍ ❅ ❍ ❅ ❍❍0 ❅ ❍ ❍❍ ❅ 1 1 ❍ ❅ ❍❍ ❅ ❍ ❅ ❍❍ ❅ ❅ ❍ ❅ ✬✩ ❘ ❍ ❘ ❅ ✬✩ ✬✩1 ✬✩ ❘ ❅ ✲ 1 ✛ B0 /0 A1 /1 ✛ D0 /0 C/0 ✟ ✫✪ ✫✪ 0 ❅ ✒✫✪ ✟✒✫✪ ■ ❅ ✟ ■ ❅ ❅ ✟ ■ ❅ ❅ ✟✟ 0 ❅ ❅❅❅❅ ✟ ❅ ❅ ❅ 0 ✟✟ ❅ 1 ✟ ❅ ❅ 1 ❅ ✟ ❅ ❅ ❅ ✟✟ ❅ ✟ ❅ ✟ 0✟✟ 1 ❅ 0 ❅ ❄ ✟ ❅ ✬✩ ✜ ❅✬✩ ✟ ✠ ✟ ❅ ✟ ✠ ❅ B /1 ✟ D1 /1 0 1 ✢ ✫✪ ✫✪ 145 “Computer arithmetic” What kinds of numbers are typically represented in a computer system? Positive integers (for addressing, pointers.) Signed integers (for integer arithmetic.) “Real numbers” for arithmetic using real numbers. Perhaps the most important characteristics of the representation of numbers in a computer system is that numbers are represented using a fixed number of binary digits. This introduces the problems of overflow or underflow. For example, the sum of the following four bit (unsigned) numbers would be 1101 + 0101 = 10010 Using only four bits, the result would be 0010, and, depending on the particular processor and instruction, the overflow may or may not be detected. When an overflow or underflow is detected, the processor usually causes an exception. In this case, the program jumps to a predetermined location in memory which handles the exception, and the return address is also saved so the program can continue after the address is handled. We will discuss exceptions later. 146 How are signed integers represented? Following are four possibilities for representing negative numbers: one’s complement representation A negative number is the complement of a positive number; e.g. 00000001 represents 1 11111110 represents -1 Note that there are two representations of zero. This representation is not commonly used. two’s compliment representation This is the one’s complement representation plus 1; e.g. 00000001 represents 1 11111111 represents -1 This is the usual representation for signed integers. To extend the size of a 2’s complement integer, it is sign extended; e.g., to make the 8-bit representation of -7 a 16 bit representation, the high order bit in the 8-bit representation is used to fill in the higher order bits. 11111001 1111111111111001 -7 (8-bit 2’s complement) -7 (16-bit 2’s complement) 147 sign-magnitude representation An integer has a single bit, say the high order bit, which represents the sign. For an eight bit number, the representation would be sddddddd 00000001 represents 1 10000001 represents -1 This is the usual representation for the mantissa (significand) of a real (floating point) number. Note that for an integer, there are again two representations for zero. biased representation (sometimes called excess n representation.) A bias is added to the representation to get the number being represented. The following shows a representation with a bias of 127 (or excess 127): 10000000 represents 1 (1 + 01111111) 01111110 represents -1 (-1 + 01111111) This is the usual representation for the exponent of a real (floating point) number. 148 Given that, for integer arithmetic, we will use a 2’s complement representation, and that we want to combine arithmetic and logic operations in one unit (the ALU), how would an ALU for the MIPS be implemented? So far, we have the following instructions to implement: add, addu, addi, sub, subu, addiu, and, andi, or, ori, slt, sltu, slti. Consider the following single-bit ALU: Operation Binvert Carryin a 0 1 Result b 0 + 2 1 Less 3 Carryout Note that there are three control bits; the single bit Binvert, and the two bit input to the MUX, labeled Operation. The ALU performs the operations and, or, add, and subtract. 149 This ALU implements all the integer arithmetic and logical instructions seen so far. Note that the control inputs must be set according to the particular operation required of the ALU. A controller will also be required, to determine what particular operation is required of the ALU for each individual instruction. The ALU will be a component of the datapath of the processor we will design; the high order bit should detect overflow, as below: Operation Binvert Carryin a 0 1 Result b 0 + 2 1 Less 3 Set overflow detection Overflow A set of 32 of these components can be used to implement a full 32 bit ALU, as shown in the next slide. It also produces an output, zero, which is set to 1 whenever the 32 bit output is 0. 150 The 32 bit ALU Binvert Operation a0 b0 Carryin ALU0 Less CarryOut a1 b1 0 Carryin ALU1 Less CarryOut a2 b2 0 Carryin ALU2 Less CarryOut a31 b31 0 Carryin ALU31 Less Result0 Result1 zero Result2 Result31 Set Overflow ALU control lines Function 0 00 and 0 01 or 0 10 add 1 10 subtract 1 11 set on less than 151 The ALU depicted on the previous slide uses ripple-carry adders. We have seen how to build a carry look-ahead adder which would permit faster arithmetic operations. Using the carry look-ahead units we considered earlier, the changes required are not difficult since they merely compute the Carryin inputs to the ALU. In order to handle 2’s complement arithmetic, however inverted inputs would be required. Binvert a0 b0 0 a0 b0 Operation a0 b0 Carryin ALU0 Less CarryOut a1 b1 0 Carryin ALU1 Less CarryOut a2 b2 0 Carryin ALU2 Less CarryOut a31 b31 0 Carryin ALU31 Less 1 Result0 c1 a1 b1 0 1 a1 Carry b1 look− ahead c2 a2 b2 0 a2 b2 1 Result1 Result2 c31 a31 b31 0 a31 b31 1 152 Result31 Set Overflow Integer Multiplication Integer multiplication is really repeated addition. The basic algorithm is a kind of “shift and add” of partial products, obtained by multiplying individual digits in the multiplier by the multiplicand, as follows: 1010 × 0110 0000 1010 1010 multiplicand multiplier 0 × 1010 1 × 1010, shifted left 1 bit 1 × 1010, shifted left 2 bits 0000 0 × 1010, shifted left 3 bits 0111100 product — sum of partial products Note that the product of 2 n-bit numbers requires 2n bits. This multiplication algorithm can be implemented in a number of ways. Recalling that single bit binary multiplication is the same as the AND function, then simply ANDing the multiplicand with the digits of the multiplier, shifting, and adding them together is all that is required. This type of multiplier can be implemented using a single 32 bit adder, and 32 shift operations. One such implementation is shown in the following slide. 153 Hardware: Multiplicand Multiplier Shift right 32 bits 32 bit ALU Product 64 bits Shift right Write Control test Algorithm: Start =1 multiplier[i] =0? =0 Add multiplicand to left half of product. Place result in left half of product Shift product register right 1 bit Shift multiplier register right 1 bit 32 repetitions? Done 154 No Note that this implementation requires 32 shift and add operations, and requires that the multiplier, multiplicand, and product all be stored in separate registers. Noting that the product register “consumes” one additional bit on each iteration, and the multiplier register effectively removes one bit in the same iteration, we can reduce the hardware further by storing the multiplier in the right hand half of the product register. It will be shifted out at the same rate product digits are shifted in. 155 Hardware: Multiplicand 32 bit ALU Product 64 bits Shift right Write Control test Algorithm: Start =1 multiplier[i] =0? =0 Add multiplicand to left half of product. Place result in left half of product Shift product register right 1 bit 32 repetitions? Done 156 No Another possibility is to use an array of adders and AND gates to directly implement each partial product and sum all the partial products. This would require n2 adders for an n-bit multiplier. Following is a single multiply-add unit, consisting of an adder and an AND gate: xj Pi Ci yi Pi C i+1 xj yi A B C C S xj C i+1 Pi+1 157 xj Ci Pi+1 A 4 bit parallel multiplier: x3 x2 x1 x0 y0 xj yi Pi C i+1 xj xj yi Pi C i+1 xj Ci Pi+1 xj yi Pi C i+1 xj Ci Pi+1 yi Pi C i+1 xj Ci Pi+1 xj Ci Pi+1 y1 xj yi Pi C i+1 xj Ci Pi+1 xj yi Pi C i+1 xj Ci Pi+1 xj yi Pi C i+1 xj yi Pi C i+1 xj Ci Pi+1 xj Ci Pi+1 y2 yi Pi C i+1 xj xj Ci Pi+1 yi Pi C i+1 xj xj Ci Pi+1 yi Pi C i+1 xj xj Ci Pi+1 yi Pi C i+1 xj xj Ci Pi+1 y3 yi Pi C i+1 xj P7 xj Ci Pi+1 P6 yi Pi C i+1 xj xj Ci Pi+1 P5 yi Pi C i+1 xj xj Ci Pi+1 P4 158 yi Pi C i+1 xj xj Ci Pi+1 P3 P2 0 P1 0 P0 0 0 Signed multiplication The previous algorithms work fine for positive numbers, but how could negative numbers be handled? One way is to convert the negative numbers to positive (2’s complementation), perform the multiplication, and adjust the sign by performing 2’s complementation again if necessary. It turns out that there is a more elegant solution. Recoded multiplication If the adders which are used to construct the multiplier can also subtract, another possibility for speeding up the multiplication process is to “recode” the multiplication operation. We can consider the following as a “recoding” of the multiply operation: ai ai−1 operation comment 0 0 — no action 0 1 1xM add 1 0 2xM shift and add 1 1 4xM - M shift 2 and subtract This recoding applied to the previous algorithms would allow the multiply operations to complete in 16 iterations, rather than 32, using the same hardware. Essentially, this algorithm does a 2-bit multiply in each step. 159 Booth’s algorithm If the multiplier contains a string of 1’s, then this can be rewritten as a string of 0’s, provided we can do a subtraction. For example, if we were to perform the following multiplication: 0 0 0 1 1 0 1 x 0 0 1 1 1 1 0 We could rewrite the multiplier as 0100000 - 0000010 to give the following equivalent operation: 0 0 0 1 1 0 1 x 0 1 0 0 0 -1 0 Note that this “recoding” of the multiplier allows many more shifts over 0’s than the previous multiplier. It is possible to use this observation to recode groups of 2 bits in a multiplier to use only shifts and add or subtract operations to implement the multiplication. The main idea behind Booth’s algorithm is to identify strings of 1’s and replace them with 0’s, and apply the above observation. This can be accomplished by examining pairs of bits: Left bit Right bit Explanation Example 1 0 Beginning of a run of 1’s 00011110 1 1 Middle of a run of 1’s 00011110 0 1 End of a run of 1’s 00011110 0 0 Middle of a run of 0’s 00011110 160 Booth’s algorithm simply examines pairs of bits in the multiplier and does the following, using the hardware for the multiplier shown previously: 1. Depending on the values of the current bit and previous bit, do the following: 00: Middle of a string of 0’s, so do nothing 01: End of a string of 1’s, so add multiplicand to the left half of the product register 10: Beginning of a string of 1’s, so subtract multiplicand from the left half of the product register 11: Middle of a string of 1’s, so do nothing 2. Shift the product register to the right by 1 bit. 3. Repeat from step 1 until all the multiplier bits have been consumed. The original purpose was to speed up multiplication, since shifting was much faster than adding. One of the major advantages of Booth’s algorithm is that it also works for 2’s complement numbers. In modern processors, the multiply operation is usually implemented directly in the hardware with a multiply unit as part of the datapath. 161 Division Division is similar to multiplication, in that it is based on repeated subtraction. The main difference is that the quotient of two integers is not necessarily an integer. The basic algorithm is a “shift and subtract” procedure, similar to multiplication. 1010 Quotient 1101101 Dividend -1010 subtract Divisor 11 difference 111 shift in next bit, compare to Divisor (set 0 in Quotient) 1110 shift in next bit, compare to Divisor -1010 subtract Divisor (set 1 in Quotient) 100 difference 1001 shift in next bit, compare to Divisor (set 0 in quotient) Remainder (last bit was shifted in) This algorithm is slightly more complex than multiplication, because of the comparison with the Divisor at each step. Only if the divisor is greater than the partial dividend is the subtraction actually done. In practice, the operations compare and subtract are essentially the same, so it is common just to subtract and check to see if the result is positive, otherwise the previous value is restored, a 0 set as the quotient bit, and the shift performed. Note that the divisor and remainder can be the same size. 162 In the same way that the multiplier and product could share the same 64 bit register, the quotient and remainder bits can share a single 64 bit register for division. Divisor 32 bit ALU Remainder 64 bits Shift right Shift left Write Control test Note the similarity between the hardware for multiplication and division — the same hardware can be used for both functions. The difference is only in the control algorithm for each. Following is a control algorithm for the hardware for division: 163 Start Shift the remainder register left 1 bit Subtract the divisor register from the left half of the remainder register and place the result in the left half of the remainder register >=0 remainder[i] >= 0 ? Shift the remainder register to the left, setting the new rightmost bit to 1 <0 Restore the original value by adding the divisor register to the left half of the remainder register and place the sum in the left half of the remainder register. Shift the remainder register to the left, setting the new rightmost bit to 0 32 repetitions? No Done. Shift left half of remainder 1 bit right 164 “Real” arithmetic — floating point numbers Often we want to represent numbers over a wider range than can be represented with 32 bit integers. Most computers support floating point numbers. These numbers are similar to the representation in scientific notation. They consist of two parts, a mantissa (significand) and an exponent. The representation is of the form (−1)s × M × 2E where s denotes the sign, M is the mantissa, and E is the exponent. The exponent is adjusted such that the mantissa is of the form 1.xxxxxx... There is a standard for representing binary floating point numbers, (IEEE 754 floating point standard) universally supported by manufacturers of computer systems. The single precision (32 bit) form of a floating point number is: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 1312 11 10 9 8 7 6 5 4 3 2 1 0 s exponent mantissa ✛ ✲ ✛ ✲ 8 bits 23 bits The exponent is in excess 127 representation, and the mantissa is normalized so the leading digit is 1. (Since it is always 1, it is not stored explicitly, permitting an additional digit to be represented in the mantissa.) 165 There is a double precision (64 bit minimum) form for floating point numbers. Here the exponent is 11 bits, in excess 1023 form, and the mantissa is 52 bits. Single precision numbers are in the range 2.0 × 10−38 to 2.0 × 1038. Double precision numbers are in the range 2.0×10−308 to 2.0×10308. Note that floating point numbers are not distributed uniformly. The following shows the case for a 2-bit mantissa: ✛ ✲ -1 0 2 2 2 1 2 2 2 3 This shows that the separation between floating point numbers depends on the value of the exponent. This means that “computer arithmetic” with floating point numbers does not behave like “real” arithmetic. For example, the following property does not hold: (A + B) + C = A + (B + C) Also, subtracting two numbers which are nearly equal causes “loss of precision.” For example, (using decimal arithmetic) 1.11010110 × 102 - 1.11010000 × 102 0.00000110 × 102 = 1.10000000 × 10−4 which now really has only 3 significant digits. 166 The IEEE 754 floating point standard attempts to minimize problems with floating point arithmetic, by several means, including: • Providing several user-selected rounding modes, including 1. Round to nearest (the default mode) 2. Round towards +∞ 3. Round towards −∞ 4. Round towards 0 These rounding modes allow the user to determine the effect of rounding errors. • Provide representations for +∞, −∞, and “not a number” (NAN). • Insisting that all calculations produce the same result as if the floating point calculations were exact, and then reduced to the number of bits used in the mantissa. In order to do this, three additional bits are required when performing arithmetic operations. Two bits, called the guard and round bits are required to ensure that normal rounding is accurate. A third bit called the sticky bit which is set if any of the discarded bits in the exact calculation would have been 1. This is required for rounding towards ∞. 167 • Providing well-defined exceptions, and a provision for trapping and handling those exceptions. The five possible exceptions are: 1. Invalid operation; e.g., 0/0, 0 × ∞ 2. Overflow 3. Underflow 4. Divide by 0 (this produces ±∞ if the exception is not trapped). 5. Inexact — the rounded result is not the actual result The standard also provides for denormalized numbers — numbers where the leading digits in the mantissa are 0 (and the implied digit is not present before the decimal.) This allows a graceful underflow. Special combinations of the exponent (E) and fractional part of the mantissa (f ) represent denormalized numbers, 0, ∞, and NAN: • if E = Emax and f = 0 then the number represents ±∞, depending on the sign bit. • if E = Emax , its maximum value (255 for single precision numbers) and f 6= 0 then the number represents NAN. • if E = 0 and f = 0 then the number represents 0. • if E = 0 and f 6= 0 then the number is denormalized, and represents (−1)s × 2−(Emax −1) × 0.f 168 Another feature of the floating point standard was the provision for extended formats for both single and double precision floating point numbers. The format parameters are summarized in the following table: Parameter Single Single Double Extended Double Extended Mantissa (bits) 23 ≥ 32 52 ≥ 64 Exponent (bits) 8 ≥ 11 11 ≥ 15 Total width (bits) 32 ≥ 43 64 ≥ 79 Exponent bias 127 unspecified 1023 unspecified Emax 127 ≥ 1023 1023 ≥ 16383 Emin -126 ≤ −1022 -1022 ≤ 16382 Recall that the leading 1 of the mantissa for normalized numbers is not included in the table, so the actual precision of the mantissa is one more bit than indicated. The extended precision formats are used in the floating point processors designed by INTEL. It uses an 80 bit representation for floating point numbers in its floating point units. The IEEE floating point standard is presently being reviewed, and proposals have been made to combine this standard with another standard for decimal floating point arithmetic. 169 Implementation of floating point arithmetic Floating point addition: ✗ ✖ ✔ ✕ Start ❄ Compare exponents, and shift the smaller to the right until its exponent equals the larger exponent. ❄ Add the mantissas ✲ ❄ Normalize the sum, shifting either right or left, incrementing or decrementing the exponent with each shift. ❄ ✟✟❍❍❍ ❍❍ ✟✟ ✟ ❍❍ ✟ Overflow or ✟ ❍❍ ✟✟ ❍❍underflow?✟✟ ✟ ❍❍ ❍✟✟ ❄ no yes ✗ no ❄ ✟✟❍❍❍ ❍❍ ✟✟ ✟ ❍❍ ✟ Still ✟ ❍❍ ✟ normalized? ✟✟ ❍❍ ✟✟ ❍❍ ❍✟✟ ✗ ✖ ❄ yes Done ✔ ✕ 170 Exception ✖ Round the mantissa to the appropriate number of bits ❄ ✔ ✕ Floating point multiplication: ✗ ✖ ✔ ✕ Start ❄ Add the exponents. Subtract the bias from the sum to get the new biased exponent. ❄ Multiply the mantissas ✲ ❄ Normalize the product, shifting right and incrementing the exponent. ❄ ✟❍ ❍ ❍❍ ✟✟ ✟ ❍ ✟ ✟ Overflow or ❍❍ ✟ ❍ ✟ ❍❍ underflow? ✟✟ ✟ ❍❍ ✟ ❍❍✟✟ yes ❄ no ✗ no ❄ ✟❍ ❍ ✟ ❍❍ ✟ ✟ ❍❍ ✟ ✟ Still ✟ ❍ ❍ ✟ ❍❍ normalized? ✟✟ ✟ ❍❍ ✟ ❍❍✟✟ yes ❄ Set the sign bit appropriately. ✗ ✖ ✔ ❄ Done ✕ 171 Exception ✖ Round the mantissa to the appropriate number of bits ❄ ✔ ✕ Hardware implementations of the basic floating point operations (addition, multiplication, and division) are provided in virtually all modern microprocessors. Some processors have independent units for multiplication and addition, so both operations can execute in parallel. The MIPS had a separate floating point unit which was used in combination with the processor chip. Later versions integrated the floating point unit with the processor. A similar evolution happened earlier with the INTEL 80x8x architecture — the floating point unit was a separate co-processor, and operated in parallel with the main processor. Internally, INTEL’s floating point processor (which was the first floating point unit to comply with the then new floating point standard) used 80 bit arithmetic, in a stack architecture. 172 How can we determine performance? Let us look at an example from the transportation industry: Aircraft Passenger Fuel Cruising Throughput Cost Capacity Capacity Range Speed Boeing 747-400 421 216,847 10,734 920 387,320 0.048 Boeing 767-300 270 91,380 10,548 853 230,310 0.032 Airbus 340-300 284 139,681 12,493 869 246,796 0.039 Airbus 319-100 120 23,859 4,442 837 100,440 0.045 77 11,750 2,406 708 54,516 0.063 132 119,501 6,230 2180 287,760 0.145 Dash-8 50 3,202 1,389 531 26,550 0.046 My car 5 60 700 100 500 0.017 BAE-146-200 Concorde Where fuel capacity is in litres, range is in Km., and speed is in Km/h, throughput is the (number of passengers) × (cruising speed) and cost is the (fuel) per (passenger - Km.) determined as (fuel capacity)/(passengers × range) 173 Which of these has the best “performance?” This depends on how you define the term “performance.” For raw speed, (getting from one place to another quickly) the Concorde is over twice as fast as its closest competitor. If we are interested in the rate at which people are carried (we call this throughput) then the Boeing 747-400 clearly has the best performance. Often we are interested in relating performance and cost. In this example, if we consider cost as the amount of fuel used per passengerKm., then the most economical plane is the Boeing 767-300. Clearly, though, the car is easily the most economical overall. Note that we could also define cost in many different ways. We can define similar measures of performance and cost for computers. In a computer system, we are interested in the number of computations done per unit time, as well as the cost of the computation. Typically, we are interested in several aspects of the cost; e.g. the initial purchase price, the operating cost, or the cost of training users of the system. 174 In a computer system, we may be interested in the amount of time it takes a program to complete, (speed or response time), or the rate at which a number of processes complete (throughput), or in the cost of the system, relative to its performance. Since a computer program is merely a set of instructions for the particular computer, one might think that comparing the average instruction speed for two computers would be a good measure of performance. This turns out not to be so, for a number of reasons; for example: • Different computers have different instruction sets; some have very powerful instructions and others very simple, so the number of instructions required for a program might be very different on two different computers. • The instructions themselves may be implemented differently, and have different execution times. This may even be true for two machines which have the same instruction set (e.g., the Pentium and the Pentium IV, or the AMD Athelon). • Different compilers may produce very different machine code from the same source code. 175 Typically, a processor has a basic “clock speed” and instructions require some multiple of this clock speed to execute. In order to determine the time required to execute a particular program (TP ) we might think that we could take each instruction (I) to be executed, multiply it by the number of clock cycles for the instruction (CP II ), and sum the result. TP = X (I × CP II ) × (time for one clock cycle) This does not work for several reasons: • Many processors have instructions with variable execution times • Most processors today execute several instructions simultaneously It is possible, however, to approximate the run time of a program if we can determine an average number of cycles per instruction for the particular processor (and the program to be run). In this case, the execution time can be approximated by TP = X (I × average CPI) × (time for one clock cycle) 176 This can be rewritten as TP = (N × average CPI) × (time for one clock cycle) where N is the number of instructions executed by the program. Note the following: • The number of instructions executed by the program depends on the compiler used to generate the machine language code, and on the particular instruction set of the processor. • The average CP I depends on the particular instruction mix of the program. • The clock cycle time depends on the detailed implementation of the processor, including the speed of the underlying technology, the complexity of the individual instructions, and the degree of parallelism in the processor. Improvements in compiler technology typically produce about a 5% speedup per year. Improvements in technology typically produce about a 50% speedup per year. 177 All of the previous discussion makes several assumptions: • The process under consideration is the only process running on the machine. In a “real” computing environment, many processes may be running simultaneously. (In a Linux system, run the program top to see what processes are presently using resources). • The processor speed determines the rate at which instructions are executed. In reality, memory access can be much slower than the processor speed, especially for large programs where the entire data and instructions cannot fit in main memory. (We will discuss memory performance later in the course.) • In high performance systems, several processors may work simultaneously on a single process. At present, most processes run on a single processor, but it is possible to break up a computation into several “threads” which can be executed on different, interconnected, processors. (We will discuss this later in the course.) 178 Why not use “typical” programs to measure performance? This seems reasonable, since our idea of performance for a computer system is related to the time required to run the programs in which we are interested. Performance = 1/execution time If we can find a “typical set of programs” which fairly reflect the type of code we run, then comparing the time to run these on different machines may be a good measure of performance. Generally, though, we do not know exactly what programs will be run on a system throughout its lifetime. Also, the typical load (set of programs to be run) usually changes over the useful life of a computer system. Usually, our goal is more modest — to determine the “best” processor for a particular set of programs, at a given price, at a given time. 179 Consider the following example: Program Time on Time on Machine A Machine B P1 10 s 20 s P2 50 s 25 s Here, if we consider P1, then Machine A is twice as fast as Machine B. If we consider program P2, Machine B is twice as fast as Machine A. It may be reasonable to use a weighted average of the programs, where the weight is the relative number of times each program is usually run. For example, if P1 is run 3 times as frequently as P2, then the relative time required for Machines A and B is: (3 × 10) + 50 (3 × 20) + 25 So, Machine A requires 80/85 the time of Machine B. Alternately, Machine A has 85/80 × the performance of Machine B. Note that, for different weightings of the two programs, the conclusion as to which machine has the higher performance could be different. 180 Performance benchmarks In order to compare different processors, or different implementations of a single processor, people use various measures of performance, or benchmarks. Many benchmarks exist, often providing contradictory information about various processors. Several “standard” benchmark suites are available, and many of these also specify how the benchmark programs are to be compiled and run. One of the most famous benchmark suites (and also one of the most useful) is the SPEC benchmark suite. Information about it can be found at URL http://www.spec.org/ The SPEC benchmark uses the weighted running times of a set of programs. The programs have changed with time; the present SPEC CPU (SPEC CPU2006) was preceded by SPEC CPU 2000, SPEC95, SPEC92, and SPEC89. There are now sets of SPEC benchmarks for different aspects of systems performance, including integer and floating point performance, and graphics processor performance. 181 SPEC 2006 Benchmarks Benchmark Language Category Integer 400.perlbench C Programming Language 401.bzip2 C Compression 403.gcc C C Programming Language Compiler 429.mcf C Combinatorial Optimization 445.gobmk C AI, Game Playing: Go 456.hmmer C Bioinf., Gene Sequence Search 458.sjeng C AI, Game Playing: Chess 462.libquantum C Physics/Quantum Computing 464.h264ref C Video Compression 471omnetpp C++ Discrete Event Simulation 473.astar C++ Path-finding Algorithms 483.xalancbmk C++ XML Processing 182 Benchmark Language Category 410.bwaves Fortran Fluid Dynamics 416.gamess Fortran Quantum Chemistry 433.milc C Physics/ Quantum Chromodynamics 434.zeusmp Fortran Physics/ Computational Fluid Dynamics 435.gromacs C, Fortran Biochemistry / Molecular Dynamics Float 436.cactusADM C, Fortran Physics / General Relativity 437.leslie3d Fortran Fluid Dynamics 444.deall C++ Finite Element Analysis 450.soplex C++ Linear Programming, Optimization 450.povray C++ Image Ray-tracing 454.calculix C, Fortran Structural Mechanics 459.GemsDFTD Fortran Computational Electromagnetics 465.tonto Fortran Quantum Chemistry 470.lbm C Fluid Dynamics 481.wrf C, Fortran Weather 482.sphinx3 C Speech Recognition 183 Determining the effect of performance “improvements”: Consider the case where some aspect of the performance of a processor is improved, without making other improvements. For example, consider a numerically intensive problem in which 25% of the time is spent doing floating point arithmetic. Suppose the floating point unit is improved to perform five times faster. How much faster does the program run now? Clearly, only the part of the program that has improved performance will run faster, and we can easily calculate by how much: 0.75 + 0.25/5 = 0.8 — the it will require 80% of the original time. This observation can be expressed as execution time after improvement = execution time of unimproved part + execution time of improved part amount of improvement This relationship is called Amdahl’s law. Note that the overall speedup is relatively small (20%) even though the performance increase for part of the code was dramatic. Amdahl’s law has interesting consequences for parallel machines — ultimately, it is the serial, or unparallelizable, component of the code that determines its running time. 184 Brief summary of performance measures The only meaningful measure of performance is execution time for your “job mix” The time to execute a program depends on: Clock speed (MHz) Code size Cycles per instruction (CPI) Composite or other measures of performance — what problems arise from their use? MIPS (Millions of Instructions Per Second, or Meaningless Indicator of Processor Speed) MFLOPS (Millions of Floating Point Operations Per Second) SPEC 185 Where are we now? We have built up a “toolbox” of components (logic gates, adders, ALU’s, MUX’s, registers, etc.), and skills, (combinitoral logic design, state machine design) and want to use those to implement a small MIPS-like processor. 186 Design and implementation of the processor We now have all the raw material to design a processor with the instruction set we examined earlier. We will actually design several implementations of the processor, each with different performance characteristics. The first implementation will be a “single cycle processor” in which each instruction will take exactly one clock period. (In other words, the CPI will be 1.) In the next implementation, each instruction will require several cycles for execution, and different instructions may require a different number of cycles. In this case, the clock period may be shorter than the single cycle machine, but the CPI will be larger. We will begin by reviewing the instruction set and designing a data path. Earlier, when discussing the instruction set, we identified a rough structure for a computer system: CPU ✓ ✓ ❙ ❙ ❙ ❙ ✓ ✓ ✓ ✓ ❙ ❙ ✚❩ ✚ ❩ ❩ ✚ ❩✚ INPUT/ OUTPUT 187 ❙ ❙ ✓ ✓ MEMORY Presently, we are interested in the CPU only, which we concluded would have a structure similar to the following: General Registers and/or Accumulator M D R Instruction decode and Control Unit ALU PC PCU Address Generator M A R The memory address register (MAR) and memory data register(MDR) are the interface to memory. The ALU and register file are the core of the data path. The program control unit (PCU) fetches instructions and data, and handles branches and jumps. The instruction decode unit (IDU) is the control unit for the processor. 188 The “building blocks” We have already designed many of the major components for the processor, or have at least identified how they could be implemented. For example, we have already designed an ALU, a data register, and a register file. A controller is merely a state machine, and we can implement one using, say, a PLA, after identifying the required states and transitions. Following are some of the combinational logic components we will use: Adder A ALU B 32 A 32 ❄ ◗ ✑ ◗✑ ❅ ❅ ❅ Adder ❅ OP ❄ Sum Carry B ❄ S 32 ❄ ❄ A 32 ❄ ◗ ✑ ◗✑ ❅ ❅ ✲ ❅ ALU ❅ 32 ❄ B 32 ❄ Multiplexor ❄ Result Zero ✲ ❄ MUX ❄ Y Note that the diagram highlights the control signals (OP and S). 189 Following are some of the register components we will use: Counter Register Write enable ✲ 30 PC Data in ✲ 32 Data out ✲ ✲ 32 32 ✂✂❇❇ ✂✂❇❇ Clock Clock Register file Write enable 5 ✲ Read register 1 ✲ Read register 2 5 ✲ 5 32 ✲ Read data 1 ✲ 32 Registers Write Register Write data Read data 2 ✲ 32 ✂✂❇❇ Clock Note that the registers have a write enable input as well as a clock input. This input must be asserted in order for the register to be written. We have already seen how to construct a register file from simple D registers. 190 Timing considerations In a single-cycle implementation of the processor, a single instruction (e.g., add) may require that a register be read from and written into in the same clock period. In order to accomplish this, the register file (and other register elements) must be edge triggered. This can be done by using edge triggered elements directly, or by using a master-slave arrangement similar to one we saw earlier: master slave DQ > DQ > s ❍❍ ❞ ✟✟ Another observation about a single cycle processor — the memory for instructions must be different from the memory for data, because both must be addressed in the same cycle. Therefore, there must be two memories; one for instructions, and one for data. Data memory Instruction memory MemWr ✁ 32 ✁ 32 ✲ ✲ Address Write data Read data ✁ ✁ 32 ✲ 32 Data Memory ✲ Read Address Instruction [31-0] Instruction Memory MemRd 191 ✁ ✲ 32 The MIPS instruction set: Following is the MIPS instruction format: R-type (register) 31 26 25 21 20 16 15 11 10 op rs rt 6 bits 5 bits rd 6 5 0 shamt funct 5 bits 5 bits 5 bits 6 bits I-type (immediate) 31 26 25 21 20 16 15 op rs rt 0 immediate 6 bits 5 bits 5 bits 16 bits J-type (jump) 31 26 25 0 op target 6 bits 26 bits We will develop an implementation of a very basic processor having the instructions: R-type instructions add, sub, and, or, slt I-type instructions addi, lw, sw, beq J-type instructions j Later, we will add additional instructions. 192 Steps in designing a processor • Express the instruction architecture in a Register Transfer Language (RTL) • From the RTL description of each instruction, determine – the required datapath components – the datapath interconnections • Determine the control signals required to enable the datapath elements in the appropriate sequence for each instruction • Design the control logic required to generate the appropriate control signals at the correct time 193 A Register Transfer Language description of some operations: The ADD instruction add rd, rs, rt • mem[PC] Fetch the instruction from memory • R[rd] ← R[rs] + R[rt] Set register rd to the value of the sum of the contents of registers rs and rt • PC ← PC + 4 calculate the address of the next instruction All other R-type instructions will be similar. The addi instruction addi rs, rt, imm16 • mem[PC] Fetch the instruction from memory • R[rt] ← R[rs] + Set register rt to the value of SignExt(imm16) the sum of the contents of register rs and the immediate data word imm16 • PC ← PC + 4 calculate the address of the next instruction All immediate arithmetic and logical instructions will be similar. 194 The load instruction lw rs, rt, imm16 • mem[PC] Fetch the instruction from memory • Addr ← R[rs] + Set memory address to the value of SignExt(imm16) the sum of the contents of register rs and the immediate data word imm16 • R[rt] ← Mem[Addr] load the data at address Addr into register rt • PC ← PC + 4 calculate the address of the next instruction The store instruction sw rs, rt, imm16 • mem[PC] Fetch the instruction from memory • Addr ← R[rs] + Set memory address to the value of SignExt(imm16) the sum of the contents of register rs and the immediate data word imm16 • Mem[Addr] ← R[rt] store the data from register rt into memory at address Addr • PC ← PC + 4 calculate the address of the next instruction 195 The branch instruction beq rs, rt, imm16 • mem[PC] Fetch the instruction from memory • Cond ← R[rs] - R[rt] Evaluate the branch condition • if (Cond eq 0) calculate the address of the next in- PC ← PC + 4 + struction (SignExt(imm16) × 4) • else PC ← PC + 4 The jump instruction j target target is a memory address • mem[PC] Fetch the instruction from memory • PC ← PC + 4 increment PC by 4 • PC<31:2> ← PC<31:28> replace low order 28 bits with concat I<25:0> << 2 the low order 26 bits from the instruction left shifted by 2 196 The Instruction Fetch Unit Note that all instructions require that the PC be incremented. We will design a datapath which performs this function — the Instruction Fetch Unit. Its operation is described by the following: • mem[PC] Fetch the instruction from memory • PC ← PC + 4 Increment the PC Add 4 PC Read address Instruction [31−0] Instruction Memory Note that this does not yet handle branches or jumps. Since it is the same for all instructions, when describing individual instructions this component will normally be omitted. 197 Datapath for R-type instructions • R[rd] ← R[rs] op R[rt] Example: add rd, rs, rt Recall that this instruction type has the following format: R−type (register) 31 26 25 op 6 bits 21 20 rs 16 15 rt 5 bits 5 bits 11 10 rd 5 bits 6 5 0 shamt funct 5 bits 6 bits The datapath contains the 32 bit register file and and ALU capable of performing all the required arithmetic and logic functions. RegWr Inst[25−21] rs Inst[20−16] rt Inst Inst[15−11] rd Read clk Register 1 Read Register 2 Read data 1 Registers Write Read Register data 2 ALUCtr BusA 32 ALU BusB 32 Result 32 Write data Note that the register is read from and written to at the “same time.” This implies that the register’s memory elements must be edge triggered, or are read and written on different clock phases, to allow the arithmetic operation to complete before the data is written in the register. 198 This datapath contains everything required to implement the required instructions add, sub, and, or, slt. All that is required is that the appropriate values be provided for the ALUCtr input for the required operation. The register operands in the instruction field determine the registers which are read from and written to, and the funct field of the instruction determine which particular ALU operation is executed. Recalling the control inputs for the ALU seen earlier, the values for the control input are: ALU control lines Function 000 and 001 or 010 add 110 subtract 111 set on less than A control unit for the processor will be designed later. It will set all the required control signals for each instruction, depending both on the particular instruction being executed (the op code) and, for r-type instructions, the funct field. 199 Datapath for Immediate arithmetic and logical instructions • R[rt] ← R[rs] op imm16 Example: addi rt, rs, imm16 Recall that this instruction type has the following format: I−type (immediate) 31 26 25 op 21 20 rs 6 bits 16 15 rt 5 bits 0 immediate 5 bits 16 bits The main difference between this and an r-type instruction is that here one operand is taken from the instruction, and sign extended (for signed data) or zero extended (for logical and unsigned operations.) RegWr ALUCtr Inst[25−21] rs Inst[20−16] rt 0 M U Inst[15−11] X 1 Read Clk Register 1 Read Register 2 Registers Write Read Register data 2 Write data Inst[15−0] RegDst Read data 1 16 imm16 Sign extend BusA 32 ALU BusB 32 0 M U X 1 32 ALUSrc Note the use of MUX’s (with control inputs) to add functionality. 200 Datapath for the Load instruction lw rt, rs, imm16 • Addr ← R[rs] + Calculate the memory address SignExt(imm16) • R[rt] ← Mem[Addr] load the data into register rt This is also an immediate type instruction: I−type (immediate) 31 26 25 op 21 20 rs 6 bits 5 bits 16 15 rt 0 immediate 5 bits 16 bits RegWr AluSrc Inst[25−21] Inst[20−16] 0 M U Inst[15−11] X 1 Read Clk Register 1 Read Register 2 AluCtr MemtoReg Read BusA data 1 Registers Write Read BusB Register data 2 32 ALU 32 Write data RegDst Address 0 M U X 1 Read data Data In Inst[15−0] 16 Sign extend 32 32 Write data Data Memory 32 MemRd 201 32 1 M U X 0 Datapath for the Store instruction sw rt, rs, imm16 • Addr ← R[rs] + Calculate the memory address SignExt(imm16) • Mem[Addr] ← R[rt] Store the data from register rt to memory This is also an immediate type instruction: I−type (immediate) 31 26 25 op 21 20 rs 6 bits 5 bits 16 15 rt 0 immediate 5 bits 16 bits RegWr AluSrc Inst[25−21] Inst[20−16] Read Clk Register 1 Read Register 2 0 M U Inst[15−11] X 1 Registers Write Read BusB Register data 2 AluCtr MemWr MemtoReg Read BusA data 1 ALU Address 0 M U X 1 Write data RegDst Read data Data In Inst[15−0] 16 Sign extend 32 32 Write data Data Memory 32 32 MemRd 202 32 1 M U X 0 Datapath for the Branch instruction beq rt, rs, imm16 • Cond ← R[rs] - R[rt] Calculate the branch condition • if (Cond eq 0) calculate the address of the next in- PC ← PC + 4 + struction (SignExt(imm16) × 4) • else PC ← PC + 4 This is also an immediate type instruction. In the load and store instructions, the ALU was used to calculate the address for data memory. It is possible to do this for the branch instructions as well, but it would require first performing the comparison using the ALU, and then using the ALU to calculate the address. This would require two clock periods, in order to sequence the operations correctly. A faster implementation would be to provide another adder to implement the address calculation. This is what we will do, for the present example. 203 0 M U X 1 Add 4 Add Shift left 2 Branch 204 RegWr Inst[25−21] PC Read address Instruction [31−0] Instruction Memory ALUSrc Inst[20−16] Read Register 1 Read Register 2 0 M U Inst[15−11] X 1 Registers Write Read Register data 2 Read data 1 Write data RegDst Inst[15−0] 16 ALUCtr Sign extend 32 Zero ALU 0 M U X 1 PCSrc Datapath for the Jump instruction j target • PC<31:2> ← PC<31:28> Calculate the jump address by con- concat target<25:0> catenating the high order 4 bits of the PC with the target address Here, the address calculation is just obtained from the high order 4 bits of the PC and the 26 bits (shifted left by 2 bits to make 28) of the target address. The additions to the datapath are straightforward. J−type (jump) 31 26 25 0 op target address 6 bits 26 bits 205 Shift left 2 0 M U X 1 Add 4 Add 1 M U X 0 Shift left 2 Jump Branch PCSrc 206 RegWr Inst[25−21] PC Read address Instruction [31−0] Instruction Memory ALUSrc Inst[20−16] Read Register 1 Read Register 2 0 M U Inst[15−11] X 1 Registers Write Read Register data 2 Read data 1 Write data RegDst Inst[15−0] 16 ALUCtr Sign extend 32 Zero ALU 0 M U X 1 Putting it together The datapath was shown in segments, some of which built on each other. Required control signals were identified, and all that remains is to: 1. Combine the datapath elements 2. Design the appropriate control signals Combining the datapath elements is rather straightforward, since we have mainly built up the datapath by adding functionality to accommodate the different instruction types. When two paths are required, we have implemented both and used multiplexors to choose the appropriate results. The required control signals are mainly the inputs for those MUX’s and the signals required by the ALU. The next slide shows the combined data path, and the required control signals. The actual control logic is yet to be designed. 207 Inst[25−0] Jump address[31−0] 26 Shift left 2 0 M U X 1 28 PC+4[31−28] Add Add RegDst 4 Inst [31−26] Shift left 2 Jump Control 1 M U X 0 PCSrc Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite 208 PC Read address Instruction [31−0] Instruction Memory Inst[25−21] rs Inst[20−16] rt 0 M U Inst[15−11] X 1 rd Inst[15−0] Read Register 1 Read Register 2 Read BusA data 1 32 Registers Write Read BusB Register data 2 32 Write data 16 Inst[5−0] Sign extend Zero ALU 0 M U X 1 32 32 funct 32 ALU control Address Read data Write data 32 Data Memory 32 32 1 M U X 0 Designing the control logic The control logic depends on the details of the devices in the control path, and on the individual bits in the op code for the instructions. The arithmetic and logic operations for the r-type instructions also depend on the funct field of the instruction. The datapath elements we have used are: • a 32 bit ALU with an output indicating if the result is zero • adders • MUX’s (2 line to 1-line) • a 32 register × 32 bits/register register file • individual 32 bit registers • a sign extender • instruction memory • data memory 209 The ALU — a single bit Operation Binvert Carryin a 0 1 Result b 0 + 2 1 3 Less Carryout Note that there are three control bits; the single bit Binvert, and the two bit input to the MUX, labeled Operation. The ALU performs the operations and, or, add, and subtract. 210 The 32 bit ALU Binvert Operation a0 b0 Carryin ALU0 Less CarryOut a1 b1 0 Carryin ALU1 Less CarryOut a2 b2 0 Carryin ALU2 Less CarryOut a31 b31 0 Carryin ALU31 Less Result0 Result1 zero Result2 Result31 Set Overflow ALU control lines Function 000 and 001 or 010 add 110 subtract 111 set on less than 211 We will design the control logic to implement the following instructions (others can be added similarly): Name Op-code Op5 Op4 Op3 Op2 Op1 Op0 R-format 0 0 0 0 0 0 lw 1 0 0 0 1 1 sw 1 0 1 0 1 1 beq 0 0 0 1 0 0 j 0 0 0 0 1 0 Note that we have omitted the immediate arithmetic and logic functions. The funct field will also have to be decoded to produce the required control signals for the ALU. A separate decoder will be used for the main control signals and the ALU control. This approach is sometimes called local decoding. Its main advantage is in reducing the size of the main controller. 212 The control signals The signals required to control the datapath are the following: • Jump — set to 1 for a jump instruction • Branch — set to 1 for a branch instruction • MemtoReg — set to 1 for a load instruction • ALUSrc — set to 0 for r-type instructions, and 1 for instructions using immediate data in the ALU (beq requires this set to 0) • RegDst — set to 1 for r-type instructions, and 0 for immediate instructions • MemRead — set to 1 for a load instruction • MemWrite — set to 1 for a store instruction • RegWrite — set to 1 for any instruction writing to a register • ALUOp (k bits) — encodes ALU operations except for r-type operations, which are encoded by the funct field For the instructions we are implementing, ALUOp can be encoded using 2 bits as follows: ALUOp[1] ALUOp[0] Instruction 0 0 memory operations (load, store) 0 1 beq 1 0 r-type operations 213 The following tables show the required values for the control signals as a function of the instruction op codes: Instruction Op-code RegDst ALUSrc MemtoReg Reg Write r-type 0 0 0 0 0 0 1 0 0 1 lw 1 0 0 0 1 1 0 1 1 1 sw 1 0 1 0 1 1 x 1 x 0 beq 0 0 0 1 0 0 x 0 x 0 j 0 0 0 0 1 0 x x x 0 Instruction Op-code Mem Mem Branch ALUOp[1:0] Jump Read Write r-type 0 0 0 0 0 0 0 0 0 10 0 lw 1 0 0 0 1 1 1 0 0 00 0 sw 1 0 1 0 1 1 0 1 0 00 0 beq 0 0 0 1 0 0 0 0 1 01 0 j 0 0 0 0 1 0 0 0 0 xx 1 This is all that is required to implement the control signals; each control signal can be expressed as a function of the op-code bits. For example, RegDst = Op5 · Op4 · Op3 · Op2 · Op1 · Op0 ALUSrc = Op5 · Op4 · Op2 · Op1 · Op0 All that remains is to design the control for the ALU. 214 The ALU control The inputs to the ALU control are the ALUOp control signals, and the 6 bit funct field. The funct field determines the ALU operations for the r-type operations, and ALUOp signals determine the ALU operations for the other types of instructions. Previously, we saw that if ALUOp[1] was 1, it indicated an r-type operation. ALUOp[0] was set to 0 for memory operations (requiring the ALU to perform an add operation to calculate the address for data) and to 1 for the beq operation, requiring a subtraction to compare the two operands. The ALU itself requires three inputs. The following table shows the required inputs and outputs for the instructions using the ALU: Instruction ALUOp funct ALU operation ALU control input lw 00 x x x x x x add 010 sw 00 x x x x x x add 010 beq 01 x x x x x x subtract 110 add 10 1 0 0 0 0 0 add 010 sub 10 1 0 0 0 1 0 subtract 110 and 10 1 0 0 1 0 0 AND 000 or 10 1 0 0 1 0 1 OR 001 slt 10 1 0 1 0 1 0 set on less than 111 215 Extending the instruction set What is necessary to add another instruction to the instruction set? First, the appropriate elements must be added to the datapath. Second, any control elements must be added, and appropriate control signals identified. Third, the control logic must be extended to enable the appropriate elements in the datapath. Let us consider adding the instruction or immediate (ori) It has the form ori $s1, $s2, imm Its function is to perform the logical OR of the contents of register $s2 with the zero extended immediate data field imm, storing the result in register $s1. $s1 ← $s2 | ZeroExtend[imm] It has op-code 0 0 1 1 0 1 and is an immediate type instruction. 216 First — add elements to the data path Examining the data path, the ALU can perform the OR operation, but the extender unit only supports sign extension. It can be replaced by a unit, sign or zero extend, which can perform both functions. Second — add control elements This new unit requires a new control signal to select the zero extend function (0) or the sign extend function (1). We will label the new signal ExtOp. Also, the 2-bit control signal ALUOp only encodes the operations add and subtract. Adding a third bit would allow the encoding of the operations AND and OR. It can be encoded as follows: ALUOp[2] ALUOp[1] ALUOp[0] Instruction 0 0 0 memory operations (load, store) 0 0 1 beq 0 1 0 ori 1 x x r-type operations (subtract, in the ALU) The following diagram shows the changes required to the datapath: 217 Inst[25−0] 26 Shift left 2 Jump address[31−0] 0 M U X 1 28 PC+4[31−28] Add Add RegDst ExtOp 4 Inst [31−26] Shift left 2 Jump 1 M U X 0 PCSrc Branch MemRead MemtoReg Control ALUOp MemWrite ALUSrc RegWrite 218 PC Read address Instruction [31−0] Instruction Memory Inst[25−21] rs Inst[20−16] rt 0 M U Inst[15−11] X 1 rd Inst[15−0] Read Register 1 Read Register 2 Read data 1 BusA 32 Zero Registers Write Register Read data 2 Write data 16 Inst[5−0] Sign or zero extend BusB 32 ALU 0 M U X 1 32 32 funct 32 ALU control Address Read data Write data 32 Data Memory 32 32 1 M U X 0 Third - the control logic The truth table for the ALU control unit extends to: Instruction ALUOp funct ALU ALU control operation input lw 000 x x x x x x add 010 sw 000 x x x x x x add 010 beq 001 x x x x x x subtract 110 ori 010 x x x x x x OR 001 add 100 1 0 0 0 0 0 add 010 sub 100 1 0 0 0 1 0 subtract 110 and 100 1 0 0 1 0 0 AND 000 or 100 1 0 0 1 0 1 OR 001 slt 100 1 0 1 0 1 0 set on less than 111 For the ori instruction, the following settings are required for the remaining control signals: Jump 0 Branch 0 MemRead 0 MemWrite 0 MemtoReg 0 ALUSrc 1 ALU operand is from the extender RegDst 0 rt is the destination register RegWrite 1 result will be written in reg[rt] ExtOp 0 zero extend 219 The modified tables for the control signals are: Inst. Op-code RegDst ALUSrc MemtoReg Reg Write r-type 0 0 0 0 0 0 1 0 0 1 lw 1 0 0 0 1 1 0 1 1 1 sw 1 0 1 0 1 1 x 1 x 0 beq 0 0 0 1 0 0 x 0 x 0 j 0 0 0 0 1 0 x x x 0 ori 0 0 1 1 0 1 0 1 0 1 Inst. Op-code Mem Mem Branch Jump ALUOp ExtOp Read Write 2 1 0 r-type 0 0 0 0 0 0 0 0 0 0 1 0 0 x lw 1 0 0 0 1 1 1 0 0 0 0 0 0 1 sw 1 0 1 0 1 1 0 1 0 0 0 0 0 1 beq 0 0 0 1 0 0 0 0 1 0 0 0 1 1 j 0 0 0 0 1 0 0 0 0 1 x x x x ori 0 0 1 1 0 1 0 0 0 0 0 1 0 0 Some of the control logic may have to be modified. For example, the logic generating the signal ALUSrc would have to ensure that the value 1 was set for the ori instruction: ALUSrc = Op5·Op4·Op2·Op1·Op0+Op5 · Op4 · Op3 · Op2 · Op1 · Op0 The new control signal, ExtOp can be evaluated as: ExtOp = Op5 · Op4 · Op2 · Op1 · Op0 + Op5 · Op4 · Op3 · Op2 · Op1 · Op0 220 Other control logic implementations Because there are only a few instructions to be implemented, the control logic for this processor implementation is probably best implemented using simple logic functions as shown previously. It is quite common to implement simple controllers as a PLA. Following is a PLA implementation for the processor we have designed so far: op5 . . . op0 000000 R−type op5 . . . op0 001101 ori op5 . . . op0 op5 . . . op0 op5 . . . op0 op5 . . . op0 100011 101011 000100 000010 lw sw beq j RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch Jump ExtOp ALUSrc[2] ALUSrc[1] ALUSrc[0] 221 The controller for the ALU could be implemented similarly, although it is also probably best implemented using simple logic functions, as well. Note that, in the preceding controller, there was an AND term corresponding to each instruction. For a small number of instructions, this is effective. However, if the number of instructions is large (i.e., there are op codes for most of the 6 bit instruction combinations) then the controller could also be implemented as a read-only memory (ROM). In this case, the op codes would be used as address inputs to the ROM, and the outputs would be the values stored at those addresses. There are 12 output bits, and the total size of the memory would be 26 = 64 words of 12 bits. The encoding would be quite straightforward; merely the contents of the logic table for each control bit. This would not be an efficient implementation for the ALU control, however. The funct field has 6 bits, and the ALUOp control input has 3 bits, for a total of 9 bits, requiring 29 = 512 memory words of 3 bits. Another option for the ALU control bits is to use the funct field to generate the required three control signals, and have the main controller also generate these control signals directly. They could then be selected by a MUX, which would select the control signals evaluated from the funct field only if the instruction is r-type. The input to the MUX could be the logical OR of the instruction field, which evaluates to 0 only for r-type instructions. 222 The time required for single cycle instructions Arithmetic and logical instructions PC time Inst. Memory Reg. Read mux ALU mux Reg. Write Inst. Memory Reg. Read mux Sign ext. add ALU mux mux Branch PC (The sign extension and add occur in parallel with the other operations, register read and ALU comparision ) Load PC Inst. Memory Reg. Read mux ALU Data Memory mux Reg. Write The "critical path" Store PC Inst. Memory Reg. Read mux ALU Data Mem. Jump PC Inst. Memory mux The clock period must be at least as long as the time for the critical path. 223 R−type operations Inst[25−0] 26 Jump address[31−0] Shift left 2 0 M U X 1 28 PC+4[31−28] Add Add RegDst ExtOp 4 Shift left 2 Jump 1 M U X 0 PCSrc Branch MemRead MemtoReg Inst [31−26] Control ALUOp MemWrite 224 ALUSrc RegWrite PC Read address Instruction [31−0] Instruction Memory Inst[25−21] rs Inst[20−16] rt 0 M U Inst[15−11] X 1 rd Inst[15−0] Read Register 1 Read Register 2 Read data 1 Registers Write Read Register data 2 Write data 16 Inst[5−0] Sign or zero extend BusA 32 Zero BusB 32 ALU 0 M U X 1 32 32 funct 32 Address Read data Write data 32 Data Memory 32 ALU control 32 1 M U X 0 The Branch instruction − beq Inst[25−0] 26 Jump address[31−0] Shift left 2 0 M U X 1 28 PC+4[31−28] Add Add RegDst ExtOp 4 Inst [31−26] Shift left 2 Jump 1 M U X 0 PCSrc Branch Control MemRead MemtoReg ALUOp 225 MemWrite ALUSrc RegWrite PC Read address Instruction [31−0] Instruction Memory Inst[25−21] rs Inst[20−16] rt 0 M U Inst[15−11] X 1 rd Inst[15−0] Read Register 1 Read Register 2 Read data 1 BusA 32 Zero Registers Write Register Read data 2 Write data 16 Inst[5−0] Sign or zero extend BusB 32 ALU 0 M U X 1 32 32 funct 32 Address Read data Write data 32 Data Memory 32 ALU control 32 1 M U X 0 The Load instruction Inst[25−0] 26 Shift left 2 Jump address[31−0] PC+4[31−28] Add Add RegDst ExtOp 4 Inst [31−26] Control 226 rs Inst[20−16] rt Inst[25−21] PC Read address Instruction [31−0] Instruction Memory 0 M U Inst[15−11] X 1 rd Inst[15−0] 0 M U X 1 28 Shift left 2 Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Read Register 1 Read Register 2 Read data 1 BusA 32 Registers Write Read Register data 2 Write data BusB 32 16 Inst[5−0] Sign or zero extend funct PCSrc Zero ALU 0 M U X 1 32 1 M U X 0 32 32 ALU control Address Write data Read data 32 Data Memory 32 32 1 M U X 0 The Store instruction Inst[25−0] 26 Shift left 2 Jump address[31−0] PC+4[31−28] Add Add RegDst ExtOp 4 Inst [31−26] Control 227 rs Inst[20−16] rt Inst[25−21] PC Read address Instruction [31−0] Instruction Memory 0 M U Inst[15−11] X 1 rd Inst[15−0] 0 M U X 1 28 Shift left 2 Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Read Register 1 Read Register 2 Read data 1 BusA 32 Registers Write Read Register data 2 Write data BusB 32 16 Inst[5−0] Sign or zero extend funct PCSrc Zero ALU 0 M U X 1 32 1 M U X 0 32 32 ALU control Address Write data Read data 32 Data Memory 32 32 1 M U X 0 The Jump instruction Inst[25−0] 26 Shift left 2 Jump address[31−0] PC+4[31−28] Add Add RegDst ExtOp 4 Inst [31−26] Control 228 rs Inst[20−16] rt Inst[25−21] PC Read address Instruction [31−0] Instruction Memory 0 M U Inst[15−11] X 1 rd Inst[15−0] 0 M U X 1 28 Shift left 2 Jump Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Read Register 1 Read Register 2 Read data 1 BusA 32 Registers Write Read Register data 2 Write data BusB 32 16 Inst[5−0] Sign or zero extend funct PCSrc Zero ALU 0 M U X 1 32 1 M U X 0 32 32 ALU control Address Write data Read data 32 Data Memory 32 32 1 M U X 0 Why is the single cycle implementation not used? In order to have a single cycle implementation, each instruction in the instruction set must have all the operands and control signals available to implement the full instruction. This means, for example, that an instruction could not require two operands from memory. It also means that data and instructions cannot share the same memory. A multi-cycle implementation could have instructions and data stored in the same memory; instructions could be fetched in one cycle, and data fetched in another. As well, every instruction will use exactly the same amount of time for its execution. Instructions like the jump instruction, which involve only a few datapath elements require the same time as say, the load instruction, which involve almost all the elements in the datapath. With more than one clock cycle, instructions using few datapath elements could complete in fewer clock cycles than instructions using many elements. Also, there may be opportunities to reuse some datapath elements if instructions used more than one clock cycle. For example, the ALU could also be used to calculate branch addresses. 229 Considerations in a multi-cycle implementation There may be many considerations in the design a multi-cycle processor. For example, the first version of the IBM PC used a processor with an 8-bit path to memory, although the internal data paths were 16 bits. This meant that, for a full data word to be fetched from memory, two (8-bit) memory accesses were required. (At the time, the cost of external connections (pins on the integrated circuit “chip”) were expensive, so a smaller path to memory made the processor cheaper to manufacture.) Other operations could be also performed in several cycles to reduce hardware costs; for example, a 32-bit add function could be implemented using an 8-bit adder, but requiring four clock cycles to complete the add operation. In general, a multi-cycle implementation attempts to find a compromise between the number of cycles required for a particular function and the hardware complexity for its implementation, at a given cost. It is a trade-off between resources and time. The problem of designing a multi-cycle processor is therefore an optimization problem: For a given cost, (i.e., amount of hardware, or logic) what is the fastest processor which can be implemented with the specified instruction set. 230 This is really a multi-dimensional problem, and at any given time, different manufacturers of similar hardware have had very different implementations of a processor, with different performance characteristics. (Consider INTEL and AMD today; they implement much the same instruction sets, but with different price and performance characteristics. Years ago, IBM and AMDAHL processors implemented the same instruction sets very differently, as well.) We will consider the problem in two steps; first, decide on the hardware resources to be available, then decide the minimum clock period, and what operations should be done in each cycle. For the hardware resources in our implementation we will have: • a single memory, 32 bits wide, for instructions and data • a single ALU similar to that designed earlier • a full 32 bit internal datapath • as few other arithmetic elements as possible (we will attempt to eliminate the adders required for addressing) 231 How are instructions broken down into cycles? This is also a complex problem. A reasonable approach might be to: • find the single indivisible operation which requires the longest time • attempt to do as many of the shorter operations as possible in single cycles of the same length In its simplest form, this is a “greedy algorithm.” It is made more complex by the fact that the operations may have to be performed in some particular order. These are called precedence relations, and discovering them is important whenever looking for opportunities for parallelism. For example, an instruction must be fetched from memory before the arithmetic or logic function it specifies can be executed. In many processors, fetching a value (instructions or data) from memory is the operation which takes the longest time. In others, it is possible to divide even this operation into sub-operations; e.g., generate a memory address in one cycle and read or write the value in the next cycle. For our purposes, we will consider the fetching of an operand from memory as the single indivisible operation which will define our basic cycle time. 232 Looking back at the instruction timing for the single cycle processor, we see that the load instruction requires two memory accesses, and therefore will require at least two cycles. Arithmetic and logical instructions PC time Inst. Memory Reg. Read mux ALU mux Reg. Write ALU Data Memory mux Reg. Write ✲ Load ✛ PC Inst. Memory Reg. Read mux ✲ The ”critical path” Jump PC Inst. Memory mux Considering the option of using the ALU to increment the PC, note also that if the PC is read at the beginning of a cycle and loaded at the end of the cycle, then it can be incremented in parallel with the memory access. Also, if the diagram really represents the time for the various operations, the register and MUX operations together require approximately the same time as a memory operation, requiring five cycles in total. 1 2 Inst. Memory Reg. Read mux ✛ PC 3 ALU The ”critical path” 233 4 5 Data Memory mux Reg. Write ✲ A multi-cycle implementation We will consider the design of a multi-cycle implementation of the processor developed so far. The processor will have: • a single memory for instructions and data • a single ALU for both addressing and data operations • instructions requiring different numbers of cycles There are now resource limitations — only one access to memory, one access to the register file, and one ALU operation can occur in each clock cycle. It is clear that both the instruction and data would be required during the execution of an instruction. Additional registers, the instruction register (IR) and the memory data register (MDR) will be required to hold the instruction and data words from memory between cycles. Registers may also be required to hold the register operands from BusA and BusB (registers A and B, respectively). (Recall that the branch instructions require an arithmetic comparison before an address calculation.) We will look at each type of instruction individually to determine if it can actually be done with the time and resources available. 234 The R-type instructions • R[rd] ← R[rs] op R[rt] 1 Example: add rd, rs, rt 2 Inst. Memory Reg. Read mux 3 4 ALU mux Reg. Write PC Clearly, the instruction can be completed in four cycles, from the timing. We need only determine if the required resources are available. • In the first cycle, the instruction is fetched from memory, and the ALU is used to increment the PC. The instruction must be saved in the instruction register (IR) so it can be used in the following cycles. (This may extend the cycle time). • In the second cycle, the registers are read, and the values from the registers to be used by the ALU must be saved, in registers A and B, again new registers. • In the third cycle, the r-type operation is completed in the ALU, and the result saved in another new register, ALUOut. • In the fourth cycle, the value in register ALUOut is written into the register file. Four registers had to be added to preserve values from one cycle to the next, but there were no resource conflicts — the ALU was required only in the first and third cycle. 235 We can capture these steps in an RTL description: Cycle 1 IR ← mem[PC] PC ← PC + 4 Cycle 2 A ← R[rs] Save instruction in IR increment PC save register values for next cycle B ← R[rt] Cycle 3 ALUOut ← A op B calculate result and store in ALUOut Cycle 4 R[rd] ← ALUOut store result in register file This is really an expansion of the original RTL description of the R-type instructions, where the internal registers are also used. The original description was: mem[PC] Fetch the instruction from memory R[rd] ← R[rs] op R[rt] Set register rd to the value of the operation applied to the contents of registers rs and rt PC ← PC + 4 calculate the address of the next instruction When using a “silicon compiler” to design a processor, designers often refine the RTL description in a similar way in order to achieve a more efficient implementation for the datapath or control. 236 The Branch instruction — beq • Cond ← R[rs] - R[rt] Calculate the branch condition • if (Cond eq 0) calculate the address of the next in- PC ← PC + 4 + struction (SignExt(imm16) × 4) • else PC ← PC + 4 1 2 3 Inst. Memory Reg. Read mux Sign ext. add PC ALU mux mux In this case, three arithmetic operations are required, (incrementing the PC, comparing the register values, and adding the immediate field to the PC.) Clearly, the comparison could not be done until the values have been read from the register, so this must be done in cycle 3. The address calculation could be done in cycle 2, however, since it uses only data from the instruction (the immediate field) and the new value of the PC, and the ALU is not being used in this cycle. The result would have to be stored in a register, to be used in the next cycle. We could use the register ALUOut for this, since the R-type operations only require it at the end of cycle 3. Recall that the ALU produced an output Zero which could be used to implement the comparison. It is available during the third cycle, and could be used to enable the replacement of the PC with the value stored in ALUOut in the previous cycle. 237 The original RTL for the beq was: • mem[PC] Fetch the instruction from memory • Cond ← R[rs] - R[rt] Evaluate the branch condition • if (Cond eq 0) calculate the address of the next in- PC ← PC + 4 + struction (SignExt(imm16) × 4) • else PC ← PC + 4 Rewriting the RTL code for the beq instruction, including the operations on the internal registers, we have: Cycle 1 IR ← mem[PC] Save instruction in IR PC ← PC + 4 increment PC Cycle 2 A ← R[rs] save register values for next cycle B ← R[rt] (for comparison) ALUOut ← PC + calculate address for branch signextend(imm16) << 2 and place in ALUOut Cycle 3 Compare A and B if Zero is set replace PC with ALUOut if Zero then PC ← ALUOut is set, otherwise do not change PC Note that this instruction now requires three cycles. Also, the first cycle is identical to that of the R-type instructions. The second cycle does the same as the R-type, and also does the address calculation. Note that, at this point, the instruction may not require the result of the address calculation, but it is calculated anyway. 238 The Load instruction • Addr ← R[rs] + SignExt(imm16) Calculate the memory address • R[rt] ← Mem[Addr] 1 load data into register rt 2 Inst. Memory Reg. Read mux 3 ALU 4 5 Data Memory mux Reg. Write PC Clearly, the first cycle is the same as in the previous examples. For the second cycle, register R[rs] contains part of an address, and register R[rt] contains a value to be saved in memory (for store) or to be replaced from memory (for load). They must therefore be saved in registers (A and B) for future use, like the previous instructions. In the third cycle, the address is calculated from the contents of A and the imm16 field of the instruction and stored in a register (ALUOut) for use in the next cycle. This address (now in ALUOut) is used to access the appropriate memory location in the fourth cycle, and the contents of memory are placed in a register MDR, the memory data register. In the fifth cycle, the contents of the MDR are stored in the register file in register R[rt]. 239 The original RTL for load was: • mem[PC] Fetch the instruction from memory • Addr ← R[rs] + Set memory address to the value of SignExt(imm16) the sum of the contents of register rs and the immediate data word imm16 • R[rt] ← Mem[Addr] load the data at address Addr into register rt • PC ← PC + 4 calculate the address of the next instruction The RTL for this implementation is: Cycle 1 IR ← mem[PC] PC ← PC + 4 Cycle 2 A ← R[rs] Save instruction in IR increment PC save address register for next cycle B ← R[rt] Cycle 3 ALUOut ← A + signextend(imm16) calculate address for data and place in ALUOut Cycle 4 MDR ← Mem[ALUOut] store contents of memory at address ALUOut in MDR Cycle 5 R[rt] ← MDR store value originally from memory in R[rt] Recall that this instruction was the longest instruction in the single cycle implementation. 240 The Store instruction • Addr ← R[rs] + SignExt(imm16) Calculate the memory address • Mem[Addr] ← R[rt] store the contents of register rt in memory 1 2 3 Inst. Memory Reg. Read mux ALU 4 Data Memory PC The store instruction is much like the load instruction, except that the value in register R[rt] is written into memory, rather than read from it. The main difference is that, in the fourth cycle, the address calculated from R[rs] and imm16 (and saved in ALUOut) is used to store the value from register R[rt] in memory. A fifth cycle is not required. 241 The original RTL for store was: • mem[PC] Fetch the instruction from memory • Addr ← R[rs] + Set memory address to the value of the SignExt(imm16) sum of the contents of register rs and the immediate data word imm16 • R[rt] ← Mem[Addr] load the data at address Addr into register rt • PC ← PC + 4 calculate the address of the next instruction The RTL for this implementation is: Cycle 1 IR ← mem[PC] PC ← PC + 4 Cycle 2 A ← R[rs] B ← R[rt] Cycle 3 ALUOut ← A + Save instruction in IR increment PC save address register for next cycle save value to be written calculate address for data signextend(imm16) and place in ALUOut Cycle 4 Mem[ALUOut] ← B store contents of register rt in memory at address ALUOut 242 The Jump instruction • PC<31:2> ← PC<31:28> Calculate the jump address by con- concat target<25:0> catenating the high order 4 bits of the PC with the target address 1 2 Inst. Memory mux PC The first cycle, which fetches the instruction from memory and places it in IR, and increments PC by 4, is the same as other instructions. The next operation is to concatenate the low order 26 bits of the instruction with the high order 4 bits of the PC. In the PC, the low order 2 bits are 0, so they are not actually loaded or stored. The shift of the bits from the instruction can be accomplished without any additional hardware, merely by connecting bit IR[25] to bit PC[27], etc. Note that adding 4 to the PC may cause the four high order bits to change. Could this cause problems ? 243 The original RTL for jump was: • mem[PC] Fetch the instruction from memory • PC ← PC + 4 increment PC by 4 • PC<31:2> ← PC<31:28> replace low order 28 bits with concat I<25:0> << 2 the low order 26 bits from the instruction left shifted by 2 The RTL for this implementation is: Cycle 1 IR ← mem[PC] Save instruction in IR PC ← PC + 4 increment PC Cycle 2 Cycle 3 PC<31:2> ← PC<31:28> replace low order 28 bits with concat IR<25:0> << 2 the low order 26 bits from the Note that nothing is done for this instruction in cycle 2. There is no clear reason for this, except that cycle 2 is substantially the same for all other instructions, and following this gives a clearer distinction between the fetch–decode–execute cycles. 244 Changes to the datapath for a multi-cycle implementation We have found that several additional registers are required in the multi-cycle datapath in order to save information from one cycle to the next. These were the registers IR, MDR, A, B, and ALUOut. The overall hardware complexity may be reduced, however, since the adders required for addressing have been replaced by the ALU. Recall that the primary reason for choosing five cycles was the assumption that the time to obtain a value from memory was the single slowest operation in the datapath. Also, we assumed that the register file operations take a smaller, but comparable, amount of time. If either of these conditions were not true, then quite a different schedule of operations might have been chosen. 245 What is done during each cycle? For this implementation, we have determined that instructions will be divided into five cycles. (Other divisions are possible, of course, but the original MIPS also used five cycles for the longest instruction.) These cycles are as follows: 1. Instruction fetch (IF) The instruction is fetched and the next address calculated. IR ← Memory[PC] PC ← PC + 4 2. Instruction decode (ID) The instruction is decoded, and the register values to be read (the contents of registers rs and rt) are stored in registers A and B respectively. A ← reg[IR[25:21]] B ← reg[IR[20:16]] At this time, the target of a branch instruction can also be calculated, because both the PC and the instruction are available. It will have to be stored in a register (ALUOut) until it is used. ALUOut ← PC + sign-extend(IR[15:0]) <<2 where <<2 means a left-shift of 2. 246 3. Execution (EX) In this cycle, either • the ALU operation is completed (for r-type and arithmetic immediate instructions), ALUOut ← A op B , or ALUOut ← A op sign-extend(IR[15:0]) • or the memory address of a data word is calculated (for load or store), ALUOut ← A + sign-extend(IR[15:0]) • or the branch instruction is completed if the conditional expression evaluates to TRUE, if A = B PC ← ALUOut Note that if the target address were not calculated in the previous clock cycle, it would have to be calculated in the next one; the ALU is used for the comparison in this cycle. • or the jump instruction is completed PC ← PC[31:28] || IR[25:0] <<2 where the operator || denotes concatenation. 247 4. Memory (MEM) Only the load or store instructions requires this cycle. In this cycle, data is read from memory, MDR ← Memory[ALUOut] or data is written to memory, Memory[ALUOut] ← B 5. Writeback (WB) In this cycle, a value is written to a register in the register file. Either the r-type and immediate arithmetic operations write their results to the register file reg[IR[15:11]] ← ALUOut or the value read from memory in the previous cycle (for a load instruction) is written into the register file, reg[IR[20:16]] ← MDR Note that not all instructions require every cycle. In particular, branch and jump instructions require only the first 3 cycles (IF, ID, and EX). The R-type instructions require 4 cycles (IF, ID, EX, and WB). Store also requires 4 cycles (IF, ID, EX, and MEM). Load requires all 5 cycles (IF, ID, EX, MEM, and WB). 248 The datapath for the multi-cycle processor Fortunately, after our design of the single cycle processor, we have a good idea of the datapath elements required to implement each individual instruction. We can also seek opportunities to reuse functional blocks in different cycles, potentially reducing the number of hardware blocks (and hence the complexity and cost) of the datapath. The datapath for the multi-cycle processor is similar to that of the single cycle processor, with • the addition of the registers noted (IR, MDR, A, B, and ALUOut) • the elimination of the adders for address calculation • a MUX must be extended because there are now three separate calculations for the next address (jump, branch, and the normal incrementing of the PC). • additional control signals controlling the writing of the registers. The following diagrams show the datapath for the multi-cycle implementation of the processor. The additions to the datapath for each cycle is shown in red. The required control signals are shown in green in the final figure. 249 PCSource 0 M U X Memory Address MemData M U X ALUSrcA IRWrite MemWrite MemRead 250 PC IorD PCW 0 0 M U X Inst[31−26] Inst[25−21] Zero ALU Inst[20−16] Inst[15−0] Write data Instruction Register 4 1M U X ALUSrcB ALU PCSource Address MemData Inst[31−26] Read Register 1 Read Register 2 rs Inst[25−21] rt Inst[20−16] 0 M U X Read BusA A data 1 Zero ALU Registers Inst[15−0] Write data M U X ALUSrcA IRWrite MemWrite Memory RegWrite 0 M U X MemRead 251 PC IorD PCW 0 Write Register Instruction Register Read data 2 BusB B Write data 4 1M U X 3 ALUSrcB 16 Sign extend 32 Shift left 2 ALU ALUOut PCSource 0 Address MemData Inst[31−26] Read Register 1 Read Register 2 rs Inst[25−21] rt Inst[20−16] 0 M U X 1 Read BusA A data 1 26 Shift left 2 28 Write Register Instruction Register Read data 2 BusB Zero ALU 0 B Write data 4 1M U 2X 3 ALUSrcB 16 Sign extend 32 Shift left 2 Jump address PC[31−28] Registers Inst[15−0] Write data ALUSrcA IRWrite MemWrite Memory RegWrite 0 M U X MemRead 252 PC IorD PCW Inst[25−0] M 1U X 2 ALU ALUOut PCSource 0 Address MemData Inst[31−26] Read Register 1 Read Register 2 rs Inst[25−21] rt Inst[20−16] 0 M U X 1 Read BusA A data 1 Write Register Instruction Register Read data 2 BusB Shift left 2 28 4 1M U 2X ALUSrcB 16 Sign extend 32 Shift left 2 Jump address PC[31−28] Zero ALU 0 B Write data 3 Memory Data Register 26 Registers Inst[15−0] Write data ALUSrcA IRWrite MemWrite Memory RegWrite 0 M U X 1 MemRead 253 PC IorD PCW Inst[25−0] M 1U X 2 ALU ALUOut PCSource 0 Address MemData Inst[31−26] Write data Read Register 1 RegDst Read Register 2 rs Inst[25−21] rt Inst[20−16] Inst[15−0] Instruction Register Memory Data Register ALUSrcA IRWrite MemWrite Memory RegWrite 0 M U X 1 MemRead 254 PC IorD PCW Inst[25−0] 0 M U Inst[15−11] X 1 rd 0 M U X 1 Read BusA A data 1 26 Shift left 2 28 PC[31−28] Zero ALU Registers Write Register Read data 2 BusB 0 B Write data 4 0 M U X 1 1M U 2X 3 ALUSrcB 16 MemtoReg Sign extend 32 Shift left 2 Jump address M 1U X 2 ALU ALUOut PCWriteCond PCWrite PCSource ALUOp MemRead Outputs MemWrite Control ALUSrcB PCSource IorD ALUSrcA RegWrite MemtoReg IRWrite RegDst 0 op Address MemData op Inst[31−26] Write data Read Register 1 RegDst Read Register 2 rs Inst[25−21] rt Inst[20−16] Inst[15−0] Instruction Register Memory Data Register ALUSrcA IRWrite MemWrite Memory RegWrite 0 M U X 1 MemRead 255 PC IorD PCW Inst[25−0] 0 M U Inst[15−11] X rd 1 0 M U X 1 Read BusA A data 1 26 Shift left 2 28 PC[31−28] Zero ALU Registers Write Register Read data 2 BusB 0 B Write data 4 0 M U X 1 ALU 1M U 2X 3 ALUSrcB 16 Sign extend 32 ALU control Shift left 2 MemtoReg Inst[5−0] funct Jump address M 1U X 2 ALUOut The control signals The following control signals are identified in the datapath: Action when Signal RegDst 0 (deasserted) the register written is the the register written is the rt field RegWrite 1 (asserted) rd field the register file will not be the register addressed by written into the instruction will be written into ALUSrcA the first ALU operand is the first ALU operand is the PC register A MemRead no memory read occurs the contents of memory at the specified address is placed on the data bus MemWrite no memory write occurs the contents of register B is written to memory at the specified address MemtoReg the value written to the the value written to the register file comes from register file comes from the ALUOut MDR 256 Action when Signal IorD 0 (deasserted) the memory 1 (asserted) address the memory address comes from the PC (an comes from ALUOut (a instruction) IRWrite data read) the IR is not written into the IR is written into (an instruction is read) PCWrite none (see below) the PC is written into; the value comes from the MUX controlled by the signal PCSource PCWriteCond if both it and PCWrite are the PC is written if the not asserted, the PC is not ALU output Zero is active written 257 Following are the 2-bit control signals: Signal ALUOp Value Action taken 00 ALU performs ADD operation 01 ALU performs SUBTRACT operation 10 ALU performs operation specified by funct field ALUSrcB 00 the second ALU operand is from register B 01 the second ALU operand is 4 10 the second ALU operand is the sign extended low order 16 bits of the IR (imm16) 11 the second ALU operand is the sign extended low order 16 bits of the IR shifted left by 2 bits PCSource 00 the PC is updated with the value PC + 4 01 the PC is updated with the value in register ALUOut (the branch target address, for a branch instruction) 10 the PC is updated with the jump target address The control unit must now be designed. Since the instructions will now require several states, the control will be a state machine, with the instruction op codes as inputs and the control signals as outputs. 258 Review of instruction cycles and actions Cycle Instruction type action IF all IR ← Memory[PC] PC ← PC + 4 ID all A ← Reg[rs] B ← Reg[rt] ALUOut ← PC + (imm16 <<2) EX R-type ALUOut ← A op B Load/Store ALUOut ← A + sign-extend(imm16) Branch if (A == B) then PC ← ALUOut Jump PC ← PC[31:28] || (IR[25:0] <<2) MEM Load WB MDR ← Memory[ALUOut] Store Memory[ALUOut] ← B R-type Reg[rd] ← ALUOut Load Reg[rt] ← MDR Note that the first two steps are required for all instructions, and all instructions require at least the first 3 cycles. The MEM step is required only by the load and store instructions. The ALU control unit is still a combinational logic block, as before. 259 Design of the control unit The control unit is a state machine, implementing the state sequencing for every instruction. Following is a partial state machine, detailing the IF and ID stages, which are the same for all instructions: IF Start Memread = 1 ALUSrcA = 0 IorD = 0 IRWrite = 1 ALUSrcB = 01 ALUOp = 00 PCWrite = 1 PCSource = 00 OP = ’LW’ OP = ’SW’ 0 ID 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 OP = ’R−type’ OP = ’BEQ’ OP = ’J’ The partial state machines which implement each of the instructions follow. 260 The memory reference instructions (Load and Store) from state 1 OP = ’LW’ or OP = ’SW’ 2 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 OP = ’LW’ 3 OP = ’SW’ 5 MemRead = 1 IorD = 1 MemWrite = 1 IorD = 1 4 RegWrite = 1 MemtoReg = 1 RegDst = 0 To state 0 (instruction completed) 261 R-type instructions from state 1 OP = ’R−type’ 6 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 7 RegDst = 1 MemtoReg = 0 RegWrite = 1 To state 0 (instruction completed) 262 Branch and Jump instructions from state 1 from state 1 OP = ’BEQ’ OP = ’J’ 8 9 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond = 1 PCSource = 01 PCWrite = 1 PCSource = 10 To state 0 (instruction completed) These can be combined into a single state diagram, and a state machine derived from this. 263 The combined control unit Memread = 1 ALUSrcA = 0 IorD = 0 IRWrite = 1 ALUSrcB = 01 ALUOp = 00 PCWrite = 1 PCSource = 00 Start OP = ’LW’ or ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 OP = ’SW’ 7 5 MemRead = 1 IorD = 1 MemWrite = 1 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite = 1 4 RegWrite = 1 MemtoReg = 1 RegDst = 0 264 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 OP = ’J’ OP = ’R−type’ OP = ’LW’ 3 1 OP = ’BEQ’ OP = ’SW’ 6 2 0 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond = 1 PCSource = 01 9 PCWrite = 1 PCSource = 10 Implementing the control unit All that remains is to implement the control unit is to design the control logic itself. Inputs are the instruction op codes, as before, and the outputs are the control signals. The following steps are typically followed in the implementation of any sequential device: • Construct the state diagram or equivalent (done). • Assign numeric (binary) values to the states. • Choose a memory element for state memory. (Normally, these would be D flip flops or JK flip flops.) • Design the combinational logic blocks to implement the nextstate functions. • Design the combinational logic blocks to implement the outputs. The actual implementation can be done in a number of ways; as discrete logic, a PLA, read-only memory, etc. Typically, the control unit would be automatically generated from a description in some high level design language. 265 The control unit we have described is a Moore state machine, where the outputs are a function only of the state. ✲ ✲ ✲ ✲ ✲ Primary Inputs ✲ ✲ ✲ ✲ t t t ✲ State State ✲ ✲ Inputs Outputs t ✲ State memory ✲ ✲ ✲ Primary ✲ Outputs ✲ ✲ ✲ ✲ ✲ Following is a state table corresponding to the previous state diagram. Note that the outputs are missing, but they depend only on the state values, not the inputs. 266 Present INPUT State Next State 0 0000 X 1 0001 1 0001 lw 100011 2 0010 1 0001 sw 101011 2 0010 1 0001 R 000000 6 0110 1 0001 BEQ 000100 8 1000 1 0001 J 000010 9 1001 2 0010 lw 100011 3 0011 2 0010 sw 101011 5 0101 3 0011 X 4 0100 4 0100 X 0 0000 5 0101 X 0 0000 6 0110 X 7 0111 7 0111 X 0 0000 8 1000 X 0 0000 9 1001 X 0 0000 Note that the outputs are not shown in this table. The notation X in the input column means that this state change does not depend on the particular instruction, only on the previous state. The following figure shows an implementation of the next-state logic for the state machine shown previously. 267 State 0 OP4 OP3 OP2 OP1 OP0 S3 S2 S1 S0 2 3 6 lw sw R beq j lw sw Operation OP5 1 ❍❍ ❞ ✟✟ t ❍❍ ❞ ✟✟ t ❍❍ ❞ ✟✟ ❍❍ ❞ ✟✟ t ❍❍ ❞ ✟✟ ❍❍ ❞ ✟✟ ❍❍ ❞ ✟✟ ❍❍ ❞ ✟✟ ❍❍ ❞ ✟✟ ❍❍ ❞ ✟✟ S3 S2 S1 S0 t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t Q D State memory 268 t t t t t t AND plane t t t t t t t t t t OR plane Other controller implementations An alternative to a PLA implementation in an implementation using a read-only memory (or even a read-write memory.) In this case, the inputs to the memory would be the OP codes (6 bits) and the state codes (4 bits in our case, but larger in a processor with a richer instruction set.) The next-state values and outputs would be stored in the memory. For this example, there would be 10 (6 + 4) address bits, so the size of the memory would be 210 = 1024 words of 16 bits (10 single bit control signals and 3 2-bit control signals.) This is a large memory for the simple control function we have implemented with a PLA. op5 ❚❚ .. . ✔✔ op0 state inputs ❚ ✔ control outputs state outputs A hybrid approach could be to use the PLA to generate the next-state values, and a memory for the outputs associated with each state. In this case, the memory size is 24 = 16 words of 16 bits. 269 Microprogrammed control An alternative to designing a classical state machine is to designing a microprogrammed control unit. A microprogrammed control unit is a simple processor that generates the control signals for a more complex processor. ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ Microcode ROM ✻ ✉ 1 ❏ ❄ ❏ ❏ ❏ Datapath control outputs ❅ ❅ Adder ❄ ✡ ✡ ✡ ✡ Microprogram counter Microprogram ✻ Address select logic ✻ sequencer ✛ ✻ op code It has storage for the microcode (values of the control outputs and microinstructions) and a microprogram sequencer which decides the next microprogram operation. (This is essentially a next-state generator.) 270 Next-state generation Note in the previous state diagram that, in many cases (states 0, 3, 6 in the example), the next state is the numerically next state in the state machine; the state value is merely incremented. For many other states (states 4,5,7,8,9 in the example), the next state is the first state (instruction fetch). For the other states, (states 1 and 2 in the example) there is a small subset of next-states reachable from those states. Typically, a dispatch table (stored in ROM) is associated with each such state. For the state machine described earlier, there would be two dispatch tables; one for state 1 and the other for state 2. They would contain the next-state information as follows: Dispatch ROM 1 OP Name Value state 000000 R-type Rformat1 0110 000010 j JUMP1 1001 000100 beq BEQ1 1000 100011 lw Mem1 0010 101011 sw Mem1 0010 271 Dispatch ROM 2 OP Name Value state 100011 lw LW2 0011 101011 sw SW2 0101 The microprogram sequencer can be expanded to the following, where the Address Select Logic block has been expanded to include the four possible sources for the next instruction described earlier: ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ ✲ Microcode ROM ✻ ✉ 1 ❏ ❄ ❏ ❏ ❏ Datapath control outputs ❅ ❅ Adder ❄ ✡ ✡ ✡ ✡ Microprogram counter ✻ ✎ MUX 3 2 1 0 ✍ ☞ ✛ ✌ ✻ ✻ ✻ ✻ Microprogram sequencer 0 ROM 1 ROM 2 ✻ ✻ ✻ op code Each microcode instruction will have to include a control word (input to the MUX above) to control the microprogram sequencer. 272 Designing the microcode The basic function of a microprogram is to supply the control signals required to implement each instruction in the appropriate order. A microprogram is made up of microinstructions (microcode). The way the microcode is organized, or formatted, depends on a number of things. Two extremes of microcode are horizontal microcode and vertical microcode. Horizontal microcode usually requires more storage. It provides all the control signals for a single cycle directly. Vertical microcode is more compact. Typically, operations are encoded so that the operations can be specified in fewer bits. This supports less parallelism, and the second level of decoding may extend the cycle time. In either case, it is usual to group together the required outputs and control information into fields. These fields are merely collections of outputs that perform related functions. For example, it might be useful to group together all the signals that control memory, or the ALU. Often, the values in different fields are given labels, much as in assembly language programs. We can identify the following fields for a microprogrammed implementation of the simple MIPS: 273 Field name Function ALU control Specify the ALU operation for this clock cycle SRC1 Specify the source for the first ALU operand SRC2 Specify the source for the second ALU operand Register control Specify read or write for the register file, and the source for the write Memory Specify read or write and the source for the memory. For a read, specify the destination register PCWrite control Specify writing of the PC Sequencing Determines how to choose the next microinstruction Note that the first six fields correspond to sets of control signals for the datapath, the last (Sequencing) determines the source for the address of the next micro-code instruction (next-state). Typically, those fields would have symbolic values, which would later be translated to actual control signal values, somewhat like the translation of an assembly language program to machine code. 274 The following tables show the values for each field: Field name Values Function ALU Add Add, using the ALU control Subt Subtract, using the ALU Funct Use the funct field to determine ALU control PC Use PC as first ALU input A Use register A as first ALU input B Use register A as second ALU input 4 Use the constant 4 second ALU input Extend Use the sign extended imm16 field as the sec- SRC1 SRC2 ond ALU input Extshft Use the 2-bit left shifted sign extended imm16 field as the second ALU input Register control Read Read two registers using rs and rt fields, placing results in A and B Write ALU Write the contents of ALUOut into the register file in register rd Write MDR Write the contents of MDR into the register file in register rt 275 Field name Values Function Memory Read memory at the address in PC and write Read PC result in IR Read ALU Read memory at the address in ALUOut and write result in MDR Write ALU Write memory at the address in ALUOut using the contents of B as data PCWrite ALU Write the output of the ALU into the PC ALUOut-cond Write the contents of ALUOut into the PC if the Zero output of the ALU is active Jump-address Write the jump address from the instruction into the PC Sequencing Seq The next microinstruction is the next sequentially Fetch The next microinstruction is instruction fetch (state 0) Dispatch i The next microinstruction is obtained from dispatch ROM i (1 or 2) Every line of microcode will have a value for each of these fields. Eventually, as in the translation of assembly language instructions, these (symbolic) values will be translated into the actual values of the control signals. 276 Creating a microprogram Let us look at writing the microcode for a few operations: The first thing done is the fetching and decoding of an instruction (states 0 and 1 in the state diagram): ALU Label control SRC1 SRC2 Register PCWrite Control Memory control Fetch Add PC 4 Read PC ALU Add PC Extshft Read Sequencing Seq Dispatch 1 The first line describes the (now familiar) operations of fetching an instruction, storing it in the IR, adding 4 to the PC, and writing the value back to the PC. The second line describes the calculation of the branch address, and the storing of register values in registers A and B. The Sequencing field determines where the next microcode instruction comes from. For the first microinstruction, it is the next in sequence. For the second, it depends on the op code (Dispatch ROM 1). 277 The memory access instructions lw and sw: ALU Register Label control SRC1 SRC2 Mem1 Add A Control PCWrite Memory control Sequencing Extend Dispatch 2 LW2 Read ALU Seq Write MDR SW2 Fetch Write ALU Fetch Note that the value in the Dispatch 2 table will cause a jump to either LW1 or LW2. R-type instructions: ALU Label Register control SRC1 SRC2 Control Rformat1 Func code A PCWrite Memory control B Sequencing Seq Write ALU Fetch Branch and jump instructions (beq and j): ALU Register PCWrite Label control SRC1 SRC2 Control Memory control BEQ1 Subt A B Sequencing ALUOut-cond Fetch JUMP1 Jump address Fetch 278 What remains is to translate these microinstructions into actual values to be stored in the microcode ROM. In this case, it is fairly straightforward to identify the values in each field with appropriate values for the control signals: Field Signals name Value active ALU Add ALUOp = 00 Cause the ALU to add control Subt Comment ALUOp = 01 Cause the ALU to subtract Func code ALUOp = 10 Use the funct field to determine ALU operation SRC1 PC ALUSrcA=0 Use the PC as the ALU’s first input A ALUSrcA=1 Use register A as the ALU’s first input SRC2 B ALUSrcB=00 Use register B as the second ALU input 4 ALUSrcB=01 Use 4 as the second ALU input Extend ALUSrcB=10 Use the sign extended imm16 field as the second ALU input Extshft ALUSrcB=11 The shifted sign extended imm16 field is the second ALU input 279 Field name Signals Value active Comment Register Read Place contents of registers referenced by rs, rt in registers A, B control Write ALU RegWrite, RegDst=1, Write the contents of ALUOut to register rd MemtoReg=0 Write MDR RegWrite, RegDst=0, Write the contents of MDR to register rt MemtoReg=1 Memory Read PC MemRead, Place the value in memory at IorD=0, IRWrite address referenced by PC into IR and MDR Read ALU MemRead, Place the value in memory at IorD=1 address ALUOut into IR Write ALU MemWrite, IorD=1 Write memory using ALUOut as address, B contents as data 280 Field Signals name Value active Comment PC write ALU PCSource=00, Write ALU output to PC PCwrite control ALUOut-cond PCSource=01, If ALU output is zero, PCwrite write ALU output to PC Jump address PCSource=10, Write jump address from Sequencing Fetch PCwrite instruction to PC AddrCtl=00 Go to the first microinstruction Dispatch 1 AddrCtl=01 Microcode address from ROM 1 Dispatch 2 AddrCtl=10 Microcode address from ROM 2 Seq AddrCtl=11 Next microinstruction is sequential 281 It is now a matter of straightforward substitution to arrive at the microcode to be stored in the ROM: ALU Register PCWrite State control SRC1 SRC2 Control Memory control Sequencing 0 00 0 01 000 1001 0010 11 1 00 0 11 000 0000 0000 01 2 00 1 10 000 0000 0000 10 3 00 0 00 000 1010 0000 11 4 00 0 00 101 0000 0000 00 5 00 0 00 000 0110 0000 00 6 10 1 00 000 0000 0000 11 7 00 0 00 110 0000 0000 00 8 01 1 00 000 0000 0101 00 9 00 0 00 000 0000 1010 00 The 18 control signals here are, in order: ALU control ALUOp[2] SRC1 ALUSrcA SRC2 ALUSrcB[2] Register Control RegWrite, RegDst, MemtoReg Memory MemRead, MemWrite, IorD, IRWrite PCWrite control PCSource[2], PCWrite, PCWriteCond Sequencing AddrCtl[2] 282 Advantages/disadvantages of microprogram control: For large instruction sets: • The control is easier to design — similar to programming • The control is more flexible — easier to adapt or modify • Changes to the instruction set can be made late in the design cycle • Very powerful instruction sets can be implemented in different datapaths Generality: • Different instruction sets can be implemented on the same machine • Instruction sets can potentially be adapted to the particular application • Many different datapath organizations can be used with the same instruction set (cost/performance tradeoffs) Microcode control can be slower than direct logic implementation of the control, and may require more circuitry (transistors). It also may encourage “instruction set bloat” — adding instructions because they can easily be provided. 283 Adding additional instructions Clearly, adding an additional instruction can be accomplished by adding to the control unit, provided that the instruction can actually be implemented in the datapath. For example, adding the ori instruction would require adding a third bit to the control signal ALUOp in order to be able to encode logic operations, and adding the capability to zero extend the imm16 field (with control signal ExtOp, as before). These additions are the same as those required for the single instruction implementation, and their truth tables can be referred to for the appropriate values for these control signals. Note that the new control signals may have to be added to the existing states, as well. The additional control signals would also have to be generated by the controller. A microprogrammed control unit is usually easier to modify than a conventional controller. It may be slower, though, because of the (local) memory access time for the microinstructions. The following diagram shows the modified datapath and controls for the processor with the ori instruction. 284 PCWriteCond PCSource PCWrite ALUOp IorD ALUSrcB ALUSrcA MemWrite Control ExtOp PCSource MemRead Outputs RegWrite MemtoReg IRWrite RegDst 0 op 285 Address MemData Inst[31−26] Write data Read Register 1 RegDst Read Register 2 rs Inst[25−21] rt Inst[20−16] Inst[15−0] Instruction Register Memory Data Register ALUSrcA IRWrite MemWrite MemRead Memory RegWrite 0 M U X 1 op 0 M U Inst[15−11] X rd 1 0 M U X 1 Read BusA A data 1 26 Shift left 2 28 Write Register Read data 2 BusB Zero ALU Write data 0 M U X 1 0 B 4 ALU 1M U 2X 3 ALUSrcB 16 MemtoReg Inst[5−0] Sign or 0 extend 32 ALU control Shift left 2 funct Jump address PC[31−28] Registers ExtOp PC IorD PCW Inst[25−0] M 1U X 2 ALUOut The following shows the additions to the state diagram required to implement the ORI instruction: from state 1 OP = ’ORI’ 14 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 010 ExtOp = 0 15 RegDst = 0 MemtoReg = 0 RegWrite = 1 To state 0 (instruction completed) Also, in states 0, 1, and 2, the control signal ALUOp would have to change from 00 to 000. In state 6, it would change from 10 to 100, and in state 8, from 01 to 001. The control signal ExtOp would have to be set to a value of 1 in states 1 and 2. 286 Modifying the microcode to add the ori instruction The two additional control signals would have to be added. The third bit in the ALUOp control would naturally be added to the ALU control field, as would a label for the OR function. The control signal ExtOp would also have to be added to one of the fields, say, SRC2. Field Signals name Value active ALU Add ALUOp = 000 Cause the ALU to add control Subt Or Comment ALUOp = 001 Cause the ALU to subtract ALUOp = 010 Cause the ALU to perform OR Func code ALUOp = 100 Use the funct field to determine ALU operation SRC2 B ALUSrcB=00 Use register B as the second ALU input 4 ALUSrcB=01 Use 4 as the second ALU input Extend ExtOp = 1 Sign extension of imm16 ALUSrcB=10 Use the sign extended imm16 field as the second ALU input Extshft ExtOp = 1 Sign extension of imm16 ALUSrcB=11 The shifted sign extended imm16 field is the second ALU input UExtend ExtOp = 0 Unsigned extension of imm16 ALUSrcB=10 Use the imm16 field as the second ALU input 287 Note that two labels have been added; OR, to specify an OR operation in the ALU, and UExtend to specify unsigned extension. Sign extension was also explicitly specified, where required. Less obviously, another label, say, Write ALUi has to be added to the Register control field, because the value to be written comes from the register ALUOut and is to be written to the register indexed by rt, which requires a new combination of control signals. Field name Signals Value active Comment Register Read Read 2 registers using the rs control and rt fields and save the results in registers A and B Write ALU RegWrite=1 RegDst=1 Write to the register file using the rd field as destination MemtoReg=0 and ALUOut as source Write MDR RegWrite=1 RegDst=0 Write to the register file using the rt field as destination MemtoReg=1 and MDR as the source Write ALUi RegWrite=1 RegDst=0 Write to the register file using the rt field as destination MemtoReg=0 and ALUOut as source 288 Since two states were added, two microcode instructions would also be required. The microcode is similar to that for the R-type instructions. The ori instruction: ALU Register Label control SRC1 SRC2 ORi OR A Control UExtend PCWrite Memory control Sequencing Seq Write ALUi Fetch Note that the additional two signals would now automatically be provided for all instructions, since they are specified in the microcode fields. One other change is required. The op code for the instruction ori has to be added to Dispatch ROM 1. Dispatch ROM 1 OP Name Value 000000 R-type Rformat1 000010 j JUMP1 000100 beq BEQ1 001101 ori ORi 100011 lw Mem1 101011 sw Mem1 289 Exceptions and interrupts A feature of virtually all processors is the capability to respond to error conditions, and to be “interrupted” by some external condition. These interruptions to the normal flow of events in the processor are called exceptions or interrupts. We will call something with a cause external to the processor an interrupt, and an exception when the cause is internal to the processor (say, an illegal instruction). Note that this is by no means a standard nomenclature; the terms are often used interchangeably. Normally, interrupts and exceptions are handled by a combination of hardware (the processor) and software (the operating system.) Three things are required when an exception occurs: 1. The cause of the exception must be recorded. 2. The exception must be “handled” in some way. Normally, the processor jumps to some location in memory where there is code for an “exception handler.” The PC is set to this address by the processor hardware. 3. The processor must have some way to return to the code that was originally running, after handling the exception. 290 Adding exception handling We will implement the hardware and control functions to handle two types of exceptions; undefined instruction and arithmetic overflow. Recall that the ALU had an overflow detection output, which can be used as an input to the controller. 1. We will use a register labeled Cause to store a number (0 or 1) to identify the type of exception, (0 for undefined instruction, 1 for arithmetic overflow). It requires a control signal CauseWrite to be generated by the controller. The controller also must set the value written to the register, depending on whether or not the exception was an arithmetic overflow. The control signal IntCause is used to set this value. 2. The PC will be set to memory address C0000000 where the operating system is expected to provide an event handler. This is accomplished by adding another input (input 3) to the MUX which updates the PC address. The MUX is controlled by the 2-bit signal PCSource. 3. The address of the instruction which caused the exception is stored in the register EPC, a 32 bit register. Writing to this register is controlled by the new signal EPCWrite. 291 Storing the address of the instruction can be done several ways; for example, it could be stored at the beginning of each instruction. This would require a change to the datapath, and a way to disable the storing of the address after each exception. It is possible to store the address with only a small change to the datapath (merely adding the EPC register to accept the output of the ALU). Recall that the next address (PC + 4) is calculated in the ALU, and is written to the PC in the first cycle of every instruction. The ALU can be used to subtract the value 4 from the PC after an exception is detected, but before it is written into the EPC, so it contains the actual address of the present instruction. (Actually, there would be no real problem with saving the value PC + 4 in the EPC; the interrupt handler could be responsible for the subtraction.) So, in order to handle these two exceptions, we have added two registers — EPC and Cause, and three control signals — EPCWrite, IntCause, and CauseWrite. The changes to the processor datapath and control signals required for the implementation of the exceptions detailed above are shown in the following diagram. 292 CauseWrite IntCause PCWriteCond EPCWrite Outputs PCWrite PCSource IorD MemRead ALUOp Control MemWrite ALUSrcB ALUSrcA MemtoReg RegWrite IRWrite RegDst 0 op Inst[25−0] 26 Shift left 2 28 1M U 2X Jump address 293 C0000000 3 PC[31−28] PC 0 M U X 1 Memory Address MemData Read Register 1 Read Register 2 rs Inst[25−21] rt Inst[20−16] Inst[15−0] Write data 0 M U X 1 Inst[31−26] Instruction Register Memory Data Register 0 M U Inst[15−11] X 1 rd Read BusA A data 1 Zero ALU Registers Write Register Read BusB B data 2 Write data 1M U 2X 0 3 Sign extend Inst[5−0] EPC 0 4 0 M U X 1 16 ALUOut overflow 32 1 ALU control Shift left 2 funct 0 M U X 1 Cause Adding exception handling to the control unit The exceptions overflow and undefined can be implemented by the addition of only one state each: OP = ’other’ 10 11 overflow IntCause = 0 CauseWrite = 1 ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 PCSource = 11 EPCWrite = 1 PCWrite = 1 IntCause = 1 CauseWrite = 1 ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 PCSource = 11 EPCWrite = 1 PCWrite = 1 to state 0 The input overflow is an output from the ALU. It is a combinational logic output, produced while the ALU is performing the selected operation. 294 Start OP = ’LW’ or 0 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 OP = ’BEQ’ ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 OP = ’other’ OP = ’R−type’ OP = ’SW’ 6 2 OP = ’J’ 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 OP = ’LW’ 9 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 PCWriteCond = 1 PCSource = 01 PCWrite = 1 PCSource = 10 OP = ’SW’ overflow 3 5 MemRead = 1 IorD = 1 4 RegWrite = 1 MemtoReg = 1 RegDst = 0 MemWrite = 1 IorD = 1 overflow 7 RegDst = 1 MemtoReg = 0 RegWrite = 1 11 IntCause = 1 CauseWrite = 1 ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 PCSource = 11 EPCWrite = 1 PCWrite = 1 10 IntCause = 0 CauseWrite = 1 ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 PCSource = 11 EPCWrite = 1 PCWrite = 1 The control unit, with exception handling The ALU operation which could result in an overflow is done in the 295 EX cycle, and the overflow signal is only available then, unless it is saved in a register. Memread = 1 ALUSrcA = 0 IorD = 0 IRWrite = 1 ALUSrcB = 01 ALUOp = 00 PCWrite = 1 PCSource = 00 Adding interrupts and exceptions with microcode It is not difficult to add exception handling with microcode. Note that three additional control signals were added. The simplest thing to do is to add another microcode field, called, say, Exception. It determines the values of the three control signals, EPCWrite, IntCause, and CauseWrite. Field name Signals Value Exception Overflow active Comment EPCWrite=1 Save the output of the ALU (PC - 4) in EPC IntCause=1 Select the cause input CauseWrite=1 Write the selected value in the Cause register Undefined EPCWrite=1 Save the output of the ALU (PC - 4) in EPC IntCause=0 Select the cause input CauseWrite=1 Write the selected value in the Cause register The operations required are that the PC is decremented by 4, saved in the EPC, the appropriate value set in the Cause register, and a jump effected to the exception handler at address C0000000. 296 Another required modification to the microcode is to add a selection to the PC write control field to accommodate the jump to the exception handler: Field name Signals Value PC write ALU active Comment PCSource=00 Select the ALU output as the control source for the PC PCWrite=1 ALUOut-cond PCSource=01 Write into the PC Select the ALUOut register as the source for the PC PCWritecond=1 Write into the PC if the zero output of the ALU is set jump-address PCSource=10 Select the jump address field as the source for the PC Exception PCSource=11 Select address C0000000 as the value to be written to the PC The other changes required are: • Add the exception state address to Dispatch Rom 1 • Add Dispatch Rom 3 (with the Overflow output as address) to the microprogram sequencer. This requires adding another bit to the Sequencing field, as well. • Changing the Sequencing field of the microcode for R-type instructions at label Rformat1 from Seq to Dispatch 3. 297 . Microcode for the exceptions: ALU Register PCWrite Label control SRC1 SRC2 Control Memory control Exception Sequencing Overflow Subt PC 4 Exception Overflow Undefined Subt PC 4 Exception Undefined Fetch Fetch Of course, the Exception field must be added to all the other microcode lines, but the values will all be the default (0) values. 298 More about interrupts The ability to handle interrupts and exceptions is an important feature for processors. We have added the control logic to detect the two types of exceptions described earlier, but note that the Cause and the EPC register cannot be read. Instructions would have to be provided to allow these registers to be read and manipulated. Processors usually have policies relating to exceptions. The MIPS processor had the policy that instructions which cause an exception has no effect (e.g., nothing is written into a register.) For some exceptions, if this policy is used, the operation may have to complete before the exception can be detected, and the result of the operation must then be “rolled back.” This makes the implementation of exceptions difficult — sometimes the state prior to an operation must be saved so it can be restored. This constraint alone sometimes results in instructions requiring more cycles for their implementation. 299 Exceptions and interrupts in other processors A common type of interrupt is a vectored interrupt. Here, different interrupts or exceptions jump to different addresses in memory. The operating system places an appropriate interrupt handler for the particular interrupt at each of these locations. A vectored interrupt both identifies the type of interrupt, and provides the handler at the same time. (Since different interrupts or exceptions have different vectors.) In the INTEL processors, it is the responsibility of the interrupting device to provide the interrupt vector. (This is usually done by one of the peripheral controller chips, under control of the operating system.) A major problem with the PC architecture is that only a small number of interrupts (typically 16) can be handled by the controller chip. this has lead to many problems with hardware devices “sharing interrupts” — defeating the advantages of vectored interrupts. We will look at interrupts again, later, when we discuss input and output devices. 300 Some questions about exceptions and interrupts The following questions often have different answers for different processors: • How does a processor return control of the program flow from the exception or interrupt handler to the interrupted program? Some processors have explicit instructions for this (e.g., the MIPS processors), others treat interrupts and exceptions as being similar to subprogram calls (INTEL processors do this.) • What happens when an exception or interrupt is itself interrupted? Some processors save the return addresses in a stack data structure, and successive levels of interrupts just increase the stack depth. Typically, this is the way subprogram return addresses are also stored. Some processors automatically turn off the interrupt capability at the beginning of an interrupt, and it must be explicitly turned back on by the interrupt or exception handler to accept another interrupt. Some processors have both features — instructions can turn the interrupt capability on and off, and can allow interrupts to be interrupted themselves. (This turns out to be important for implementing certain operating system functions.) 301 Comments on our implementation of exceptions Note that our implementation has only one register for the address of the interrupting instruction, and no way to read that address and modify it to resume the program where the exception occurred. What changes would be required to the instruction set accomplish this? The simplest solution would probably be to allow only one interrupt at a time, by disabling the interrupt capability, and to provide: 1. An instruction to store the EPC in the register file. 2. An instruction to store the Cause register in the register file. 3. An instruction to turn on interrupt capability after the next instruction completed execution. (This assumes that the next instruction restores the PC to the address of the instruction following the one that caused the exception.) Note that these would require changes to the datapath and control. This example was just to give the flavor of the problems involved with handling exceptions in the processor. More complex instruction sets and architectures exacerbate the problems. 302 Comments on handling interrupts Although exception handling is complex, it is often simpler than the handling of external interrupts. Exceptions occur as a result of occurrences internal to the processor. Consequently, they are usually both predictable, and occur and are detected at known times in the execution of a particular instruction. Interrupts are external events, and are not at all synchronized with the execution of instructions in the processor. Since interrupts may be notification of an urgent event, they usually require fast servicing. Decisions therefore have to be taken about exactly when in the execution of an instruction an interrupt will be detected and handled. Some of the considerations are: • If the instruction is not allowed to complete, information must be retained in order to either continue or restart the interrupted instruction. How will this be done? • If the interrupted instruction is allowed to complete, how will the processor return to the next instruction in the current program? • Can the interrupt handler be interrupted? • Can interrupts be prioritized so that a high priority interrupt can interrupt a lower priority interrupt? 303 How can we “speed up” the processor? One idea is to try to make the most frequently used instructions as fast as possible. Instruction distributions for some common program types Instruction type Type of program LATEX C compiler Fortran (numerical) calls 0.012 0.006 0.010 branches 0.115 0.229 0.068 loads/stores 0.331 0.231 0.456 flops 0.001 0.000 0.163 data (R-type) 0.414 0.293 0.284 nops 0.127 0.241 0.059 304 Instruction counts for a 60 page LaTeX document (the GWM manual) count percent type 1387566431 (1.004) cycles (55.5s @ 25.0MHz) 1382108615 (1.000) instructions 206864803 (0.150) basic blocks 19570428 (0.014) calls 342200862 (0.248) loads 181925435 (0.132) stores 524126297 (0.379) loads+stores 524252660 (0.379) data bus use 50344780 (0.036) partial word references 150139046 (0.109) branches 316292645 (0.229) nops 0 (0.000) load interlock cycles 5292110 (0.004) multiply/divide interlock cycles 124148 (0.000) flops (0.00224 mflops/s @ 25.0MHz) 305 FORTRAN number crunching – hard-sphere molecular dynamics calculation count percent type 873855362 (1.050) cycles (35s @ 25.0MHz) 832495305 (1.000) instructions 81119362 (0.097) basic blocks 8071192 (0.010) calls 289695712 (0.348) loads 112932164 (0.136) stores 402627876 (0.484) loads+stores 426704925 (0.513) data bus use 258649 (0.000) partial word references 56782868 (0.068) branches 20814556 (0.025) nops 0 (0.000) load interlock cycles 751865 (0.001) multiply/divide interlock cycles 124343032 (0.149) flops (3.56 mflop/s @ 25.0MHz) 40496083 (0.049) floating point data interlock cycles 8015 (0.000) floating point add interlock cycles 46335 (0.000) floating point multiply interlock cycles 0 (0.000) floating point divide interlock cycles 57759 (0.000) other floating point interlock cycles 24071112 (0.029) 1 cycle interlocks 24052679 (0.029) overlapped floating point cycles 306 Other ideas for “speedup” There are a number of ways of “speeding up” a multicycle processor — generally by doing certain operations in parallel. For example, in the INTEL 80x86 processors, the fetching of instructions from memory is decoupled from the instruction execution. There is a logically separate bus interface unit which attempts to fill an instruction queue during the times when the execution unit is not receiving operands from memory. (The 80x86 is not a load/store processor.) Would this be a useful idea for our multicycle implementation of the MIPS? Another possibility is performing operations in the datapath in parallel. For example, it is not unusual for a processor to have different adders for integer and floating point operations, and those operations can be performed simultaneously. (The MIPS R2000/R3000 performs floating point operations in parallel with integer operations.) 307 A Gantt chart showing a simple, multicycle implementation WB MEM ALU RD IF 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 time (clock cycles) The thick lines indicate memory accesses. Not all instructions would require all the cycles shown. 308 A simple overlap implementation — here the instruction fetch proceeds during the WB clock phase, in which results are written into internal registers. WB MEM ALU RD IF 0 1 2 3 4 5 6 7 8 9 10 11 12 13 time (clock cycles) Could this implementation be done with the multicycle datapath shown earlier? Are the resources used in the IF cycle also used in the WB cycle? What parts of the instruction register are required? When in the cycle are they used? Which instructions do not have a WB cycle, and how could they be handled? 309 An implementation which makes full use of a single memory bus — data reads and writes do not interfere with instruction fetch. WB MEM ALU RD IF 0 1 2 3 4 5 6 7 8 9 10 11 12 13 time (clock cycles) Note that the single memory is a bottleneck. In reality, not every instruction accesses data from memory; in sample codes earlier, only 1/4 to 1/2 of the instructions were loads or stores. The Gantt chart for this situation would be more complex. Would the single cycle datapath be sufficient in this case? What instructions would cause problems with this implementation? 310 Pipelining Pipelining is a technique which allows several instructions to overlap in time; different parts of several consecutive instructions are executed simultaneously. The basic structure of a pipelined system is not very different from the multicycle implementation previously discussed. In the pipelined implementation, however, resources from one cycle cannot be reused by another cycle. Also, the results from each stage in the pipeline must be saved in a pipeline register for use in the next pipeline stage. 311 A pipelined implementation WB MEM ALU RD IF 0 1 2 3 4 5 6 7 8 9 10 11 12 13 time (clock cycles) Note that two memory accesses may be required in each machine cycle (an instruction fetch, and a memory read or write.) How could this problem be reduced or eliminated? 312 What is required to pipeline the datapath? Recall that when the multi-cycle implementation was designed, information which had to be retained from cycle to cycle was stored in a register until it was needed. In a pipelined implementation, the results from each pipeline stage must be saved if they will be required in the next stage. In a multi-cycle cycle implementation, resources could be “shared” by different cycles. In a pipelined implementation, every pipeline stage must have all the resources it requires on every clock cycle. A pipelined implementation will therefore require more hardware than either a single cycle or a multicycle implementation. A reasonable starting point for a pipelined implementation would be to add pipeline registers to the single cycle implementation. We could have each pipeline stage do the operations in each cycle of the multi-cycle implementation. The next figure shows a first attempt at the datapath with pipeline registers added. 313 1 M U X 0 Add 4 Add Shift left 2 Inst[25−21] 314 PC Read address Instruction [31−0] Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Inst[15−11] IF Read data 1 ID Sign extend ALU Address 0 M U X 1 Read data Write data Data Memory 0 M U X 1 32 0 M U X 1 EX MEM WB It is useful to note the changes that have been made to the datapath, The most obvious change is, of course, the addition of the pipeline registers. The addition of these registers introduce some questions. How large should the pipeline registers be? Will they be the same size in each stage? The next change is to the location of the MUX that updates the PC. This must be associated with the IF stage. In this stage, the PC should also be incremented. The third change is to preserve the address of the register to be written in the register file. This is done by passing the address along the pipeline registers until it is required in the WB stage. The output of the MUX which provides the write address is now the pipeline register. 315 Pipeline control Since five instructions are now executing simultaneously, the controller for the pipelined implementation is, in general, more complex. It is not as complex as it appears on first glance, however. For a processor like the MIPS, it is possible to decode the instruction in the early pipeline stages, and to pass the control signals along the pipeline in the same way as the data elements are passed through the pipeline. (This is what will be done in our implementation.) A variant of this would be to pass the instruction field (or parts of it) and to decode the instruction as needed for each stage. For our processor example, since the datapath elements are the same as for the single cycle processor, then the control signals required must be similar, and can be implemented in a similar way. All the signals can be generated early (in the ID stage) and passed along the pipeline until they are required. 316 W B PCSrc M E M 1 M U X 0 W B RegDst M MemRead E MemWrite M Branch E ALUSrc X ALUop RegWrite W B MemtoReg Inst [31−26] Add Add 4 Shift left 2 Inst[25−21] 317 PC Read address Instruction [31−0] Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Read data 1 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Sign extend ALU IF ID Read data Write data Data Memory 0 M U X 1 32 Inst[5−0] Inst[15−11] Address 0 M U X 1 ALU control 0 M U X 1 EX MEM WB Executing an instruction In the following figures, we will follow the execution of an instruction through the pipeline. The instructions we have implemented in the datapath are those of the simplest version of the single cycle processor, namely: • the R-type instructions • load • store • beq We will follow the load instruction, as an example. 318 1 M U X 0 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Shift left 2 Inst[25−21] PC Read address Instruction [31−0] 319 Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Read data 1 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Sign extend ALU IF ID Read data Write data Data Memory 0 M U X 1 32 Inst[5−0] Inst[15−11] Address 0 M U X 1 ALU control 0 M U X 1 EX MEM WB 1 M U X 0 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Shift left 2 Inst[25−21] PC Read address Instruction [31−0] 320 Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Read data 1 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Sign extend ALU IF ID Read data Write data Data Memory 0 M U X 1 32 Inst[5−0] Inst[15−11] Address 0 M U X 1 ALU control 0 M U X 1 EX MEM WB 1 M U X 0 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Shift left 2 Inst[25−21] PC Read address Instruction [31−0] 321 Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Read data 1 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Sign extend ALU IF ID Read data Write data Data Memory 0 M U X 1 32 Inst[5−0] Inst[15−11] Address 0 M U X 1 ALU control 0 M U X 1 EX MEM WB 1 M U X 0 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Shift left 2 Inst[25−21] PC Read address Instruction [31−0] 322 Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Read data 1 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Sign extend ALU IF ID Read data Write data Data Memory 0 M U X 1 32 Inst[5−0] Inst[15−11] Address 0 M U X 1 ALU control 0 M U X 1 EX MEM WB 1 M U X 0 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Shift left 2 Inst[25−21] PC Read address Instruction [31−0] 323 Instruction Memory Inst[20−16] Read Register 1 Read Register 2 Read data 1 Zero Registers Write Read Register data 2 Write data Inst[15−0] 16 Sign extend ALU IF ID Read data Write data Data Memory 0 M U X 1 32 Inst[5−0] Inst[15−11] Address 0 M U X 1 ALU control 0 M U X 1 EX MEM WB Representing a pipeline pictorially These diagrams are rather complex, so we often represent a pipeline as simpler figures representing the structure as follows: LW IM REG SW ALU IM REG ADD IM DM REG ALU DM REG ALU DM REG Often an even simpler representation is sufficient: IF ID IF ALU MEM WB ID IF ALU MEM WB ID ALU MEM WB The following figure shows a pipeline with several instructions in progress: 324 REG LW ADD SW IM REG IM ALU REG IM DM REG ALU DM REG ALU DM REG ALU DM REG ALU DM REG ALU DM REG 325 SUB BEQ AND IM REG IM REG IM REG REG Pipeline “hazards” There are three types of “hazards” in pipelined implementations — structural hazards, control hazards, and data hazards. Structural hazards Structural hazards occur when there are insufficient hardware resources to support the particular combination of instructions presently being executed. The present implementation has a potential structural hazard if there is a single memory for data and instructions. Other structural hazards cannot happen in a simple linear pipeline, but for more complex pipelines they may occur. Control hazards These hazards happen when the flow of control changes as a result of some computation in the pipeline. One question here is what happens to the rest of the instructions in the pipeline? Consider the beq instruction. The branch address calculation and the comparison are performed in the EX cycle, and the branch address returned to the PC in the next cycle. 326 What happens to the instructions in the pipeline following a successful branch? There are several possibilities. One is to stall the instructions following a branch until the branch result is determined. (Some texts refer to a stall as a “bubble.”) This can be done by the hardware (stopping, or stalling the pipeline for several cycles when a branch instruction is detected.) beq IF ID stall add ALU MEM WB stall stall IF ID IF lw ALU MEM WB ID ALU MEM WB It can also be done by the compiler, by placing several nop instructions following a branch. (It is not called a pipeline stall then.) beq nop nop nop IF ID IF ALU MEM WB ID IF ALU MEM WB ID ALU MEM WB IF ID add IF ALU MEM WB ID IF lw 327 ALU MEM WB ID ALU MEM WB Another possibility is to execute the instructions in the pipeline. It is left to the compiler to ensure that those instructions are either nops or useful instructions which should be executed regardless of the branch test result. This is, in fact, what was done in the MIPS. It had one “branch delay slot” which the compiler could with a useful instruction about 50% of the time. beq IF branch delay slot ID ALU MEM WB IF ID instruction at branch target IF ALU MEM WB ID ALU MEM WB We saw earlier that branches are quite common, and inserting many stalls or nops is inefficient. For long pipelines, however, it is difficult to find useful instructions to fill several branch delay slots, so this idea is not used in most modern processors. 328 Branch prediction If branches could be predicted, there would be no need for stalls. Most modern processors do some form of branch prediction. Perhaps the simplest is to predict that no branch will be taken. In this case, the pipeline is flushed if the branch prediction is wrong, and none of the results of the instructions in the pipeline are written to the register file. How effective is this prediction method? What branches are most common? Consider the most common control structure in most programs — the loop. In this structure, the most common result of a branch is that it is taken; consequently the next instruction in memory is a poor prediction. In fact, in a loop, the branch is not taken exactly once — at the end of the loop. A better choice may be to record the last branch decision, (or the last few decisions) and make a decision based on the branch history. Branches are problematic in that they are frequent, and cause inefficiencies by requiring pipeline flushes. In deep pipelines, this can be computationally expensive. 329 Data hazards Another common pipeline hazard is a pipeline hazard. Consider the following instructions: add $r2, $r1, $r3 add $r5, $r2, $r3 Note that $r2 is written in the first instruction, and read in the second. In our pipelined implementation, however, $r2 is not written until four cycles after the second instruction begins, and therefore three bubbles or nops would have to be inserted before the correct value would be read. add $r2, $r1, $r3 IF ID ALU MEM WB data hazard add $r5, $r2, $r3 IF ID ALU MEM WB The following would produce a correct result: IF ID nop ALU MEM WB nop nop IF ID ALU MEM WB The following figure shows a series of pipeline hazards. 330 add $2, $1, $3 sub $5, $2, $3 331 and $7, $6, $2 beq $0, $2, −25 sw $7, 100($2) IM REG IM ALU REG IM DM REG ALU DM REG ALU DM REG ALU DM REG ALU DM REG IM REG IM REG REG Handling data hazards There are a number of ways to reduce data hazards. The compiler could attempt to reorder instructions so that instructions reading registers recently written are not too close together, and insert nops where it is not possible to do so. For deep pipelines, this is difficult. Hardware could be constructed to detect hazards, and insert stalls in the pipeline where necessary. This also slows down the pipeline (it is equivalent to adding nops.) An astute observer could note that the result of the ALU operation is stored in the pipeline register at the end of the ALU stage, two cycles before it is written into the register file. If instructions could take the value from the pipeline register, it could reduce or eliminate many of the data hazards. This idea is called forwarding. The following figure shows how forwarding would help in the pipeline example shown earlier. 332 add $2, $1, $3 sub $5, $2, $3 and $7, $6, $2 IM REG IM ALU REG IM DM REG ALU DM REG ALU DM REG ALU DM REG ALU DM REG 333 IM beq $0, $2, −25 sw $7, 100( $2) REG IM forwarding Note how forwarding eliminates the data hazards in these cases. REG REG Implementing forwarding Note that from the previous examples there are now two potential additional sources of operands for the ALU during the EX cycle — the EX/MEM pipeline register, and the the MEM/WB pipeline. What additional hardware would be required to provide the data from the pipeline stages? The data to be forwarded could be required by either of the inputs to the ALU, so two MUX’s would be required — one for each ALU input. The MUX’s would have three sources of data; the original data from the registers (in pipeline stage ID/EX) or the two pipeline stages to be forwarded. Looking only at the datapath for R-type operations, the additional hardware would be as follows: 334 ID/EX Read R1 EX/MEM MEM/WB M U X Read Data 1 Read R2 zero ForwardA Registers Write R Write data ALU result Read M U X Data 2 Read address Write Read Data Memory Write Data ForwardB rt 0 rd 1 M U X There would also have to be a “forwarding unit” which provides control signals for these MUX’s. 335 0 Data 1 M U X Forwarding control Under what conditions does a data hazard (for R-type operations) occur? It is when a register to be read in the EX cycle is the same register as one targeted to be written, and is held in either the EX/MEM pipeline register or the MEM/WB pipeline register. These conditions can be expressed as: 1. EX/MEM.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt 2. MEM/WB.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt Some instructions do not write registers, so the forwarding unit should check to see if the register actually will be written. (If it is to be written, the control signal RegWrite, also in the pipeline, will be set.) Also, an instruction may try to write some value in register 0. More importantly, it may try to write a non-zero value there, which should not be forwarded — register 0 is always zero. Therefore, register 0 should never be forwarded. 336 The register control signals ForwardA and ForwardB have values defined as: MUX control Source 00 ID/EX Explanation Operand comes from the register file (no forwarding) 01 MEM/WB Operand forwarded from a memory operation or an earlier ALU operation 10 EX/MEM Operand forwarded from the previous ALU operation The conditions for a hazard with a value in the EX/MEM stage are: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 6= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) then ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 6= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) then ForwardB = 10 337 For hazards with the MEM/WB stage, an additional constraint is required in order to make sure the most recent value is used: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 6= 0) and (EX/MEM.RegisterRd 6= ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) then ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 6= 0) and (EX/MEM.RegisterRd 6= ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) then ForwardB = 01 The datapath with the forwarding control is shown in the next figure. 338 ID/EX Read R1 EX/MEM M U X Read Data 1 Read R2 zero ForwardA Registers Write R Write data MEM/WB ALU result Read Data 2 M U X Write Read Data Data Memory Write Data ForwardB rs rt Read address 0 rd 1 EX/MEM.RegisterRd M U X Forwarding unit MEM/WB.RegisterRd For a datapath with forwarding, the hazards which are fixed by forwarding are not considered hazards any more. 339 0 1 M U X Forwarding for other instructions What considerations would have to be made if other instructions were to make use of forwarding? The immediate instructions The major difference is that the B input to the ALU comes from the instruction and sign extension unit, so the present MUX controlled by the ALUSrc signal could still be used as input to the ALU. The major change is that one input to this MUX is the output of the MUX controlled by ForwardB. The load and store instructions These will work fine, for loads and stores following R-type instructions. There is a problem, however, for a store following a load. lw $2, 100($3) sw $2, 400($3) IM REG IM ALU REG DM REG ALU DM Note that this situation can also be resolved by forwarding. It would require another forwarding controller in the MEM stage. 340 REG There is a situation which cannot be handled by forwarding, however. Consider a load followed by an R-type operation: lw $2, 100($3) IM add $4,$3, $2 REG IM ALU REG DM REG ALU DM REG Here, the data from the load is not ready when the r-type instruction requires it — we have a hazard. What can be done here? lw $2, 100($3) IM REG STALL ALU DM IM REG REG ALU DM add $4,$3, $2 With a “stall”, forwarding is now possible. It is possible to accomplish this with a nop, generated by a compiler. Another option is to build a “hazard detection unit” in the control hardware to detect this situation. 341 REG The condition under which the “hazard detection circuit” is required to insert a pipeline stall is when an operation requiring the ALU follows a load instruction, and one of the operands comes from the register to be written. The condition for this is simply: if (ID/EX.MemRead and (ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) then STALL 342 Forwarding with branches For the beq instruction, if the comparison is done in the ALU, the forwarding already implemented is sufficient. add $2,$3, $4 IM REG ALU IM beq $2, $3,25 REG DM REG ALU DM REG In the MIPS processor, however, the branch instructions were implemented to require only two cycles. The instruction following the branch was always executed. (The compiler attempted to place a useful instruction in this “jump delay slot”, but if it could not, an nop was placed there.) The original MIPS did not have forwarding, but it is useful to consider the kinds of hazards which could arise with this instruction. Consider the sequence add $2, $3, $4 beq $2, $5, 25 IF ID IF ALU MEM WB ID ALU MEM WB Here, if the conditional test is done in the ID stage, there is a hazard which cannot be resolved by forwarding. 343 In order to correctly implement this instruction in a processor with forwarding, both forwarding and hazard detection must be employed. The forwarding must be similar to that for the ALU instructions, and the hazard detection similar to that for the load/ALU type instructions. Presently, most processors do not use a “branch delay slot” for branch instructions, but use branch prediction. Typically, there is a small amount of memory contained in the processor which records information about the last few branch decisions for each branch. In fact, individual branches are not identified directly in this memory; the low order address bits of the branch instruction are used as an identifier for the branch. This means that sometimes several branches will be indistinguishable in the branch prediction unit. (The frequency of this occurrence depends on the size of the memory used for branch prediction.) We will discuss branch prediction in more depth later. 344 Exceptions and interrupts Exceptions are a kind of control hazard. Consider the overflow exception discussed previously for the multicycle implementation. In the pipelined implementation, the exception will not be identified until the ALU performs the arithmetic operation, in stage 3. The operations in the pipeline following the instruction causing the exception must be flushed. As discussed earlier, this can be done by setting the control signals (now in pipeline registers) to 0. The instruction in the IF stage can be turned into a nop. The control signals ID.flush AND EX.flush control the MUX’s which zero the control lines. The PC must be loaded with a memory value at which the exception handler resides (some fixed memory location). This can be done by adding another input to the PC MUX. The address of the instruction causing the exception must then be saved in the EPC register. (Actually, the value PC + 4 is saved). Note that the instruction causing the exception cannot be allowed to complete, or it may overwrite the register value which caused the overflow. Consider the following instruction: add $1, $1, $2 The value in register 1 would be overwritten if the instruction finished. 345 The datapath, with exception handling for overflow: IF.Flush EX.Flush ID.Flush Hazard detection unit 40000040 ID/EX M u x WB Control 0 M u x 0 M u x M 0 EX IF/ID EX/MEM M u x Cause WB MEM/WB M WB Except PC 4 Shift left 2 Registers PC = M u x Instruction memory ALU M u x Sign extend M u x Forwarding unit 346 Data memory M u x Interrupts can be handled in a way similar to that for exceptions. Here, though, the instruction presently being completed may be allowed to finish, and the pipeline flushed. (Another possibility is to simply allow all instructions presently in the pipeline to complete, but this will increase the interrupt latency.) The value of the PC + 4 is stored in the EPC, and this will be the return address from the interrupt, as discussed earlier. Note that the effect of an interrupt on every instruction will have to be carefully considered — what happens if an interrupt occurs near a branch instruction? 347 Superscalar and superpipelined processors Most modern processors have longer pipelines (superpipelined) and two or more pipelines (superscalar) with instructions sent to each pipeline simultaneously. In a superpipelined processor, the clock speed of the pipeline can be increased, while the computation done in each stage is decreased. In this case, there is more opportunity for data hazards, and control hazards. In the Pentium IV processor, pipelines are 20 stages long. In a superscalar machine, there may be hazards among the separate pipelines, and forwarding can become quite complex. Typically, there are different pipelines for different instruction types, so two arbitrary instructions cannot be issued at the same time. Optimizing compilers try to generate instructions that can be issued simultaneously, in order to keep such pipelines full. In the Pentium IV processor, there are six independent pipelines, most of which handle different instruction types. In each cycle, an instruction can be issued for each pipeline, if there is an instruction of the appropriate type available. 348 Dynamic pipeline scheduling Many processors today use dynamic pipeline scheduling to find instructions which can be executed while waiting for pipeline stalls to be resolved. The basic model is a set of independent state machines performing instruction execution; one unit fetching and decoding instructions (possibly several at a time), several functional units performing the operations (these may be simple pipelines), and a commit unit which writes results in registers and memory in program execution order. Generally, the commit unit also “kills off” results obtained from branch prediction misses and other speculative computation. In the Pentium IV processor, up to six instructions can be issued in each clock cycle, while four instructions can be retired in each cycle. (This clearly shows that the designers anticipated that there would be many instructions issued — on average 1/3 of the instructions — that would be aborted.) 349 Instruction fetch and decode unit Functional units In-order issue Reservation station Reservation station … Reservation station Reservation station Integer Integer … Floating point Load/ Store Out-of-order execute In-order commit Commit unit Dynamic pipeline scheduling is used in the three most popular processors in machines today — the Pentium II, III, and IV machines, the AMD Athlon, and the Power PC. 350 A generic view of the Pentium P-X and the Power PC pipeline Data cache PC Instruction cache Branch prediction Instruction queue Register file Decode/dispatch unit Reservation station Branch Reservation station Integer Reservation station Integer Reservation station Floating point Commit unit Reorder buffer 351 Reservation station Reservation station Store Load Complex integer Load/ store Speculative execution One of the more important ways in which modern processors keep their pipelines full is by executing instructions “out of order” and hoping that the dynamic data required will be available, or that the execution thread will continue. Two cases where speculative computation are common are the “store before load” case, where normally if a data element is stored, the element being loaded does not depend on the element being stored. The second case is at a branch — both threads following the branch may be executed before the branch decision is taken, but only the thread for the successful path would be committed. Note that the type of speculation in each case is different — in the first, the decision may be incorrect; in the second, one thread will be incorrect. 352