Module 3: Central Processing Unit and Memory Design Commentary Topics I. II. III. The Basic Little Man Computer Organization of the CPU and Memory Instruction Set Architecture I. The Basic Little Man Computer A. Basic Computer Organization The modern-day computer is based on an architecture defined in 1951 by John von Neumann at the Institute for Advanced Study in Princeton, New Jersey. Von Neumann's design was based on three concepts: memory contains both programs and data—the stored-program concept the contents of memory are addressable by location, without regard to the type of data contained therein execution of instructions occurs in a sequential fashion unless that order is explicitly modified The hardware of the computer is usually divided into three major components: 1. The central processing unit (CPU) contains an arithmetic and logic unit for manipulating data, a number of registers for storing data, and control circuits for fetching and executing instructions. 2. The memory subsystem of a computer contains storage for instructions and data. The CPU can access any location in memory at random and read and write data within a fixed interval of time. 3. The input/output (I/O) subsystem contains electronic circuits for communicating and controlling the transfer of information between the computer and the outside world. The I/O devices connected to the computer may include keyboards, printers, terminals, magnetic disk drives, and communication devices. These major components are interconnected by a set of wires, called a bus, which carries information relating to addresses in memory, data, and control signals. We will discuss each of the major components in section II of this module. The instruction-set architecture, which we will discuss in section III of this module, describes the instructions the CPU can process. As part of the instruction processing, the CPU uses a fetchexecute cycle that it executes repetitively until it encounters and executes a "halt" instruction. This cycle consists of the following steps: 1. Fetch the next instruction from memory. 2. Decode the instruction. 3. Resolve memory addresses and determine the location of any operands. 4. Execute the instruction. 5. Go back to step 1. We will use this cycle to help explain the operation of the Little Man Computer in the next section. B. Little Man Computer Layout A model often used to help explain the operation of a computer is the Little Man Computer (LMC), developed by Dr. Stuart Madnick at the Massachusetts Institute of Technology (MIT) in 1965, and revised in 1979. The LMC model looks at the operations of the computer from a decimal viewpoint, using easy-to-understand analogies. We will use the LMC at several key points throughout this module. The general model of the LMC is that of a little circuit-board man inside a walled mailroom. See the figure below for the LMC model that we will use in this course. It has: an In Basket an Out Basket (the in and out baskets are the Little Man's only means of communication with the outside world) a simple Calculator with a display consisting of three decimal digits and a sign (+ or –). The display can show any number in the range of –999 to +999. an instruction location counter, or Location Counter an Instruction Display for the instruction being executed one hundred mailboxes, each uniquely identified by a two-digit address in the range of 00 to 99. Each mailbox can contain a three-digit number, that is, a number in the range of 000 to 999. There is no provision for storing a sign (+ or –) in a mailbox, so (unlike the calculator) a mailbox cannot contain a negative number. a comment box that explains what is happening during the current step navigation buttons to take the user at the appropriate time to the fetch and execute steps and to demonstrate these steps (these buttons are not shown in the model in figure 3.1.) Figure 3.1 shows the components of the Little Man Computer. Figure 3.1 Little Man Computer Each mailbox can store three digits, which can represent either data or an instruction. When they represent an instruction, the three digits are broken up into two parts: the first digit tells the LMC what to do—it is called an operation code, or op code the next two digits are used to indicate the mailbox that is to be addressed (when required) C. LMC Operation To show how the LMC works, let's look at a simple program for subtraction, using the basic instruction set in table 3.1 below, and the program in the step diagram of figure 3.2. In this table, "XX" in the Address column indicates a memory address. Table 3.1 LMC Instructions Instruction Mnemonic Op Address What the Little Man Does Code Coffee Break (Halt) COB or HLT 0 00 Self-explanatory. Do nothing. Add ADD 1 XX Add the contents of mailbox XX to the calculator. Add 1 to the counter. Subtract SUB 2 XX Subtract the contents of mailbox XX from the calculator. Add 1 to the counter. Store STO 3 XX Copy the contents of the calculator into mailbox XX. The sign indication is discarded. Add 1 to the counter. Load LDA 5 XX Copy the contents of mailbox XX into the calculator. The sign in the calculator is set to +. Add 1 to the counter. Input IN 9 01 Move the contents of the inbox to the calculator. The sign in the calculator is set to +. Add 1 to the counter. (If there is nothing in the inbox, just wait until an input is provided. Moving the inbox contents exposes the next input if there is one, or leaves the inbox empty.) Output OUT 9 02 Copy the contents of the calculator to the outbox. The sign indication is discarded. Add 1 to the counter. Branch Unconditional BR 6 XX Copy XX into the counter. Branch on Zero BRZ 7 XX If the calculator contains zero, copy XX into the counter; otherwise, add 1 to the counter. (The sign in the calculator is ignored.) Branch on Positive BRP 8 XX If the value in the calculator is positive, copy XX into the counter; otherwise, add 1 to the counter. (Note that zero is considered to be a positive value.) Now let's look at a simple program for subtraction using the LMC. The interactive, animated diagram below shows how the LMC calculates the positive difference between 123 and 456. Our diagram has: Fetch and Execute buttons to step through the instruction cycle Show Me buttons that demonstrate each fetch and execute step comments explaining what is happening during the current step You can repeat the animation at any step by clicking on the Back button, then clicking again on the Show Me button. The Next button bypasses the animation and goes directly to the next step. Figure 3.2 Finding the Positive Difference between 123 and 456 Note that program instructions 06 and 07 were skipped because the condition of the branching instruction was met. If, for example, the numbers had been 123 and 012, then the branching requirement in step 08 would not have been met, and the program would have continued through program instructions 06 and 07, where the order of substitution would have been reversed and a positive answer would have been obtained. II. Organization of the CPU and Memory A. CPU Organization The CPU is made up of three major parts: 1. The registers store intermediate data used during the execution of the instructions. 2. The arithmetic logic unit (ALU) performs the required operations for executing the instructions. 3. The control unit supervises the transfer of information among the registers and instructs the ALU as to which operation to perform. Registers In module 2, we saw that registers are composed of flip-flops, one for each bit in the register. The CPU uses registers as a place to temporarily store data, instructions, and other information. Different CPU designs have different numbers and types of registers. The following four types of registers, however, are found in all designs: PC, the program counter. The PC tells the CPU which memory location contains the next instruction to be executed. Typically, the PC is incremented by 1 so that it points to the next instruction in memory. But branch instructions can set the PC to another value. In the LMC, the "program counter" is represented as the Location Counter. IR, the instruction register. This register is important because it holds the instruction that is currently being executed. In the LMC, the "instruction register" is shown as the Instruction Display. To visualize the LMC operation, we can picture the Little Man holding the instruction written on a slip of paper he has taken from a mailbox, and reading what needs to be done. MAR, the memory address register. This register tells the CPU which location in memory is to be accessed. The MAR is not shown in the LMC. MDR, the memory data register. This register holds the data being put into or taken out of the memory location identified by the MAR. It is sometimes called the memory buffer register, or MBR. The MDR is not shown in the LMC. To visualize the LMC operation, we can picture the Little Man holding the data written on a slip of paper that is either being put into or taken out of a mailbox. Other registers that may be included in a particular CPU design are: accumulators or general-purpose registers, which process data in the CPU. Normally, multiple accumulators are in the CPU, and the number of accumulators is usually a power of 2 (i.e., 2, 4, 8, or 16) in order to optimize the addressing scheme. In the LMC, the accumulator is represented as the Calculator. the program status register. This register stores a series of one-bit pieces of information called flags that keep track of special conditions. The LMC does not have a program status register. IOR, the input and output interface registers, which are used to pass information to and from the peripheral devices. In the LMC, the input and output interface registers are represented as the In Basket and the Out Basket. the SP, the stack pointer, which points to the top of the stack. We will discuss the operation of a stack in section III of this module. The LMC does not have a stack pointer. Here are some of the operations that can be performed on registers: reading: When data are read from a register, the data are copied to a new destination without changing the value in the register. writing: When data are written into a register, the new data overwrite and destroy the old data. arithmetic operations: When data are added or subtracted, the new results are stored in the register, thus destroying the old data. shifts and rotations: Registers can be shifted or rotated to the right or to the left. We will discuss these operations in section III of this module. Arithmetic Logic Unit We discussed the operation of a simple one-bit, four-operation ALU in module 2. A complete ALU handles all of the bits in a word and can perform a wide variety of operations. Signals from the control unit are used to select the operation to be performed. Control Unit and the Fetch-Execute Instruction Cycle The control unit controls the execution (including the order of execution) of instructions through the use of the fetch-execute instruction cycle. The instruction to be executed is determined by reading the contents of the program counter (PC). Execution of each instruction includes setting a new value into the PC, thereby determining which instruction will be executed next. When an instruction other than a branching instruction is executed, the PC is incremented by 1. When a branching instruction is executed, the new value of the PC depends on the result of the test (if any) and the address field of the instruction. A diagram of the fetch-execute cycle is shown in figure 3.3 below. In the fetch phase, the instruction is decoded using a decoder similar to the one shown in figure 2.7 (module 2, section I D). The inputs A2, A1, and A0 in figure 2.7 are controlled by the control unit. The output of the decoder is a "1" on one of the op code lines. That line causes the selected type of instruction to execute. Figure 3.3 Fetch-Execute Cycle We can use the LMC to demonstrate how the fetch-execute instruction cycle works. The LMC instructions listed in section I are composed of a series of data transfers and operations. To discuss these data manipulations, we use the following notations: A → B indicates that the contents of the A register are copied to the B register. IR[address] → MAR indicates that the address portion of the IR register is copied into the MAR. A + B → A indicates that the contents of A are replaced by the sum of A and B. In order for a program to run on the LMC, each instruction must first be fetched and then executed. Thus, the instruction cycle is divided into two phases: 1. Fetch The Little Man looks at the program counter (labeled Location Counter) to find where to go in the memory for the next instruction. Then, the Little Man goes to that location to fetch the instruction to be executed. 2. Execute The Little Man executes the fetched instruction. Based on the instruction being executed, the Little Man puts the location of the next instruction into the program counter. The fetch cycle, which we saw in figure 3.2 in section I, is achieved with the following registertransfer steps: 1. PC → MAR The value of the program counter is loaded in the MAR. Remember, the program counter holds the location of the next instruction to be executed. 2. Mem[MAR] → MDR The contents of memory location MAR are transferred to the MDR. 3. MDR → IR The value in the MDR is copied into the IR. 4. IR[op code] → decoder The op code portion of the instruction that is in the IR is transferred to the decoder. The decoder selects the instruction to be executed. Once the instruction is fetched, the execution phase can start. The execution phase of the STORE instruction is given below: 1. IR[address] → MAR The address portion of the IR register identifies the address in memory where data will be stored. 2. A → MDR The data in A is put into the MDR. 3. MDR → Mem[MAR] The data in the MDR is written into memory. 4. PC + 1 → PC The program counter is incremented to point to the next instruction. Every instruction will start with the same fetch cycle. The complete cycle of register transfers for two additional instructions, ADD and BR, are shown in the tables below: Table 3.2 Register Transfers for ADD Fetch Execute Step 1 PC → MAR Step 5 IR[address] → MAR Step 2 Mem[MAR] → MDR Step 6 Mem[MAR] → MDR Step 3 MDR → IR Step 7 A + MDR → A Step 4 IR[op code] → decoder Step 8 PC + 1 → PC Table 3.3 Register Transfers for a BR Instruction Fetch Step 1 PC → MAR Execute Step 5 IR[address] → PC Step 2 Mem[MAR] → MDR Step 3 MDR → IR Step 4 IR[op code] → decoder B. Memory Organization The memory of a computer consists of a large number of 1-bit storage devices organized into words. The number of bits in a word varies with the computer design. In modern computers, the memory word size is a power of two (8, 16, 32, or 64), but earlier computer designs contained memories with words of 12, 18, 24, 36, 48, or 60 bits. Thus, the structure of a memory can be described by specifying the number of bits in a word, and the number of words. Figure 3.4 shows the organization of a 16 x 32 memory. The word length is 16, and the number of words is 32. For each bit position in the word, a data line connects the memory to the Memory Data Register in the CPU. The data line for a bit position carries data to or from one of the bits in the corresponding bit position. To specify a particular word in the memory, the address of the word is given in binary. In the example shown in the figure, the words are numbered 0 to 31, which is 00000 to 11111 in binary. Five lines, called address lines, carry the address from the CPU's Memory Address Register to the memory. Figure 3.4 Organization of a 32 Word x 16 Bits/Word Memory Each word of this memory (shown by a row) contains 16 bits (two bytes). There are 16 data lines connecting the memory to the CPU. The outputs of the address decoder determine which of the 32 words is connected to the address lines. The input to the address decoder has 5 address lines because 25 = 32. In general: If a memory contains words that are n bits long, then n data lines are required. If a memory contains 2k words, then k address lines are required. The following terminology is commonly used to refer to sizes of computer memories: kilobyte (K) = 210 = 1,024 bytes megabyte (M) = 220 = 1,048,576 bytes gigabyte (G) = 230 = 1,073,741,824 bytes We can use our understanding of decoders and addressing to explain how the MAR and MDR work together to access a specific address in memory. Little Endian, Big Endian There is a convention that the bits in a word of any size are numbered from right to left; that is, bit 0 is the least significant bit. So a byte looks like this: bit bit bit bit bit bit bit bit 7 6 5 4 3 2 1 0 When data consist of more than one byte, it is necessary for the CPU to determine the order in which the data will be stored in memory and to ensure the relative order of the bytes so that information transmitted from one computer will be reassembled in the correct order by the receiving computer. The two possibilities are: big endian, where byte 0 of multi-byte data is on the left (or MSB) side of the word (see figure 3.5 for storage of two 4-byte words) little endian, where byte 0 of multi-byte data is on the right (or LSB) side of the word (see figure 3.6 for storage of two 4-byte words) Figure 3.5 Big Endian Memory bit 31 (MSB) ↓ (LSB) bit 0 bit 31 (MSB) ↓↓ Word 0 byte 0 (LSB) bit 0 ↓ Word 1 byte 1 byte 2 byte 3 byte 4 byte 5 byte 6 byte 7 In this view, the most significant bit of the 32-bit word 0 is bit 7 of byte 0, while the least significant bit of the 32-bit word 0 is bit 0 of byte 3. Figure 3.6 Little Endian Memory bit 31 (MSB) ↓ (LSB) bit 0 bit 31 (MSB) ↓↓ (LSB) bit 0 ↓ Word 1 byte 7 byte 6 byte 5 byte 4 Word 0 byte 3 byte 2 byte 1 byte 0 In this view, the most significant bit of the 32-bit word 0 is bit 7 of byte 3, while the least significant bit of the 32-bit word 0 is bit 0 of byte 0. As an example of big endian versus little endian storage, consider the 32-bit hexadecimal word 89ABCDEF stored in memory starting at location 1000 in table 3.4. Table 3.4 Big Endian versus Little Endian Storage Big Endian Little Endian Memory Location Value Stored Memory Location Value Stored 1000 89 1000 EF 1001 AB 1001 CD 1002 CD 1002 AB 1003 EF 1003 89 (All numbers are in hexadecimal) The choice of which endian system to use does not impact the performance of the CPU and memory. Both systems are in use today with Motorola and SUN processors using big endian, whereas Intel processors use little endian. The type of endian used is an issue in data communications because TCP and IP packets have 16- and 32-bit numeric fields, for example, the IP address, and the protocol specifies that these be transmitted in big endian order. C. Types of Memory The address space of a computer can have both random access memory (RAM) and read-only memory (ROM). Below, we discuss both types of memory. Random Access Memory (RAM) Random access memory (RAM) refers to memory cells that can be accessed for information transfer from any desired random location. That is, no matter where the cells are located physically in memory, the process of locating a word in memory is the same and requires the same amount of time. RAM Characteristics The key characteristic of RAM is that data can be written into and read from any memory word without affecting any other word. When you first turn on your computer, the operating system is loaded into the RAM so it can manage all the processes that your computer will perform. If you decide to do word processing, your operating system will load the word-processing application from your hard drive into RAM so that you can begin your word-processing tasks. As you use your application to create or edit some document of interest to you, the operating system will manage the program in execution. It will also do a retrieval of documents from your hard drive or external disk (known as a read operation), or store a document on your hard drive or external disk (known as a write operation) when you so request. Most RAM used in computers today is semiconductor dynamic RAM. Semiconductor dynamic RAM has one very significant weakness: if the power to the computer is interrupted, everything in RAM is lost and cannot be recovered. Much to the frustration of the unprepared user, if a document exists only in RAM and power is lost, the document no longer exists. RAM Size The two main factors that determine the size of RAM are the number of words that it stores, and the length of those words. The size is usually stated as the number of words times the number of bits per word. Thus, the size of a memory that stores 2,048 (2K) words where each word is 16 bits long is 2K words x 16 bits/word. Looking at the RAM above, we must answer the following questions: How many address lines and data lines are required for a memory that stores 2K words where each word is 16 bits? How many bytes can this memory store? The calculations below provide the answers: 2K can be represented by 211 as shown below: 2K = 2 x K = 21 x 210 = 2(1+10) = 211 Therefore, 11 address lines are required. 16 data lines for data input/output are required, one for each bit in the 16-bit data word. The number of bytes that can be stored is: (2K words x 16 bits per word) = 4K bytes 8 bits/byte RAM Connections The following connections are required: Read: An input control signal that tells the RAM when a READ is to be performed. Write: An input control signal that tells the RAM when a WRITE is to be performed. Select: An input that selects a particular RAM chip from other RAM chips in memory. A block diagram of a typical 2K x 16 RAM chip is shown below: Figure 3.7 Block Diagram of a 2K x 16 RAM Chip Read-Only Memory (ROM) Computers have another type of memory that almost never changes, called read-only memory (ROM). Changing ROM is beyond the capability of the average user. Permanently stored in the computer, ROM provides many of the lower-level instructions used by the operating system to interact with the components of the computer. ROM is specific to the processor in each computer and is independent of the operating system. When the computer is turned off or accidentally shut down, there is no loss of ROM. Because ROM is more expensive than RAM, computer manufacturers generally use the smallest capacity ROM possible. As with RAM, the two main factors that determine the size of ROM are the number of words that it stores and the length of those words. ROM differs from RAM in the following ways: There is no user input data because the contents of ROM are permanently stored. There is no need for read/write selects because it is only possible to read from ROM. The calculations for a ROM that stores 4K words where each word has 8 bits are shown below: 4K can be represented by 212 (4K = 4 x K = 22 x 210 = 212) Therefore, 12 address lines are required. Eight data lines for outputting data are required, one for each bit in the 8-bit data word. The number of bytes that can be stored is: (4K words x 8 bits per word) = 4K bytes 8 bits/byte The following connections are required: select: An input that selects a particular ROM chip from other ROM chips in memory. A block diagram of the 4K x 8 ROM chip is shown below: Figure 3.8 Block Diagram of a 4K x 8 ROM Chip The following types of ROMs are available: custom ROMs ("simple" ROMs): The factory programs the ROMs. PROMs (programmable ROMs): The customer purchases a "raw" ROM and (carefully!) does the programming. A PROM may be programmed only once. PROMs and ROMs are robust and fast. EPROMs (erasable PROMs): EPROMs give more flexibility in programming, at the expense of speed and robustness. Such ROMs are often called read-mostly ROMs, to signify that the ROMs are mostly going to be read, but that they can also be written into occasionally. Writing should not be done too often because it is slow and laborious. EPROMs may have their entire programming wiped clean by exposure to UV light for about 20 minutes. We can also get electrically erasable PROMs (EEPROMs), where any individual PROM location may be reprogrammed in place, using just electrical signals. The erasure operation takes much longer (by several microseconds) than a standard read. Finally, we have flash memory, which provides electrical erasure, but only to groups of words, not to individual words. D. Input/Output (I/O) The instructions we used for I/O in the LMC were IN and OUT. Each time we called one of those instructions, we could input or output one three-digit number or one word. A similar instruction in a real computer will input or output one word. This type of I/O, called programmed I/O, is a relatively inefficient method of I/O. In this section, we will look at programmed I/O, and then at other, more efficient methods, such as programmed I/O with interrupts and direct memory access (DMA). Programmed I/O We start with a consideration of some of the I/O devices that might be connected to the CPU. They include: keyboard (input) mouse (input) voice (input and output) scanner (input) printers (output) graphics display (output) optical disk (input and output) magnetic tape (input and output) magnetic disk (input and output) modem (input and output) communications adaptor (wired or wireless) electronic instrumentation controls and indicators on an appliance, such as a microwave oven or a cell phone (when the processor is embedded in the appliance) Some I/O devices operate at a low speed, such as keyboards or printers, and some operate at a high speed, such as magnetic or optical disks. Some transfer one character at a time, such as a mouse, and some transfer blocks of data at once, such as a printer or a graphics display. In general, multiple I/O devices will be connected to the CPU through a set of I/O modules, as shown in the figure below. The CPU will control the process with an I/O address register similar to the MAR and an I/O data register similar to the MBR. Figure 3.9 Use of Multiple I/O Modules To handle I/O devices, computer systems must be able to: address different peripheral devices respond to I/O initiated by the peripheral devices handle block transfers of data between I/O and memory handle devices with different control requirements The problem with the type of programmed I/O used by the LMC is that I/O can only occur when called for by the CPU, and thus there is no way for the user or a peripheral to initiate commands to the CPU. One solution to this problem is to use polling, a technique in which the CPU uses programmed I/O to send out requests to I/O devices to see if they have any data to send and to receive that data when it's sent. The disadvantage of polling is that it requires a large overhead because each device must be polled frequently enough to ensure that data held by the I/O device awaiting transfer is not lost. If there are many devices to be polled, much of the CPU's time is wasted doing polling instead of doing other, more useful, processing. The use of interrupts, which we discuss next, is a better way of managing I/O requests. Interrupts Interrupts are signals sent on special control lines to the CPU. When an interrupt is received, the CPU stops its execution of the current program after completing the current instruction and jumps to a special interrupt processing program. Before leaving the current program, however, the CPU completes the instruction it was processing and then stores all of the registers in a group of registers called a stack, or in a table called the process control block (PCB). After the interrupt is serviced, the CPU restores the registers that had been stored in the stack or the PCB. The CPU then returns to processing the original program. Figure 3.10 below shows the procedure for processing an interrupt. Successively clicking on the Step buttons will walk you through the interrupt process. Figure 3.10 Servicing an Interrupt An interrupt can be used in the following ways: as an external event notifier. The CPU is notified that an external event has occurred. For example, an interrupt is used to notify the system that there is a keyboard input. as a completion signal. This type of interrupt can be used to control the flow of data to an output device. For example, because printers are slow, an interrupt can be used to let the CPU know that a printer is ready for more output. as a means to allocate CPU time. This type of interrupt is used in a multi-tasking system where more than one program is running. Each of the programs is allowed access to the CPU for a short period of time to execute some instructions. Then, another program is allowed to access the CPU. The interrupt is used to stop the current program from executing and then to switch to a dispatcher program, which allocates a block of execution time to another program. as an abnormal event indicator. This type of interrupt is used to handle abnormal events that affect operation of the computer system. Two examples of abnormal events are: o o a power loss an attempt to execute an illegal instruction such as "divide by zero" Interrupts have many sources. In fact, the standard I/O for today's PC has 15 interrupt lines labeled IRQ1 through IRQ15 (where IRQ is Interrupt ReQuest). With 15 interrupt lines, it is possible—in fact, probable—that multiple interrupts will occur simultaneously from time to time. Thus, when an interrupt occurs, several questions must be answered before the interrupt is serviced: Are there other interrupts awaiting service, or is an interrupt currently being serviced? What is the relative priority of the interrupts? What is the source of each interrupt? With multiple interrupts, there is a priority system in which the various interrupts are ranked by importance. For example, an abnormal-event interrupt caused by a power failure is more important than a completion interrupt telling the CPU that a printer is ready for more output. Direct Memory Access (DMA) With programmed I/O, the data are loaded directly into the CPU. It is more efficient, however, to move large blocks of data (including programs) directly to or from memory rather than to move the data word-by-word into the CPU and then transfer it word-by-word into memory. For the LMC, it's like loading data directly from the rear of the mailboxes, thus bypassing the LMC I/O instruction procedures. Moving a block of data directly between memory and an I/O module is called direct memory access. Three primary conditions that must be met for a DMA transfer are: There must be a method to connect the I/O interface and memory. The I/O module involved must be capable of both reading and writing to memory without participation of the CPU. In particular, the I/O module must have its own MAR and an MBR. There must be a means to avoid conflict between the CPU and the I/O module when both are attempting to access memory. The use of DMA is advantageous because: high-speed transfers are possible. the CPU is available to perform other tasks during the I/O transfers. DMA transfers are possible in either direction; for example, CD music can be played on your PC while the computer is used for other tasks. The procedure used by the CPU to initiate a DMA transfer requires four pieces of data to be provided to the I/O controller: 1. 2. 3. 4. the the the the location of the block of data on the I/O device starting location of the block of data in memory size of the block of data to be transferred direction of transfer—either a read (I/O → memory) or a write (memory → I/O) Figure 3.11 below shows the DMA process during a data transfer. Successively clicking on the Step buttons will walk you through the DMA process. Figure 3.11 The DMA Process E. Buses The CPU, memory, and I/O peripherals are interconnected in various ways. The basic components involved in the interface are listed below: the CPU the I/O peripheral devices memory I/O modules the buses that connect the other components The basic pathways for those interconnections are shown in figure 3.12. Figure 3.12 Bus Connections for CPU-Memory-I/O Pathways The two basic I/O system architectures are: 1. bus—one or more buses connects the CPU and memory to the I/O modules 2. channel—a separate I/O processor, a computer in its own right, performs complex input and output operations Buses are characterized by: type (parallel or serial) configuration width (number of lines) speed use Parallel buses have an individual line for each bit in the address, data, and control words. They run over short distances (up to a few feet) at high speeds because all of the data bits for one word are transferred at the same time. As distance increases, however, it becomes more difficult to keep the bits of a word synchronized. Parallel buses are generally used for all internal buses because speed is a critical factor in performance. Serial buses transfer data sequentially, one bit at a time. They usually carry data in external buses over greater distances at a somewhat lower data rate. Generally speaking, serial buses have a single data pair and several control lines. The characteristics of several common buses are listed below: System buses ISA (Industry Standard Architecture) is a parallel bus with a 16-bit data width and separate lines for addresses and data. It was used by the X86 and PowerPC families of computers, but is being phased out of use in favor of the PCI bus. PCI (Peripheral Control Interface) is a parallel bus with 32- or 64-bit data width. The data and address share the same lines through multiplexing of the signals, i.e., addresses are sent and then the data follow on the same line. This bus is replacing the ISA bus on the X86 and PowerPC families of computers. External buses SCSI (Small Computer System Interface) is a parallel bus that allows multiple I/O devices to be daisy-chained. USB (Universal Serial Bus) is a serial bus that is faster (up to 12 megabits per second throughput) than the RS-232 bus. It has a four-wire cable with two wires for address and data, and two lines to carry power to the I/O device. IEEE 1394 (also known as FireWire) is a serial bus that can handle up to a 400 megabits-per-second data rate. It can either be daisy-chained or connected to hubs. The signals in computer buses are usually divided into three categories: 1. data 2. addresses 3. control information, including interrupts The parallel bus connecting a printer to a PC is one case where an address line is not required, because it is a point-to-point connection. However, when the printer is connected in a daisy chain with other devices (such as a fax machine, a scanner, and an external disk drive), the address must be added. III. Instruction Set Architecture A. Instruction Word Formats Instructions are divided into two parts: 1. The op code, or operation code, tells the computer what operation to perform. 2. The operand (the address in the LMC) provides the address of the source of the data to be operated on and the destination of the result of the operation. In the Little Man Computer, the op code is one digit in length and the operand is two digits. In a real CPU, however, such a simple arrangement would not work. Let's start with a discussion of the op code. The length of the op code is a function of the number of operations that the CPU can perform. An instruction set with a 4-bit op code will allow up to 16 (24) different operations to be performed, with each operation requiring a unique op code. But if 17 to 32 operations were required, an op code of 5 bits (because 32 = 25) would be required to provide unique op codes for each operation. Next, let's look at the operands. Operands specify the location or address of data that are used in the instruction. Data instructions have one or two source operands and one destination operand. Operands may be either explicit or implicit. Explicit operands are included in the statement of the instruction, whereas implicit operands are understood without being stated. Instructions may be written with zero to three operands. The number of bits required for each operand can vary because some operands point to one of a limited set of CPU registers, whereas others point to memory locations. The former may be identified in 4 bits (for a CPU with 16 general-purpose registers), whereas the latter may require 32 bits or more (based on the size of memory). The length of an instruction in a given computer can also vary. For instance, the IBM mainframes may have instructions that are two bytes (16 bits), four bytes, and six bytes long. On the other hand, the Sun SPARC RISC instructions are always four bytes long. The advantage of variable-length instructions is that more flexibility is provided to the system programmers. The advantage of fixed-length instructions is that the system design can be simplified, and it may be easier to optimize system performance because of the simpler design. Zero-Operand Instructions Zero-operand instructions have the form: op code These instructions may be used for instructions that do not involve data, i.e., HLT. They also may be used as stack instructions. For example, the following instructions would put both A and B on the top of the stack and then multiply them. PUSH PUSH MUL We will discuss stack instructions further in the next section. One-Operand Instructions One-operand instructions have the form: op code operand where the accumulator and the operand contain the operands used in the calculation, and the result is placed in the accumulator. For example: MUL A would be written for the equation: accumulator = accumulator * A Two-Operand Instructions Two-operand instructions have the form: op code operand 1 operand 2 where operand 1 and operand 2 are used in the calculation, and the result is placed in operand 1. For example: MUL A, B would be written for the equation: A=A*B Three-Operand Instructions Three-operand instructions have the form: op code operand 1 operand 2 operand 3 where operand 2 and operand 3 are used in the calculation, and the result is placed in operand 1. For example: MUL A, B, C would be written for the equation: A=B*C B. Classes of Instructions Instructions can be classified in the following ways: data-movement instructions The LOAD and STORE instructions in the Little Man Computer are good examples of this class of instruction. Many real CPUs have additional forms of these basic operations. Another example of data-movement instruction is MOVE. arithmetic instructions The ADD and SUBtract instructions in the Little Man Computer are examples of this class of instruction. Most real CPUs also have instructions for floating-point and BCD arithmetic. Multiplication and division are often implemented in hardware, but can also be performed using addition, subtraction, and shifting instructions. Boolean-logic instructions Boolean instructions were not included in the Little Man Computer instruction set. In a real CPU, the AND, OR, and EXCLUSIVE-OR instructions are implemented. We will discuss Boolean-logic instructions in more detail below. single-operand manipulation instructions These instructions include COMPLEMENT, INCREMENT, DECREMENT, and NEGATE. Negating is performing the 2's complement of the values in a register. We will discuss single-operand manipulation instructions in more detail below. bit-manipulation instructions These instructions operate on single bits in an instruction. shift and rotate instructions There are three basic instructions: LOGICAL SHIFT, ROTATE, and ARITHMETIC SHIFT. We will discuss shift and rotate instructions in more detail below. program and control instructions These instructions include jumps, branches, subroutine calls, and returns. The Little Man Computer has three of these instructions: Branch, BRZ (branch on zero), and BRP (branch on positive). stack instructions Stack instructions are used to access one of the most important structures in programming—the stack. We will discuss stack instructions in more detail below. multiple-data instructions Multiple data instructions perform single operations on multiple pieces of data. They are also known as SIMD or single instruction, multiple-data instructions and are used in multimedia applications. privileged instructions Privileged instructions can only be executed by the operating system because they affect the operating status of the computer and thus could affect other programs that are running. The HALT, INPUT, and OUTPUT instructions fall into this category. For example, the program can request an INPUT, but it is up to the operating system to perform that instruction in conjunction with INPUT and OUTPUT requests from other programs that are running. Boolean-Logic Instructions Boolean-logic operations are implemented on a bit-by-bit basis. Thus, each set of bits, i.e., the 0th, the 1st, the 2nd, ..., the nth, are operated on independent of the other bits. See the example below for the ANDing of two 8-bit words. ANDing Two 8-Bit Words A: 1011 0011 B: 1100 0101 A ● B: 1000 0001 Note that the result is 1 only when the two bits being ANDed are both 1, and that the result is 0 when either of the two bits being ANDed is 0. Single-Operand Manipulation Instructions As discussed in module 1, complementing a number involves changing all 1s to 0s, and all 0s to 1s. Incrementing a number is simply adding 1 to the number. A useful trick for incrementing a number is to change the rightmost 0 to a 1 and then to change all of the bits to the right of that 0 from 1 to 0. The example below shows how to complement and then increment an 8-bit number. Complementing and Then Incrementing an 8-Bit Word A: 0111 0000 complement of A: 1000 1111 +1 A' incremented: 1001 0000 Note: referring to the trick described above for incrementing a number: 0000 0000 0000 0000 0000 0000 0001 0011 0111 1111 plus plus plus plus plus 1 1 1 1 1 = = = = = 0000 0000 0000 0000 0001 0001; 0010; 0100; 1000; 0000; i.e., i.e., i.e., i.e., i.e., decimal decimal decimal decimal decimal 0 plus 1 = decimal 1 1 plus 1 = decimal 2 3 plus 1 = decimal 4 7 plus 1 = decimal 8 15 plus 1 = decimal 16 Negating a number is the same as complementing and then incrementing that number. Thus, to negate a number, take its 2's complement. Shift and Rotate Instructions There are three basic shift operations, and each shift can be in either the left or right directions: 1. logical shifts (left and right)—LOGICAL SHIFT LEFT and LOGICAL SHIFT RIGHT 2. arithmetic shifts (left and right)—ARITHMETIC SHIFT LEFT and ARITHMETIC SHIFT RIGHT 3. rotates (left and right)—ROTATE LEFT and ROTATE RIGHT We explain these three operations below. Logical Shifts In logical shifts, the vacated spots [most significant bit (MSB) for a right shift and least significant bit (LSB) for a left shift] are filled in by 0s. Study the examples below: Figure 3.14 Logical Shifts Right Shift Left Shift 1011 is shifted right logically by one position, and a 0 is placed in the MSB position. 1011 is shifted left logically by one position, and a 0 is placed in the LSB position. Arithmetic Shifts Arithmetic shifts have a numerical connotation, and it is essential that the sign of the number be preserved: a positive number must stay positive after an arithmetic shift (both left and right). The same is true for negative numbers. Thus, in all arithmetic shifts, the sign bit is retained. Study the following example, which shows successive left shifts of the binary pattern 000011, which represents +3 in 2's complement. Note that each shift multiplies the number by 2, until the point where overflow results. Left shifts cannot be done when the two MSBs have different signs, because a further left shift will mean that the sign of the number will change. Verify for yourself that each left shift doubles the number. Successively click on the Step button to see this demonstration. Figure 3.15 Arithmetic Shifts Explanation: Press Step to start. Study the example below that shows successive right shifts of the pattern 101000, which represents –24 in 2's complement. Note that each right shift performs an integer division (or truncating) by 2, until 0 is obtained. Start with the 2's complement representation for +24, and verify the result of each arithmetic right shift. Figure 3.16 Right Shifts Rotate Rotate operations move bits cyclically left or right. No particular arithmetic significance is attached to them. Rotates are used to examine all the bits in a number or register without destroying the number itself. For example, we could cyclically shift an n -bit number n times, and doing so would give the original number back. But with each shift, a different bit would "pass under" the LSB (or MSB), and this bit could be "examined" by doing a bit-wise AND with the number 1 (binary 000...01). See what happens when the word 1001 is rotated left. Figure 3.17 Rotate Example Step 1: Try to visualize the left rotation before clicking the ANIMATE button. All the bits except the MSB move left by one position. The MSB rotates into the LSB position. Step 2: Before you click the ANIMATE button, try to visualize the results after rotating left by two positions. Step 3: When you click on ANIMATE, you will see how 1001 looks after rotating left by 3 positions. Step 4: When you click on ANIMATE, you will see how 1001 looks after rotating left by 4 positions. Note that we are back to the starting position. If you start with the word 1001 and rotate right, you should get the following succession, ending at the original position: Figure 3.18 Rotate Right Example Start with 1001. 1001 after rotating right by 1 position. 1001 after rotating right by 2 positions. 1001 after rotating right by 3 positions. 1001 after rotating right by 4 positions (note that we have returned to the starting position). Stack Instructions A stack is a last-in first-out (LIFO) memory, where data items are pushed in at the top location of the stack and access is permitted only to the top location of the stack. Stacks are used in modern machines to store return addresses from subroutines and interrupts. To visualize how a stack works, picture the spring-loaded circular device that holds plates in some cafeterias. Dinner plates or salad plates are generally placed on the circular stand, which has a powerful spring beneath the surface. As plates are placed on the surface, the entire group of plates is lowered, compressing the spring. Whether you remove one plate or 10, they all come off the top of the stack of plates. The last plates that were added are the first to be removed. Two operations are associated with a stack: PUSH describes the operation of inserting one or more items onto a stack. POP describes the operation of deleting, one, some, or all of the items from a stack. The stack pointer (SP) is a register that indicates the top of the stack. It is incremented or decremented depending on whether a push or pop occurs, and how many items are added to or deleted from the stack. The advantage of a memory stack is that the CPU can always refer to it without having to specify an address, because the address is automatically stored and updated and can be easily referenced in the stack pointer. Stacks may also be used to evaluate arithmetic expressions, and we illustrate this process below. Stack Evaluation of (A + B) * (C + D) Suppose we want to find the result of (A + B) * (C + D), where A, B, C, and D are operands—in this case, variables that are placeholders for data—and * and + are operators on the operands. (Note that for clarity, we are using * to represent multiplication.) It is important to recognize that there are two types of information here: operands (the variables) and operators. Each type is governed by one of the following rules as we compute the value of the expression using a stack, after first putting the expression into postfix form: Rule 1: If the symbol we encounter is an operand (in this case, a variable), push it onto the stack. Rule 2: If the symbol we encounter is an operator (e.g., * or +), pop the top two items off the stack, perform the operation involving the operator and the top two items, and push the result onto the stack. We can use a stack to compute the expression by: 1. converting the expression into postfix form so that it is amenable to stack manipulation. In this form, also known as reverse polish notation, operators are placed after operands instead of between them, and computations are made in the order listed below: o parentheses o multiplication and division o addition and subtraction 2. reading the postfix expression from left to right, and, depending on the symbol we encounter, using Rule 1 or Rule 2 as appropriate, until we come to the end of the expression. If we have done all this correctly, there should be one item left on the stack when we have come to the end of our expression, and that item should be the result. Let's see how this works in practice by first looking at the general case of the stack evaluation of the expression, and then by looking at a specific numerical case. First, we must reorder the expression: (A + B) * (C + D) into postfix form by going from left to right and (within parentheses) putting operands (the variables) after operators. The postfix operation always takes operands in the sequence in which they show up in the original expression. A + B becomes AB+ C + D becomes CD+ (AB+) * (CD+) becomes AB+CD+* Now we can apply Rules 1 and 2 to evaluate the postfix expression: AB+CD+* Click successively on the Step button to see this demonstration of the use of stacks to evaluate the postfix expression. Figure 3.19 Stack Evaluation of AB+CD+* Stack Evaluation of (2 + 3) x(5 - 1) Now, let's perform the stack evaluation of the arithmetic expression (2 + 3) x (5 – 1), where the operands are numbers rather than the variables in our previous example. First, reorder the expression into postfix form: 2 + 3 becomes 2 3 + 5 – 1 becomes 5 1 – (2 3 +) x (5 1 –) becomes 2 3 + 5 1 – x Now we can apply Rules 1 and 2 to evaluate the postfix expression 2 3 + 5 1 – x. Click successively on the Step button to see this demonstration. Figure 3.20 Stack Evaluation of (2 + 3) x (5 – 1) C. Addressing Modes In assembly language, each instruction may need some operands (data values). These operands may be specified by providing either a register address (i.e., the name of a register, such as R1 or R2) or by specifying a main memory address ("operand is in location 3011"). Associated with each operand is an address mode, which influences how the specified address is interpreted. Each operand involved in an instruction could be addressed using any of these modes, and within a given instruction, different operands could be using different modes. Addressing modes are used to determine the way operands are chosen during program execution. The use of various addressing modes provides flexibility to: use programming facilities such as pointers to memory, counters for loop control, indexing of data, and program relocation reduce the number of bits in the addressing field of the instruction The Little Man Computer uses only one addressing mode, the direct addressing mode, also known as direct addressing absolute, because the address given in the instruction is the actual memory location being addressed. That implies that there is a non-absolute mode of addressing. We will discuss two modes that use non-absolute addressing, base register addressing, and relative addressing, later in this section. We will start our discussion of addressing modes with a modified LMC to look at three common types of addressing. The modified LMC uses a four-digit instruction with the additional digit, representing the addressing mode, inserted as the second digit from the left. The three addressing modes are: immediate addressing (with an addressing mode of 1) direct addressing (with an addressing mode of 0) indirect addressing (with an addressing mode of 2) The following discussions are based on different modes with the "load" instruction. Immediate Addressing Immediate addressing can be defined as: an addressing mode in which the address field of the instruction contains the data The fetch-execute cycle for a load instruction with immediate addressing would be: PC → MAR This is first step of the fetch cycle. Mem[MAR] → MDR This is the second step of the fetch cycle, in which the contents of memory location MAR are transferred to the MDR. MDR → IR This is the third step of the fetch cycle, where the value in MDR is copied into the IR. IR[op code] → decoder This is the last step of the fetch cycle, where the op code portion of the instruction that is in the IR is transferred to the decoder, and the decoder selects the instruction to be executed. IR[Address] → A The data, which was the address portion of the instruction, is loaded into the A register. PC + 1 → PC The program counter is incremented. Direct Addressing Direct addressing can be defined as: an addressing mode in which the address field of the instruction contains the address in memory where the data are located The fetch-execute cycle for a load instruction with direct addressing is: PC → MAR Mem[MAR] → MDR MDR → IR IR[op code] → decoder IR[address] → MAR The address portion of the instruction points to the location of the data in memory. Mem[MAR] → MDR MDR → A The data from memory is loaded into the A register. PC + 1 → PC Indirect Addressing Indirect addressing can be defined as: an addressing mode in which the address field of the instruction contains the address in memory that contains another address in memory where the operand data are located The fetch-execute cycle for a load instruction with indirect addressing would be: PC → MAR Mem[MAR] → MDR MDR → IR IR[op code] → decoder IR[address] → MAR The address portion of the instruction points to the location of another address in memory. Mem[MAR] → MDR MDR → MAR The contents of the MDR point to the location of the data in memory. Mem[MAR] → MDR MDR → A The data from memory is loaded into the A register. PC + 1 → PC Carefully read the three definitions above to see if you understand the differences. Then, follow the table of memory locations and values below along with the interactive examples illustrating the load immediate 20, load direct 20, and load indirect 20 instructions in the LMC. Table 3.5 Sample Memory Locations and Values Memory Address Data at Memory Address 20 40 30 50 40 60 50 70 The instruction load immediate 20 loads the value 20 into the register because the operation specifies that the operand datum is in the operand field. The instruction load direct 20 loads the value 40 into the register. The load direct 20 operation tells the computer that the operand datum is located at memory address 20. In our case, the value at location 20 is 40. The instruction load indirect 20 loads the value 60 into the register. The load indirect 20 operation tells the computer to go to memory address 20 to find another memory address, 40. In our case, the actual operand datum, 60, is located at the memory address 40. Let's look at how all this would appear graphically: Figure 3.21 Immediate, Direct, and Indirect Addressing Load Immediate 20 Load Direct 20 Load Indirect 20 Register Direct Addressing Register direct addressing can be defined as: an addressing mode in which the address field of the instruction contains the address of the register where the data are located Register direct addressing is similar to direct addressing. The difference is that the data are located in a CPU register rather than in memory. The figure below shows register direct addressing in the LMC. Figure 3.22 Register Direct Addressing for a Load Instruction Register Indirect Addressing Register indirect addressing can be defined as: an addressing mode in which the address field of the instruction contains the register address that contains an address in memory where the operand data are located This is similar to indirect addressing. The difference is that the address of the data is located in a CPU register rather than in another memory location. The figure below shows register indirect addressing in the LMC. Figure 3.23 Register Indirect Addressing for a Load Instruction Base Register Addressing Base register addressing, also known as base offset addressing, can be defined as: an addressing mode in which the address field of the instruction is added to the contents of a base register to determine the final address of the data Base addressing is an example of a non-absolute addressing mode. The advantage of this mode is that the entire program may be moved to a different location in memory and still be addressed by changing the contents of the base register. The fetch-execute cycle for a load instruction using base register addressing would be: PC → MAR Mem[MAR] → MDR MDR → IR IR[op code] → decoder IR[address] + BR → MAR The sum of the address portion of the instruction register and the base register points to the final address. Mem[MAR] → MDR MDR → A PC + 1 → PC In figure 3.24 below, we introduce an additional modification to the LMC with a four-digit base register. This register contains a starting point in memory, while the operand value provides an offset from that starting point. In figure 3.24, part a, we have a program that starts at location 1000, and a current instruction with an address field (operand) of 10. Using base addressing, the actual location of the data in memory is: 1000 (the base register value) + 10 (the operand) = 1010 In figure 3.24, part b, the program has been moved by the operating system (we will discuss moving the program in module 4) to start at location 2000. The new final address of the data is 2000 (the base register value) plus 10 = 2010. Figure 3.24 Base Register Addressing Relative Addressing Relative addressing can be defined as: an addressing mode in which the address field of the instruction is added to the contents of the program counter (PC) to determine the final address of the data Relative addressing is also a non-absolute addressing mode. It differs from base addressing in that the final address is relative to the program counter instead of the base register. The fetch-execute cycle for a load instruction using relative addressing would be: PC → MAR Mem[MAR] → MDR MDR → IR IR[op code] → decoder IR[address] + PC → MAR The sum of the address portion of the instruction register and the program counter points to the final address. Mem[MAR] → MDR MDR → A PC + 1 → PC Indexed Addressing Indexed addressing can be defined as: an addressing mode in which the address field is added to the contents of the index register to determine the final address of the data Indexed addressing is similar to base addressing in that the content of two registers is added to determine the final address. The differences are philosophical. In base addressing, the base address is relatively large and is intended to locate a block of addresses in memory. The base register does not change until the program is moved to a new location in memory. In indexed registering, the index is relatively small and is used as an offset for handling subscripting. Thus, the index register frequently changes during program executing. The fetch-execute cycle for a load instruction using indexed addressing is: PC → MAR Mem[MAR] → MDR MDR → IR IR[op code] → decoder IR[address] + X → MAR The sum of the address portion of the instruction register and the index register counter points to the final address. Mem[MAR] → MDR MDR → A PC + 1 → PC It is also possible—and quite common—to combine modes of addressing. Figure 3.25 a gives an example of indexed addressing. Figure 3.25 b gives an example of indexing a base offset address. In figure 3.25 a, the index register value of 5 is added to the address field of 100 to obtain the final address of 105. In figure 3.25 b, the index register value of 5 is added to the address field of 100 and the base register value of 1000 to get the final address of 1105. Figure 3.25 a Indexed Addressing Figure 3.25 b Indexing with a Base Register Indirect Indexed Versus Indexed Indirect Addressing When indirect and indexed addressing modes are combined, the sequence determining which mode is applied first makes a difference. Figures 3.26 a and 3.26 b show how the sequence determining how addressing modes are applied can affect the answer. In both figures, the address field has a value of 20, the index register has a value of 10, and indirect addressing is used in combination with indexed addressing. In figure 3.26 a, the sequence is indexed, then indirect addressed. Applying indexed registering gives an address of 20 + 10 = 30. Indirect addressing uses address 30 to point to the final address, 50, where the data value of 70 is located. Figure 3.26 a Using the Indexed Indirect Addressing Mode In figure 3.26 b, the sequence is indirect, then indexed. Applying the indirect addressing first, we determine the pre-indexed address as 30 (the value at location 20), and then index 30 by 10 to find the final address of 40. The data value of 60 is the value at location 40. Figure 3.26 b Using the Indirect Indexed Addressing Mode D. Instruction Set Architecture Comparisons In this section, we look at the instruction set of architectures of several computer families—the X86, PowerPC, and IBM mainframe—to show the fundamental similarities between all computer CPUs. X86 The X86 family of computers includes the following: 8088, 8086, 80286, 80386, and 80486 Pentium, Pentium Pro, Pentium II, Pentium III, and Pentium 4 Itanium Instructions with zero, one, or two operands are supported: Zero-operand instructions take actions that do not require a reference to any memory, or register data in the instruction. Stack operations, which do not refer to the data within the instruction, are also zero-operand instructions. One-operand instructions modify data in place, i.e., they complement or negate data in a register. Two-operand instructions use both a source and destination for the data. The registers for the above instructions could vary from 16 bits on the early X86 versions to 32 bits on the later X86 versions. The later X86 versions also have 80-bit or 128-bit registers for floating-point instructions. The classes of instructions that are supported include: data transfer instructions integer arithmetic instructions branch instructions bit manipulation, rotate, and shift instructions input/output instructions miscellaneous instructions such as HALT floating-point instructions (on the later versions of the family) The addressing modes include: immediate mode direct addressing mode register mode (in these modules, referred to as register direct addressing) register deferred addressing mode (in these modules, referred to as register indirect addressing) base addressing mode indexed addressing mode base indexed addressing mode (in these modules, referred to as indexed with base register addressing) We will discuss other aspects of the X86 family in module 4. PowerPC The PowerPC is a family of processors developed around a specification for open system software developed jointly by Apple, Motorola, and IBM. Members of this family include the Apple Power Macintoshes and the IBM RS/6000. Every instruction in the PowerPC is 32 bits in length. The instruction set is divided into six classes: integer instructions floating-point instructions load/store instructions (to move data to and from memory) flow-control instructions (for conditional and unconditional branching) processor-control instructions (including moves to and from registers and systems/interrupt calls) memory-control instructions (to access cache memory and storage) Three addressing modes are used: register indirect register indirect with indexing register indirect with immediate offset The last two modes are not exactly what we discussed in section III C, but are similar to the indexed addressing mode we presented. We will discuss other aspects of the PowerPC family in module 4. IBM Mainframe The IBM mainframe family includes the System 360/370/390/zSeries family. Here, we discuss aspects of the z Series system. The z/Series instructions have 0, 1, 2, or 3 operands. The 3-operand instructions operate on multiple quantities of data, with two of the operands used to describe the high and low end of the range of data. For example, a group of all data between memory locations 10 and 20 would have the operand addresses 10 and 20. Five classes of instructions are used: general instructions (used for data transfer, integer arithmetic and logical instructions, branches, and shifts) decimal instructions (used for simple arithmetic, including rounding and comparisons) floating-point instructions control instructions (used for system operations) input/output instructions Four types of addressing are used: immediate addressing register addressing (in these modules, referred to as register direct addressing) storage addressing (in these modules, referred to as base addressing) storage indexed addressing (in these modules, referred to as indexed with a base register addressing) We will discuss other aspects of the z/Series in module 4. Module 4: Advanced Systems Concepts Commentary Topics I. II. III. CPU Design Memory System Concepts Performance Feature Comparisons I. CPU Design Like their automotive industry counterparts, computer designers try to obtain the highest possible performance from their designs. In module 1, we saw how an additional measure of precision was obtained in the IEEE 754 standard by using an implied "1" in the significand. A second means of improving performance—through the use of various addressing methods—was investigated in module 3. In this module, we will look at several other approaches to improving performance, including the use of: RISC processors in place of CISC processors pipelining and superscalar processing cache memory virtual memory Several additional steps can be taken to improve performance, including using: processors with more than one CPU to assist in the computations faster clock speed wider instruction and data paths, which allow the CPU to access more data in memory or fetch more instructions at one time faster memory and disk accesses A. RISC versus CISC There are two different types of CPU design: Complex Instruction Set Computer (CISC) design and Reduced Instruction Set Computer (RISC) design. CISC is represented by the X86 (80286, 80386, 80486, and Pentium) family of processors, the IBM 360 (360, 370, 390, and zSeries) family of processors, and the Motorola 6000 series microprocessors. RISC is represented by the PowerPC (IBM RS/6000, IBM AS/400, and Apple Power Macintoshes) and the Sun SPARC processors. The RISC CPU differs from the CISC CPU in five principal ways. It has: 1. a limited and simple instruction set made up of instructions that can be executed at high speeds. The high speeds are accomplished, in part, by using a hard-wired CPU with instructions that are pipelined, e.g., the fetch phase of the next instruction occurs during the execution phase of the current instruction. 2. register-oriented instructions with limited memory access. Most of the RISC instructions operate on data in registers; there are only a few memory access instructions (i.e., LOAD and STORE). 3. fixed length and fixed format instructions that make it easier to pipeline follow-on instructions. The SPARC RISC format has five instruction formats, each 32 bits long. In comparison, the IBM 360 format has nine formats that vary in length from 16 to 48 bits. 4. limited addressing modes. RISC has only one or two addressing modes, as compared with the various modes discussed in module 3. Less complicated addressing leads to simplified CPU design. 5. a large bank of registers. RISC CPUs have many registers—often 100 or more—that allow multiple programs to execute without moving data back and forth from memory. Proponents of RISC and CISC architectures argue over the merits of these approaches. These arguments have become less meaningful, however, as improved technology has led to new CPU enhancements that combine the features of each design. Newer RISC designs have an increased number of instructions, which has been made possible because today's technology provides for faster processing techniques and more logic on the system chips. Newer CISC designs have an increased number of user registers and more register-oriented instructions. Thus, the features and capabilities of recent RISC and CISC CPUs are very similar. Two of these features—pipelining and superscalar processing, which we discuss in the next section—help compensate for the argued differences of the two architectures. Thus, the choice of RISC or CISC design simply reflects the preferences and specific goals of the CPU designers. B. Pipelines and Superscalar Processing In module 2, we briefly mentioned that clock pulses are used to trigger flip-flops in sequential circuits. The entire computer is synchronized to the pulses of an electronic clock. This clock controls when each step of a computer instruction occurs. For example, figure 4.1 shows the series of pulses required for an ADD instruction in the Little Man Computer (LMC, discussed in section II A of module 3), where: the the the the first three pulses perform the fetch fourth step performs the decoding next two steps execute the ADD instruction seventh pulse starts the next instruction Note that the step: PC + 1 → PC is performed early because it doesn't affect the calculation. Also, these steps: IR[opcode] → decoder and IR[address] → MAR are performed in parallel. Performing the steps in parallel reduces the number of clock cycles. Figure 4.1 Timing for an ADD Instruction When pipelining is used, steps of instructions that are in a sequence are fetched, decoded, and executed in parallel. While one instruction is being decoded, the next instruction is being fetched, and while the first instruction is being executed, the second instruction is being decoded and a third instruction is being fetched. In the figure below, note that four different instructions are performing at the same time after pulse 4: instruction instruction instruction instruction 1 2 3 4 is is is is on on on on step step step step 4 3 2 1 Figure 4.2 Pipelining Instructions A further improvement to pipelining is to separate the fetch, decode, and execute phases of the fetch-execute cycle into separate components, then use multiple execution units for the execution phase, and pipeline the execution portions of the instructions. With only one execution unit, the processor is considered scalar. But with multiple execution units, the processor is considered superscalar. Figure 4.3 shows a comparison of scalar and superscalar processing. Figure 4.3 a shows that, with scalar processing, the fetches, decodes, and executions are pipelined. Figure 4.3 b shows that, with two fetch-execute units, two complete sets of instructions are performed in parallel. Figure 4.3 Scalar Processing versus Superscalar Processing a. Scalar Processing Instruction 4 - - - - - - - - - - - - - - - fetch decode Instruction 3 - - - - - - - - - - fetch decode execute Instruction 2 - - - - - fetch execute Instruction 1 fetch decode execute decode execute b. Superscalar Processing Instruction 4 - - - - - fetch decode execute Instruction 3 - - - - - fetch decode execute Instruction 2 fetch decode execute Instruction 1 fetch decode execute We can use what we have learned thus far to show the benefits of pipelining for a computer using the LMC instruction set that was described in section I-C of module 3. For this analysis, we will make some simplifying assumptions: 1. The PC incrementing step is performed in parallel with another instruction, as shown on pulse 2 in figure 4.1. 2. An average program is composed of: o 70 percent STORE, LOAD, ADD, and SUBTRACT instructions, requiring six steps o 30 percent IN, OUT, and BRANCH instructions, requiring five steps each HLT is ignored because it occurs only once per program. 3. 4. 5. The clock runs at a 10 MHz rate, e.g., 10,000,000 ticks per second. A new step starts with each tick of the clock. With a pipeline, the time for each instruction (after the first) is reduced to the time required for the first step. 6. There are no delays in the pipeline. Example 4.1 shows the calculation for the number of instructions per second (IPS) for execution without pipelining. Example 4.2 shows the calculation for the number of instructions per second (IPS) for execution with pipelining. Example 4.1—Instructions per Second Without Pipelining Average number of steps per instruction: = .70 * 6 steps + .30 * 5 steps = 5.7 steps Instructions per second (IPS): = (10,000,000 ticks/sec * 1 step/tick) / (5.7 steps/instruction) = 1.75 million IPS Example 4.2—Instructions per Second With Pipelining Average number of steps per instruction: = .70 * 1 step + .30 * 1 step = 1.0 step Instructions per second (IPS): = (10,000,000 ticks/sec * 1 step/tick) / (1.0 step/instruction) = 10 million IPS In a RISC architecture where a new instruction is started on each clock tick, the average number of steps per instruction would approach 1. Then, the number of instructions per second would approach 10 million IPS. II. Memory System Concepts A. Memory Hierarchy An entire hierarchy of memories exists. Levels 1–3 of this hierarchy have traditionally been called primary storage. 1. CPU registers—Registers are by far the fastest and most expensive type of memory, with each register providing temporary storage for only one word while it is being processed in the CPU. 2. cache memory level 1—This type of memory is incorporated into the CPU using small, very high-speed static RAM (SRAM) chips and is the fastest to access. 3. cache memory level 2—This type of memory is usually found outside the CPU. It is a larger, but slower, cache built from dynamic RAM (DRAM) that backs up the first cache. Some modern CPUs contain both level-1 and level-2 cache memory. 4. main memory (MM)—MM, also known as RAM, is built from DRAM chips. It contains programs and has a slower access time than cache memory. 5. disc storage—Disc storage, which includes disks, tapes, CD-ROMs, bubble memories, and so on, offers much larger capacities at the expense of access time. Throughout the memory hierarchy, there is an inverse relation between access time and size. Memory components in the memory hierarchy grow larger as their distance from the CPU increases. The cost per byte decreases as the memory component moves further from the CPU, but the access time becomes longer. We discussed main memory (MM) in module 3. We will discuss cache memory in detail in the following section. B. Cache Memory A cache memory (CM) is a small, high-speed memory that is placed between the CPU and MM. If we are lucky, most of the CPU accesses will be to CM, which has the effect of creating a faster memory cycle. The MM will be accessed only if what is being sought is not in CM. For such cases, the access time will actually increase slightly, but one hopes this will not happen frequently. A key measure of cache efficiency is the hit ratio, which is the fraction of memory accesses that succeed in finding an item within the cache. Hit ratios of around 80 percent are considered to be very good. The success of a CM is predicated on the principle of locality of reference, which simply refers to the fact that, in typical programs, references to memory tend to be in localized areas, due to subroutines and loops within the program. You may wish to see how a CPU access to memory works when a cache is present. Two terms we must understand to discuss cache memory are: cache slot—This contains an MM address, or portion of an address, and one or more data words. It is also known as a cache word. tag—This is a field used to uniquely identify which MM address has been mapped into a cache word. In addition, we must understand the following CM concepts: cache types valid bit block size address mapping policies (see below) Address Mapping The address mapping process used in the cache is by far the most important factor in determining performance. This process must occur with every single memory reference. Therefore, it is crucial that all parts of this process be carried out by hardware rather than by software, because hardware has the advantage of speed. The three mapping methods described below differ in the way in which they determine how this mapping is done: 1. associative mapping 2. direct mapping 3. set-associative mapping We will compare these methods by applying them to the same situation—an MM of 32K x 12 (15bit addresses). The CM size will be 512 words (9-bit addresses). The word size of the cache depends on the mapping method. This organization is shown in figure 4.4: Figure 4.4 Location of Cache Memory Associative Mapping The associative mapping method allows for full flexibility in the situation where a given MM slot may be stored in cache. This mapping method uses a special kind of memory called an associative memory When the CPU generates an MM reference, the associative memory is first searched using the CPU-generated address as the pattern. If a match is found, it means that this reference is a hit, and the data part of the matched location is read. If the reference results in a miss, then the MM is accessed using the CPU address. Clearly, if the required value is not found in the cache, the associative memory must also be loaded with the address and MM contents for future cache access. Let's look at the example of associative mapping in figure 4.5. We will need an associative memory that is 15 + 12 = 27 bits wide. This memory will store the memory addresses generated by the CPU (15 bits) and the 12 bits of data stored at that MM address. The 15-bit memory address is the tag for the associative mapping cache. Figure 4.5 Associative Mapping Cache | ← CPU Address ↓ Address → | ← Data 01000 2345 02777 1670 23450 3456 : : : : : : : : → | All values are in octal. When we include the valid bit, the cache word is 28 bits long and has this form: Tag Data 15 bits 12 bits Valid Bit 1 bit The associative cache described and shown above is the case when the block size is 1. If we have 8 locations per block (i.e., a block size of 8), we will need a tag field of 12 bits (15 bits minus the 3 bits that identify the block location) plus a word-within-the-block field of 3 bits. Each slot in the associative CM must store the tag of 12 bits and the block field of 3 bits plus the contents of 8 memory locations (i.e., 8 x 12 bits) and 1 valid bit (for the entire cache word). Thus, this cache word is 112 bits long and has this form: Tag Block Data 1 Data 2 Data 3 - - - Data 8 Valid Bit 12 bits 3 bits 12 bits 12 bits 12 bits 12 bits 1 bit When the CPU generates an MM reference, the associative memory is first searched using the CPU-generated address as the pattern. If a match is found, it means that this reference is a hit, and the data part of the matched location is read. If the reference results in a miss, then the MM is accessed using the CPU address. The associative memory must also be loaded with the address and the MM contents of the reference that caused the miss. If the associative memory is full, we use one of the replacement policies discussed at the end of this section. The advantage of associative mapping is that we have a lot of flexibility in determining which cache slots to replace, and we can do something to keep frequently accessed slots in cache. The disadvantage is the need for associative memories, which are expensive. Direct Mapping The direct mapping approach uses a random access memory where the CPU address is divided into two fields, a tag and an index. If there are 2n words in main memory and 2k words in cache, then the tag field is n – k bits, and the index field is k bits. The addressing relationship for the 32K x 12 memory and 512 x 12 cache is shown in figure 4.6 below. Figure 4.6 Addressing Relationship Between MM and CM Each MM location will map to a single specific cache location. In figure 4.7 below, every MM address has 15 bits (shown as octal digits), which are split into two parts as follows: 1. The lower-order 9 bits of an MM address, which are the index bits, are used to address the CM. 2. The remaining six bits, which are the tag bits, are used to uniquely identify which MM address has been mapped into a cache slot. Thus, the CM has the value in MM location 00000 (2120) loaded at index address 000 with tag 00. And the value at MM location 02777 (1670) is loaded at index address 777 with tag 02. Figure 4.7 Direct Mapping Cache Storage In the situation above, we considered only one byte of data per cache location. But if we have a block size of 8, then the index is broken down into two parts: Block Word 6 bits 3 bits The advantages of direct mapping are its relative cost and simplicity. There are, however, a number of disadvantages of direct mapping. Set-Associative Mapping Set-associative mapping has some of the simplicity of direct mapping. Each MM address is allowed to map to a fixed set of cache "slots" (each slot can contain a data value). But a given MM address can be placed in any one of the slots in the set to which it maps. If the size of each set is k, then the mapping is said to be k-way set associative. You must keep clear the distinction between cache set and cache slot. We can choose which cache slot within the set to replace, in case there is not a match. Thus, we do have some flexibility. The k-way set-associative mapping differs from both the associative mapping and direct mapping caches in that, in addition to a valid bit field, there is a count bit field that keeps track of when data was accessed and that is used for replacement purposes. Each index word refers to k data words and their associated tags. The number of the count bits, m, is the base-2 logarithm of k (2m = k). In addition, the tag is also longer by m bits and the index is shorter by m bits because there are fewer cache words (requiring fewer bits to address), and the shorter index requires a longer tag to adequately describe the full address in MM. Using our example of a 32K memory x 12-bit data word, a two-way set-associative memory of size 512 x 12 would have 256 cache words (512 / 2) for an index of 8 bits (256 = 28) and a tag of 7 bits (15 – 8). Each set would have 7 tag bits, 12 data bits, 1 count bit, and 1 valid bit. The format of the complete cache word would be: Tag Data 7 bits 12 bits Count Bit Valid Bit 1 bit 1 bit Way 1 ← Tag Data Count Bit Valid Bit 7 bits 12 bits → ← 1 bit Way 2 1 bit → If we consider a four-way set-associative memory, we would have 128 cache words (512 / 4) for an index of 7 bits (128 = 27) and a tag of 8 bits (15 – 7). Each set would have 8 tag bits, 12 data bits, 2 count bits, and 1 valid bit. The format of the complete cache word would be: Tag Data Count Bits Valid Bit 8 bits 12 bits 1 bit Way 1 ← Tag 2 bits Data ← 2 bits Way 3 Data Tag Data 1 bit → Count Bits Valid Bit 8 bits 12 bits → ← 2 bits Way 2 → ← 1 bit Count Bits Valid Bit 8 bits 12 bits Count Bits Valid Bit 8 bits 12 bits Tag 2 bits Way 4 1 bit → Replacement Policies An important design issue for cache memory is which replacement policy to use. Most processors use a least recently used (LRU) replacement policy or proprietary variants of LRU, but several replacement policies are available, described below. In the case of a miss, using the associative and set-associative mappings, we must have a replacement policy so that a certain cache slot may be vacated. Some of the popular replacement policies are: 1. FIFO (first-in first-out): The entire associative memory is treated as a "circular buffer," and slots are replaced in a round-robin order. Note that this policy makes no concessions to the frequency of usage of an MM address. 2. LRU (least recently used): The slot that is replaced is the one that has not been used for the longest time. The LRU policy is difficult to implement, but it does favor frequently used addresses. 3. LFU (least frequently used): A count is kept of how many times a given cache location is accessed in a fixed period of time. When a miss occurs, the slot that has the smallest count is the one replaced. Note that a slot that was recently referred to for the first time will have a small count value, and that it is therefore a candidate for exile to MM. For this reason, all the frequency counters are periodically reset. 4. Random: One slot is randomly picked from the list of those eligible. Some studies indicate that this policy is not far behind the best one in terms of performance (LRU), but is easier to implement. Write Methods Another important design issue is that of how to write data to the cache. When a CPU instruction modifies (i.e., writes a new value into) a cache location, the cache will contain a different value than the corresponding MM location because, in the case of a hit, only the CM is accessed! Whenever we have two memories (or, in general, two copies of any database), we will have consistency problems if writes are not done simultaneously to both copies. You will study this problem in more detail if you take a course on databases. Several write methods are available. If we use the write-through method, each operation that writes a new value into cache must simultaneously also write that value into the corresponding MM location to guarantee memory integrity. The problem is that, if there are many writes, frequent accesses to MM must be made, slowing everything down. With the write-back method, every cache slot is associated with a single bit, called the dirty bit. In this method, writes are made only to cache. When any location within that cache slot (remember that a cache slot will, in general, contain an entire block of memory) is written, the dirty bit for that slot is set. Thus, the dirty bit indicates that some location within that slot has been contaminated. When a cache slot must be replaced, it is necessary to actually write the slot back into memory only if the dirty bit has been set for that slot. At the conclusion of a program, the cache contents must also be written out to MM for the final time. C. Security and Memory Management Nearly every modern computer is designed to support manipulating multiple programs on a single CPU, a technique known as multitasking, or multiprogramming. The design of the computer must support multiprogramming in such a way that malicious or inadvertent code cannot shut down the computer. In order to meet this requirement, the computer hardware should: 1. limit any executing program to a specific portion of memory. Protection is provided for storage access such that address spaces can be completely isolated from each other. This keeps a program from accessing addresses being used by the system or other programs. 2. limit the set of instructions that a user's program can execute. Multiple modes of protection, including a privileged or supervisory mode for control of the system, and a user mode for application programmers, are provided. This protection keeps application programmers from using instructions that are designed for system programmers. Most instructions are unprivileged and available to application programmers. 3. eliminate the programmer's concern about exactly where his or her program will load in MM. Program addresses are referred to as logical addresses, whereas actual memory addresses are referred to as physical addresses. Logical addresses do not have any meaning outside the program itself. Physical addresses are like LMC mailboxes. They have physical reality; that is, they physically exist. Transforming from logical to physical addresses is known as mapping. Memory management is performed by the memory management system, a collection of hardware and software used for managing programs residing in memory. The hardware associated with the memory management system is called the memory management unit, and is located between the CPU and MM. The memory management software is usually part of the operating system. In a multiprogramming environment with many programs in physical memory, it is necessary to: move programs and data around the MM alter the amount of memory a specific program employs prevent a program from inadvertently modifying other programs A memory management unit may: map logical addresses into physical addresses enable memory sharing of common programs from different users enforce security provisions of memory Older computer systems used various means of partitioning a fixed-size memory to handle programs. Specific memory management systems used included: single-task systems. In a single-task system, the memory is allocated between the operating system and the executing program, with some unused memory remaining. Even if there is enough room in the unused portion of memory for additional programs, they cannot be loaded because the operating system allows only one program in memory at a time. fixed-partition multiprogramming. A computer that runs multiple programs must have a memory large enough for those programs to reside. One approach to managing multiple programs in memory is to establish a set of fixed and immovable regions, or partitions. One partition is reserved for the operating system and the other partitions are available for user programs. The job of the operating system is then to decide which partition any one program can occupy, and to decide in which queue to place a program waiting for space in memory. variable-partition multiprogramming. A more flexible approach to memory management is variable-partition multiprogramming, where the operating system is allowed to increase or decrease the size of partitions. As programs are moved into and out of memory, holes are created. The holes are the wasted space not being used by programs. The fixed-partition multiprogramming and variable-partition multiprogramming memory schemes have a common goal of keeping many processes (programs) in memory simultaneously to allow multiprogramming. Both schemes, however, require the entire program to be in memory before the process can execute. We say that these schemes require contiguous memory, meaning that all parts of memory related to a given program are located adjacent to each other. The problem is that very large programs may still require more memory than is available. Virtual memory provides the solution to the problem of memory size by allowing the execution of programs that are larger than the physical memory. Thus, where partitioned systems require contiguous allocations of memory, virtual memory can work with noncontiguous allocations of memory. Another problem with both fixed-partition and variable-partition memories is the inefficient use of memory, even if there are a lot of modest-sized programs. If there is enough free memory to load a program, but the free memory is not contiguous, the operating system must pause some programs and relocate them, modifying the current partition structure. Virtual memory solves this problem by enabling modest-sized programs to be scattered about MM, using whatever space is available. A third problem with both partitioning schemes is that the final address in memory where a program will be loaded for execution is not known. One solution to that problem is to use base addressing, where the operating system uses a base register (discussed in section III-C of module 3) to hold the location that corresponds to the starting location of the program. The overall effect of the partitioning forms of memory management is that they lead to memory fragmentation and other assorted inefficiencies. D. Virtual Memory The concept of virtual memory resolved the shortcomings of memory partitioning by incorporating hardware into the memory-management unit to perform paging, the process of dividing both the user address space and MM into pages of a fixed size, the "page size." Thus, different pages of the program can be loaded into different pages of MM. Consecutive pages of the user address space need not be consecutive in MM, and not all pages of user space need to be present in MM at any time. An example of virtual memory, also known as virtual storage, is shown in figure 4.8, where memory used by the program (the logical organization) is stored in three different memory locations and one disk drive location (the physical organization). This type of storage allows the computer to load and execute a program that is larger than memory. Figure 4.8 Logical Organization versus Physical Organization In a paged system, each equal-sized logical block is called a page (like a page of a book), and the corresponding physical block is called a frame. The size of a page is equal to the size of a frame. The size of the block is chosen in such a way that the bits of the memory address can be naturally divided into a page number and an offset. The offset is a pointer to the location on a page. A page table converts between the frame address and the page number. An example of page translation from logical address 10A17 to physical address 3FF617 is shown in figure 4.9. Both addresses are shown as hexadecimal numbers. Thus, the logical address is 20 bits with a 12-bit page number (10A) and an 8-bit offset (17). The physical address is 24 bits with a 16-bit page number (3FF6) and the same 8-bit offset (17). Figure 4.9 Page Translation in a Virtual Memory System Let's consider how we could modify the LMC to support virtual memory. The LMC instruction set that we discussed in section I-C of module 3 has a two-digit logical address space that allows for 100 two-digit addresses and 100 physical mailboxes. If the page (and frame) size were 10, there would be 10 pages (from 0 to 9) and 10 frames (also from 0 to 9), and the offset for a particular address would be from 0 to 9. In a page table, there would be a one-digit page number and a one-digit offset. Thus, two-digit memory address 26 would become page 2 with an offset of 6, as shown below: PageNumber 2 Offset 6 If we allowed the LMC to physically expand to 1,000 addresses, but kept the two-digit address allowed in the instruction set, a program would be still be limited to 100 logical addresses, but those addresses could be spread over the 1,000 physical addresses. A table translation table such as that shown in figure 4.10 below would be needed to convert from the logical space to the physical space. Figure 4.10 LMC Page Table with a Large Physical Space Virtual memory can also be used to execute two programs that have the same code, but different data. As shown in figure 4.11, the two programs use: five 20-unit pages with logical addresses 1 to 100 for program code (e.g., instructions) three 20-unit pages with logical locations 101 to 160 for data The program code is stored in five 20-unit frames with physical addresses from 200 to 300 in memory. The data for the first program are stored in three 20-unit frames with physical addresses from 401 to 460, and the data for the second program are stored in three 20-unit frames with physical addresses from 501 to 560 in memory. Figure 4.11 Virtual Memory for Execution of Two Programs Page Faults Virtual memory simplifies the problem of memory management because finding memory partitions large enough to fit a program contiguously is no longer necessary. But to execute an instruction or to access data, it is necessary to have: the instruction or data in memory an entry in the page table that maps the logical address to the physical address If both conditions above are not met, the CPU hardware causes a page fault. When a page fault occurs, the memory management software selects a memory frame to be removed (swapped out) and replaced (swapped in) with the needed page. By having only those portions (i.e., referenced pages) of a program that are needed at a particular time in MM, it is possible to "swap" needed portions (pages) in and out of MM at different times to enable a large program to run in a small amount of MM space. Furthermore, these pages can be placed in any location in MM available at that time, so the program can exist in scattered (noncontiguous) locations. The page size is typically a hardware design feature of a computer system. The concept of paging is similar to that of CM in that it is hoped that memory accesses will be found in pages already resident in MM, thus avoiding the time-consuming need to swap information into and out of auxiliary storage devices. Segmentation Another approach to virtual memory, called segmentation, is one in which the blocks have variable sizes. Unlike paging, segmentation gives the user control over the segment size, so it is necessary to include the size of the variable-sized blocks in the translation table. Thus, the translation table, called a segment table, includes the logical and physical starting addresses and the block size for each segment. Figure 4.12 shows an example of a segmentation translation, where logical address 217 translates to physical memory address 492 as an offset of 17 from Block B. Figure 4.12 Segmentation Translation in a Virtual Memory System Segmentation is harder to operate and maintain than paging, and is falling out of favor as a virtual-memory technique. III. Performance Feature Comparisons In this section, we will continue the discussions we started in module 3 on features of the X86, PowerPC, and IBM Mainframe computer families to show the fundamental similarities between all computer CPUs. We will discuss RISC versus CISC, pipelining, cache and virtual memory, and some security issues. A. X86 Although the early X86 CPUs had a simple design, the X86 architecture has evolved into a sophisticated and powerful design with improved processing methods and features. Downward software compatibility with earlier family members has been maintained so that each new model has been capable of executing the software built for previous models. The X86 CPU incorporates CISC processing with pipelining and superscalar processing, 2-level cache memory, and virtual storage (which we called virtual memory). Floating-point instructions, virtual storage, and multiprogramming support are now part of the basic architecture. As a typical CISC processor, the X86 has relatively few general purpose registers and a relatively large number of specialized instructions, along with a great variety of instruction formats. CPU interrupts can be either emergency interrupts or normal interrupts, at one of thirty-two prioritized levels from IRQ0 to IRQ31. System security supports multiprogramming through the implementation of a protected mode for addressing. In this mode, the CPU provides four levels of access to memory. Application programs have the lowest level of access, and key portions of the operating system are on the highest level of access. Early versions of the X86 family required floating-point arithmetic to be performed in software or in a separate processor. Later versions added the capability for floating-point arithmetic. B. PowerPC The PowerPC family of processors is developed around the RISC concept. The RISC architecture is used in the IBM RS/6000 workstations, Apple Power Macintoshes, and Nintendo Gamecube systems. Floating-point arithmetic, memory caching, virtual memory, operating-system protection, and superscalar processing is standard. As expected for a RISC design, every instruction is 32 bits long. The uniform instruction length and a consistent set of op codes simplify the fetch and execution pipelines and make superscalar processing practical and efficient. There are two implementations: a 32-bit implementation with 32-bit registers, and a 64-bit implementation with 64-bit registers. Both have 32 general-purpose registers and 32 floatingpoint registers to support the RISC design. Programs written for the 32-bit processors can be run on the 64-bit processors. Two levels of protection, supervisor (or privileged) and problem (or user), are provided for the PowerPC. C. IBM zSeries Mainframe The zSeries is a CISC multiprocessing computer that can perform simultaneous computations using pipelining and superscalar techniques found in other CPUs. Its capabilities include floatingpoint arithmetic, virtual storage (virtual memory), and two-level cache memory. The basic building block of a z/System is the multichip module, which consists of either twelve or twenty CPUs. Programs written for older IBM System 360/370/390 computers will execute on the z/Series computers. As expected in a mainframe computer, the z/Series computers provide excellent security. There are two protection states, a problem state for application programs, and a supervisory state for control of the system. Instructions in the problem state are not privileged to modify system parameters.