Chapter 2 Designing a CPU 2.1 Hardware and Software Processing Environment We have learned that computers from desktop PCs to mobile phones, game consoles and iPods contain some form of processor, some memory and connections to various input and output devices. Also we have seen that there are various classes of processor, some are general purpose programmable engines (such as the various Intel and AMD families), others are programmable, but specialised, such as Texas Digital Signal Processing (DSP) processors, and others are application-specific (ASICS), which are embedded into dedicated products such as calculators, engine-management systems or microwave cookers, delivering application specific functionality; they are hardwired and not malleable via programming In this chapter, we are interested in the general purpose programmable processor like the one probably now at work in your PC. We shall discover which building-blocks are needed to construct a simple, although meaningful Central Processing Unit (CPU), and how these blocks are combined in a particular architecture. The details of the inside working of our CPU are discussed, and you will be presented with a Java simulator of our CPU to investigate for yourself how it works, you’ll even write some assembler programs. As its name suggests, the CPU lies at the centre of other components such as program memory, data memory and input-output devices. Data memory holds numbers, words, whole text documents, music, images or any other information which can be distilled into bits and bytes as we have already seen. The program memory holds a list of atomic instructions which the CPU fetches, then executes. Segments of code and data memory are often situated on the CPU chip, where they are referred to as code (program) cache and data cache. These are highlighted in Fig.1 which shows the layout of the Intel?? CPU die. You should know that the actual size of this die is ?? and it contains over 3 million transistors! We shall discover how the functional blocks outlined on this chip can work together to provide the most fundamental level of processing which ultimately supports our “User Application” such as an Excel spreadsheet or a computer game such as Unreal Tournament 2004. Figure 1 Photograph of the Intel Pentium chip die, shown with an overlay of the main functional blocks. This chip contains 3.2 million transistors. Acknowledgement here. Let’s consider first how programs are both represented to the User and fundamentally in program memory, to be fed into the Central Processing Unit. As often in computing, it’s useful to consider a hierarchy of levels (see Fig.2). At the top of the hierarchy is found what you, the user, sees when you open up e.g., an Excel spreadsheet. There is a formula in cell B3 which adds the values of cells B2 and C3 together. This is how you write to “program” in Excel. But like most applications, Excel has been constructed out of a “High-Levellanguage” (HLL), such as “C” which you can think of as situated at a middle level. In this language, an addition is coded as the statement “w = x + y”. C is a high-level language, which means it cannot be fed directly into the CPU hardware and executed. Instead it must be compiled into the into a series of atomic or primitive instructions which the CPU hardware can execute; this is the bottom of our hierarchy shown in Fig,2. This fundamentallevel assembler program consists of some “mov” instructions, (we’ll see what they are shortly), and an “add” which is clearly where the addition takes place. Application (Excel) w=x+y; High-Level Language (e.g. “C”) mov ax,[x] mov bx,[y] add ax, bx mov [w],ax Code fed to CPU (assembler) Figure 2 The operation of addition shown at a number of levels. At the top, the computer user sets up an addition in the application Excel. The operation of addition is coded within excel in a High-Level language, such as “C”. The bottom this High-Level addition is coded as a number of atomic “mov” and “add” operations which are fed into the CPU hardware and executed. The chip layout shown in Fig.1 is now hopefully beginning to make sense. The code cache holds these primitive instructions which are fetched into the “execution unit”, (also known as the “arithmetic logic unit”, or “ALU”) where they are invoked. Data involved in the processing are stored in a separate area of memory known as the “data cache”. 2.2 A Minimalist CPU Architecture Let’s start designing our CPU which will be able to run our exemplar Excel addition. How do we add, and for that matter subtract, multiply and perform other arithmetic functions? How to we check to see if one number is greater than another, or negative (these are called “logical” operations)? Clearly we need an “Arithmetic Logic Unit” (ALU), containing electronic circuits to perform these tasks. The electronic-engineering symbol for an ALU is shown in Fig.3(a) where it is performing an addition. There are two important points concerning this ALU symbol to note here. First, data comes in at the top where there are two inputs (since we must add two numbers), and the result exits at the bottom through a single output. The routing of data through the entire CPU chip is called the “data path”, and consists of a complex engineered network of paths for data to follow. The second point to note is the signal arriving at the side of the ALU. This is a “control” signal which informs the inner ALU circuitry whether it should add or subtract (or perform some other operation). The routing of control signals is known as the “control path”. 3 0 5 3 5 8 add / sub 1 3 ALU 5 8 (a) (b) 4 MAR ALU 8 4 Data cache 0 Mov ax,[0] 4 mov bx,[1] 8 add ax,bx 12 mov [4],ax 3 5 3 (c) 5 8 4 8 MAR 8 IP Data cache Code cache IR ax bx MDR d) Code cache Data cache Figure 3. Steps in building a simple CPU: (a) shows the ALU adding two numbers, (b) the data cache which contains these numbers is added; the memory address register (MAR) points here to cell 4 which receives the result 8, (c) the code cache is now added which contains the program code; IP, the instruction pointer points to line 8 which contains the add instruction, (d) Registers ax and bx used in data movement and the MDR (memory data register) are shown added. Also the instruction register (IR) used to decode the program code. Where do all these inputs to the ALU come from? The data clearly originates in the data cache, but what about the control signal? This originates from the program code, which informs the ALU whether an add or a subtract (or other) operation must be performed, e.g., the line of code “add ax bx” could set a logic “1” on the add/subtract control input, while a “sub ax,bx” would put a logic “0” on this line. The ALU’s internal electronic circuits reads the control line state and effects the appropriate operation. It’s interesting to note that an ALU may be able to add more than just individual numbers. Intel architecture has been developed to add images (well, at least parts of images). This is known as the “MultiMedia Extension” or “MMX architecture. Clearly an important response to market demand for powerful multi-media applications. Let’s now take our basic ALU and add some data cache and then program cache blocks. First, the data cache. This is connected to the ALU as shown in Fig.3(b). In the example shown, two numbers are being moved from data cache into the ALU which then forms the sum, and this in turn is written back into data cache. The important point is that ALU and data cache need to communicate, and this communication needs to be coordinated in time. Also, the correct data has to be fetched, and since data is located in a cell with an address, then the data cache circuits have to be provided with the correct address to retrieve the correct data. This address is stored in the “Memory Address Register” (MAR). Registers are small high-speed memory cells with dedicated functions located on the CPU; the MAR is dedicated to pointing to the correct cell within data cache to retrieve or store data. Now let’s add in the program cache (see Fig. 3(c)). This contains a list of atomic assembler instructions which will ultimately send those control signals we mentioned above to the ALU (and most other components) to actually execute our program. Assembler instructions are stored in a list and are fetched into the CPU internals one at a time, so there needs to be a pointer to the instruction currently being read in and processed, in other words the address of this instruction in code memory. This is the function of the “Instruction Pointer” (IP), which is stored in the IP register, just like the MAR, dedicated to this pointing function. Note there is a nice symmetry here, both code and data cache have associated address registers pointing to either the code or the data currently being accessed. Let’s dwell for a moment on these registers: Registers are very useful CPU components, a typical CPU will contain many registers. It’s useful to think of these as “parking places” where data can be stored temporarily as it moves through the CPU circuits. A register may hold the results of an intermediate calculation, or just store or “buffer” a data element while it is waiting for the following CPU component to become available. Two important registers “ax” and “bx” are located at the inputs to the ALU. In fact, when data is brought into the ALU, they are first loaded into registers “ax” and “bx”. These registers are available to the assembler programmer, e.g., they appear in the instruction “add ax,bx” seen above. The Intel x86 (aka “Pentium”) architecture contains a small number of registers such as “ax”, “bx”, “cx”, “dx”, “si”, “di”, etc. There are other registers in the CPU which are not available to the programmer. One such register is used to buffer data going in and out of the data memory. This is the “Memory Data Register” (MDR) whose purpose is to coordinate movement of data. The MAR, IP and IR (instruction register, which we shall discuss later) are other examples. Our minimalist CPU is almost complete, from the diagram in Fig.3(d) you can see how the components introduced above are connected together. These connections show the “datapath”, (how and where data is shunted around), the control signals are not explicitly shown. There is one more register shown here, the “Instruction Register” which we shall discuss at the end of this chapter. For the moment glance forwards to Fig.?? which shows a screenshot of the Java simulator you will use in the Activities presented at then end of this chapter and check out the location of the components introduced here. This CPU is named “SAM”, “Simple Although Meaningful”, it’s important to give a CPU an interesting name, to help with marketing, like “Pentium”, “Athlon”, “Itanium”, “SAM”. 2.3 The Instruction Set and its Function The lines of assembler code such as “add ax,bx” form part of the Instruction Set of the CPU, a coherent set of atomic instructions, which has been designed as an integrated whole to enable the CPU to perform useful high-level functions. The complete set of SAM’s instructions is shown in Table.1 We have designed SAM to implement a subset of Intel’s Pentium instructions, so as you experiment with the SAM simulator, you will also be learning how to program a real Intel Pentium in assembly language. Let’s take a couple of these instructions individually and see how they can be combined to implement a program which adds two numbers in data cache and stores the result back into data memory. Mnemonic mov mov add inc jmp dest ax , ax , ax , ax c src Operation bx [1] bx - moves the contents of register bx into register ax - moves the contents of memory address 1 into ax - adds the contents of bx to ax, result is put into ax - increases the contents of ax by 1 - go to the line of code at address c Table 1. SAM’s “Instruction Set” . Each line comprises the instruction mnemonic, and the destination and source of data to be processed. Intel’s convention places the destination before the source. Let’s continue with our example of adding two numbers. The assembler program which does this is listed in Fig.5. Let’s wee what we need to do: First we have to get the numbers (data) from memory into the registers ax and bx from where they will pass into the ALU. 0 4 8 12 mov ax,[1] mov bx,[2] add ax,bx mov [4], ax Figure 5. Simple assembler program to add two numbers located at data memory addresses 1 and 2, and to write the result back at data memory address 4. The numbers on the left are the addresses of the instructions in code memory. This is the purpose of the two instructions “mov ax,[1]” and “mov bx,[2]”. The first instruction mov ax,[1] moves data into ax, the second moves data into bx. This may be a little surprising, but Intel designed all their instructions so that the destination of the data follows after the name of the instruction, here “mov”. So in general our instructions will all have the form mnemonic destination, source Consider the instruction mov ax,[1]. What’s the source of the data being moved into ax? You may think that this instruction loads ax with the number 1, but that’s not the case. The square brackets indicate that we are loading from memory, and that the “1” is the address in memory where we’re loading the data from. Therefore, as shown in Fig.3(d), these two lines of code move the numbers 3 and 5 into registers ax and bx respectively. Now let’s turn to the instruction add ax,bx. This does just what it says, and will add 3 and 5 within the ALU circuitry. But where is the result 8 which comes out of the ALU deposited? Following Intel’s convention, the result is placed in the destination register, ax which will now contain 8. Of course, if we had written a line of code “add bx,ax”, we still would have still got 8 but this would have been written into bx. But we didn’t so the situation we now have is shown in Fig.3(c). Finally, to complete this example of addition we must move the result from ax back into memory, let’s say to address 4. Again following Intel’s convention, we need to use the instruction “mov [4],ax.” which moves the contents of register ax (the source) into memory address 4 (the destination). That’s about all you need to know to get a good feel of how to write assembler, more instructions will be introduced and explained in the Activities. Now we must move onto examine the details of this computational process. For the moment, it’s interesting to reflect on the general principle of this architecture: Numbers are loaded from memory into registers, operated upon and the result, dropped into a register, is then explicitly written back into memory. We can’t add numbers directly in memory without passing them through the registers. We’ll have more to say about this later. 2.4 The Fetch-Execute cycle When a program is running, there is potentially a large amount of data moving through the data-path, and some organisation of control is required to orchestrate a desired computational result. Think of your town centre, with traffic-lights, bus lanes and a sprinkling of one-way streets. The data elements (people?) move in complex patterns on paths controlled by traffic lights, bollards and other signs, policemen and common sense. But unlike most towns centres which, despite their original planning, have grown organically, i.e. over time without global organisation, computer architects have had the freedom to design simple and efficient control mechanisms from scratch. The control mechanism within a typical CPU uses a prescribed sequences of stages each of which carries out a particular action. This sequence is known as the “Fetch-Execute” cycle. As we shall discover, each line of assembler takes a cycle of 5 stages to execute, whether it’s a mov operation, an add operation, or anything else. In each state of the cycle a specific movement of data or other action takes place, e.g. on stage 3 any ALU operations required by the assembler instruction are carried out, and on stage 4 any memory access is carried out. Let’s run around this cycle for the add ax,bx instruction, noting the actions and data movement. Fig.5 a diagrammatic summary of the 5 stages in the cycle, Fig.6 and Fig.7 show two examples presented as screenshots from the SAM simulator you shall use in this chapter’s activities. Let’s first consider the “add ax,bx” instruction. Fig.5 presents the overview, Fig.6 shows the details of the add ax,bx instruction’s operation. Refer to both figures while reading the following commentary. 5. Do any writes to registers 1. Fetch instruction from code cache Write result into ax Load instruction “add ax,bx” 2. Decode instruction; do any register reads 4. Do any access to memory (read/write) Read ax and bx into ALU Nothing to do here 3. Do any ALU operations Do the add ax + bx Figure 5. Fetch-Execute cycle showing the five stages in the processing of the “add ax,bx” instruction Stage 1. Here the instruction is fetched from code memory and placed into the instruction register. Stage 2. Here the instruction in this register is “decoded”, it is converted to electrical control and perhaps data signals which spread through the CPU. Also any register reads are performed. The instruction add ax,bx loads the contents of both ax and bx into the ALU; since these are register reads, they must happen now. Stage 3. Here any ALU operations required by the instruction are performed. Since our current instruction is an “add”, the required addition is carried out on this stage. Stage 4. Here any memory access is performed, which means either routing data into memory from a register, or reading data out of memory into a register. The add instruction makes no reference to memory, so nothing happens on this state. Stage 5. Here registers are written to if required, any data located in a parking place (e.g. MDR, or the ALU output) is written into a register. Since our add ax,bx writes the result of the addition into ax, that write happens here. It’s important to realise that the classes of actions effected at each stage of the FE-Cycle are identical, irrespective of the actual assembler instruction being executed during these five stages as that instruction is fetched and executed. As mentioned above, all ALU operations happen at stage 3.We shall return to this point in ??. You may have noticed that only four of the five states are actually used during the fetch and execution of our add. State 4 did nothing, it was apparently a waste of time. Would it not have been more efficient to have used a four-state cycle? Perhaps, but other instructions need the full five states, and it makes the design of the control circuitry simpler (and therefore cheaper) if a common five-state cycle is used, even though not all instructions need the full five stages. Figure 6. Fetch Execute cycle for the mov ax.[1] line of code in the SAM simulator. (1) The instruction is placed into the IR, the address “1” is sent towards the MAR. (2) The instruction is decoded and control signals generated (not shown). (3) There is no ALU op to perform, but the address “1” is written into the MAR. (4) The memory access is performed, the data from cell whose address is in the MAR (cell “1”) is placed into the MDR. (5) The data in the MDR (Lisa) is written into the register ax. Let’s now take a second example of a fetch-execute cycle in operation, for the instruction mov ax,[1] which moves the contents of memory location address 1 into register ax. Stage 1. The instruction mov ax,[1] is fetched from code memory into the instruction register. Stage 2. Once in the instruction register, it is decoded and the control signals generated. Any register reads are done in this state. We do not read a register in this instruction (the ax is the register where we shall write to), but we do need to get at the address 1. That is read out from the instruction register and placed into the data path to start its journey to the MAR. Stage 3. Any ALU operations are done in this state. There are none to do, for this “mov” operation, but here the address 1 is placed into the MAR parking place. Stage 4. This is the opportunity to do any memory access. Here “mov ax,[1]” implies reading the data at memory address 1, and placing the result into the MDR parking place. Stage 5. Finally, any register writes are performed. Here the data loaded into the MDR in the previous stage is now written into the destination register, ax. Finally, the addition is complete. Figure 7. Fetch Execute cycle for the add ax,bx instruction: (1) Intruction is fetched into the IR and (2) decoded. Here also register reads are performed. Lisa and Marge (which have already been moved into ax and bx) are read into the ALU. (3) Here the ALU operation of addition is performed, just like Intel’s MMX instruction, images are added. (4) There is nothing to do here on the memory write stage. (5) The result is taken from the ALU and written back into the destination register ax. There are some interesting observations which we can make from these two examples. First, the data pathways and fetch-execute states can be used to handle data from several sources: A register read, and an address read can both take place at stage 2. An ALU operation and an address insertion into the MAR cab both occur on stage 3. The CPU architecture has been designed by the engineer to allow this to happen. While it is beyond the level of this text and the SAM-2 simulator to explain these issues, a more advanced SAM-4 simulation and tutorial is available from the author’s web-site. Second, while some parts of the data-path are one-way (the inputs to the registers are at the top, outputs at the bottom); and the same with the ALU, which means that no direct write from MDR into the register along the short paths can occur, (the data is forced to take the long way round), some parts of the data-path are bidirectional, such as the long path connecting instruction register, the registers ax, and bx, the ALU output, the MAR and MDR. The five-cycle fetch-execute cycle has been designed so that data can only flow one way along these paths at any one time, and of course, that no two different data elements can be placed on the same part of the data path at the same time. So there’s quite a lot of organisation packed into the fetch-execute cycle. 2.5 The Structure of the Intel Pentium It is useful to revisit the photograph of the Intel Pentium chip die and identify where activity associated with the 5 fetch-execute states is located. These are shown in Fig.?? Of course this labelling is somewhat simplistic, nevertheless gives a good indication of where data is flowing. 1 2 3 4 5 Figure 8. Pentium Chip die with indication of data flow during the 5-stages of the FetchExecute cycle shown as labelled arrows 1-5. 2.8 Advanced Issues. Why the 5-stage Fetch-Execute Cycle ? 2.7 Activities In these activities, we'll use a Java Applet “SAM-2” to simulate and investigate the working of a CPU plus memory. SAM ("Simple Although Meaningful") consists of a CPU based on an “Instruction Set Architecture”, and contains a small set of instructions to allow storage and retrieval of data from memory, and simple mathematical operations in the ALU. Its instructions look very like Intel's Pentium instructions, so when you learn SAM’s instructions, you are also learning the Pentium’s. The simulation applet is located on the CD, together with Sun’s runtime environment which you need to install on your machine. The first activities are designed to help you get to grips with the “Fetch-execute cycle”, subsequent ones teach you how to program in assembler, and finally there are a series of tasks, where you are invited to write assembler programs to solve some problems. But first let’s take a quick tour of SAM, to discover where it’s various elements of userinterface are located. Load prepared programs Load data sets Stepping button Area to write your own code Figure ?? Screenshot of SAM CPU Simulator showing buttons to load prepared code and data, the stepping button to move through the 5 stages of the fetch-execute cycle and the area where you can write your own program. (1) At the lst if the programming area (“Code Memory”) where you will soon write your own assembler code. But first we shall load prepared examples using the (2) “Yellow” C-for-code buttons in the menu bar. (3) “Green” D-for-data buttons are used to load various data sets, either images or numbers, into the memory (“Data Memory”). Sam can work with both numbers and images. (4) The “cycle-step” button, which steps through each state of the Fetch-Execute cycle. Remember, each assembler instruction takes 5 cycles. Using SAM is easy; you select a program (or write one), you select a data set and you run the program by repeatedly pressing the cycle-step button. As you progress, SAM responds by highlighting the line of code you are currently executing in read, and by displaying the state you are at within the FE-Cycle at any time. So you see a slow-motion execution of your program. ACTIVITY 1 Question 1. Load the program “Yellow 1”. The first instruction is mov ax,[1]. This means 'load register ax with the contents of memory at address 1'. Run through the 5 stages of the Fetch-Execute cycle and note down what happens at each stage. (Use vocabulary from the diagram above). Write down in simple English how the data from address 1 gets into register ax. Question 2. Now step through the second line of code and note the difference in where the data is written - into bx. Question 3 Finally step through the third line of code (add ax bx). Write down what happens on each of the 5 stages of the Fetch-Execute cycle. On which stage does nothing much happen? Question 4 Load up program 'Yellow 2'. This contains some code like 'mov [2],ax' which moves the contents of ax into memory. Investigate how this instruction works. Note down the 5 stages for this instruction. Question 5. Load up program 'Yellow 3', which contains a mixture of moves from registers to memory and from memory to register. Make sure you understand what is happening. Don't write anything down. Try experimenting with numeric as well as iconic data. Question 6. Program 'Yellow 4' is intended to reveal to you how to move data from register to register, eg the mov ax,bx instruction. Experiment and learn. Note down anything you wish. Question 7. Program 'Yellow 5' concerns addition. Add it to your understanding. Mmm. Actually this program also introduces a new instruction "move immediate" which is not the same as a move. The first line is 'mvi ax,2' which does not load from memory at address 2. Find out what 'move immediate' does ACTIVITY 2 Question 8. Write a program (first on paper) to load X into AX and Y into BX. That's all. (You find these icons in dataset 1). Now write the program in the Applet and check if it works. Question 9. Write a program (first on paper) to swap the position of Bart and Crusty the Clown in memory. Yes, you must first mov them into registers. Yes, now enter your code in the Applet and pray. Question 10. Write (yes first on paper) a program to add up the first three numbers from dataset 5, and to write the result back into memory. (Look at Yellow 1 for inspiration). Question 11. Now that you've written it on paper, write a program to mov in Crusty the Clown and to overwrite all memory icons with Crusty the Clown. Question 12. Write a program to reverse the order of the icons in any data set. Who needs paper? ACTIVITY 3 Web Searches Web Search 1. See how many images of CPU chips you can find. Try words like 'chip die photo' in your search engine. Web Search 2. Find out all you can about the latest "64-bit CPU Chips" from Intel and AMD. Web Search 3. Manufacturer's give each chip a name, a sort of nickname. Here's some examples - 'Sledgehammer', 'Wilamette', 'Katmai'. Compile a list of nicknames and group them in manufacturer's families.