Chapter 5: Computer Systems Organization: Having looked at the building of circuits, it is now time to look at the computer as a collection of these units that allow the processing of input data to get results (output data). Inclusive would be the ability to store data as well. Note that all of the functional units of a computer are built from gates and circuits. As such, everything processed or/and stored in the computer are converted to its binary equivalent because the gates/circuits only deal with binary. Because it requires millions of gates/circuits to build a computer, to understand better how a computer works, we will group these gates/circuits into subsystems (each subsystem consisting of millions of such gates/circuits) and study the subsystems instead. The Von Neumann Architecture: Computers vary in size, memory capacity, speed and cost that are available in today’s market – supercomputers, mainframes, minicomputers, workstations, laptops and tiny hand held digital assistants. Regardless of all of these differences, almost every computer in use today use have the same basic design (One difference is the Turing Machine). This is known as the Von Neumann architecture who first proposed the architecture in 1946. The architecture is based on the three following characteristics: 1. A computer constructed from 4 major subsystems called memory, input/output, the arithemetic/logic unit (ALU) and the control unit. 2. The stored program concept, in which the instructions to be executed by the computer are represented as binary values and stored in memory. 3. The sequential execution of instructions. One instruction at a time is fetched from memory to the control unit, where it is decoded and executed. Memory and Cache: All information stored in memory is in binary format. Memory is usually referred to as RAM. Ram has the following characteristics including the fact that it is volatile: 1. Memory is divided into fixed sized unit called cells – each of which is associated with a unique address (these addresses are unsigned integers 0, 1, 2, …). 2. All accesses to memory are to a specified address and a complete cell is always fetched or stored (the cell is the minimum unit of access). 3. The time it takes to fetch or store a cell is the same. ROM is nothing but a RAM with the ability to store information disabled. 1 The cell size or memory width is fixed i.e. a fixed number of bits make up a cell – which is 8 bits or 1 byte, thus the largest unsigned integer that can be stored in one cell is 255 from 8 bits – 11111111. Hence, computers with a cell width of 8 bits use multiple cells to store large numbers. Example, 2 or 4 bytes are used to store one whole number, 4 or 8 bytes to store one real number, 1 byte to store one character. However, to fetch an whole number from memory, it will involve two (or 4) trips – one cell at a time. The size of memory is dependent upon the amount of bits available to represent the addresses of the cells. The maximum memory size is 2n where n is the number of binary digits available to represent the address. Most computers today have at least 32 bits available for addresses of memory (about 4 billion), with 1 gb or 2 gb being very common. Note carefully the difference between the address and the contents of that address. Address 64 Contents 11111111 Of course the address value 64 could be converted to its binary equivalent depending if it has a 16 bit addressing scheme or 24 bit addressing scheme, etc. As noted earlier, the two basic operations on memory is fetching and storing. Fetching means bring a copy of the contents of the memory cell with this particular address. The original contents of the memory cell remain unchanged. Storing means put the specified value into the cell with the given address. The previous contents of this cell are lost. The memory access time is the same for all 2n addresses which is currently approximately less than 5 - 20 nanoseconds (1 nonosecond is 1 billionth of a second). This is the time required to carry out one instruction – either a fetch or a store instruction. Note: I millisecond = 1/1000 second = 10-3 second (millisec or ms) 1 microsecond = 1/1000000 second = 10-6 second (µsec or µs) 1 nanosecond = 1/1000000000 second = 10-9 second (nsec or ns) 1 picosecond = 1/1000000000000 second = 10-12 second (psec or ps) 1 femtosecond = 1/1000000000000000 second = 10-15 second (fsec or fs) 1 attosecond = 10-18 second (asec or as) 1 zeptosecond = 10-21 second (zsec or zs) 2 1 yoctosecond = 10-24 second (ysec or ys) MAR and MDR: (memory address register and memory data register) These are the memory registers which are used to hold address and data respectively. As indicated, an operation requires two pieces of information – the address of the cell and the contents of the cell. The size of the MAR must be at least n bits wide – so it can hold the maximum number of cells available. The size of MDR is usually a multiple of 8 where 8 bits is the typical size of a cell. This is so because for example, a whole number requires 2 (or 4) bytes for storage, thus the MDR would have been at least 2 bytes in this case. Typical sizes for MDR are 16 or 32 or 64 bits. The operations are as follows: Fetch (address) 1. Load the address into the MAR 2. Decode the address in the MAR 3. Copy the contents of that memory location into the MDR. Store (address, value) 1. Load the address into the MAR 2. Load the value into the MDR 3. Decode the address in the MAR 4. Store the contents of the MDR into that memory location. Decode address in MAR: A decoder circuit could be used to decode the integer number in the MAR so that that particular cell is referenced. A decoder circuit has n inputs and 2n outputs. Example if n = 3 then there would be 3 inputs and 8 outputs as the following example shows. MAR n=3 Decoder For each output line 000 001 010 011 100 101 110 111 A B C 3 0 1 2 3 4 5 6 7 This is a 3-8 decoder circuit (there are 3 inputs and 8 outputs). For example if MAR contains a 3-bit address – say 101 (5), then only the 6th output line will be ON. All of the other output lines will be OFF. Thus, such a decoder could be used to decode whatever exist in MAR and hence obtain the address of the cell in question. Of course, with an addressing scheme that consists of 16 bits, to use a decoder circuit of this nature, then there must be 16 inputs and 216 outputs, a practically impractical circuit. Instead memory is divided into a 2-dimensional structure with a row-major order. The desired memory cell is obtained by taking the point where the row and column intersects. The book give a good example on page 197 and page 199 (for the complete diagram). The fetch/store controller: We need to know whether if the operation is a fetch or store operation so the controller will do this for us. The controller is like a traffic officer – it directs the movements between the MDR and the memory. It receives an F-signal (from processor) when it is to be a fetch operation and a S-signal when it is to be a store operation. As such, there ought not to exist a ‘traffic jam’ and hence a very smooth operation will result. Cache Memory. Even though computers can run at very high speeds, there still exist bottleneck situations. The processor would have to sit and wait for information to be brought from memory. To solve this problem, the size of RAM could either be increased or another form be found. The CACHE memory is born. The cache memory is much smaller than RAM (often a size of 512 kb to maybe a few mega bytes) but it is about 10 times faster. The principle on which the cache is built is the Principle of Locality, which is nothing but the idea that if something is used now, there is a very good chance that it will be used in the very near future. So, on this basis, information (instruction and data) that is currently used is regarded as potential information to be used within the near future and so is stored in cache as well as the original copy remaining in RAM. Note not all data is kept in cache and if cache is full, the least used data (which is often the oldest data) is removed to make room for the new data. The process used to get information is as follows: 1. Look first in cache memory to see whether the information needed is there. If it is, then access time would be much faster. 2. If it is not in cache, then go through the process mentioned above – i.e. get the information from RAM at the slower speed. 3. Copy the data just fetched into the cache. This scenario could reduce access time very much indeed. For example, if RAM access time is 20 nsec and cache access rate is 5 nsec (which are quite realistic figures today) 4 and cache has a hit rate of 75% (again a realistic figure) then overall access time would be Average Access time of Cache = (.75 * 5) + (.25 * (5 + 20)) = 10 nsec which is much less than if all information was in RAM with an access time of 20 nsec. Input/output and Mass storage: Mass storage devices such as tapes and disks enable us to store and retrieve data. RAM looses all data after the power of the computer is turned off – not so mass storage devices. These devices tend to differ according to manufacturers – but one common principle prevail – the i/o access method and the i/o controllers. The i/o devices come in basically two forms – those that representation information in human readable form and the other that represent information in machine readable form. The former consist of such devices as the keyboard, screen, and printers. The latter is usually the mass storage devices such as disks and tapes. These are grouped into direct access storage devices (DASDs) and sequential access storage devices (SASDs). A direct access storage device is one in which the equal access time has been eliminated. There is still a unique address for each cell but the time needed to access the particular unit depends upon the physical location of the unit and the current state of the device. A disk stores information in units called sectors each of which contains the address and the data block with a fixed number of characters. A fixed number of these sectors are placed in concentric circles called tracks. And the surface of a disk contains many tracks with a read/write head that can be moved in and out so as to position itself over any track on the surface. The access time of any sector on a track is made up of three components: 1. Seek time: the time needed to position the head over the correct track. 2. Latency: the time for the beginning sector to rotate under the head. 3. Transfer time: the time for the entire sector to pass under the head and have its content read by the head or data written to the sector. If we assume the following we can calculate the seek time, latency and transfer time: Rotation Speed = 7200 rev/min = 1 rev/8.33 msec. (millisecond) Arm movement time = .02 msec to move to an adjacent track. Number of tracks per surface = 1000 (0..999) Number of sectors per track = 64 Number of characters per sector = 1024. 5 1. Seek time: Best case = 0 msec (no arm movement necessary) Worst case = 999 x .02 = 19.98 msec (must move from track 0 to 999). Average case = 400 x .02 = 8.00 msec (assume that on an average the head must move about 400 tracks). 2. Latency: Best case = 0 msec (sector is just about to come under head) Worst case = 8.33 msec (just missed the sector so have to do one rev). Average case = 4.17 msec (one-half rev) 3. Transfer time: 1/64 x 8.33 msec = .13 msec (the time for one sector to pass under the head) Seek time Latency Transfer Total Best case 0 0 .13 .13 Worst case 19.98 8.33 .13 28.44 Average case 8.00 4.17 .13 12.3 As could be seen, the average access time is about 12 milliseconds or this could be even less in some cases. The other type of mass storage uses the sequential access method. In this case, there is no addresses of the data and so the head will have to search the entire disk (in the worst case) sequentially to find the required data. Most disks today are direct access in nature which makes it faster than sequential access. Sequential access is good when we want to copy an entire disk of data to another (back up) medium. Overall, i/o devices are very slow when compared to RAM – (RAM access time about 20 nsec whereas disk is about 12 msec a difference with a factor about 1,666,666 i.e. RAM is about 1,666,666 times faster than disks). If this was the entire picture, then often the processor would have to sit and wait for long periods until access is completed, etc. – wasting time. The i/o controller is used to help the situation. This component is used to handle i/o details and to help compensate for time differences. It has a small amount of memory called an i/o buffer. It also consists of enough i/o control and logic processing to handle i/o devices such as the head, paper feed and screen display. It can also transmit to the processor a special hardware signal called an interrupt signal. 6 For example, if a line of data is to be read from memory and to be displayed on screen, the data would be read and placed in the i/o buffer at the very high speed associated with access from RAM. The processor will now instruct the i/o controller to output the data to the screen which will be much slower. But the processor will not sit and wait for the process to be completed. It would have been freed to do something else while the i/o controller output the data to the screen. When the i/o controller is finished, it will send an interrupt message to the processor telling it that it is finished with its task. Arithmetic/Logic Unit: This part of the computer performs such operations as mathematical and logic operations. The ALU and the control unit have become fully integrated as a single unit in all modern computers called the processor. The ALU is made up of three parts: 1. The registers. 2. The interconnections between components 3. The ALU circuitry. Registers are storage cells that hold the operands and/or results of an arithmetic expression. They have the following characteristics: 1. They do not have addresses but are accessed by a special register designator such as A, X or R0. 2. They can be accessed much faster than RAM because they are fewer (typically a computer may have 16 to 32 registers) compared with millions of addresses in RAM. 3. They are not used for general purpose storage but for specific purposes such as holding operands for an upcoming arithmetic operation. Why do we need Registers? The main reason is that the data they contain can be accessed faster than RAM and even Cache. Further, this means that the time spent to find the storage location of the data is less. An example of the use of registers is as follows: If we want to evaluate the following expression: (a + b) x (c – d) Then by storing the result from (a + b) into a register, rather than memory, when the time is ready to access this part of the result, it would be much faster. This result would be stored in a result register. The operation is as follows: The operands are copied from memory to the register – the left operands’ register connected to the left BUS and the right operands’ register 7 connected to the right BUS. After the operation is carried out the result is taken to a register connected to the result BUS. The operation to be carried out is determined by the operator, example +, =, <, >, etc. The circuits to carry out these operations are built in the ALU using something like a multiplexor. A multiplexor accepts many inputs (2n + n) and produces one output as shown in Figures 5.11, 5.12 and 5.13 pps 209-211. There are several different registers available in the ALU as the hand out shows. Also note that registers have different names and different sizes according to the architecture of the computer. Also note that the language syntax may differ for IBM architecture as well. Some of the IBM architecture registers are: The registers range from R0 to R15 to be used for data and/or instructions: ADD … meaning addition. MUL … meaning multiplication. MOD …meaning modulo. JMP … meaning unconditional jump. JE … meaning jump if equal. JNE …meaning jump if not equal. JG … meaning jump if greater than. INC … Increment register by 1. STORE … Store the data into memory SUB … meaning subtraction. DIV… meaning division. MOV… move data into a register or memory. CMP … compare 1st value to 2nd value. JGE … jump if greater or equal to. JL … jump if less than. JLE … jump if less than or equal. DEC … Decrease register by 1. LOAD … move data from memory to register The Control Unit: The Von Neumann architecture most important characteristic is the stored program concept – a sequence of machine language instruction stored as binary values in memory. It is the task of the control unit to: 1. Fetch from memory the next instruction to be executed. 2. Decode it – that is, determine what is to be done and 3. Execute it by issuing the appropriate command to the ALU, memory and I/O controllers. Machine Language Instructions: These instructions are expressed in binary code in the following format: Operation Code Address Field 1 Address Field 2 ………… The operation code is a unique unsigned integer code assigned to each machine language operation recognized by the hardware. If this code contains n bits, then the total number of operations available is 2n operations. 8 An example: If Operation code is decimal 5 and address of field 1 is 45 with address of field 2 being 70, then using 16-bits address size and 8-bits op code size, the following will result: 00000101 0000000000101101 0000000001000110 And putting them together it would look like: 0000010100000000001011010000000001000110 op code address 1 address 2 The address field is the memory addresses of the values on which these operations will work. If the computer has a maximum capability of 128mb RAM (227) or a maximum of a 27-bit addressing scheme, then this field must also be 27 bits wide (to accommodate the maximum address that is possible). The amount of address-fields varies from 0 – 3 depending upon the operation – example an addition of two operands may require two address fields or three (this last one is for the storing of the result). The set of all instructions that can be executed by a computer is called the instruction set. This instruction set differs according to computer manufacturer, especially between IBM compatible computers and Macintosh. That is one of the main reasons why programs written for IBM compatibles will not run on Mac. The current trend in computer manufacturing is to make a small an instruction set as possible – RISC – reduced instruction set computers (30-50 instructions) vs. CISC – complex instruction set computers (200-400 instructions). In the case of RISC, more program instruction may be required to complete the execution, but this is compensated for by the fact that the processor is getting faster and faster. In fact, a RISC machine today can run a program 1.5 to 2 times faster than running that same program on a CISC machine. Machine instructions can be grouped into 4 major categories: 1. 2. 3. 4. Data Transfer. Arithmetic. Compare. Branch. Data Transfer: These are operations that move information between or within different components of the computer. Could do the following operations when transferring information: Memory cell ALU register Memory cell ALU register ALU register. memory cell memory cell ALU register 9 An example of an instruction under data transfer: Load x load the register R with the contents of memory location x Store x Store the value found in register R into memory location x Move x, y Copy the contents from memory cell x into memory cell y. Arithmetic: This involves the mathematic operations of +, -, x, /, and, not and or. Could involve operation between registers or/and values. An example of an instruction under Arithmetic: ADD x, y, z Add the value in x to the value in y and store it in z. (a 3-address instruction) ADD x, y Add the value in x to the value in y and leave the result in y. (a 2-address instruction) ADD x Add the value in x to the value in register R and leave the result in R. (a 1address instruction) Compare: Uses the 6 logic operators for comparison - <, <=, =, >, >=, and <>. When two values are compared, the result is either TRUE or FALSE translated into 1 or 0. A condition code which exist in the processor is set to reflect the result. An example of an instruction under compare: COMPARE x, y Compare the value in x with that in y and set the condition code as: Con(x) > Con(y) GT = 1 Con(x) < Con(y) LT = 1 Con(x) = Con(y) EQ = 1 Branch: This is generally a jump operation which is an instruction meaning to jump to an indicated point. The Von Neumann architecture is build to execute instructions sequentially by default. However, if there is a need to skip certain instructions, then the jump instruction is used. An example of an instruction under Branch: JUMP x: An unconditional jump i.e you just go ahead and jump to this next instruction JUMPGT x If GT = 1, do the next instruction that is in Memory cell x otherwise, keep 10 going sequentially with the next instruction. HALT: Stop program execution. The Control Unit Registers and Circuits: The task of the control unit is to fetch and execute instructions. To do this it uses the Program Counter (PC), Instruction Register (IR) and an Instruction Decoder Circuit. The PC holds the address of the next instruction. The Control Unit sends the content of PC to MAR. The fetch operation will then be carried out and the PC would be incremented by 1. The IR holds the instruction fetched from memory. The operation code is decoded (by instruction Decoder Circuit) to know what the instruction is. The full operation is then carried out depending upon the operation that is required. This process is repeated until the program is fully executed. Parallel Computing: The Von Neumann Architecture allows us to carry out instructions sequentially. Owing to the massive amount of data and size of programs, this can lead to the Von Neumann Bottleneck (we are unable to process data fast enough). To prevent such bottlenecks, other architectures are considered one of the most popular in today’s’ world is parallel architecture and hence parallel computing. The idea behind parallel computing is to be able to do more than one operation at the same time. To do so, we have to have the hardware. One solution is to incorporate multiple processors in the same “machine” i.e. make the computer with dual or quad, etc. processors and when executing a program, it is divided into sub-parts with each part being handled by each of the processors simultaneously. So if the program as a whole took 1 minute to execute on a single processor, by having a quad-processor machine, it can conceivably take ¼ minutes (15 seconds) Another solution is to have a network of single (or even multiple) processor machines all connected together and broadcast the instruction to all of the machines along with distributing the data (SIMD – single instruction stream/multiple data stream). In this way, all of the machines will do the single instruction but on their own data set as the following diagram shows: 11 A very good application for this type of architecture is vector manipulation – adding two vectors of scaling a vector, etc. For example, if we had a vector with 4 numbers and we wanted to scale by a factor of 3, we give the 4 ALU one data each and all 4 ALU’s would have the instruction to multiply their data by 3. This operation would have been done simultaneously, hence speed up the time required. This type of parallel computing is known as distributed processing. Another method is we distribute both multiple instructions and multiple to each computer (MIMD – multiple instruction stream/multiple data stream). Here, the program is divided into separate processes and each individual process is allocated to an individual machine. Each of these processes may be made up of many instructions and may require multiple data. The diagram on the next page shows this type of architecture. This type of distribution is also known as cluster computing. Note: one variation of the MIMD architecture is Grid Computing whereby all of the processors don’t have to be located in the same “building” but can be located anywhere where communication is possible for networking. So computers from all parts of the country can be connected to solve a problem – ex. SETI (Search for Extraterrestrial Intelligence) at Berkeley where individuals have rented their computer whilst they are at work. 12 13