A Simple Computer consists of a Processor (CPU-Central Processing Unit), Memory, and I/O Arithmetic Logic Unit Control Unit Input Memory Registers Output Processor Or CPU I/O Basic Functional Units of a Computer • Input – accepts coded information from human operators, from electromechanical devices (such as keyboards), or from other digital medium via digital communication lines. • The information received is either stored in the memory or immediately used by the arithmetic and logic unit (ALU) to perform the desired operations. • The results are sent back out through the output medium. • All actions are coordinated through the control unit. The Information • Categorized as either instructions or data • Instructions (or machine instructions) are explicit commands that – Govern the transfer of information within a computer as well as between the computer and its I/O devices. – Specify the arithmetic and logic operations to be performed. Programs • A list of instructions that performs a task is called a program. • Usually the program is stored in memory. • The program fetches the instructions from memory, one after another, and performs the desired operations. • The computer is completely controlled by the stored program, except for possible interruption by an operator or by I/O devices connected to the machine. • Data are numbers and encoded characters that are used as operands by the instructions. Computer System Organization Inside the CPU • Control Unit (CU) coordinates the sequencing of steps involved in executing machine instructions • Arithmetic Logic Unit (ALU) performs arithmetic and logical operations • Registers storage locations • Clock synchronizes the internal operations of the CPU with the other system components Bus Structure • Bus - a group of parallel wires that transfer information from one part of the computer to another. – Control Bus synchronizes the actions of all of the devices attached to the system bus. – Address Bus passes the addresses of instructions and data between the CPU and memory (or I/O). – Data Bus transfers instructions and data between the CPU and memory (or I/O). Bus Sizes • For the 8086 Processor – Address Bus – 20 bits • can access 1M of memory • Addresses defined as $00000-$FFFFF – Data Bus – 16 bits (16-bit processor) • A word is 16 bits • Each word is byte addressable More Facts on The 8086 Processor Generation External Data Bus Width Internal Register Width Address Bus Width Numeric Data Processor L1 Cache L2 Cache P1 16 16 20 External None None The Intel CPU Family Chip 4004 8008 8080 8085 8086 8088 80286 80386 80486 Pentium Pentium Pro Pentium II Pentium III Date MHz 4/1971 0.108 4/1972 0.108 4/1974 2-3 4/1976 3-8 6/1978 5-10 6/1979 5-8 2/1982 8-12 10/1985 16-33 4/1989 25-100 3/1993 60-233 3/1995 150-200 5/1997 233-400 1998 550 Transistors Memory 2,300 3,500 6,000 6,500 29,000 29,000 134,000 275,000 1.2M 3.1M 5.5M 7.5M 9.5M 640 16KB 64KB 64KB 1MB 1MB 16MB 4GB 4GB 4GB 4GB 4GB Notes First microprocessor on a chip First 8-bit processor First general-purpose CPU on a chip First 16-bit CPU on a chip Used in IBM PC Memory protection present First 32-bit CPU Built-in 8K cache memory Two pipelines; later models had MMX Two levels of cache built in Pentium Pro plus MMX Streaming SIMD extensions (SSE) Notes from Intel Family Chart • Notice that 386 – Pentium 4 are 32-bit processors (32-bit data bus – 4 bytes) • Notice that 386 and beyond have 32-bit address bus can access (4G of memory addresses). Machine Cycle • Most basic unit of time for machine instructions • = the time required for one complete clock cycle. • Machine instructions require at least 1 clock cycle to execute. Most require more. • Wait states – empty clock cycles of machine execution time (due to memory access time being slower than speed of clock). Instruction Execution Cycle • If using Memory operand (mov ax, 0A69Bh) – Calculate address of operand – Place address of operand on address bus – Wait for memory to get operand and pass it on data bus The data path of a typical von Neumann Machine A+B A Registers B A B ALU Input Register ALU Input Bus ALU A+B ALU Output Register Instruction Execution Cycle The CPU executes each instruction in a series of small steps 1. Fetch the next instruction from memory into the instruction register. 2. Change the program counter to point to the next instruction. 3. Decode the instruction. 4. Fetch any memory operands necessary into a CPU register. 5. Execute the instruction. 6. Store output operand into a CPU register. Execution of von Neumann Machines To fetch the next instruction while the first is executing would speed up the machine Instructions are stored in prefetch buffers (registers), to be accessed more quickly than waiting for fetch from memory. Prefetching divides instruction execution up into two parts: fetching and actual execution. Pipelining divides up instruction execution into many parts, each one handled by a piece of dedicated hardware, all which can run in parallel. 2-stage Pipelining • Execution Unit: executes the microcode instructions. • Bus Interface Unit: accesses memory and provides I/O A Five-stage Pipeline S1 S2 S3 S4 S5 Code Prefetch Unit Instruction Decode Unit Operand Fetch Unit Instruction Execution Unit Write Back Unit A five-stage pipeline. S1 S2 S3 S4 S5 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1 6 5 4 3 2 7 6 5 4 3 8 7 6 5 4 9 8 7 6 5 t1 t2 t3 t4 t5 t6 t7 t8 t9 The state of each stage as a function of time. How Fast Does This Machine Run? • Suppose that the cycle time of this machine is 2 nsec. • Then it takes 10nsec for an instruction to progress all the way through the five-stage pipeline. • Does the machine run at 100MIPS (1/10n)? • No, at every clock cycle (2nsec) a new instruction is completed, so the actual rate of processing is 500MIPS. How many cycles are required to execute n instructions? (Pipelined Versus Non-Pipelined Systems) • For a system with k stages • In non-pipelined systems, n instructions require (n*k) cycles to process. – 5 instructions require 5 clock cycles • Using a pipelined system with k pipeline stages, n instructions require (k + (n-1)) cycles to complete. – 5 instructions require (5 + (5-1)) = 9 clock cycles (*refer to slide #14) Tradeoffs Pipelining allows a tradeoff between – Latency • How long it takes to execute an instruction • Latency = nT nanosec (where cycle time is T nanosec and the number of stages is n) • And – Processor Bandwidth • How many MIPS the CPU has • Bandwidth = 1000/T MIPS *logically we should measure CPU bandwidth in BIPS or GIPS since we are measuring T in nanosec, but nobody does this. IA-32 Processor Pipelining (6-stage Execution Cycle) • Bus Interface Unit: accesses memory and provides I/O • Code Prefetch Unit: receives instructions from the BIU and inserts them into a holding area (instruction queue) • Instruction Decode Unit: decodes machine instructions from the prefetch queue and translates them into microcode. • Execution Unit: executes the microcode instructions. • Segment Unit: translates logical addresses into linear addresses and performs protection checks • Paging Unit: translates linear addresses into physical addresses, performs page protection checks and keeps a list of recently accessed pages Superscalar Architecture • If one pipeline is good, then two pipelines must be better. • Parallel paths exist through which different instructions can be executed in parallel. • It is possible to start the execution of several instructions in every clock cycle. • The logical correctness of programs must be maintained. Dual five-stage pipelines with a common Code Prefetch Unit The code prefetch unit fetches pairs of instructions together and puts each one into Its own pipeline, complete with its own ALU for parallel operation. Superscalar processor with 5 functional units Four pipelines duplicates too much hardware. Instead, use a single pipeline and give it multiple functional units. This assumes that the S3 stage can issue instructions faster than the S4 stage can execute them. (Pentium II) Parallelism • So far we have dealt with instruction-level parallelism. • There is also processor-level parallelism – Array processors – Multiprocessors – Multicomputers CISC Complex Instruction Set Computer • A large number of variable length instructions (more than 128) • Multiple addressing modes • A small number of internal processor registers • Instructions that require multiple numbers of clock cycles to execute 8086 (A Real CISC) • Over 3000 different instruction forms, each requiring anywhere from one to six bytes • Nine different addressing modes are supported • The processor only has eight general purpose registers • Instruction execution times range from 2 clock cycles to more than 80 cycles for ASCII adjust for multiplication instruction. Intel’s i860 RISC Processor • • • • 82 instructions, each 32 bits in length Four addressing modes 32 general purpose registers All instructions execute in one clock cycle Why hasn’t RISC won out? • Backward compatibility (companies have spent billions of dollars on Intel processor software). • Intel has built CPU cores with RISC like structure that executes the simplest and most common instructions in a simgle data path, while interpreting the more complex instructions in the usual CISC way. Design Principles of Modern Computers • All instructions are directly executed in hardware • Maximize the rate at which instructions are issued • Instructions should be easy to decode • Only loads and stores should be able to reference memory • Provide plenty of registers Application Specific Microprocessors Digital Signal Processors • Previously, analog signals had to be handled with discrete circuits (op-amps, capacitors, inductors, and resistors forming filters, amplifiers, etc…) • Now low-cost analog-to-digital and digitalto-analog converters are available. • => thus we have digital signal processing systems DSP systems • DSPs are used to perform repetitive complex mathematical computations on the converted analog data. • One computation may require as many as 500,000 add-multiply operations. DSP Architecture • Data and instructions are stored in two different memory areas each with their own buses (Harvard Architecture) • Hardware multipliers and adders are built into the processor and optimized to perform a calculation in a single clock cycle. • Arithmetic pipelining is used so that several instructions can be operated on at once. • Hardware DO loops are provided to speed up repetitive operations • Multiple (serial) I/O ports are provided for communication with other processors. DSP Applications • Mulitmedia sound cards (used to compress speech and music signals) • DSP can be reprogrammed (allows some sound cards to double as a modem • Cellular phones • Speech and image compression • Optical character recognition • Video conferencing Operating System • A collection of programs (a large program), that are used to control the sharing of and interaction among various computer units as they execute application programs. • Performs the tasks required to assign computer resources to individual application programs. – Assigning memory and magnetic disk space to program and disk files – Moving data between memory and disk units – Handling I/O operations Example of How A Operating System Manages the execution of more than one application program at the same time • Application program has been compiled from a high level language form into machine language form and is stored on disk • Assume somewhere in the program, a data file must be read, perform some computation on the data, and print results . – Transfer file into memory – When transfer is complete, begin execution – When point in program is reached that data file is needed, the program requests the operating system to transfer the data file from the disk to memory. • The OS performs this task and passes execution control back to the application program, which then proceeds to perform the required computation. • When the computation is completed and the results are ready to be printed, Can Multitasking be used for concurrent execution of application programs? Printer Disk OS Routines Program t0 t1 t2 t3 t4 t5