OVERVIEW & COMPUTER PERFORMANCE SCR1043 - Module 1 1 -- Organization and Architecture - Structure and Function Reference: William Stallings – Computer Organization & Architecture SCR1043 - Module 1 2 Computer Architecture is those attributes visible to the programmer. Examples: the Instruction set the number of bits used to represent various data types I/O mechanisms memory addressing techniques Computer Organization is how features are implemented: Control signals Interfaces between computer and peripherals The memory technology being used So, for example, the fact that a multiply instruction is available is a computer architecture issue. How that multiply is implemented is a computer organization issue. SCR1043 - Module 1 3 Many computer manufacturers offer a family of computer models, all with the same architecture but with differences in organization. All Intel x86 family share the same basic architecture The IBM System/370 architecture first introduced in 1970 included a number of models that share the same basic architecture and has survived to this day as the architecture of IBM’s mainframe product line. The newer models retained the same architecture so that the customer’s software investment was protected (code compatibility) SCR1043 - Module 1 4 A computer is a complex system with a hierarchical system of interrelated subsystems with different levels. At each level, the designer is concerned with structure and function: Structure: The way in which the components are interrelated. Function: The operation of each individual component as part of the structure. The computer system in this course will be described from the top down, instead of bottomup. SCR1043 - Module 1 5 Four Central processing unit (CPU): Controls the operation of the computer and performs its data processing functions. Its major structural components are: main structural components: Control unit: Controls the operation of the CPU Arithmetic and logic unit (ALU): Performs the computer’s data processing functions Registers: Provides storage internal to the CPU CPU interconnection: Some mechanism that provides for communication among the control unit, ALU, and registers Main memory: Stores data I/O: Moves data between the computer and its external environment System interconnection: Some mechanism that provides for communication among CPU, main memory, and I/O SCR1043 - Module 1 6 Computer Peripherals Central Processing Unit Computer Main Memory Systems Interconnection Input Output Communication lines SCR1043 - Module 1 7 CPU Computer Arithmetic and Login Unit Registers I/O System Bus CPU Internal CPU Interconnection Memory Control Unit SCR1043 - Module 1 8 Control Unit CPU Sequencing Login ALU Internal Bus Control Unit Control Unit Registers and Decoders Registers Control Memory SCR1043 - Module 1 9 There are only four functions: Data processing process data in variety of forms and requirements Data storage short and long term data storage for retrieval and update Data movement move data between computer and outside world. Control control of process, move and store data using instruction. How to perform this function? through PROGRAM SCR1043 - Module 1 10 A sequence of steps For each step, a computer function is executed For each operation, a different/new set of control signals is needed For each operation a unique code (instruction) is provided e.g. ADD, MOVE A hardware segment accepts the code and issues the control signals SCR1043 - Module 1 11 Approach 1: Hardwired program connecting/combining various logic components to store data and perform arithmetic and logic operations Hardwired systems are inflexible SCR1043 - Module 1 12 Approach 2: Software General purpose hardware can do different tasks, given correct control signals Instead of re-wiring, supply a new set of control signals through instruction codes SCR1043 - Module 1 13 - A Brief History of Computers - Designing for Performance - Pentium and PowerPC Evolution Reference: William Stallings – Computer Organization & Architecture SCR1043 - Module 1 14 SCR1043 - Module 1 15 1943-1946: ENIAC (Electronic Numerical Integrator And Computer) First general purpose computer Designed by Mauchly and Eckert Designed to create ballistics tables for WWII, but too late – helped determine H-bomb feasibility instead. General purpose! 30 tons + 15000 sq. ft. + 18000 vacuum tubes + 140 KW = 5000 additions/sec SCR1043 - Module 1 16 SCR1043 - Module 1 17 1945: stored-program concept first implemented for EDVAC (Electronic Discrete Variable Computer). Key concepts: Data and instructions are stored in a single read-write memory. The contents of this memory are addressable by location, without regard to the type of data contained there Execution occurs in a sequential fashion from one instruction to the next SCR1043 - Module 1 18 SCR1043 - Module 1 19 Prototype for all subsequent general-purpose computers. With rare exceptions, all of today’s computers have this same general structure, and are referred to as von Neumann machines. General IAS structure consists of: A main memory, which stores both data and instructions An ALU capable of operating on binary data A control unit, which interprets the instructions in memory and causes them to be executed I/O equipment operated by the control unit SCR1043 - Module 1 20 SCR1043 - Module 1 21 1950: UNIVAC – commissioned by Census Bureau for 1950 calculations Late 1950’s: UNIVAC II Greater memory and higher performance Same basic architecture as UNIVAC First example of upward compatibility 1953: IBM 701 – primarily for science 1955: IBM 702 – primarily for business SCR1043 - Module 1 22 1947: Transistor developed at Bell Labs Introduction of more complex ALU and control units High-level programming languages The data channel – an independent I/O module with its own processor and instruction set The multiplexor – a central termination point for data channels, CPU, and memory. Precursor to idea of data bus. DEC (Digital Equipment Corporation) founded in 1957 delivered its first computer, PDP-1, a minicomputer phenomenon. SCR1043 - Module 1 23 SCR1043 - Module 1 24 1958: Integrated circuit developed 1964: Introduction of IBM System/360 First planned family of computer products. Characteristics of a family: Similar or Identical Instruction Set and Operating System Increasing Speed Increasing Number of I/O Ports Increasing Memory Size Increasing Cost Different models could all run the same software, but with different price/performance. SCR1043 - Module 1 25 Literally - “small electronics” A computer is made up of gates, memory cells and interconnections These can be manufactured on a semiconductor e.g. silicon wafer With microelectronics, density of components on chip keep on increasing From Number of transistors on a chip will double every year Since 1970’s development has slowed a little, a modified law Gordon Moore – co-founder of Intel, it says Number of transistors on a chip doubles every 18 months Therefore, more circuit can be packed on the same size chip Higher packing density means shorter electrical paths, giving higher performance Smaller size gives increased flexibility Reduced power and cooling requirements Fewer interconnections increases reliability SCR1043 - Module 1 27 Moore prediction Actual SCR1043 - Module 1 28 1964 Replaced First (& not compatible with) 7000 series planned “family” of computers Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more terminals) Increased memory size Increased cost SCR1043 - Module 1 29 SCR1043 - Module 1 30 1964: First PDP-8 shipped First minicomputer Started OEM market Introduced the bus structure Did not need air conditioned room Small enough to sit on a lab bench $16,000 compared to $100k++ for IBM 360 SCR1043 - Module 1 31 Semiconductor memory Replaced bulky core memory Goes through its own generations in size, increasing by a factor of 4 each time: 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M on a single chip with declining cost and access time Microprocessor Distributed Larger and personal computers computing and larger scales of integration SCR1043 - Module 1 32 SCR1043 - Module 1 33 Microprocessor : all CPU components on a single chip 1971 - 4004 First microprocessor 4 bit Followed in 1972 by 8008 8 bit Both designed for specific applications 1974 - 8080 Intel’s first general purpose microprocessor Designed to be the CPU of a general purpose microcomputer SCR1043 - Module 1 34 8080 first general purpose microprocessor 8 bit data path Used in first personal computer – Altair 8086 much more powerful 16 bit instruction cache, prefetch few instructions 8088 (8 bit external bus) used in first IBM PC 80286 16 MB memory addressable 80386 First 32 bit design Support for multitasking- run multiple programs at the same time SCR1043 - Module 1 35 80486 sophisticated powerful cache and instruction pipelining built in maths co-processor Pentium Superscalar technique - multiple instructions executed in parallel Pentium Pro Increased superscalar organization Aggressive register renaming branch prediction data flow analysis speculative execution SCR1043 - Module 1 36 Pentium II MMX technology graphics, video & audio processing Pentium III Additional floating point instructions for 3D graphics Pentium 4 Further floating point and multimedia enhancements Itanium 64 bit Core Duo starts of a multicore processor SCR1043 - Module 1 37 1975, 801 minicomputer project (IBM) RISC Berkeley RISC I processor 1986, IBM commercial RISC workstation product, RT PC. Not commercial success Many rivals with comparable or better performance 1990, IBM RISC System/6000 RISC-like superscalar machine POWER architecture IBM alliance with Motorola (68000 microprocessors), and Apple, (used 68000 in Macintosh) Result is PowerPC architecture Derived from the POWER architecture Superscalar RISC Apple Macintosh Embedded chip applications SCR1043 - Module 1 42 Price/performance Price drops every year Performance increases almost yearly Memory goes up a factor of 4 every 3 years of so The basic building blocks for today’s computers are the same as those of the IAS computer nearly 50 years ago. SCR1043 - Module 1 43 Density of integrated circuits increases by 4 every 3 years (e.g. memory evolution) Also results in performance boost of 4-5 times every 3 years Requires more elaborate ways of feeding instructions quickly enough. Some techniques: Branch prediction Data-flow analysis Speculative execution SCR1043 - Module 1 44 All components do not increase performance at same rate as processor Results in a need to adjust the organization and architecture to compensate for the mismatch among the capabilities of the various components. SCR1043 - Module 1 45 Must carry a constant flow of program instructions and data between memory chips and processor Processor speed and memory capacity have grown rapidly Speed with which data can be transferred between processor and main memory has lagged badly DRAM density goes up faster than amount of main memory needed Number of DRAM’s goes down With fewer DRAM’s, less opportunity for parallel data transfer SCR1043 - Module 1 46 Increase number of bits retrieved at one time Make DRAM “wider” rather than “deeper” Change DRAM interface Include cache in DRAM chip Reduce frequency of memory access More complex and efficient cache between processor and memory Cache on chip/processor Increase interconnection bandwidth between processor and memory High speed buses Hierarchy of buses I/O devices also become increasingly demanding SCR1043 - Module 1 47 Peripherals with intensive I/O demands Large data throughput demands Processors can handle this Problem moving data Solutions: Caching Buffering Higher-speed interconnection buses More elaborate bus structures Multiple-processor configurations Peripherals (I/O devices) has extremes •speed variations : < 1Hz to GHz •in amount of data transfer: <1bit/sec to Gb/sec Because of constant and unequal changes in: Processor components Main memory I/O devices Interconnection structures, designers must constantly strive to balance their throughtput and processing demands. SCR1043 - Module 1 50 Increase Fundamentally due to shrinking logic gate size More gates, packed more tightly, increasing clock rate Propagation time for signals reduced Increase Cache access times drop significantly Change size and speed of caches Dedicating part of processor chip hardware speed of processor processor organization and architecture Increase effective speed of execution Parallelism Power Power density increases with density of logic and clock speed Dissipating heat RC delay Speed at which electrons flow is limited by resistance and capacitance of metal wires connecting them due to increased density Interconnected wires becomes thinner, increasing resistance (R) Wires are closer together, increasing capacitance (C) Therefore, Delay increases as RC product increases Memory latency Memory speeds lag behind processor speeds Solution: More emphasis on organizational and architectural approaches Better performance if improvement in architecture of the CPU compared to the processing speed (technology) SCR1043 - Module 1 53 Typically two or three levels of cache between processor and main memory (L1,L2,L3) Chip density increased More cache memory on chip Faster cache access Pentium chip devoted about 10% of chip area to cache Pentium 4 devotes about 50% Enable parallel execution of instructions Pipeline works like assembly line Different stages of execution of different instructions at same time along pipeline Superscalar allows multiple pipelines within single processor Instructions that do not depend on one another can be executed in parallel Both of these approaches are reaching a point of diminishing returns. Internal organization of processors complex Can get a great deal of parallelism Further significant increases likely to be relatively modest Benefits from cache are reaching limit Increasing clock rate runs into power dissipation problem Some fundamental physical limits are being reached We can use Amdahl’s law to estimate maximum expected performance improvements to an overall system when only part of the system is improved. Within a processor, increase in performance is proportional to square root of increase in complexity If software can use multiple processors, doubling number of processors almost doubles performance So, use two simpler processors on the chip rather than one more complex processor Multiple processors on single chip With large shared cache With two processors, larger caches are justified Power consumption of memory logic (for cache) is less than processing logic Example: IBM POWER4 Two cores based on PowerPC CPU Performance and its factors Evaluating Performance Reference: David A. Patterson & John L. Hennessy – Computer Organization And Design SCR1043 - Module 1 59 Hardware performance is often key to the effectiveness of an entire system of hardware and software. For different types of applications, different performance metrics may by appropriate, and different aspects of a computer systems may be the most significant factor in determining overall performance. Understanding how best to measure performance and limitations of performance is important when selecting a computer system To understand the issues of assessing performance. Why a piece of software performs as it does? Why one instruction set can be implemented to perform better than another? How some hardware feature affects performance? SCR1043 - Module 1 60 Performance Identify is important! HW/SW performance problems Comparisons: Which machine is faster? Which ISA is better? Which implementation (of an ISA) is faster? Expose significant performance issues (enable us to ignore unimportant issues) SCR1043 - Module 1 61 Which • of these airplanes has the best performance? How do we say one computer has better performance than another? • Peformance based on speed • • To take a single passenger from one point to another in the least time – Concorde Performance based on throughput • To transport 450 passengers from one point to another - 747 SCR1043 - Module 1 62 Response Time and Throughput Response Time: time to respond (complete an operation) Throughput: jobs completed per unit time Often can trade one for the other SCR1043 - Module 1 63 MB/s, Mb/s: Megabytes, Megabits Per Second MIPS: Millions of Instructions Per Second CPI: Clock Cycles Per Instruction IPC: Instructions Per Clock cycle Hz: (processor clock frequency) cycles Per Second LIPS: Logical Interference Per Second FLOPS: Floating-Point arithmetic Operations Per Second SCR1043 - Module 1 64 Real time: “Wall Clock” time, always ticking CPU execution time (CPU time): ticks only when CPU is working for you User: CPU time spent in the program System: CPU time spent in the operating system performing tasks on behalf of the program Clock cycle: Also called tick, clock tick, clock period, clocks, cycle (e.g. 0.25 nanosecond). The time for one clock period, usually of the processor clock, which runs at a constant rate Clock rate: the inverse of the clock cycle. Frequency (e.g. 4 GHz) 65 SCR1043 - Module 1 CPU execution time for a program = Seconds for the program Clock cycle time = Seconds per clock cycle Clock ticks at a constant rate, measure time in clock cycles: Seconds = Cycles * Seconds Program Program Cycle Prefer clock frequency? Divide by Hz Seconds = Cycles / Clock rate (Freq) Program Program SCR1043 - Module 1 66 A simple formula relates the most basic metrics (i.e., clock cycles and clock cycle time) to CPU time SCR1043 - Module 1 67 Our favorite program runs in 10 seconds on computer A, which has a 4 GHz clock. Computer B will run this program in 6 seconds, given that computer B requires 1.2 times as many clock cycles as computer A for this program. What is computer B’s clock rate? CPU Time(A) = CPU Clock Cycles(A) / Clock Rate(A) 10 s = CPU Clock Cycles(A) / 4 GHz 10 s = CPU Clock Cycles(A) / 4 X 10*9 Hz CPU Clock Cycles(A) = 40 x 10*9 cycles CPU Time(B) = 1.2 X CPU Clock Cycles(A) / Clock Rate(B) 6s = 1.2 X CPU Clock Cycles(A) / Clock Rate(B) Clock Rate (B) = 1.2 X 40 X 10*9 cycles / 6 seconds Clock Rate (B) = 48 X 10*9 cycles / 6 seconds Clock Rate (B) = 8 X 10*9 cycles / seconds Clock Rate (B) = 8 GHz SCR1043 - Module 1 68 Instruction count = Instructions executed for the program Clock cycle per instruction = Average number of clock cycles per instructions Programs are made of instructions: Cycles Program Using Instructions Program * Cycles Instructions CPI: Cycles Program Or, = = Instructions * CPI Program using Instructions Per Clock (IPC): Cycles Program = Instructions Program SCR1043 - Module 1 / IPC 69 CPU time = Seconds Program = Cycles Program = Instructions * Cycles Program Instructions In * * Seconds Cycle Seconds Cycle other words: SCR1043 - Module 1 70 Suppose we have two implementations of the same instruction set architecture and for the same program. Which computer is faster and by how much? Computer A: clock cycle time=250 ps and CPI=2.0 Computer B: clock cycle time=500 ps and CPI=1.2 Say I = number of instructions for the program, find number of clock cycles for A and B CPU Clock Cycles(A) CPU Clock Cycles(A) CPU Clock Cycles(B) CPU Clock Cycles(B) =I =I =I =I SCR1043 - Module 1 X CPI(A) X 2.0 X CPI(B) X 1.2 71 Compute CPU Time for A and B CPU Time(A) CPU Time(A) CPU Time(B) CPU Time(B) = CPU Clock Cycles(A) X Clock Cycle Time(A) = I X 2.0 X 250 ps = I X 500 ps = CPU Clock Cycles(B) X Clock Cycle Time(B) = I X 1.2 X 500 ps = I X 600 ps Clearly A is faster. The amount faster is the ratio of execution time. Performance(A) = Execution time(B) = I X 600 ps = 1.2 times Performance(B) Execution time(B) I X 500 ps We can conclude, A is 1.2 times faster than B for this program SCR1043 - Module 1 72 Sometimes it is possible to compute the CPU clock cycles by looking at the different types of instructions and using their individual clock cycle counts CPIi = count of the number of instructions of class i executed Ci = average number of cycles per instruction for that instruction class n = number of instruction classes Remember that overall CPI for a program will depend on both the number of cycles for each instruction type and the frequency of each instruction type in the program execution SCR1043 - Module 1 73 SCR1043 - Module 1 74 A compiler designer is trying to decide between two code sequences for a particular computer. The hardware designers have supplied the following facts: For a particular high-level-language statement, the compiler writer is considering two code sequence that require the following instruction counts: Which code sequence executes the most instructions? Which will be faster? Which is the CPI for each sequence? SCR1043 - Module 1 75 Sequence 1 (Instruction Count(1)) 2+1+2=5 instructions Sequence 2 (Instruction Count(2)) 4+1+1=6 instructions CPU Clock Cycles(1)= (2X1)+(1X2)+(2X3) = 10 cycles CPU Clock Cycles(2)= (4X1)+(1X2)+(1X3) = 9 cycles So code Sequence 2 faster, even though it executes 1 extra instruction Code Sequence 2 uses fewer clock cycles, must have lower CPI CPI = CPU Clock Cycles/Instruction Count CPI(1) = CPU Clock Cycles(1)/Instruction Count(1) = 10/5 = 2 CPI(2) = CPU Clock Cycles(2)/Instruction Count(2) = 9/6 = 1.5 SCR1043 - Module 1 76 The evolution of computers has been characterized by increasing processor speed, decreasing comp size, increasing memory size, and increasing I/O capacity and speed. All computer designers must balance performance and cost. Execution time of real programs as the metric is a reliable method of determining and reporting performance. SCR1043 - Module 1 77