Chapter 2 Computer Evolution and Performance Yonsei University Contents • • • • 2-2 A Brief History of Computers Designing for Performance Pentium and PowerPC Evolution Performance Evaluation Yonsei University ENIAC • • • • A brief history of computers Electronic Numerical Integrator And Computer John Mauchly and John Presper Eckert Trajectory tables for weapons Started 1943 / Finished 1946 – Too late for war effort • • • • • • • • • 2-3 Used until 1955 Decimal (not binary) 20 accumulators of 10 digits Programmed manually by switches 18,000 vacuum tubes 30 tons 1,500 square feet 140 kW power consumption 5,000 additions per second Yonsei University von Neumann/Turing A brief history of computers • • • • Stored Program concept Main memory storing programs and data ALU operating on binary data Control unit interpreting instructions from memory and executing • Input and output equipment operated by control unit • Princeton Institute for Advanced Studies – IAS • Completed 1952 2-4 Yonsei University von Nuemann Machine Input Output Equipment Arithmetic And Logic Unit A brief history of computers Main Memory Program Control Unit • If a program could be represented in a form suitable for storing in memory, the programming process could be facilitated • A computer could get its its instructions from memory, and a program could could be set or altered by setting the values of a portion of memory 2-5 Yonsei University IAS Memory Formats A brief history of computers 01 39 (a) Number Word Sign Bit Left Instruction 0 Right Instruction 8 Opcode 19 20 Address 28 Opcode 39 Address (b) Instruction Word • 1000 x 40 bit words – – – – 2-6 Binary number 2 x 20 bit instructions Each instruction consisting of an 8-bit opcode A 12-bit address designating one of the words in memory Yonsei University IAS Registers A brief history of computers • Memory Buffer Register – Containing a word to be stored in memory, or used to receive a word from memory • Memory Address Register – Specifying the address in memory of the word to be written from or read into the MBR • Instruction Register – Containing the 8-bit opcode instruction being executed • Instruction Buffer Register – Employed to hold temporarily the righthand instruction from a word memory • Program Counter – Containing the address of the next instruction-pair to be fetched from memory • Accumulator and Multiplier Quotient – Employed to hold temporarily operands and results of ALU operations. 2-7 Yonsei University Structure of IAS A brief history of computers Central Processing Unit Arithmetic and Logic Unit Accumulator MQ Arithmetic & Logic Circuits MBR Input Output Equipment Instructions Main & Data Memory PC IBR MAR IR Control Circuits Program Control Unit 2-8 Address Yonsei University Partial Flowchart of IAS A brief history of computers Start Yes No Memory Access required Fetch Cycle IR ← IBR (0:7) MAR ← IBR (8:19) Is Next Instruction In IBR? No MAR ← PC MBR ← M(MAR) IR ← IBR (20:27) MAR←IBR(28:39) No Left Instruction Required? Yes IBR←MBR(20:39) IR ← MBR (0:7) MAR ←MBR(8:19) PC ← PC+1 AC ← M(X) Go to M(X, 0:19) ExecuMBR ← M(MAR) tion Cycle AC ← MBR 2-9 Decode instruction in IR If AC ≥ 0 then Go to M(X, 0:19) Yes PC ← MAR Is AC ≥ 0? No AC ← AC + M(X) MBR ← M(MAR) AC ← MBR Yonsei University The IAS Instruction Set 2-10 A brief history of computers Yonsei University The IAS Instruction Set 2-11 A brief history of computers Yonsei University The IAS Instruction Set A brief history of computers • Data transfer – Move data between memory and ALU registers or between two ALU registers • Unconditional branch – This sequence can be changed by a branch instruction allowing decision points • Conditional branch – The branch can be made dependent on a condition, thus allowing decision points • Arithmetic – Operations performed by the ALU • Address modify – Permits addresses to be computed in the ALU and then inserted into instruction stored in memory. 2-12 Yonsei University Commercial Computers • • • • • A brief history of computers 1947 - Eckert-Mauchly Computer Corporation UNIVAC I (Universal Automatic Computer) US Bureau of Census 1950 calculations Became part of Sperry-Rand Corporation Late 1950s - UNIVAC II – Faster – More memory 2-13 Yonsei University IBM A brief history of computers • Had helped build the Mark I • Punched-card processing equipment • 1953 - the 701 – IBM’s first stored program computer – Scientific calculations • 1955 - the 702 – Business applications • Lead to 700/7000 series 2-14 Yonsei University Computer Generations 2-15 Generation Approximate Dates 1 2 3 1946-1957 1958-1964 1965-1971 4 1972-1977 5 1978- . A brief history of computers Technology Vacuum tube Transistor Small- and Medium-scale Integration Large-scale Integration Very-large-scale Integration Typical Speed (operations per second) 40,000 200,000 1,000,000 10,000,000 100,000,000 Yonsei University Transistors • • • • • • • • 2-16 A brief history of computers Replaced vacuum tubes Smaller Cheaper Less heat dissipation Solid State device Made from Silicon (Sand) Invented 1947 at Bell Labs William Shockley et al. Yonsei University Transistor Based Computers A brief history of computers • Second generation machines • NCR & RCA produced small transistor machines • IBM 7000 • Digital Equipment Corporation (DEC) - 1957 – Produced PDP-1 2-17 Yonsei University IBM 700/7000 Series 2-18 Model Number First Delivery 701 1952 Vacuum Tubes 704 1955 709 A brief history of computers CPU Memory Technology Technology Cycle Time(㎲) Memory Size(K) ElectroStatic tubes 30 2-4 Vacuum Tubes Core 12 4-32 1958 Vacuum Tubes Core 12 32 7090 1960 Transistor Core 2.18 32 7094 I 1962 Transistor Core 2 32 7094 II 1964 Transistor Core 1.4 32 Yonsei University IBM 700/7000 Series A brief history of computers Number Hardwired I/O Instruction of Index Floating Overlap Fetch Registers Point (Channels) Overlap Speed (relative To 701) Model Number Number of Opcodes 701 24 0 No No No 1 704 80 3 Yes No No 2.5 709 140 3 Yes Yes No 4 7090 169 3 Yes Yes No 25 Yes Yes 30 Yes Yes 50 7094 I 185 7 7094 II 185 7 2-19 Yes (double Precision) Yes (double Precision) Yonsei University An IBM 7094 Configuration A brief history of computers Mag Tape Units CPU Data Channel Card Punch Line Printer Card Reader Multiplexor Drum Data Channel Disk Data Channel Disk Hypertapes Memory Data Channel 2-20 Teleprocessing Equipment Yonsei University The IBM 7094 A brief history of computers • The most important point is the use of data channels. A data channel is an independent I/O module with its own processor and its own instruction set. • Another new feature is the multiplexor, which is the central termination point for data channel, the CPU, and memory. 2-21 Yonsei University Microelectronics A brief history of computers • Literally - “small electronics” • A computer is made up of gates, memory cells and interconnections • These can be manufactured on a semiconductor • e.g. silicon wafer 2-22 Yonsei University Microelectronics A brief history of computers • Data storage – Provided by memory cells • Data processing – Provided by gates • Data movement – The paths between components are used to move data from memory to memory and from memory through gates to memory • Control – The paths between components can carry control signals. The memory cell will store the bit on its input lead when the WRITE control signal is ON and will place that bit on its output lead when the READ control signal is ON. 2-23 Yonsei University Wafer, Chip, and Gate A brief history of computers Wafer Chip Package Chip Gate • Small-scale integration (SSI) 2-24 Yonsei University Generations of Computer A brief history of computers • Vacuum tube - 1946-1957 • Transistor - 1958-1964 • Small scale integration - 1965 on – Up to 100 devices on a chip • Medium scale integration - to 1971 – 100-3,000 devices on a chip • Large scale integration - 1971-1977 – 3,000 - 100,000 devices on a chip • Very large scale integration - 1978 to date – 100,000 - 100,000,000 devices on a chip • Ultra large scale integration – Over 100,000,000 devices on a chip 2-25 Yonsei University Moore’s Law A brief history of computers • Increased density of components on chip • Gordon Moore - cofounder of Intel • Number of transistors on a chip will double every year • Since 1970’s development has slowed a little – Number of transistors doubles every 18 months • Cost of a chip has remained almost unchanged • Higher packing density means shorter electrical paths, giving higher performance • Smaller size gives increased flexibility • Reduced power and cooling requirements • Fewer interconnections increases reliability 2-26 Yonsei University Growth in CPU Transistor Count 2-27 A brief history of computers Yonsei University IBM 360 series A brief history of computers • 1964 • Replaced (& not compatible with) 7000 series • First planned “family” of computers – – – – – – Similar or identical instruction sets Similar or identical O/S Increasing speed Increasing number of I/O ports(i.e. more terminals) Increased memory size Increased cost • Multiplexed switch structure 2-28 Yonsei University Key Characteristics of 360 Family A brief history of computers • Many of its features have become standard on other large computers Characters 2-29 Model 30 Model 40 Model 50 Model 65 Model 75 Maximum memory size (bytes) 64K 256K 256K 512K 512K Data rate from memory (Mbytes/s) 0.5 0.8 2.0 8.0 16.0 Processor cycle time (㎲) 1.0 0.625 0.5 0.25 0.2 Relative speed 1 3.5 10 21 50 Maximum number of data channels 3 3 4 6 6 Maximum data rate on one channel (Mbytes/s) 250 400 800 1250 1250 Yonsei University DEC PDP-8 • • • • • A brief history of computers 1964 First minicomputer (after miniskirt!) Did not need air conditioned room Small enough to sit on a lab bench $16,000 – $100k+ for IBM 360 • Embedded applications & OEM • Later models of the PDP-8 used a bus structure that is now virtually universal for minicomputers and microcomputers 2-30 Yonsei University PDP-8/E Block Diagram A brief history of computers • Highly flexible architecture allowing modules to be plugged into the bus to create various configurations 2-31 Yonsei University Semiconductor Memory A brief history of computers • The first application of integrated circuit technology to computers – construction of the processor – also used to construct memories • 1970 • Fairchild • Size of a single core – i.e. 1 bit of magnetic core storage • • • • 2-32 Holds 256 bits Non-destructive read Much faster than core Capacity approximately doubles each year Yonsei University Evolution of Intel Microprocessors 2-33 A brief history of computers Yonsei University Evolution of Intel Microprocessors 2-34 A brief history of computers Yonsei University Evolution of Intel Microprocessors 2-35 A brief history of computers Yonsei University Microprocessor Speed Design for performance • In memory chips, the relentless pursuit of speed has quadrupled the capacity of DRAM, every years • Pipelining • On board cache • On board L1 & L2 cache • Branch prediction • Data flow analysis • Speculative execution 2-36 Yonsei University Design for Evolution of DRAM / Processor Characteristics performance 2-37 Yonsei University Performance Mismatch Design for performance • Processor speed increased • Memory capacity increased • Memory speed lags behind processor speed 2-38 Yonsei University Performance Balance Design for performance • It is responsible for carrying a constant flow of program instructions and data between memory chips and the processor → The interface between processor and main memory is the most crucial pathway in the entire computer 2-39 Yonsei University Trends in DRAM use 2-40 Design for performance Yonsei University Performance Balance Design for performance • On average, the number of DRAMs per system is going down. • The solid black lines in the figure show that, for a fixed-sized memory, the number of DRAMs needed is declining • The shaded bands show that for a particular type of system, main memory size has slowly increased while the number of DRAMs has declined 2-41 Yonsei University Solutions Design for performance • Increase number of bits retrieved at one time – Make DRAM “wider” rather than “deeper” • Change DRAM interface – Cache • Reduce frequency of memory access – More complex cache and cache on chip • Increase interconnection bandwidth – High speed buses – Hierarchy of buses 2-42 Yonsei University Performance Balance Design for performance • Two constantly evolving factors to be coped with – The rate at which performance is changing in the various technology areas differs greatly from one type of element to another – New applications and new peripheral devices constantly change the nature of the demand on the system in terms of typical instruction profile and the data access patterns. 2-43 Yonsei University Intel Pentium and PowerPC evolution • Pentium - results of design effort on CISCs • 1971 - 4004 – First microprocessor – All CPU components on a single chip – 4 bit • Followed in 1972 by 8008 – 8 bit – Both designed for specific applications • 1974 - 8080 – Intel’s first general purpose microprocessor • 8086 – 16 bit, instruction cache, or queue • 80286 – addressing a 16-Mbyte memory 2-44 Yonsei University Intel Pentium and PowerPC evolution • 80386 – 32 bit, multitasking • 80486 – built-in math coprocessor • Pentium – superscalar techniques • Pentium Pro • Pentium II – Intel MMX thchnology • Pentium III – additional floating-point instruction • Merced – 64-bit organization 2-45 Yonsei University PowerPC Pentium and PowerPC evolution • RISC systems • PowerPC Processor Summary 2-46 Yonsei University Two Notions of Performance Performance evaluation Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concodre 3 hours 1350 mph 132 178,200 • Which has higher performance? – Time to do the task (Execution Time) • execution time, response time, latency – Tasks per day, hour, week, sec, ns. .. (Performance) • throughput, bandwidth – Response time and throughput often are in opposition 2-47 Yonsei University To Assess Performance Performance evaluation • Response Time – Time to complete a task • Throughput – Total amount of work done per time • Execution Time (CPU Time) – User CPU time • Time spent in the program – System CPU time • Time spent in OS • Elapsed Time – Execution Time + Time of I/O and time sharing 2-48 Yonsei University Criteria of Performance Performance evaluation • Execution time seems to measure the power of the CPU • Elapsed time measures the performance of whole system including OS and I/O • User is interested in elapsed time • Sales people are interested in the highest number of performance that can be quoted • Performance analysist is interested in both execution time and elapsed time 2-49 Yonsei University Definitions Performance evaluation • Performance is in units of things-per-second – bigger is better • If we are primarily concerned with response time – performance(x) = 1 execution_time(x) " X is n times faster than Y" means n 2-50 = Performance(X) ---------------------Performance(Y) Yonsei University Example Performance evaluation • Time of Concorde vs. Boeing 747? – bigger is better – Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5hours/3hours • Throughput of Concorde vs. Boeing 747 ? – Concord is 178,200 pmph / 286,700 pmph = 0.62 times faster – Boeing is 286,700 pmph / 178,200 pmph = 1.6 times faster • Boeing is 1.6 times (60% faster in terms of throughput • Concord is 2.2 times (220% faster in terms of flying time • We will focus primarily on execution time for a single job 2-51 Yonsei University Basis of Evaluation Cons Pros • representative Actual Target Workload • portable • widely used • improvements useful in reality • easy to run, early in design cycle • identify peak capability and potential bottlenecks 2-52 Performance evaluation • very specific • non-portable • difficult to run, or measure • hard to identify cause •less representative Full Application Benchmarks Small kernel Benchmarks Microbenchmarks • easy to cool • peak may be a long way from application performance Yonsei University MIPS Performance evaluation • Millions of Instruction(Executed) Per Second • Often used measure of performance • Native MIPS = = = = 2-53 clock rate CPI × 106 instruction count execution time × 106 instruction count CPU clocks × cycle time × 106 instruction count × clock rate cycle time × 106 instruction count × clock rate instruction count × CPI × 106 clock rate CPI × 106 Yonsei University MIPS Performance evaluation • Meaningless information – Run a program and time it – Count the number of executed instruction to get MIPs rating • Problems – Cannot compare different computers with different instruction sets – Varies between programs executed on the same computer • Peak MIPS – This is what many manufacturers provide – Usually neglecting ‘peak’o 2-54 Yonsei University Relative MIPS Performance evaluation • Call VAX 11/780 1 MIPS machine (not true) CPU time of VAX 11/780 • . × MIPS of VAX 11/780 CPU time of machine A CPU time of VAX 11/780 • . CPU time of machine A • Makes MIPS rating more independent of benchmark programs • Advantage of relative MIPS is small 2-55 Yonsei University FLOPS Performance evaluation • Million Floating Point Instructions Per Second • Used for engineering and scientific applications where floating point operations account for a high fraction of all executed instructions • Problems – Program dependent – Many programs does not use floating point operations – Machine dependent – Depends on relative mixture of integer and floating point operations – Depends on relative mixture of cheep(+.-) and expensive(×) floating point operations • Normalized FLOPS (relative FLOPS) • Peak FLOPS 2-56 Yonsei University SPEC Marks Performance evaluation • System Performance Evaluation Coorperative • Non-profit group initially founded by APOLLO, HP, MIPSCO, and SUN • Now includes many more like IBM, DEC, AT&T, MOTOROLA, etc • Measures the ratio of execution time on the target measure to that on a VAX 11/780 • Summarizes performance by taking the geometric means of the ratios 2-57 Yonsei University SPEC95 Performance evaluation • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer – go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive – tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags – eliminate special undocumented incantations that may not even generate working code for real programs 2-58 Yonsei University Metrics of performance Performance evaluation Answers per month Application Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second ?MIPS (millions) of (F.P.) operations per second ?MFLOP/s Datapath Control Megabytes per second Function Units Transistors Wires Pins Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused 2-59 Yonsei University Aspects of CPU Performance CPU time = Seconds Program Performance evaluation = Instructions x Cycles Program instr. count x Seconds Instruction CPI Cycle clock rate Program Compiler Instr. Set Arch Organization Technology 2-60 Yonsei University Criteria of Performance Performance evaluation • CPU Time – (Instruction count) × (CPI) × (Clock Cycle) cycle second × – number of Instructions × instruction cycle • Clock Rate cycle –. seconds – Depends on technology and organization • CPI – Cycles Per Instruction – Depends on organization and instruction set • Instruction Count – Depends on compiler and instruction set 2-61 Yonsei University Criteria of Performance Performance evaluation • If CPI is not uniform across all instructions n – CPU cycles = Σ i=1 (CPIi × Ii) • n - number of instructions in instruction set • CPIi - CPI for instruction i • Ii - number of times instruction i occurs in a program • CPU Time = Σ i=1 (CPIi × Ii × clock cycle) n Σ i=1(CPIi × Ii) • CPI = number of executed instruction n • It assumes that a given instruction always takes the same number of cycles to execute 2-62 Yonsei University Aspects of CPU Performance CPU time = Seconds Program = Instructions x Cycles Program instr. count 2-63 CPI X Compiler X X Instr. Set X X Technology x Seconds Instruction Program Organization Performance evaluation X Cycle clock rate X X Yonsei University CPI Performance evaluation average cycles per instruction CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPU time = ∑ (Clock Cycle Time × CPI i × I I) i=1 n CPI = ∑ CPI i × F i i=1 where F i Ii = Instruction Count "instruction frequency" Invest Resources where time is Spent! 2-64 Yonsei University Example of RISC Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 Performance evaluation % Time 23% 45% 14% 18% Typical Mix How much faster would the machine be is a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? 2-65 Yonsei University Amdahl's Law Performance evaluation Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = -------------------------ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) ((1-F) + F/S) X ExTime(without E) Speedup(with E) . 1 . (1-F) + F/S 2-66 Yonsei University Cost Performance evaluation • Traditionally ignored by textbooks because of rapid change • Driven by learning curve : manufacturing costs decrease with time • Understanding learning curve effects on yield is key to cost projection • Yield – Fraction of manufactured items that survive the testing procedure • Testing and Packaging – Big factors in lowering costs 2-67 Yonsei University Cost Performance evaluation • Cost of Chips – Cost = manufacture + testing + packaging final yield – Cost of die = cost of wafer dies per wafer × die yield – Wafer Yield = dies / wafer • Cost vs. Price – – – – 2-68 Component cost : 15~33% Direct cost : 6~8% Gross margin : 34~39% Average discount : 25~40% Yonsei University