Computer Architecture “The architecture of a computer is the interface between the machine and the software” - Andris Padges IBM 360/370 Architect Course Outline Computer Architecture Quarter Autumn 2006-7 Instructor Muhammad Jahangir Ikram Office: Room 424 e-mail: jikram@lums.edu.pk Office Hours: Monday and Wednesday, 3:00 – 4:30pm Course Outline (Contd..) Description This course focuses on the principles, practices and issues in Computer Architecture, while examining computer design tradeoffs both qualitatively and quantitatively. The course starts with a quick overview of computer design fundamentals and instruction set principles, the materials which the student has already covered in the pre-requisite of this course. The following topics are covered in greater detail: Advanced Pipelining Instruction-level parallelism and Compiler Support Memory - hierarchy design SIMD, VLIW, Superscalar Architectures Code Optimization and Compiler Issues Course Outline (Contd..) Text Book Hennessy, J. L, and Patterson, D. A., Computer Architecture: A Quantitative Approach, 2nd Edition. Morgan Kaufmann, 1996. Course Outline (Contd..) Lectures There will be two 75 minutes lecturers per week and 50 minutes Lecture/ 100 minutes lab. TOTAL SESSIONS = 29 There will be four Labs during weeks 2, 3, 4, 5. Course Outline (Contd..) Grading Quizzes & assignments 17+3% Laboratory 10% (Atten 3 + Lab Task 3 + HW 4) Midterm exam 30% Final exam 40% Schedule Fundamentals of Computer Design Measuring and Reporting Performance Quantitative Principles of Computer Design Instruction Set Principles and Examples Classifying Instruction Set Architectures Memory Addressing Operations in the Instruction Set Encoding an Instruction Set 3-5 2.1 – 2.8 7-14 Single Cycle Computer Study 9 LAB 2: Study of Pipelining 12 The Major Hurdle of Pipelining – Pipeline Hazards Data Hazards 1.1 – 1.10 LAB 1: MIPS Instruction Format and Instruction Study 6 Pipelining Overview What Is Pipelining? 1,2 A.1 to A10 Schedule Control Hazards and Static Branch Prediction LAB 3: Pipeline Studies and Control Hazards Scoreboarding MIDTERM 15 ILP and Dynamic Exploitation 17-19 Static Branch Prediction Tomasulo’s Dynamic Scheduling Dynamic Branch Prediction Superscalar and VLIW architectures Advanced Pipelining And ILP (Cont’d.) 20-22 Taking Advantage of More ILP with Multiple Issue P6 Architecture Advanced Pipelining And ILP (Cont’d.) 23-25 Compiler Support for Exploiting ILP Hardware Support for Extracting More Parallelism Putting It All Together: The PowerPC 620, and Itanium 3.1 – 3.5 3.6 – 3.10 4.1, 4.7 Schedule Memory-Hierarchy Design The ABCs of Caches Reducing Cache Misses Reducing Cache Miss Penalty Virtual Memory System Computer I/O 26-29 5.1 – 5.7 30 6.1 - ? Background Emergence of the first microprocessor in late 1970’s Roughly 35% growth per year Important changes in the marketplace: Virtual elimination of assembly language programming reduced the need for object code compatibility Creation of standardized, vendor-independent operating systems, such as UINX, LINX lowered the risk of bringing out a new architecture Development of RISC These changes lead to the development of a new set of architectures, called the RISC (Reduced Instruction Set Computer) architecture RISC uses two performance techniques: Instruction level parallelism (pipelining) Use of Cache Growth in microprocessor performance Moore’s Law Technology Scaling Scaling of Transistors Feature Size has reduced to 3 micron in 1985 to 0.09 micron. Reducing Feature-size means quadratic increase in Transistor Count and better Performance. But higher routing Delays and poor performance of Long Wires Also means More Power Consumption (Less load Capacitance) The Itanium Processor Intel microprocessor die IC Cost Trends (Source: IC Knowledge) Measuring performance Definition of time: Response time, elapse time: CPU time: The latency to complete the task, including disk access, input/output, operating system overhead etc. User CPU Time System CPU Time: Time spent in the program Time Spent by operating system. Unix Time Command: 90.7s 12.9s 2:39 (159s) 65% (User, System, Elapsed Time) (90.7+12.9)/159 What is a Benchmark? A benchmark is "a standard of measurement or evaluation" (Webster’s II Dictionary). A computer benchmark is typically a computer program that performs a strictly defined set of operations - a workload - and returns some form of result - a metric - describing how the tested computer performed. Computer benchmark metrics usually measure speed: how fast was the workload completed; or throughput: how many workload units per unit time were completed. Running the same computer benchmark on multiple computers allows a comparison to be made. Source: Standards Performance Evaluation Corporation Programs to Evaluate Performance Real Applications Modified (or scripted) applications Kernels Toy benchmarks Synthetic benchmarks Programs to evaluate performance Real Applications Example: Compliers for C, text-processing software etc. Modified (or scripted) applications CPU oriented bench mark, I/O may be removed to minimize its impact on execution Programs to evaluate performance Kernels Toy benchmarks To isolate performance of individual features of a machine. Produces a result that the user already knows Synthetic benchmarks Try to match the average frequency of operations and operands of a large set of programs Benchmark Suites SPEC95, SPEC2000 (11 Integer, 14 FP), SPEC2006 (12 Integer, 17 FP) C Compiler, Router, FEM Desktop (CPU and Graphics Intensive) Server (File Servers, Web Servers, Transaction Processing) Embedded (EEMBC) 34 Kernels What is SPEC SPEC is the Standard Performance Evaluation Corporation. SPEC is a non-profit organization whose members include computer hardware vendors, software companies, universities, research organizations, systems integrators, publishers and consultants. SPEC's goal is to establish, maintain and endorse a standardized set of relevant benchmarks for computer systems. Although no one set of tests can fully characterize overall system performance, SPEC believes that the user community benefits from objective tests which can serve as a common reference point. What does a benchmark measure? the computer processor (CPU), the memory architecture, and the compilers. SPEC CPU2006 contains two components that focus on two different types of compute intensive performance: The CINT2006 suite measures computeintensive integer performance, and The CFP2006 suite measures computeintensive floating point performance Source: Standards Performance Evaluation Corporation Reference Machine Source: Standards Performance Evaluation Corporation SPEC uses a historical Sun system, the "Ultra Enterprise 2" which was introduced in 1997, as the reference machine. The reference machine uses a 296 MHz UltraSPARC II processor, as did the reference machine for CPU2000. But the reference machines for the two suites are not identical: the CPU2006 reference machine has substantially better caches, and the CPU2000 reference machine could not have held enough memory to run CPU2006. It takes about 12 days to do a rule-conforming run of the base metrics for CINT2006 and CFP2006 on the CPU2006 reference machine. SPEC2000 now takes less a minute on latest High Performance M/Cs Example Result for SPEC 2000 Source: Standards Performance Evaluation Corporation SYSTEM Intel SE440BX-2 (800 MHz Pentium III) 1 core, 1 chip, 1 core/chip Base 340 Peak 344 Intel D850GB motherboard(1.4 GHz, Pentium 4 processor) 1 core, 1 chip, 1 core/chip 502 512 Sun Blade 2500 (1.28GHz) 1 core, 1 chip, 1 core/chip 604 696 Intel D850EMV2 motherboard (2.0A GHz, Pentium 4 processor) 1 core, 1 chip, 1 core/chip 756 759 PowerEdge 2650 (3.06 GHz Xeon) DELL 1 core, 1 chip, 1 core/chip (Hyper-Threading Technology disabled) 1014 1056 Precision WorkStation 350 (2.8 GHz P4) DELL 1 core, 1 chip, 1 core/chip 1017 1061 SGI Altix 3000 (1300MHz, Itanium 2) 1 core, 1 chip, 1 core/chip 1019 -- Example Result for SPEC 2000 Source: Standards Performance Evaluation Corporation SYSTEM Precision Workstation 690 (Intel® Xeon® processor 5160, 3.0 #CPU 4 cores, 2 chips, 2 cores/chip BASE 3057 PEAK 3063 PowerEdge 1950 (Intel Xeon processor 5160, 3.00GHz) 4 cores, 2 chips, 2 cores/chip 3061 3065 Intel(R) DG965WH motherboard( 2.93 GHz, Intel(R) Core(TM) 2 2 cores, 1 chip, 2 cores/chip 3099 3109 Intel(R) DG965WH motherboard( 2.93 GHz, Intel(R) Core(TM) 2 2 cores, 1 chip, 2 cores/chip 3106 3111 Precision Workstation 390 (Intel Core 2 Extreme processor X6 2 cores, 1 chip, 2 cores/chip 3108 3119 Summarizing Performance Amdahl’s Law The performance improvement to be gained from using faster mode of execution is limited by the fraction of the time the faster mode can be used Amdahl’s Law: Law of Diminishing Returns Speedup Performance for entitre task with the enhancement when possible Performance for entitre task without the enhancement when possible Fraction Enhanced Execution time new ExecutionTime old 1 Fraction Enhanced Speedup Enhanced SpeedUp Execution Time Old 1 Execution Time new Fraction Enhanced 1 Fraction Enhanced Speedup Enhanced CPU performance Equations Instructions Clock Cycle Seconds Instruction Clock Cycle CPU Time = Program Example: Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20 Assume CPI of FPSQR decreased to 2 OR the CPI of all FP operations to 2.5 Compare these two designs using the CPU performance equations Example: Solution n ICi CPIorignal CPIi i 1 Instruction Count 4 25% 1.33 75% 2.0 CPI for enhanced FPSQR CPIFPSQR CPI orignal 2% CPIold FPSQR CPI new FPSQRonly 2.0 2% 20 2 1.64 CPI for enhanced FP operation CPInewFP 75% 1.33 25% 2.51.625 Example: Solution SpeedupnewFP CPUtimeorignal IC Clockcycle CPIorignal CPUtimenewFP IC Clockcycle CPInewFP CPIorignal 2.0 1.23 CPInewFP 1.625 Another Measure -- MIPS Instruction Count MIPS = Execution Time 10 6 Example: An Embedded Processor 120 MIPS for single processor. 80 MIPS for Processor –Co-Processor Combination (That is how they are measured for combined) I= Number of Integer Instructions F = Number of Floating Point Instructions (8M) Y = No. of Integer Instructions to Emulate one FP Instruction (50) W = Time for choice 1 (4 seconds) B = Time for Choice 2 End of Lecture 1 CINT 2006 400.perlbench 401.bzip2 C C PERL Programming Language Compression 403.gcc 429.mcf 445.gobmk C C C C Compiler Combinatorial Optimization Artificial Intelligence: go 456.hmmer 458.sjeng C C Search Gene Sequence Artificial Intelligence: chess 462.libquantum 464.h264ref 471.omnetpp C C C++ Physics: Quantum Computing Video Compression Discrete Event Simulation 473.astar 483.xalancbmk C++ C++ Path-finding Algorithms XML Processing CFP 2006 410.bwaves Fortran Fluid Dynamics 416.gamess Fortran Quantum Chemistry 433.milc C Physics: Quantum Chromodynamics 434.zeusmp Fortran Physics/CFD 435.gromacs C/Fortran Biochemistry/Molecular Dynamics 436.cactusADM C/Fortran Physics/General Relativity 437.leslie3d Fortran Fluid Dynamics 444.namd C++ Biology/Molecular Dynamics 447.dealII C++ Finite Element Analysis 450.soplex C++ Linear Programming, Optimization 453.povray C++ Image Ray-tracing 454.calculix C/Fortran Structural Mechanics 459.GemsFDTD Fortran Computational Electromagnetics 465.tonto Fortran Quantum Chemistry 470.lbm C Fluid Dynamics 481.wrf C/Fortran Weather Prediction 482.sphinx3 C Speech recognition