Computer Architecture (“MAMAS”, 234267) Spring 2012 Lecturer: Dan Tsafrir Reception: Mon 18:30, Taub 611 12/3/2012 Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2012 – Introduction (lec1) General Info Grade 20% Exercise (mandatory) תקף 80% Final exam Textbook “Computer Architecture: A Quantitative Approach” (4th Edition) by: Patterson & Hennessy Other course information 2 Course web site: http://webcourse.cs.technion.ac.il/234267/Spring2012 Lectures will be upload to the web a day before the class Computer Architecture 2012 – Introduction (lec1) Computer System Structure 3 Computer Architecture 2012 – Introduction (lec1) Classical Motherboard Diagram Cache More to the “north” = closer to the CPU = faster CPU CPU BUS North Bridge External Graphics Card DDR2 or DDR3 Channel 1 PCI express 2.0 IOMMU On-board Memory Graphics controller Serial Port Parallel Port IO Controller 4 DDR2 or DDR3 Channel 2 PCI express ×1 South Bridge Floppy Drive Mem BUS keybrd USB controller mouse SATA controller DVD Drive Hard Disk PCI Sound Card speakers Lan Adap LAN Computer Architecture 2012 – Introduction (lec1) Intel Core 2 Northbridge = MCH = mem controller hub Notice bandwidths Southbridge = ICH = I/O controller hub 65 to 45 nm 5 Computer Architecture 2012 – Introduction (lec1) Intel Nehalem Core i3 i5 i7 For high-end i-Series chips, Northbridge functionality moved onto processor (=> made faster) 45 to 32 nm 6 Computer Architecture 2012 – Introduction (lec1) Intel Sandy Bridge Core i3 i5 i7 32 to 22 nm 7 The trend continues Computer Architecture 2012 – Introduction (lec1) 8 Computer Architecture 2012 – Introduction (lec1) Course Focus Start from CPU (=processor) Move on to Memory Hierarchy 9 Caching Main memory Virtual Memory Move on to PC Architecture Instruction set, performance Pipeline, hazards Branch prediction Out-of-order execution Motherboard & chipset, DRAM, I/O, Disk, peripherals End with some Advanced Topics Computer Architecture 2012 – Introduction (lec1) The Processor 10 Computer Architecture 2012 – Introduction (lec1) Architecture vs. Microarchitecture Architecture: = The processor features as seen by its user = Interface Microarchitecture: = Manner by which the processor is implemented = Implementation details 11 Caches size and structure, number of execution units, … Note: different processors with different u-archs can support the same arch Instruction set, number of registers, addressing modes,… Example: Intel Pentium-IV vs. Intel Core2 Duo We will address both Computer Architecture 2012 – Introduction (lec1) Why Should We Care? Abstractions enhance productivity, so: Same goes for arch Just details for a programmer of a high-level language Abstractions only work so long as what’s below works 12 If we know the arch (=interface), Why should we care about the u-arch (=internals)? The taxi story: http://vimeo.com/11478146 (4:50-6:00) Computer Architecture 2012 – Introduction (lec1) Recent Processor Trends Source: http://www.scidacreview.org/0904/html/multicore.html 13 Computer Architecture 2012 – Introduction (lec1) Well-Known Moore’s Law Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm 14 Computer Architecture 2012 – Introduction (lec1) 15 Computer Architecture 2012 – Introduction (lec1) The Story in a Nutshell Transistors (1000s) clock speed (MHz) power (W) Instructions/cycle (ILP) 16 Computer Architecture 2012 – Introduction (lec1) Took the Industry by Surprise 17 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Performance 18 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Sales 19 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Sales 20 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Programmers 21 Computer Architecture 2012 – Introduction (lec1) Supercomputing: “Top 500 list” 22 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Supercomputing 23 Computer Architecture 2012 – Introduction (lec1) Processor Performance 24 Computer Architecture 2012 – Introduction (lec1) Metrics: IC, CPI, IPC CPUs work according to a clock signal Instruction Count (IC) Clock cycle: measured in nanoseconds (10-9 of a second) Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec) Total number of instructions executed in the program Cycles Per Instruction (CPI) Average #cycles per Instruction (in a given program) CPI = 25 #cycles required to execute the program IC IPC (= 1/CPI) : Instructions per cycles. Can be > 1; see the “story in a nutshell slide” Computer Architecture 2012 – Introduction (lec1) Minimizing Execution Time CPU Time - time required to execute a program CPU Time = IC CPI clock cycle Our goal: minimize CPU Time (any of above components) Minimize clock cycle: increase GHz (processor design) Minimize CPI: u-arch (e.g.: more execution units) Minimize IC: arch + u-arch (e.g.: SSETM) SSE = streaming SIMD extension (Intel) 26 Computer Architecture 2012 – Introduction (lec1) Alternative Way to Calculate CPI ICi = #times instruction of type-i is executed in program n IC = #instruction executed in program = IC ICi i 1 Fi = relative frequency of type-i instruction = ICi/IC CPIi = #cycles to execute type-i instruction e.g.: CPIadd = 1, CPImul = 3 n #cycles required to execute the program: # cyc CPI i ICi i 1 CPI: n # cyc CPI IC 27 CPI IC i 1 i IC i n ICi n CPIi CPIi Fi IC i 1 i 1 Computer Architecture 2012 – Introduction (lec1) Performance Evaluation: How? No simple answer Performance depends on Mathematical analysis 28 Application Input Typically impossible What to do? Computer Architecture 2012 – Introduction (lec1) Benchmarks Use benchmarks & measure how long it takes Use real applications (=> no absolute answers) Preferably standardized benchmarks (+input), e.g., SPEC INT: integer apps • Compression, C complier, Perl, text-processing, … Sometimes you see FLOPS (“pick” or “sustained”) 29 SPEC FP: floating point apps (mostly scientific) TPC benchmarks: measure transaction throughput (DB) SPEC JBB: models wholesale company (Java server, DB) Supercomputers (top500 list), against LINPACK Computer Architecture 2012 – Introduction (lec1) Evaluating Performance Use a performance simulator to evaluate the performance of a new feature / algorithm Models the uarch to a great detail Run 100’s of representative applications Produce the performance s-curve Sort the applications according to the IPC increase Baseline (0%) is the processor without the new feature 3% Bad S-curve 2% 6% Positive outliers Good S-curve Positive outliers 4% 1% 0% 2% -1% -2% Negative outliers -3% -4% 0% Small negative outliers -2% 30 Computer Architecture 2012 – Introduction (lec1) Amdahl’s Law Suppose we accelerate the computation such that P = proportion of computation we make faster S = speedup experienced by the proportion we improved For example If an improvement can speedup 40% of the computation => P = 0.4 31 If the improvement makes the portion run twice as fast => S = 2 Then overall speedup = 1 (1 P) P S Computer Architecture 2012 – Introduction (lec1) Amdahl’s Law - Example FP operations improved to run 2x faster S = 2, but… P = only affects 10% of the program Speedup: 1 1 1.053 0.1 0.95 (1 P) P (1 0.1) S 2 Conclusion 32 1 Better to make common case fast… Computer Architecture 2012 – Introduction (lec1) Amdahl’s Law – Parallelism When parallelizing a program P = proportion of program that can be made parallel 1 - P = inherently serial N = number of processing elements (say, cores) Speedup: 1 (1 P) P N Serial component imposes a hard limit 1 1 lim N (1 P) (1 P) P N 33 Computer Architecture 2012 – Introduction (lec1) Instruction Set Design software The ISA is what the user & compiler see instruction set hardware 34 The HW implements the ISA Computer Architecture 2012 – Introduction (lec1) Considerations in ISA Design Instruction size Long instructions take more time to fetch from memory Longer instructions require a larger memory • Important for small (embedded) devices, e.g., cell phones Number of instructions (IC) 35 Reduce IC => reduce runtime (at a given CPI & frequency) Virtues of instructions simplicity Simpler HW allows for: higher frequency & lower power Optimization can be applied better to simpler code Cheaper HW Computer Architecture 2012 – Introduction (lec1) Basing Design Decisions on Workload Immediate argument’s size in bits (histogram) 30% Int. Avg. FP Avg. 20% 10% 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0% Immediate data bits 36 1% of data values > 16-bits Having 16 bits is likely good enough Computer Architecture 2012 – Introduction (lec1) CISC Processors CISC - Complex Instruction Set Computer Example: x86 The idea: a high level machine language • Once people programmed in assembly, CISC supposedly easier Characteristic Many instruction types, with a many addressing modes Some of the instructions are complex • Execute complex tasks • Require many cycles ALU operations directly on memory (e.g., arr[j] = arr[i]+n) • Registers not used (and, accordingly, only a few registers exist) Variable length instructions • common instructions get short codes save code length 37 Computer Architecture 2012 – Introduction (lec1) But it Turns Out… Rank instruction % of total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% Simple instructions dominate instruction frequency 38 Computer Architecture 2012 – Introduction (lec1) CISC Drawbacks Complex instructions and complex addressing modes complicates the processor slows down the simple, common instructions contradicts Make The Common Case Fast Compilers don’t use complex instructions / indexing methods Variable length instructions are real pain in the neck 39 Difficult to decode few instructions in parallel • As long as instruction is not decoded, its length is unknown It is unknown where the instruction ends It is unknown where the next instruction starts An instruction may be over more than a single cache line An instruction may be over more than a single page Computer Architecture 2012 – Introduction (lec1) RISC Processors RISC - Reduced Instruction Set Computer The idea: simple instructions enable fast hardware Characteristic A small instruction set, with only a few instructions formats Simple instructions • execute simple tasks • Most of them require a single cycle (with pipeline) A few indexing methods ALU operations on registers only • Memory is accessed using Load and Store instructions only • Many orthogonal registers • Three address machine: Add dst, src1, src2 40 Fixed length instructions Examples: MIPSTM, SparcTM, AlphaTM, PowerTM Computer Architecture 2012 – Introduction (lec1) RISC Processors (Cont.) Simple arch => simple u-arch Compiler can be smarter Better pipeline usage Better register allocation Existing RISC processor are not “pure” RISC 41 Room for larger on die caches Smaller => faster Easier to design & validate (=> cheaper to manufacture) Shorten time-to-market More general-purpose registers (=> less memory refs) e.g., support division which takes many cycles Computer Architecture 2012 – Introduction (lec1) Compilers and ISA Ease of compilation Orthogonality: • no special registers • few special cases • all operand modes available with any data type or instruction type Regularity: • no overloading for the meanings of instruction fields streamlined • resource needs easily determined Register assignment is critical too 42 Easier if lots of registers Computer Architecture 2012 – Introduction (lec1) Still, CISC Is Dominant x86 (CISC) dominates the processor market Legacy CISC internally arch emulates RISC 43 A vast amount of existing software Intel, AMD, Microsoft benefit But put lot of money to compensate for disadvantage Starting at Pentium II and K6, x86 processors translate CISC instructions into RISC-like operations internally Inside core looks much like that of a RISC processor Computer Architecture 2012 – Introduction (lec1) Software Specific Extensions Extend arch to accelerate exec of specific apps Example: SSETM – Streaming SIMD Extensions 128-bit packed (vector) / scalar single precision FP (4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations, … Packed: Scalar: 128-bits x3 x2 x1 128-bits x0 x3 x2 + y3 y2 x0 + y1 y0 x3+y3 x2+y2 x1+y1 x0+y0 44 x1 y3 y2 y1 y0 y3 y2 y1 x0+y0 Computer Architecture 2012 – Introduction (lec1) BACKUP 45 Computer Architecture 2012 – Introduction (lec1) Compatibility Backward compatibility (HW responsibility) When buying new hardware, it can run existing software: • i5 can run SW written for Core2 Duo, Pentium4, PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268 BTW: Forward compatibility (SW responsibility) Architecture-independent SW 46 For example: MS Word 2003 can open MS Word 2010 doc Commonly supports one or two generations behind Run SW on top of VM that does JIT (just in time compiler): JVM for Java and CLR for .NET Interpreted languages: Perl, Python Computer Architecture 2012 – Introduction (lec1)