מבנה מחשבים 0368-2159 Lecture 1 הקדמה נתן אינטרטור ויהודה אפק מתרגלים: 1/ 75 הילל אבני נועה בן-עמוס מה זה מבנה מחשבים? חומרה -טרנזיסטורים מעגלים לוגיים ארכיטקטורת מחשבים 2/ 75 על מה נדבר היום: Introduction : Computer Architecture Administrative Matters History ממוליכים וחשמל ועד פעולות בינריות בסיסיות במחשב מתח חשמלי • מוליכים • סיליקון :מוליך למחצה • טרנזיסטור • פעולות בינריות ברכיבים אלקטרוניים • 3/ 77 Computing Devices Then… EDSAC, University of Cambridge, UK, 1949 4/ 77 Computing Devices Now Sensor Nets QuickT ime ™an d a TIFF ( Uncomp res sed) deco mpre ssor ar e need ed to see this pictur e. Cameras Set-top boxes Games QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompress ed) dec ompres sor are needed t o s ee this pic ture. Media Players QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Laptops Servers Routers Smart phones Automobiles Robots Supercomputers 5/ 77 מבנה מחשבים, מה זה? 6/ 77 7/ 77 Mother board 8/ 77 9/ 77 The paradigm (Patterson) Every Computer Scientist should master the “AAA” Architecture Algorithms Applications 10/ 77 Computer Architecture: GOAL Fast, Effective and Cheap The goal of Computer Architecture To build “cost effective systems” • How do we calculate the cost of a system ? • How we evaluate the effectiveness of the system? To optimize the system • What are the optimization points ? Fact: most of the computer systems still use Von-Neumann principle of operation, even though, internally, they are much different from the computer of that time. 11/ 77 Anatomy: 5 components of any Computer (since 1946) Personal Computer Computer Processor Control (“brain”) Datapath (“brawn”) Memory (where programs, data live when running) Devices Input Output Keyboard, Mouse Disk (where programs, data live when not running) Display, Printer 12/ 77 Computer System Structure Cache Mem BUS CPU BUS CPU Bridge Memory I/O BUS Scsi/IDE Adap Lan Adap USB Hub Graphic Adapt Scsi Bus Hard Disk LAN KeyBoard Mouse Scanner Video Buffer 13/ 77 The Instruction Set: a Critical Interface software instruction set hardware 14/ 77 ? “Computer Architecture” מה זה Computer Architecture = Instruction Set Architecture + Machine Organization + … = ארכיטקטורה+ הנדסה 15/ 77 What are “Machine Structures”? Application (ex: browser) Software Hardware Operating Compiler System Assembler (Linux, Win, ..) Processor Memory I/O system מבנה מחשבים Instruction Set Architecture Datapath & Control Digital Design Circuit Design transistors Physics * Coordination of many levels (layers) of abstraction 16/ 77 Levels of Representation temp = v[k]; High Level Language Program v[k] = v[k+1]; v[k+1] = temp; Compiler lw $15, lw $16, sw sw Assembly Language Program Assembler Machine Language Program 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 0110 1000 1111 1001 0($2) 4($2) $16, 0($2) $15, 4($2) 1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Machine Interpretation Control Signal Specification ALUOP[0:3] <= InstReg[9:11] & MASK ° ° 17/ 77 Computer Architecture’s Changing Definition 1950s to 1960s Computer Architecture Course • Computer Arithmetic 1970s to mid 1980s Computer Architecture Course • Instruction Set Design, especially ISA appropriate for compilers 1990s Computer Architecture Course • Design of CPU, memory system, I/O system, Multiprocessors, Networks 2000s Computer Architecture Course: • Special purpose architectures, Functionally reconfigurable, Special considerations for low power/mobile processing 2005 – futue (?) Multi processors, Parallelism • Synchronization, Speed-up, How to Program ??? !!! 18/ 77 Forces on Computer Architecture Technology Programming Languages Applications Computer Architecture Operating Systems Cleverness History 19/ 77 Computers in the News: Sony Playstation 2000 The Playstation 3 will deliver nearly 2 teraflops overall performance, said Ken Kutaragi, president and group CEO of Sony Computer Entertainment As reported in Microprocessor Report, Vol 13, No. 5: • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second • Graphics Synthesizer: 2.4 Billion pixels per second • Claim: Toy Story realism brought to games! 20/ 77 Where are We Going?? Input Multiplier Input Multiplicand 32 Multiplicand Register LoadMp 32=>34 signEx <<1 32 34 34 32=>34 signEx 1 Arithmetic Multi x2/x1 34 34 Sub/Add 34-bit ALU Control Logic 32 32 2 ShiftAll LO register (16x2 bits) Prev 2 Booth Encoder HI register (16x2 bits) LO[1] 2 "LO [0]" 34 Extra 2 bits ENC[2] ENC[1] ENC[0] LoadLO LoadHI 2 ClearHI Single/multicycle Datapaths 0 34x2 MUX 32 Result[HI] LO[1:0] 32 Result[LO] 1000 CPU IFetchDcd WB Exec Mem WB Performance Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 9%/yr. DRAM (2X/10 yrs) 1 198 2 3 198 498 1 5 198 6 198 7 198 8 198 9 199 0 199 199 2 199 399 1 4 199 5 199 699 1 7 199 8 199 9 200 0 Exec Mem 100 198 098 1 1 198 IFetchDcd מבנה מחשבים “Moore’s Law” µProc 60%/yr. (2X/1.5yr) Time IFetchDcd Exec Mem IFetchDcd WB Exec Mem WB Pipelining I/O Memory Systems 21/ 77 שקופית מאחת ההרצאות לקראת סוף הסמסטר 22/ 77 Instructors: Course Administration Yehuda Afek (afek@post.tau.ac.il) Nathan Intrator (nin@post.tau.ac.il) TA: Hillel Avni (hillel.avni@gmail.com ) Noa Ben Amos(noaben4@post.tau.ac.il) http://cs.tau.ac.il/~nin/Courses/CompStruct/CompStr uct.htm http://virtual.tau.ac.il Books: 1. V. C. Hamacher, Z. G. Vranesic, S. G. Zaky Computer Organization. McGraw-Hill, 1982 2. H. Taub Digital Circuits and Microporcessors. McGraw-Hill 1982 3. מערכות ספרתיות בהוצאות האוניברסיטה הפתוחה 4. Hennessy and Patterson, Computer Organization Design, the hardware/software interface, Morgan Kaufman 1998 23/ 77 Grading ציון: מבחן סופי 80% תרגילים 20% 7תרגילים 24/ 77 Architecture & Microarchitecture Elements Architecture: • • • • Registers data width (8/16/32/64) Instruction set Addressing modes Addressing methods (Segmentation, Paging, etc...) Architecture: • Physical memory size • Caches size and structure • Number of execution units, number of execution pipelines • Branch prediction • TLB Timing is considered Arch (though it is user visible!) Processors with the same arch may have different Arch 25/ 77 Compatibility Backward compatibility – New hardware can run existing software – Example: Pentium 4 can run software originally written for Pentium III, Pentium II, Pentium , 486, 386, 286 Forward compatibility – New software can run on existing (old) hardware – Example: new software written with MMXTM must still run on older Pentium processors which do not support MMXTM – Less important than backward compatibility New ideas: architecture independent – JIT – just in time compiler: Java and .NET – Binary translation 26/ 77 How to compare between different systems? 27/ 77 Benchmarks – Programs for Evaluating Processor Performance Toy Benchmarks – 10-100 line programs – e.g.: sieve, puzzle, quicksort Synthetic Benchmarks – Attempt to match average frequencies of real workloads – e.g., Winstone, Dhrystone Real programs – e.g., gcc, spice SPEC: System Performance Evaluation Cooperative – SPECint (8 integer programs) – and SPECfp (10 floating point) 28/ 77 CPI – to compare systems with same instruction set architecture (ISA) The CPU is synchronous - it works according to a clock signal. • Clock cycle is measured in nsec (10-9 of a second). • Clock rate (= 1/clock cycle) is measured in MHz (106 cycles/second). CPI - cycles per instruction • Average #cycles per Instruction (in a given program) CPI = #cycles required to execute the program #instruction executed in the program • IPC (= 1/CPI) : Instructions per cycles Clock rate is mainly affected by technology, CPI by the architecture CPI breakdown: how many cycles (in average) the program spends for different causes; e.g., in executing, memory I/O etc. 29/ 77 CPU Time CPU Time – The time required by the CPU to execute a given program: CPU Time = clock cycle #cyc = clock cycle CPI IC Our goal: minimize CPU Time – Minimize clock cycle: more MHz (process, circuit, Arch) – Minimize CPI: Arch (e.g.: more execution units) – Minimize IC: architecture (e.g.: MMXTM technology) Speedup due to enhancement E ExTimew / oE Performancew / E SpeedupE = = ExTimew / E Performancew / oE 31/ 77 Amdahl’s Law Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: Fractionenhanced ExTimenew = ExTimeold x(1 - Fractionenhanced) + Speedupenhanced ExTimeold Speedupoverall = = ExTimenew 1 Fractionenhanced (1 - Fractionenhanced) + Speedupenhanced 32/ 77 Amdahl’s Law: Example • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold Speedupoverall = 1 = 1.053 0.95 Corollary: Make The Common Case Fast 33/ 77 Instruction Set Design software The ISA is what the user and the compiler sees instruction set The ISA is what the hardware needs to implement hardware 34/ 77 Why ISA is important? Code size • long instructions may take more time to be fetched • Requires large memory (important in small devices, e.g., cell phones) Number of instructions (IC) • Reducing IC reduce execution time (assuming same CPI and frequency) Code “simplicity” • Simple HW implementation which leads to higher frequency and lower power • Code optimization can better be applied to “simple code” 35/ 77 The impact of the ISA RISC vs CISC 36/ 77 CISC Processors CISC - Complex Instruction Set Computer The idea: a high level machine language Characteristic • Many instruction types, with many addressing modes • Some of the instructions are complex: - Perform complex tasks - Require many cycles • ALU operations directly on memory - Usually uses limited number of registers • Variable length instructions - Common instructions get short codes save code length Example: x86 37/ 77 CISC Drawbacks Compilers do not take advantage of the complex instructions and the complex indexing methods Implement complex instructions and complex addressing modes complicate the processor slow down the simple, common instructions contradict Amdahl’s law corollary: Make The Common Case Fast Variable length instructions are real pain in the neck: • It is difficult to decode few instructions in parallel - As long as instruction is not decoded, its length is unknown It is unknown where the instruction ends It is unknown where the next instruction starts • An instruction may not fit into the “right behavior” of the memory hierarchy (will be discussed next lectures) Examples: VAX, x86 (!?!) 38/ 77 RISC Processors RISC - Reduced Instruction Set Computer The idea: simple instructions enable fast hardware Characteristic • A small instruction set, with only a few instructions formats • Simple instructions - execute simple tasks - require a single cycle (with pipeline) • A few indexing methods • ALU operations on registers only - Memory is accessed using Load and Store instructions only. - Many orthogonal registers - Three address machine: Add dst, src1, src2 • Fixed length instructions Examples: MIPSTM, SparcTM, AlphaTM, PowerPCTM 39/ 77 RISC Processors (Cont.) Simple architecture Simple microarchitecture • Simple, small and fast control logic • Simpler to design and validate • Room for on die caches: instruction cache + data cache - Parallelize data and instruction access • Shorten time-to-market Using a smart compiler • Better pipeline usage • Better register allocation Existing RISC processor are not “pure” RISC • e.g., support division which takes many cycles 40/ 77 RISC and Amdhal’s Law (Example) In compare to the CISC architecture: • 10% of the static code, that executes 90% of the dynamic has the same CPI • 90% of the static code, which is only 10% of the dynamic, increases in 60% • The number of instruction being executed is increased in 50% • The speed of the processor is doubled - This was true for the time the RISC processors were invented CPInew Fractionenhanced = 0.9 + 0.11.6 = 1.06 = 1 Fractionenhanced + We get CPIold Speedupenhanced And then CPU Time old clock old CPI old IC old Speedup overall = = ∗ ∗ = 2/ 1.06∗ 1.5 = 1.26 CPU Timenew clock new CPI new IC new 41/ 77 So, what is better, RISC or CISC Today CISC architectures (X86) are running as fast as RISC (or even faster) The main reasons are: • Translates CISC instructions into RISC instructions (ucode) • CISC architecture are using “RISC like engine” We will discuss this kind of solutions later on in this course. 42/ 77 Technology Trends: Microprocessor Complexity 100000000 Itanium 2: 410 Million Athlon (K7): 22 Million Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million 10000000 Moore’s Law Pentium i80486 Transistors 1000000 i80386 i80286 100000 2X transistors/Chip Every 1.5 years i8086 10000 i8080 i4004 1000 1970 1975 1980 1985 Year 1990 1995 2000 Called “Moore’s Law” 43/ 77 44/ 77 45/ 77 Performance measure Technology Trends: Processor Performance 900 800 700 600 500 400 300 200 100 0 Intel P4 2000 MHz (Fall 2001) DEC Alpha 21264/600 1.54X/yr DEC Alpha 5/500 DEC Alpha 5/300 DEC Alpha 4/266 IBM POWER 100 87 88 89 90 91 92 93 94 95 96 97 year 46/ 77 Technology Trends: Memory Capacity (Single-Chip DRAM) size 1000000000 100000000 Bits 10000000 1000000 100000 10000 1000 1970 1975 1980 1985 1990 1995 Year • Now 1.4X/yr, or 2X every 2 years. • 8000X since 1980! 2000 year 1980 1983 1986 1989 1992 1996 1998 2000 2002 size (Mbit) 0.0625 0.25 1 4 16 64 128 256 512 47/ 77 Technology Trends Imply Dramatic Change Processor • Logic capacity: about 30% per year • Clock rate: about 20% per year Memory • DRAM capacity: about 60% per year (4x every 3 years) • Memory speed: about 10% per year • Cost per bit: improves about 25% per year Disk • Capacity: about 60% per year • Total data use: 100% per 9 months! Network Bandwidth • Bandwidth increasing more than 100% per year! 48/ 77 1980-2003, CPU--DRAM Speed gap Q. How do architects address this gap? Performance (1/latency) A. Put smaller, faster “cache” memories between CPU and DRAM. The power wall CPU 60% per yr 2X in 1.5 yrs CPU Gap grew 50% per year DRAM 9% per yr 2X in 10 yrs DRAM Year 49/ 77 Dimensions 2006: 0.04 10e-6 2005: 0.12 10e-6 = 1.2 10e-7 2001 devices (0.18 µm) 1 cm 1 mm Chip size (1 cm) Demo 0.1 mm Diameter of Human Hair (25 µm) 10µm 1 µm 0.1 µm 1996 devices (0.35 µm) Deep UV Wavelength (0.248 µm) 10 nm 1 nm 2007 devices (0.01 µm) 1Å Silicon atom radius (1.17 Å) X-ray Wavelength (0.6 nm) 50/ 77 ארכיטקטורת מחשבים בשנים הבאות בעבר :אנרגיה /צריכת חשמל .non issue היום: Power Wallחשמל יקר. טרנזיסטורים הם בחינם. בעבר :ביצועים משתפרים ע"י מיקבול ברמת פקודות המכונה ,קומפיילרים חכמים ,וארכיטקטורות CPUיחיד ( pipelining, superscalar, out-of-order )execution, speculations היום ILP Wall :שיפורי חומרה לשיפור ביצועים לא משתלם. בעבר :כפל איטי ,גישה לזיכרון מהירה. היום Memory Wall :כפל מהיר גישות לזיכרון איטיות. ( 200מחזורי שעון ל 4 DRAMמחזורים לכפל) בעבר: ביצועי מעבד יחיד 2 Xכל 1.5שנים. היום :כל הנ"ל :אולי 2 Xכל 5שנים?? 51/ 77 אבל 2 Xמעבדים (ליבות ) Coresכל שנתיים .היום 4עד 40ליבות למעבד Physics / Transistor’s History 1906 Audion (Triode), 1906 Lee De Forest 1947 First point contact transistor (germanium), 1947 John Bardeen and Walter Brattain Bell Laboratories 52/ 77 History 1958 1997 First integrated circuit (germanium), 1958 Jack S. Kilby, Texas Instruments Contained five components, three types: transistors resistors and capacitors Intel Pentium II, 1997 Clock: 233MHz Number of transistors: 7.5 M Gate Length: 0.35 53/ 77 Annual Sales 1018 transistors manufactured in 2003 alone • 100 million for every human on the planet Global Semiconductor Billings (Billions of US$) 200 150 100 50 0 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 Year 54/ 77 55/ 77 56/ 77 57/ 77 58/ 77 Integrated Circuits (2003 state-of-the-art) Bare Die Chip in Package Primarily Crystalline Silicon 1mm - 25mm on a side 2003 - feature size ~ 0.13µm = 0.13 x 10-6 m 100 - 400M transistors (25 - 100M “logic gates") 3 - 10 conductive layers “CMOS” (complementary metal oxide semiconductor) - most common. Package provides: • spreading of chip-level signal paths to board-level • heat dissipation. Ceramic or plastic with gold wires. 59/ 77 Printed Circuit Boards fiberglass or ceramic 1-20 conductive layers 1-20in on a side IC packages are soldered down. 60/ 77 nMOS Transistor Four terminals: gate, source, drain, body Gate – oxide – body stack looks like a capacitor • Gate and body are conductors • SiO2 (oxide) is a very good insulator • Called metal – oxide – semiconductor (MOS) capacitor • Even though gate is Source no longer made of metal Gate Drain Polysilicon polysilicon gate SiO2 W Off tox n+ n+ L On p-type body n+ n+ p SiOSi gate oxide bulk 2 (good insulator, ox = 3.90 61/ 77 nMOS Operation Body is commonly tied to ground (0 V) When the gate is at a low voltage: • P-type body is at low voltage • Source-body and drain-body diodes are OFF • No current flows, transistor is OFF Source Gate Drain Polysilicon SiO2 Off 0 n+ n+ S p D bulk Si 62/ 77 nMOS Operation Cont. When the gate is at a high voltage: • Positive charge on gate of MOS capacitor • Negative charge attracted to body • Inverts a channel under gate to n-type • Now current can flow through n-type silicon from source through channel to drain, transistor is ON Source Gate Drain Polysilicon SiO2 1 n+ On n+ S p D bulk Si 63/ 77 pMOS Transistor Similar, but doping and voltages reversed • Body tied to high voltage (VDD) • Gate low: transistor ON • Gate high: transistor OFF • Bubble indicates inverted behavior Source Gate Drain Polysilicon SiO2 p+ p+ n bulk Si 64/ 77 65/ 77 Example: Inverter 66/ 77 Example: NAND3 Horizontal N-diffusion and p-diffusion strips Vertical polysilicon gates Metal1 VDD rail at top Metal1 GND rail at bottom 32 l by 40 l 67/ 77 68/ 77 69/ 77 CMOS Inverter A Y VDD 0 1 A A Y Y GND 70/ 77 CMOS Inverter A Y VDD 0 1 OFF 0 A=1 Y=0 ON A Y GND 71/ 77 CMOS Inverter A Y 0 1 1 0 VDD ON A=0 Y=1 OFF A Y GND 72/ 77 73/ 77 74/ 77 Multiplexers 2:1 multiplexer chooses between two inputs S D1 D0 0 X 0 0 X 1 1 0 X 1 1 X S Y D0 0 Y D1 1 75/ 77 Multiplexers 2:1 multiplexer chooses between two inputs S D1 D0 Y 0 X 0 0 0 X 1 1 1 0 X 0 1 1 X 1 S D0 0 Y D1 1 76/ 77 Transmission Gate Mux Nonrestoring mux uses two transmission gates • Only 4 transistors S D0 Y S D1 S 77/ 77 out 78/ 77 מה למדנו היום Computer Architecture: integrates few levels, from programming languages to logic design. Instruction Set Architecture (ISA) Amdahl’s law Moor’s law Processor (CPU) --- Memory speed gap History Transistors. What, and how. From transistors to logic design 79/ 77