A Few Words About Christos • VIRAM media-processor Associate professor of EE & CS – Ph.D. from U.C. Berkeley IRAM test chip – B.Sc. from University of Crete EE382A • Advanced Processor Architecture Current research – Parallel systems (scheduling, TM) ATLAS ATM Switch Telegraphos DSM switch – Energy efficient data-centers – Security systems – More info at http://csl.stanford.edu/~christos Christos Kozyrakis & John Shen • Systems I have worked on – Networking chips: ATLAS & Telegraphos switches – Processor chips: VIRAM media-processor • Department of Electrical Engineering Stanford University Raksha Security System ATLAS TM System 125 million transistors, 9.6 billion ops/sec – FPGA prototypes: Raksha & Atlas – Server prototypes: CoolSort http://eeclass.stanford.edu/ee382a EE382A – Spring 2009 Lecture 1 - 1 Christos Kozyrakis EE282 – Autumn 2009 A Few Words About John Lecture 1 - 2 Christos Kozyrakis EE382a Team • Head of Nokia Research Center in Palo Alto • Instructors: Christos Kozyrakis & John Shen – Ph.D. from USC – B.Sc. from University of Michigan • Teaching assistant: David Signiorelli • Prior to Nokia – Director of the Microarchitecture Research Lab (MRL) at Intel • Superscalar architecture, speculative multithreading and memory prefetching, 3D die-stacking technology, and heterogeneous multi-sequencer architectures • Guest lectures: Ben Lee + one more • Administrative support: Teresa Lynn – Professor of Computer Engineering at CMU • Contact info & office hours: up-to-date info on class webpage • Author of the main textbook for EE382a – http://eeclass.stanford.edu/ee382a – Check frequently EE282 – Autumn 2009 Lecture 1 - 3 Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 4 Christos Kozyrakis You… Class Basics • Class participation is EXTREMELY important in EE382a • Lectures: Mo & We, 11am-12.15pm, Hewlett 101 – There will also be some discussion sessions on Fridays • Friday 2-3pm, Gates Hall 498 • Your goals • Discussion sessions will be explicitly announced – Ask questions – The class is not available on SCPD this quarter – Offer answers – Suggest discussion topics • Web page: http://eeclass.stanford.edu/ee382a – Make us learn your name – Announcements, handouts, office hours, latest schedule, bulletin board • Will take and post photos of everyone next week – Check frequently – Signup with webpage for on-line access to grades • We will let you know when registration is open… EE282 – Autumn 2009 Lecture 1 - 5 Christos Kozyrakis EE282 – Autumn 2009 The Bulletin Board Lecture 1 - 6 Christos Kozyrakis EE382a Topics • The preferred way to ask class-related questions • Pipelining overview and analysis – We promise to check & answer often, especially close to deadlines – We encourage you to contribute to answers & have on-line discussions on class material • Architectures for instruction level parallelism – Supersalar: instruction fetch, branch prediction, dynamic scheduling & register renaming, memory disambiguation • The bulletin rules – VLIW and dynamic binary translation – Before posting a new question • Check if question has already been asked or even answered • Architecture for task and data level paralellism – Use the search capabilities of your web browser • Check the FAQ page for the assignment – Multithreading, multi-core architectures, vector processing, GPUs, tradeoffs in designing multi-core chips, memory hierarchy for multi-core – Choose an appropriate subject for your question • E.g. “HW2, problem 3, definition of memory latency” • For questions not appropriate for the public: send us an email • Cross-cutting issues – Checkpointed processors, phase-change memory, … EE282 – Autumn 2009 Lecture 1 - 7 Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 8 Christos Kozyrakis Textbooks and Papers Assignments, Exams, and Class Load • Single exam and 1+2 homework assignments • Textbooks – Required: "Modern Processor Design: Fundamentals of Superscalar Processors", J.P. Shen and M. Lipasti, 1st edition, McGraw-Hill • Do not use/buy the beta edition! – Reference: “Computer Architecture: A Quantitative Approach”, J. Hennessy & D. Patterson, 4th edition, Morgan Kaufmann – Reference: “Computer Organization and Design: The Hardware/Software Interface”, D. Patterson & J. Hennessy, 4th edition, Morgan Kaufmann • Papers (check handouts link on the webpage) – A few required papers • Large research project – – – – On an open question in computer architecture Work in groups of up to 3 students See topic suggestions on-line or suggest your own project Milestones: proposal, halfway review/status, presentation, paper… • Grade breakdown (tentative) – Exam 40%, Project 40%, HW + summaries + participation 20% – All deadlines are final, no extensions, no exceptions – Remember the honor code (more info on web page) • These papers are included in the exam materials • Warnings • Have to submit a 1-page paper summary by the next lecture – Several optional papers – This will be a loaded class!! – This class will be as good as your participation… • Further in-depth information, references for projects, … EE282 – Autumn 2009 Lecture 1 - 9 Christos Kozyrakis EE282 – Autumn 2009 Prerequisites and Registration • Lecture 1 - 10 Christos Kozyrakis Should I Take EE382A? Prerequisites: EE108B or equivalent • Good reason to take EE382A – Expected to know: simple pipelines, basic caching, virtual memory, main memory – Prepare for research in computer architecture – Broaden your Ph.D. research perspective • EE282 is not a required prerequisite • Class registration: – Become a digital systems architect in industry – Honest curiosity (how do Intel/AMD/… processors work?) – Limited to 30 students; all students must receive instructor’s approval • – Want to take a class with a research project • Not a good reason to take EE382A Homework 1: prerequisite assessment – Due on in-class on Monday – Prepare for quals, comps, etc… – Work on it on your own – Need another course for your degree program • “EE382A is supposed to be an easy A, right?” – Will send you email about your registration by Wednesday – Learn about digital circuits and CAD tools EE282 – Autumn 2009 Lecture 1 - 11 Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 12 Christos Kozyrakis On Reading & Summarizing Papers • Look for the following – The issue or problem addressed by the paper – The original contributions (real or claimed, you have to check) – Critique: what are the major strengths and weaknesses of the papers? • Look at the claims and assumptions, the methodology, the analysis of data, and the presentation style EE382A Lecture 1: – Future work: what are the natural extensions or improvements to this work? • Or, can we apply a similar methodology to other problems of interest • Do not submit the paper abstract as your summary :) • Helpful tips Introduction to Advanced Processor Architecture – Read the abstract, introduction, and conclusions sections first. – Read the rest of the paper twice • First a quick pass to get rough idea of details, then a detailed reading – Underline/highlight the important parts of the paper – Keep notes on the paper margins about comments or questions • Department of Electrical Engineering Stanford University Important insights, questionable claims, relevance to other topics, ways to improve some technique etc. – Look up references that seem to be important or missing • In some cases, you may also want to check who and how references this paper http://eeclass.stanford.edu/ee382a EE282 – Autumn 2009 Lecture 1 - 13 Christos Kozyrakis EE382A – Spring 2009 Historical Perspectives on Processors • The Decade of the 1970’s: “Birth of Microprocessors” Lecture 1 - 14 Christos Kozyrakis Performance Growth • Doubling every 18 months (1982-2000): – Programmable Controller – total of 3,200X – Single-Chip Microprocessors – Cars travel at 176,000 MPH; get 64,000 miles/gal. – Personal Computers (PC) – Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) – Wheat yield: 320,000 bushels per acre • The Decade of the 1980’s: “Quantitative Architecture” • Doubling every 24 months (1971-2001): – Instruction Pipelining – Fast Cache Memories – total of 36,000X – Compiler Considerations – Cars travel at 2,400,000 MPH; get 600,000 miles/gal. – Workstations – Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) – Wheat yield: 3,600,000 bushels per acre • The Decade of the 1990’s: “Instruction-Level Parallelism” – Superscalar,Speculative Microarchitectures Unmatched by any other industry!! – Aggressive Compiler Optimizations – Low-Cost Desktop Supercomputing EE282 – Autumn 2009 Lecture 1 - 15 [John Crawford, Intel, 1993] Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 16 Christos Kozyrakis Evolution of Single-Chip Processors Convergence of Key Enabling Technologies • CMOS VLSI: – Submicron feature sizes: 0.3u 0.25u 0.18u 0.13u 90n 65n 45nm… – Metal layers: 3 4 5 6 7 (copper) 12 … – Power supply voltage: 5V 3.3V 2.4V 1.8V 1.3V 1.1V … • CAD Tools: – Interconnect simulation and critical path analysis – Clock signal propagation analysis – Process simulation and yield analysis/learning • 1980’s 1990’s 2010 Transistor Count 10K-100K 100K-1M 1M-100M 0.5-1B Clock Frequency 0.2-2MHz 2-20MHz 20M-1GHz 1-5GHz Instruction/Cycle < 0.1 0.1-0.9 0.9- 2.0 10 MIPS or MFLOPS < 0.2 0.2-20 20-2,000 100,000 Watt <2 <10 <40 1-100+ (?) CPUs/chip` 1 1 1 4-10 Architecture & Microarchitecture: – Superpipelined and superscalar machines – Speculative and dynamic microarchitectures – Simulation tools and emulation systems • 1970’s Compilers: – Extraction of instruction-level parallelism – Aggressive and speculative code scheduling – Object code translation and optimization EE282 – Autumn 2009 Lecture 1 - 17 Christos Kozyrakis EE282 – Autumn 2009 Aspects of Computer Architecture • ARCHITECTURE Christos Kozyrakis Our Objective for this Quarter (instruction set architecture) • The “What’s-How’s-Why’s” of Processor Design – programmer/compiler view - “Functional appearance to its immediate user/ system programmer” • IMPLEMENTATION Lecture 1 - 18 1. Knowledge (“what’s”) - Technology - Techniques (microarchitecture) – processor designer view - “Logical structure or organization that implements the instruction set” 2. Design Skills (“how’s”) - Critical Issues - Trade-off Intuitions • DESIGN 3. Understanding (chip realization) – chip/system designer view - “Physical structure that embodies the implementation” EE282 – Autumn 2009 Lecture 1 - 19 Christos Kozyrakis (“why’s”) - Deeper Insights - Fundamental Principles EE282 – Autumn 2009 Lecture 1 - 20 Christos Kozyrakis Basic Tools and Principles for Architects Amdahl’s Law • Speedup= timewithout enhancement / timewith enhancement • Suppose an enhancement speeds up a fraction f of a task by a factor of S timenew = timeold·( (1-f) + f/S ) Soverall = 1 / ( (1-f) + f/S ) timeold (1 - f) f timenew (1 - f) EE282 – Autumn 2009 Lecture 1 - 21 Christos Kozyrakis EE282 – Autumn 2009 f/S Lecture 1 - 22 Amdahl’s Law (continued) Christos Kozyrakis Pipelining • Real life analogy: After driving through 60 minutes of traffic jam, how much time can you make up by speeding in the final mile? • Latency : Elapsed time from start to completion of a particular task • Throughput : How many tasks can be completed per unit of time • A pipeline is like an assembly line! • Applications in Computer Architecture – RISC - Reduced Instruction Set Computer stage1 – Optimized to execute frequently used instructions quickly – Infrequently used instructions take longer, or even emulated with SW We should concentrate efforts on improving frequently occurring events or frequently used mechanisms stage2 stage3 stage4 stage5 start • finish Pipelining only improves throughput – Latency: each job still takes 5 cycles to complete – Throughput: 1 job per cycle if pipelined vs. 1 job per 5 cycles if not pipelined EE282 – Autumn 2009 Lecture 1 - 23 Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 24 Christos Kozyrakis Pipelining (continued) Parallel Processing • Real life analogy: Henry Ford’s automobile assembly line. • Parallelism - the amount of independent sub-tasks available • Example in computer architecture: – 5-stage Instruction Execution Pipeline • If sub-tasks are independent, the order that they are carried out does not matter – Fetch-Decode-Execute-Memory-Writeback Stages Fetch Decode Execute Memory Writeback time t0 t1 t2 I1 I2 I1 EE282 – Autumn 2009 I3 I2 I1 t3 t4 t5 I4 I3 I2 I1 I5 I4 I3 I2 I1 I5 I4 I3 I2 Lecture 1 - 25 t6 t7 .... • Thus by executing the independent subtasks concurrently, we can finish the entire task faster Improve Speedup!!! I5 I4 I3 I5 I4 I5 Christos Kozyrakis EE282 – Autumn 2009 Parallel Processing Lecture 1 - 26 Christos Kozyrakis Our-of-order Execution • Real life analogy: collaboration on problem sets • Specification (or Program) Order vs Dataflow Order (although not always encouraged) • Dataflow: Data-driven scheduling of events – The start of an event should be enabled by the availability of its required input (data dependency) • Examples in computer architecture: – The completion of an event will produce an output that will enable the start a of other events b – Parallel computers – Superscalar processors *2 + – Multi-core processors x = a + b; y = b * 2 z = (x-y) * (x+y) x y - + * EE282 – Autumn 2009 Lecture 1 - 27 Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 28 Christos Kozyrakis Our-of-order Execution Work and Critical Path • Real life analogy: • Work – A tip on taking tests: work on the questions you know first T1 - time to complete a computation on a sequential system • Critical Path • Examples in computer architecture – Most modern microprocessors (Intel P4, Opteron etc) all schedule instruction execution in dataflow order x = a + b; y = b * 2 z =(x-y) * (x+y) T - time to complete the same computation on an infinitely-parallel system a • Average Parallelism b *2 + Pavg = T1 / T x y • For a p wide system - + Tp max{ T1/p, T } * Pavg>>p Tp T1/p EE282 – Autumn 2009 Lecture 1 - 29 Christos Kozyrakis EE282 – Autumn 2009 Work and Critical Path Christos Kozyrakis Speculation Is it possible to parallelize the critical path? • Real life analogy: undergraduate degree requirements i.e. violate data dependence? – Work = unit requirement – Critical Path to graduation is determined by course sequences and their prerequisites • Added constraints: classes are only available on specific quarters… – Parallel job scheduling • Speculation techniques must also include mechanisms for – Given a collection of inter-dependent task: 1. Checking if the guesses are correct • How much resources should be allocated? • Which sequence of tasks should be given priority? Lecture 1 - 31 • Guess the outcome of an operation from its inputs without performing the operation • Even better, guess the outcome of an operation before the inputs to the operation are even known • Applications to computer architecture EE282 – Autumn 2009 Lecture 1 - 30 2. Undoing “speculative execution” after wrong guesses Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 32 Christos Kozyrakis Speculation (continued) Locality Principle • Real life analogy: • One’s recent past is a very good indication of his near future – Another tip on taking tests: You can often guess what is going to be on an exam by looking at lectures and HWs. – Temporal Locality: If you just did something, it is very likely that you will do the same thing again soon – Spatial Locality: If you just did something, it is very likely you will do some thing related or similar next • Examples in computer architecture – Circuit-level speculations: Carry Select Adder • Locality == Patterns == Predictability – Architectural-level speculations – Converse: • Branch target predictions • Anti-locality : If you haven’t done something for a very long time, it is very likely you won’t do it in the near future either • Load value predictions • Speculative loop execution EE282 – Autumn 2009 Lecture 1 - 33 Christos Kozyrakis EE282 – Autumn 2009 Locality Principle (continued) Lecture 1 - 34 Christos Kozyrakis Memoization • If something is expensive to compute, you might want to remember the answer for a while, just in case you will need the same answer again • Real life analogy: – spatial locality - where you choose to sit in a room – temporal locality - will you be here again next week? Why does memoization work?? • Examples in computer architecture: • Real life analogy: – Execution of program loops – Keeping a list of frequently used phone numbers by your telephone • Spatial locality - after you execute an instruction, with very good probability, you will execute the next instruction • Temporal locality - you are very likely to repeat the same instructions many times EE282 – Autumn 2009 Lecture 1 - 35 Christos Kozyrakis • Examples in computer architecture – ? EE282 – Autumn 2009 Lecture 1 - 36 Christos Kozyrakis Amortization Amortization (continued) • Overhead cost : one-time cost to set something up • Real life analogy: economy of scale – Why is pasta sauce cheaper when bought by the gallon? • Per-unit cost : cost for per unit of operation • Examples in computer architecture: total cost = overhead + per-unit cost x N Cache Access Latency • It is often okay to have a high overhead cost if the cost can be distributed over a large number of units Tmiss= 50 cycles Thit = 1 cycle If on the average a cache line is reused n times before being ejected low the average cost Tave = ( Tmiss+ (n-1)Thit ) / n Tmiss / n + Thit average cost = total cost / N = ( overhead / N ) + per-unit cost EE282 – Autumn 2009 Lecture 1 - 37 Christos Kozyrakis EE282 – Autumn 2009 Basic Equations and Metrics n = 50 Tavg 2 n=2 Tavg 25 Lecture 1 - 38 Christos Kozyrakis Ready to Learn More? • Performance – CPUtime = Instruction Count * CPI * Clock Cycle Tie – AMAT = Hit Time + Miss Rate * Miss Penalty – Amdahl’s law, amortization • Cost – Processor cost = f(die area4) • Power Consumption – Power = C*Vdd2*F + Vdd*Ishortcircuit*F + Vdd*Ileakage – Energy = Power * Time – E*D, E*D2, ED3, … • Fault tolerance: MTTF, MTTR, … • Design complexity: ? EE282 – Autumn 2009 Lecture 1 - 39 Christos Kozyrakis EE282 – Autumn 2009 Lecture 1 - 40 Christos Kozyrakis