Status of Microprocessors Technology Advanced Computer Architecture Spring 2013, Kyushu University Lecturer: Farhad Mehdipour Email: farhad@ejust.kyushu-u.ac.jp Web: http://www.c.csce.kyushu-u.ac.jp/~farhad A Typical Computer Organization CPU: Central Processing Unit RF: Register File ALU: Arithmetic & Logic Unit I/O: Input/Output 2 Designing Computers All computers more or less based on the same basic design: the Von Neumann Architecture! 3 The Von Neumann Architecture • Model for designing and building computers, based on the following three characteristics: 1) The computer consists of four main sub-systems: • • • • Memory ALU (Arithmetic/Logic Unit) Control Unit Input/Output System (I/O) 2) Program is stored in memory during execution. 3) Program instructions are executed sequentially. 4 The Von Neumann Architecture Bus Processor (CPU) Input/Output Memory Control Unit ALU Store data and program Execute program Do arithmetic/logic operations requested by program Communicate with "outside world", e.g. • Screen • Keyboard • Storage devices • ... 5 Classes of Computers • 1960s - large mainframes – – – – Costing millions of dollars Stored in computer rooms Multiple operators Typical applications: business data processing and large-scale scientific computing • 1970s - the birth of the minicomputer – A smaller-sized and cheaper computer • Also the emergence of supercomputers – High-performance computers for scientific computing 6 Classes of Computers • 1980s - the rise of the desktop computer based on microprocessors – Personal computers – Workstations • 1990s - the emergence of – The Internet and the World Wide Web – The first successful handheld computing devices (personal digital assistants or PDAs) – High-performance digital consumer electronics – Cell phones and smart phones 7 Personal Mobile Device (PMD) • Wireless devices with multimedia interfaces such as cell phones, smartphones, tablet computers and …. • Requirements – – – – Cost Energy efficiency Real-time performance Minimized memory 8 Desktop Computers • One of the largest markets in dollar terms • Low-end (<$500) to high-end ($5K) systems • Optimized price-performance – Performance measured in the no. of calculations and graphic operations – Price is what matters to customers 9 Servers • Provide large-scale and more reliable file and computing services (Web servers) • Key requirements – Dependability – effectively provide service 24/7/365 (Yahoo!, Google, eBay) – Scalability – server systems grow over time, so the ability to scale up the computing capacity is crucial – Performance – transactions per minute 10 Clusters/Warehouse-Scale Computers • Software as a Service(SaaS) – – – – Search Social networking Video sharing Multiplayer games • Each nodes runs its own OS and nodes communicate using a network protocol. • The largest of the clusters are called Warehouse-Scale Computers (WSC), tens of thousands of servers can act as one. • Power (80% of the cost of $90M a WCS is associated with power and cooling) Google’s data center • As clusters grow in popularity, the number of conventional supercomputers is shrinking. 11 Embedded Computers • Computers as parts of other devices where their presence is not obviously visible – e.g., home appliances, printers, smart cards, cell phones, set-top boxes, gaming consoles, network routers. • Fastest growing portion of the market • Wide range of processing power and cost – $0.1 (8-bit, 16-bit processors), $10 (32-bit, capable to execute 50M instructions per second), $100-$200 (highend video gaming consoles and network switches) • Requirements – Real-time performance (e.g., time to process a video frame is limited) – Minimized memory – Minimized power – Price, Weight, Size 12 Classes of Computers • These changes in computer use have led to five different computing markets: 13 Exciting Change It impacts every aspect of human life. Eniac, 1946 Occupied 17x10 meter ^2 room, weighted 30 tones, contained 18000 electronic valves, consumed 150KW of electrical power; capable to perform 5K addition per second PlayStation Portable (PSP) Approx. 170 mm (L) x 74 mm (W) x 23 mm (D) Weight: Approx. 260 g (including battery) CPU: PSP CPU (clock frequency 1~333MHz) Main Memory: 32MB Embedded DRAM: 4MB Profile: Game, Audio, Video 14 Evolution of Computers First generation (1939-1954) - vacuum tube Second generation (1954-1959) - transistor Third generation (1959-1971) - IC Fourth generation (1971-present) - microprocessor 15 Technology Used in Computers Transistors Vacuum Tube Integrated Circuit- IC Microprocessor VLSI* chips *VLSI: Very large-scale integration 16 Wafer & Die Die 20~30 cm X nm (nanometer) Wafer x mm (e.g. 100 mm) 17 Evolution of Computers First Generation: ENIAC, 1946 (U of Penn) –Vacuum Tubes • The first programmable electronic digital computer • 18,000 vacuum tubes • 30 ton, 30m x 2.5m x 1m • 5000 additions per second • 20×10-decimal-digit words • Programmed by 3000 switches • Cost: almost $500,000 (approximately $6,000,000 today) (became stored program in 1948 following von Neumann's advise) 18 Evolution of Computers Second generation (1954-1959) - Transistor Manchester University Experimental Transistor Computer http://history.acusd.edu/gen/recording/computer1.html http://www.computer50.org/kgill/transistor/trans.html 19 Commercialization in the 50s • UNIVAC, 1951, the first commercial computer – contract price $400K, actual cost ~$1M, sold 48 copies • IBM 701, 1952, shipped 19 copies – leased at $12K per month • IBM 650, 1953, mass produced ~2000 units – $200K ~ 400K • IBM System/360, 1964 – A family of binary compatible computer – 19 combinations of varying speed and memory capacity from $200K ~ $2M – Still lives on today as the “highly-profitable” IBM z900 series 20 Evolution of Computers Third generation (1959-1971) - IC PDP-8, Digital Equipment Corporation Thanks to the use of ICs, the DEC PDP-8 is the least expensive general-purpose small computer in 1960s http://history.acusd.edu/gen/recording/computer1.html http://www.piercefuller.com/collect/pdp8.html 21 Cheaper or Faster in 60s and 70s • Minicomputers – DEC PDP-8, 1965, $20K, size of large refrigerators – Less powerful than “mainframes”, 10x cheaper – Departmental computers--PDP-11 and VAXs enjoyed extreme popularity in the 70s and 80s • Supercomputers – Performance at all cost!! – Biggest customers: national security, nuclear weapons, cryptography, (also aerospace, petroleum, automotive, pharmaceutical, sciences) check out www.top500.org 22 Evolution of Computers Fourth generation (1971-present) - microprocessor In 1971, Intel developed 4-bit 4004 chip for calculator applications. ROM/RAM buffer Timing Reset Control logic Program counter Instruction decoder ALU Reg. I/O Refresh logic System bus Block diagram of Intel 4004 4004 chip layout http://www.intel.com A good review article: The History of The Microprocessor, Bell Labs Technical Journal, 1997. 23 Early Examples DEC PDP 8, 1963 An early mini Xerox Alto, 1973 An early “PC” with mouse 24 Cray 3, 1993 • • • • Up to 16 processors and up to 2 gigawords (16 GB) of memory Power consumption: 90KW 15 GFLOPS (1 sec on Cray3 ≈ 67 years ENIAC) $30,000,000 25 Microprocessor Generations • First generation: 1971-78 – Behind the power curve (16-bit, <50k transistors) • Second Generation: 1979-85 – Becoming “real” computers (32-bit , >50k transistors) • Third Generation: 1985-89 – Challenging the “establishment” (Reduced Instruction Set Computer/RISC, >100k transistors) • Fourth Generation: 1990– Architectural and performance leadership (64-bit, > 1M transistors, Intel/AMD translate into RISC internally) 26 Intel 4004 @ 70s • Intel 4004, first single chip CPU – – – – 4- bit processor for a calculator 2,300 transistors 16-pin DIP package 740kHz (eight clock cycles per CPU cycle of 10.8 microseconds) – ~ 100K OPs per second 27 Intel Itanium 9500 Series • 64-bit processor • 3.1 billion transistors • 2.53 GHz, issue up to 12 instructions per cycle • 8 Cores • 54 MByte of cache!! In ~40 years, about 1,000,000 times growth in transistor count and performance! 28 Key Architectural Trends • Increase performance at 1.6x per year (2X/1.5yr) – True from 1985-present • Combination of Technology and Architectural enhancements – Technology provides faster transistors Faster transistors leads to high clock rates – More transistors (“Moore’s Law”): • Architectural ideas turn transistors into performance – Responsible for about half the yearly performance growth • Two key architectural directions – Sophisticated memory hierarchies – Exploiting instruction level parallelism 29 Moor’s Law Transistor count doubles every 18-24 months! 30 Transistor CountIntel Processors Transistor count doubles every 18-24 months 31 Processor Transistor Count Intel 4004, 2300tr (1971) Intel P4 – 55M tr (2001) Intel McKinley – 221M tr. (2001) Intel Core 2 Extreme Quadcore 2x291M tr. (2006) 32 Microprocessors (Y2K-2014) Year of 1st shipment 1997 1999 2002 2005 2008 2011 2014 Clock Frequency (GHz) 0.75 1.2 1.6 2 2.5 3 3.674 Chip Size (mm²) 300 340 430 520 620 750 901 Transistors per chip 11M 21M 76M 200M 520M 1,4B 3,62B 33 Towards RISCs • Two significant changes: – Virtual elimination of assembly language programming reduced the need for object-code compatibility – The creation of standardized, vendor-independent operating systems (UNIX and Linux) • These changes – A new set of architectures with simpler instructions, called RISC (Appendix I) (early 1980s). • RISC-based machines focused on – the exploitation of Pipelining (Appendix II) and Instruction Level Parallelism (Appendix III) – use of Caches 34 Growth in Processor Performance • Advances in technology • Innovations in computer design 35 Growth in Processor Performance RISC • ILP (pipelining, multiple instruction issue) • Use of caches 36 Growth in Processor Performance RISC Forcing prior architectures to keep up or disappear • Digital Equipment VAX was replaced by a RISC architecture • Intel rose to the challenge, primarily by translating x86 (or IA-32) instructions into RISC-like instructions internally 37 Growth in Processor Performance RISC • Little ILP left to exploit efficiently (ILP-Wall) • Almost unchanged memory latency (Memory-Wall-Appendix IV) • Maximum power dissipation of air-cooled chips (Power-Wall- Appendix V) 38 Growth in Processor Performance Move to Multiprocessor RISC • Maximum power dissipation of air-cooled chips • Little ILP left to exploit efficiently • Almost unchanged memory latency 39 Multiprocessor • “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) • All microprocessor companies switch to MP (2X CPUs / 2 yrs) AMD/’05 Intel/’06 IBM/’04 Sun/’05 Processors/chip 2 2 2 8 Threads/Processor 1 2 2 4 Threads/chip 2 4 4 32 Manufacturer/Year 40 Future of Computers • End of Moore’s law – Future of VLSI technology after 2015 is unknown Transistor size will be measured in atoms and node charge will be measured in electrons!! It doesn’t mean VLSI is finished, just no more scaling • Non-von Neumann architectures toward: – Parallel and distributed processing – Reconfigurable hardware computing • Non-silicon technologies – Nanotechnologies: carbon nanotubes, molecular switches – Biological/cellular computers: DNA, proteins and enzymes – Quantum computers: magnetic resonance and quantum dots. • New ways of using computers!!! 41 Thank you! 42 Appendix I: RISC-Reduced Instruction Set Architectures • Properties of RISC architectures: – All ops on data apply to data in registers and typically change the entire register (32-bits or 64-bits). – The only ops that affect memory are load/store operations. Memory to register, and register to memory. – Load and store ops on data less than a full size of a register (32, 16, 8 bits) are often available. – Usually instructions are few in number (this can be relative) and are typically one size. Back Appendix II: Pipelining Single-Cycle CPU Load IF Dec EX Mem WB Multiple Cycle CPU Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Load IF Dec EX Mem WB Pipelined CPU Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8 Load IF Dec EX Mem WB Load IF Dec EX Mem WB Load IF Dec EX Mem WB Load IF Dec EX Mem WB Back 44 Appendix III: Instruction Level Parallelism • Architectural technique that allows the overlap of individual machine operations ( add, mul, load, store …) • Multiple operations execute in parallel (simultaneously) • Goal: Speed Up the execution • Example: instr. 1: sub instr. 2: add instr. 3: add R1 R1, “1” R4 R1, R3 R5 R3, R2 • Sequential execution (Without ILP) each instruction takes one cycle Total execution time: 3 cycles • ILP execution (overlap execution) instr. 1 or instr. 2 can run simultaneously with instr. 3 Total execution time: 2 cycles Back 45 Appendix IV: Memory Wall Back 46 Appendix V: Power Wall Back 47