Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah http://www.cs.utah.edu/~rajeev 1 What is Computer Architecture? 2 What is Computer Architecture? • If the Intel Pentium4 has a faster clock speed than the IBM Power4, does it execute your programs faster? 3 What is Computer Architecture? • If the Intel Pentium4 has a faster clock speed than the IBM Power4, does it execute your programs faster? Case 1: Completing instruction Clock tick Case 2: Time 4 What is Computer Architecture? To a large extent, computer architecture determines: • the number of instructions used to execute a program • the time each instruction takes to execute • the idle cycles when no work gets done • the number of instructions that can execute in parallel 5 A Typical Microprocessor Branch Predictor L1 Instr Cache L2 Cache Decode & Rename L1 Data Cache Issue Logic ALU ALU ALU ALU Register File 6 Architecture Trends in the 90s • Performance was the ultimate metric • Transistors were a limiting factor As on-chip transistors became available in the 90s, more functionality and complex circuitry was added to boost performance – most of the low-hanging fruit has now been picked 7 Hitting the Wall We have now hit the following walls: • Single core performance • Memory • Complexity • Power, temperature 8 Hitting the Power Wall From Shekhar Borkar, MICRO’99 Power is as important a metric today as performance 9 The Advent of Multi-Core Chips Core Cache bank • In the past, performance magically increased by 50% every year • In the future, this improvement will be only ~20% every year … unless … the application is multi-threaded! 10 Upcoming Architecture Challenges • Improving single core performance • Functionalities in multi-core chips • Simplifying the programmer’s task • Efficient interconnects • Power and temperature-efficient designs • Designs tolerant of errors For publications, see http://www.cs.utah.edu/~rajeev/research.html 11 Interconnects as a Bottleneck • In the past, on-chip data transmission on wires cost almost nothing • Interconnect speed and power has been improving, but not at the same rate as transistor speeds Hence, relative to computation, communication is much more expensive • In the near future, it will take 100 cycles to travel across the chip • 50% of chip power can be attributed to interconnects 12 Interconnects in Multi-Core Chips CPU 1 L2 cache CPU 2 L2 control L2 control L1 A A A A A A CPU 3 A 13 Not all Wires are Created Equal B-Wires Relative latency 1x Relative area 1x Dynamic power (W/m) 2.65a Static Power (W/m) 1.02 L-Wires 0.5x 4x 1.46a 0.57 W-Wires PW-Wires 1.6x 0.5x 2.9a 1.16 3.2x 0.5x 0.87a 0.31 14 Data Transfers have Varying Needs • Example of a cache coherence transaction: Read exclusive request for a shared block 15 Other Interconnect Choices • Optical interconnects: speed of light, cost in converting between optical and electrical domains • 3D chips: reduces communication distances, low cost for vertical signal transmission, increase in power density 16 3D Layouts Cluster Cache bank Intra-die horizontal wire Inter-die vertical wire Die 1 Die 0 (a) Arch-1 (cache-on-cluster) (b) Arch-2 (cluster on cluster) (c) Arch-3 (staggered) 17 Upcoming Architecture Challenges • Improving single core performance • Functionalities in multi-core chips • Simplifying the programmer’s task Clustered architectures: relatively low complexity scalable solution easily handles multiple threads • Efficient interconnects • Power and temperature-efficient designs • Designs tolerant of errors 18 Upcoming Architecture Challenges • Improving single core performance • Functionalities in multi-core chips • Simplifying the programmer’s task • Efficient interconnects Heterogeneous perf/power Cores that execute the OS Cores that verify results • Power and temperature-efficient designs • Designs tolerant of errors 19 Upcoming Architecture Challenges • Improving single core performance • Functionalities in multi-core chips • Simplifying the programmer’s task • Efficient interconnects Hardware to support transactional memory • Power and temperature-efficient designs • Designs tolerant of errors 20 Upcoming Architecture Challenges • Improving single core performance • Functionalities in multi-core chips • Simplifying the programmer’s task • Efficient interconnects Faults are caused by high energy particles that deposit enough charge to toggle bits Variations in conditions may cause a circuit to not produce its result in time • Power and temperature-efficient designs • Designs tolerant of errors 21 Research Methodologies It’s all about the simulators! • Simplescalar & Wattch & Hotspot: about 10,000 lines of C code that models the flow of instructions through a modern processor • Inputs: configuration file that specifies processor parameters, benchmark program (say, gzip) • Outputs: how long the program runs on the simulated processor (Simplescalar), how much power is consumed (Wattch), what is the peak temperature (Hotspot) 22 Evaluating a New Idea • Lots of reading (it’s better than waiting for divine inspiration) • Identify bottlenecks, identify problems, develop an idea, repeatedly question that idea • Understand simulator • Engineer a solution, modify simulator code (perhaps, write fewer than 1000 lines of C code) • Analyze data (things never work the first time), engineer/optimize/debug your solution • Write papers • Implement in silicon? 23 To Learn More… • CS/EE 3810: Computer Organization • CS/EE 6810: Computer Architecture • CS/EE 7810: Advanced Computer Architecture • CS/EE 7820: Parallel Computer Architecture • CS 7937 / 7940: Architecture Reading Seminar 24 Title • Bullet 25