INTRODUCTION Jehan-François Pâris jparis@uh.edu An evolving field • Computer architectures keep changing – Building faster computers • Supercomputers and data centers – Building cheaper, smaller computers • Laptops, notebooks, netbooks, smartbooks – Putting computer systems everywhere • Cars, cell phones, HDTV: embedded computers An analogy • Electrical motors – Replaced the single steam engine powering many machines through transmission belts and pulleys – One electrical motor per machine – Domestic appliances, car starters, … – Power tools – Power windows, electrical toothbrushes, … The coming revolution • Cannot increase CPU clock frequency above 2 GHz without running into unsolvable heat dissipation problems – Switch to multicore architectures • Two, four, eight, … CPUs per chip – Creates new problems • Hardware: cache synchronization • Software: programming these beasts Ouch! Other challenges • Reducing power consumption of data centers – Often contain archival data that are very rarely accessed • Finding new ways to keep increasing magnetic disk capacity • Dealing with physical limits to SDRAM density – Will never get 8 TB SODIMM modules • Finding a replacement for hard drives Classical computer components • • • • • Input Output Memory Datapath Control – Datapath + Control = Processor • Storage subsystem is missing! A laptop motherboard The course philosophy • Showing you how computer work is fine • Showing you how to make them faster is better! PERFORMANCE ISSUES • Defining performance • Measuring it – Not an easy task • Evaluating the impact of – Amount of work done by each instruction – Time they take to run – CPU clock speed Measuring Performance • Inverse of execution time of a benchmark Performance = 1/Execution Time • If computers A and B are such that Execution TimeA < Execution TimeB for the same benchmark, then PerformanceA > PerformanceB SPEC CPU Benchmark • SPEC CPU2006 – Set of 12 integer and 17 floating-point benchmarks – Results are normalized: Execution on a reference processor / Execution on benchmarked processor – Single value is geometric mean of these ratios How is it computed (I) • Two new processors P and Q compared to a reference processor R • Execution times for n benchmarks – P1, P2, …, Pn – Q1, Q2, …, Qn – R1, R2, …, Rn How it is computed • SPEC value for processor P is Ri SPECP n i 1 Pi n • Observe that SPECP n SPECQ Qi i 1 P i n • (property of geometric mean) Impact of Instruction Set • Execution Time = Number of Instructions × Mean Instruction Execution Time – Gave birth to the idea of more complex instruction sets • Each does more • Fewer instructions Impact of Clock Speed • Execution Time = Number of Clock Cycles × Clock Cycle Time same as Execution Time = Number of Clock Cycles / Clock Frequency Putting everything together • Execution Time = Number of Instructions × Number of Clock Cycles per Instruction × Clock Cycle Time • Gives us three ways to reduce program execution time 1. Using fewer instructions • VAX – Super minicomputer designed in late 70’s – Had a complicated instruction set (CISC) – Idea was to use more powerful instructions in order to reduce the number of instructions used to perform most frequent tasks – Poor pipelining performance 2. Using a faster clock • Major reason for explosion of CPU performance in the 80’s and 90’s – IBM PC (1981): Intel 8088 @ 4.77 MHz – IBM PC AT (1984): Intel 80286 @ 6 and 8 MHz – Nowadays up to 3 GHz • Cannot get much higher! 3. Using better instructions • Best strategy is to reduce the average number of clock cycles per instruction – Privileging fast instructions – Using fixed-size instructions to allow pipelining – Trying to execute as many tasks as possible in parallel Amdahl’s Law (I) • Examples: – Supersonic jet • Could fly from Houston to Washington in thirty minutes • Total travel time would be dominated by travel time to airport and check in procedures – Today's laptops: • Disk access times are the bottleneck Amdahl’s Law (II) • Assume that we have a technique for improving the performance of some part of a system. • Let – To be the time originally spent in the part of the system that can be improved – Ti be the time spent in that part once the improvement has been applied – Tn be the time spent in in the part of the system that remains unaffected Amdahl’s Law (III) • The total speedup for the whole system will be Tn To Speedup Tn Ti • The maximum possible speedup when Ti 0 Tn To Speedup Tn An example • Flying to Washington National Airport takes three hours • Going to the airport and waiting for the flight takes a minimum of two hours • Going from the airport to Washington downtown takes a minimum of 30 minutes • What is the maximum speedup that could be achieved using much faster planes? 5h30 / 2h30 = 2.2 Answer • Current travel time: – To airport and wait: 2 hours – Plane: 3 hours – To downtown by DC metro: 30 minutes – Total: 5 hours 30 minutes Answer • Assume plane travels at speed of light: – To airport and wait: 2 hours – Plane: negligible – To downtown by DC metro: 30 minutes – Total: 2 hours 30 minutes • Maximum speedup would be 5h30 / 2h30 = 2.2 Train and busses • Commuter trains and city busses spend significant amount of trip time debarking and embarking travelers – Have wide doors • Not true for Amtrak train and intercity buses – Fewer narrower doors Train and busses A problem • Assume we have a technique to improve the speed of floating-point operations by 20 percent • What will be the overall CPU speedup if we expect it to spend 10 percent of its time executing floating point operations? • How would that speedup be affected if the CPU spends 30 percent of its time executing floating point operations? Solution (I) • First case: – Baseline time = 0.9 × 1 + 0.1 × 1 = 1 – After improvement = 0.9 × 1 + 0.1 × 0.8 = 0.98 – Speedup = 1/0.98 = 1.02 • A 2 percent improvement! Solution (II) • Second case: – Baseline time = 0.7 × 1 + 0.3 × 1 = 1 – After improvement = 0.7 × 1 + 0.7 × 0.8 = 0.94 – Speedup = 1/0.94 = 1.064 • A 6.4 percent improvement! REVIEW PROBLEMS Problem • Consider a huge program that consists of a purely sequential part that takes two hours and another part that takes eight hours. What is the maximum speedup we can achieve by parallelizing the second part of the program? Answer • Current run time: – Sequential part: 2 hours – Other part: 8 hours – Total: 10 hours • Minimum run time: – Sequential part: 2 hours – Other part: negligible – Total: 2 hours Answer • Current run time: – Sequential part: 2 hours – Other part: 8 hours – Total: 10 hours • Minimum run time: – Sequential part: 2 hours – Other part: negligible – Total: 2 hours Maximum speed up 10/2 = 5 Problem • Server motherboard A has a SPEC CPU2006 rating of 31.4 while server motherboard B has a rating of 29.7. Which one of the two motherboards is faster? Answer • Server motherboard A has a SPEC CPU2006 rating of 31.4 while server motherboard B has a rating of 29.7. Which one of the two motherboards is faster? • Motherboard A because a higher SPEC value is better Fun problem • Shanghai maglev train runs at 268 mph • How does it compare to airplane for going between Houston and Washington, DC? Fun answer • Current travel time: – To airport and wait: 2 hours – Plane: 3 hours – To downtown by DC metro: 30 minutes – Total: 5 hours 30 minutes • With maglev: – To station: 1 hour – Train to downtown DC: 6 hours 30 minutes – Total: 7 hours 30 minutes Fun answer • Current travel time: – To airport and wait: 2 hours – Plane: 3 hours – To downtown by DC metro: 30 minutes – Total: 5 hours 30 minutes Plane is still faster • With maglev: for very long trips – To station: one hour – Train to downtown DC: 6 hours 30 minutes – Total: 7 hours 30 minutes