Introduction to Architecture & Code Sequences HM Introduction to Computer Architecture (10/7/2010) Synopsis Historical Perspective Evolution of µP Performance Processor Performance Growth Key Messages about Computer Architecture Code Sequences for Different Architectures Dependences Score Board Bibliography Historical Perspective Before 1940 - 1643 Pascal’s Arithmetic Machine - About 1660 Leibnitz Four Function Calculator - 1710 -1750 Punched Cards by Bouchon, Falcon, Jacquard - 1810 Babbage Difference Engine, unfinished; 1st programmer Lady Ada - 1835 Babbage Analytical Engine, also unfinished - 1920 Hollerith Tabulating Machine to help with census in the USA Decade of 1940s: - 1939 – 1942 John Atanasoff built programmable, electronic computer at Iowa State University - 1936-1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised “tubes” - 1946 Mauchly and Eckert built ENIAC, modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monster Decade of the 1950s: - Univac Uniprocessor based on ENIAC, commercially viable - Commercial systems sold by Remington Rand - Mark III computer Decade of the 1960s: - IBM’s 360 family co-developed with GE, Siemens, et al. - Transistor replaces vacuum tube - Burroughs stack machines, compete with GPR architectures - All still von Neuman architectures, even stack architectures - 1969 ARPANET - Cache and VMM developed Decade of the 1970s: - High-point of main-frames, birth of microprocessor - High-end mainframes, e.g. CDC 6000s, IBM 360/67, and 370 series 1 Introduction to Architecture & Code Sequences HM - Caches, VMM common on mainframes - Intel 4004, Intel 8080, single-chip microprocessors - Programmable controllers - Mini-computers, PDP 11, HP 3000 - Expensive memories still magnetic-core based - Height of Digital Equipment Corp. (DEC) - Birth of personal computers, which DEC misses Decade of the 1980s: - Decrease of mini-computer use - 32-bit computing even on minis - Multitude of Supercomputer manufacturers - Architecture advances: fast caches, larger caches - Compiler complexity: trace-scheduling, VLIW - Workstations common: Sun, Apollo, HP, and DEC trying to catch up Decade of the 1990s: - Architecture advances: superscalar-pipelined, speculative execution, out-of-order execution - Powerful desktops - End of mini-computer and of many super-computer manufacturers - Microprocessor as powerful as early supercomputers - Cheaper memory technology - Consolidation of computer companies into a small # of large ones - Numerous supercomputer corporations close Decade of the 2000s: - Architecture advances: Multi-core CPUs - Multi-threaded cores - 64-bit computing and addressing - Heterogeneous computer grids Evolution of µP Performance Transistor Count Clock Frequency Instructions / cycle: ipc MFLOPs 1970s 1980s 1990s 2000+ 10k-100k 100k-1M 1M-100M 1B 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz <= 0.1 0.1 – 0.9 0.9 – 2.0 >= 10 < 0.2 0.2 - 20 20 – 2,000 100,000 2 Introduction to Architecture & Code Sequences HM Processor Performance Growth Moore’s Law see [1] Observation made in 1965 by Gordon Moore --co-founder of Intel-- that the number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for another two decades. Others coin a more general law, stating that “the circuit density increases predictably over time.” So far, 2010, Moore’s Law is holding true since ~1968. Some Intel fellows believe an end to Moore’s Law will be reached around 2018 due to 1. physical limitations in the process of manufacturing transistors from semi-conductor material 2. Limitations of adequately cooling small masses, confined areas 3. Accessing the number of pins on such small surfaces This phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have achieved the following: - cars would travel at 2,400,000 Mph, and get 600,000 MpG - Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 seconds 3 Introduction to Architecture & Code Sequences HM Key Messages about Computer Architecture 1: Memory is Always Slow, Way Too Slow! The inner core of the processor, the CPU or the µP, is getting faster at a steady rate. Access to memory is also getting faster over time, but at a slower rate. This rate differential has existed for quite some time, with the strange effect that fast processors have to rely on slow memories. It is not uncommon that on an MP server the processor has to wait > 100 cycles before a memory access completes. On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, arbitration, etc., thus the number of cycles to complete an access can grow so high. Discarding conventional memory altogether, relying only on cache-like memories, is NOT yet an option, due to the price differential between cache and regular DRAM. Another way of seeing this: Using solely reasonablypriced cache memories (say at 10 times the cost of regular memory) is not possible, since the resulting physical address space would be too small. Corollary 1: Almost all intellectual efforts in high-performance computer architecture focus on reducing the performance disparity of fast processors vs. slow memories. All else seems easy compared to this fundamental problem! µProc ~60%/yr . “Moore’s Law” CPU 100 10 DRAM 1 DRAM ~7%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 Performance 1000 Time Source: David Patterson, UC Berkeley 4 Introduction to Architecture & Code Sequences HM 2: Clustering of Events - What Happened Just Now, Will Soon Happen Again! A strange thing happens during program execution: Seemingly unrelated events tend to cluster. For example, memory accesses tend to concentrate a majority of their referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during some periods of time this phenomenon of clustering seems immutable. While one memory access seems independent of another, they both happen to fall onto the same page (or working set of pages, or cache line). We call this phenomenon Data Locality! We see later, architects exploit locality to speed up Virtual Memory Management. Similarly, hash functions tend to concentrate an unproportionally large number of keys onto a small number of hash values, i.e. table entries. Here the incoming search key (say, source program identifier “i”) is mapped into an index, but the next, completely unrelated key, happens to map onto the same index. In an extreme case, this may render a hash lookup slower than a sequential search. This clustering happens in all disparate modules of the processor architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small cache suffices. This is again due to Data Locality (spatial and temporal). Data that have been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in the near future. Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here clustering is exploited as a valuable opportunity. Corollary 2: If this clustering of events (AKA locality) would not happen, the whole architectural ideas of caches (rendering slow memories fast), of branch predictors, and VMM (making small memories appear large) would not work at all. It is due to great locality that major performance bottlenecks and resource limitations can be overcome. 5 Introduction to Architecture & Code Sequences HM 3: Heat is Bad – Design CPU for Low-Voltage, Low Current, Low Heat Clocking a processor fast (.e.g. > 3-5 GHz) increases performance and thus generally “is good”. Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirable. But this comes at the cost of higher current and thus more heat generated in the identical physical space, the geometry of the silicon processor or chipset. However, a Silicon part acts like a resistor, conducting better, as it gets warmer (negative temperature coëfficient resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as VDroop in the figure below (© Anandtech, see [2]). This in turn means the voltage has to be increased artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the part. Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the Silicon part (processor as well as 6 Introduction to Architecture & Code Sequences HM chip-set), and Turbo boost mode, to contain heat generation while boosting clock speed just at the right time. Corollary 3: Good that to date (2010) Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter. 7 Introduction to Architecture & Code Sequences HM Code Sequences for Different Architectures Goals, Core Ideas Analyze various levels of complexity; same source, various target systems Interaction between high-level language, compiler, target architecture Sample language rules: operator precedence, associativity, commutativity Derive measures of architectural quality: some architecture (ISA) better than other? Example 1: Object Code Sequence without Optimization Strict left-to-right translation, no optimization at all! Consider non-commutative subtraction and division operators No common subexpression elimination (CSE), No register reuse Conventional operator precedence For Single Accumulator SAA, Three-Address GPR, Stack Architectures Sample source snippet: d ( a + 3 ) * b - ( a + 3 ) / c No 1 2 3 4 5 6 7 8 9 10 11 12 Single-Accumulator ld a add #3 mult b st temp1 ld a add #3 div c st temp2 ld temp1 sub temp2 st d Three-Address GPR dest op1 op op2 add r1, a, #3 mult r2, r1, b add r3, a, #3 div r4, r3, c sub d, r2, r4 Stack Machine push a pushlit #3 add push b mult push a pushlit #3 add push c div sub pop d Observations, Example 1 Three-address code looks shortest, w.r.t. number of instructions Maybe optical illusion, must also consider number of bits for instructions, and consider: How many registers are available? Must consider number of I-fetches, operand fetches Must consider total number of stores Numerous memory accesses on SAA due to temporary values held in memory 8 Introduction to Architecture & Code Sequences HM Most memory accesses on SA, since everything requires a memory access, even multiple memory accesses for a single arithmetic computation!! Architect considers designing “reverse subtract” operation for SA, to save some stores and loads Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either order No need for reverse-operation opcodes for Three-Address architecture Decide in Three-Address architecture how to encode operand types Numerous stack instructions, since each operand fetch is separate instruction 9 Introduction to Architecture & Code Sequences HM Example 2: Using CSE + Register Re-Use Optimization Eliminate common subexpression Compiler handles left-to-right order for non-commutative operators on SAA Best possible code for: d ( a + 3 ) * b - ( a + 3 ) / c No 1 2 3 4 5 6 7 8 9 10 11 Single-Accumulator ld a add #3 st temp1 div c st temp2 ld temp1 mult b sub temp2 st d Three-Address GPR dest op1 op op2 add r1, a, #3 mult r2, r1, b div r1, r1, c sub d, r2, r1 Stack Machine push a pushlit #3 add dup push b mult xch push c div sub pop d Observations, Example 2 Single Accumulator Architecture (SAA) optimized still needs temporary storage, uses temp1 for common subexpression; has no other register!! SAA could use negate instruction or reverse subtract Register-use optimized for Three-Address architecture Common sub-expression optimized on Stack Machine by duplicating, exchanging, etc. 10 Introduction to Architecture & Code Sequences HM Example 3: Interaction of Operator Precedence & Architecture Analyze similar source expressions but with reversed operator precedence One operator sequence associates right-to-left, due to precedence: Expression 1 Compiler uses commutativity The other, Expression 2, uses left-to-right, due to explicit parentheses Use simple-minded code model: no cache, no optimization Will there be advantages/disadvantages inherent in any architecture? Expression 1 is: e a + b * c ^ d No 1 2 3 4 5 6 7 8 Single-Accumulator ld c expo d mult b add a st e Three-Address GPR dest op1 op op2 expo r1, c, d mult r1, b, r1 add e, a, r1 Stack Machine Implied Operands push a push b push c push d expo mult add pop e Expression 2 is: f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right No 1 2 3 4 5 6 7 8 Single-Accumulator ld g add h mult i expo j st f Three-Address GPR dest op1 op op2 add r1, g, h mult r1, i, r1 expo f, j, r1 Stack Machine Implied operands push g push h add push i mult push j expo pop f Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical, unless blurred by secondary effect, see cache example below Conclusion: all architectures handle arithmetic and logic operations well, except when register starvation causes superfluous spill code 11 Introduction to Architecture & Code Sequences HM Timing Analysis: For Stack Machine with 2-Word Cache Stack Machine with no register inherently slow: Memory Accesses!!! Implement few top of stack elements via HW shadow registers Cache Measure equivalent code sequences with/without consideration for cache Top-of-stack register tos points to last valid word of physical stack Two shadow registers may hold 0, 1, or 2 true top of stack words Top of stack cache counter, tcc, specifies the number of shadow registers actually used to know “real” top of stack Thus tos plus tcc jointly specify what and where is the true top of stack stack stack tos tos 0,1,2 0,1,2 tcc tcc 22tos tosregisters registers free free Timings for push, pushlit, add-etc., pop operations depend on tcc Operations in shadow registers fastest, arbitrarily here 1 cycle, includes the register access and the operation itself Each memory access adds 2 cycles, or costs 2 added cycles; in reality >> 10 cycles For stack changes use some defined policy, e.g. keep shadow registers 50% full, or keep always full, or keep always empty Table below refines timings for stack with shadow registers Operation Cycles tcc before tcc after add add add push x push x pushlit #3 pushlit #3 pop y pop y 1 1+2 1+2+2 2 2+2 1 1+2 2 2+2 tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc = 1 tcc = 1 tcc = 1 tcc++ tcc = 2 tcc++ tcc = 2 tcc-tcc = 0 = = = = = = = = = 2 1 0 0, 1 2 0, 1 2 1, 2 0 12 tos change nochange tos-tos -=2 nochange tos++ nochange tos++ nochange tos-- comment underflow? underflow? overflow? overflow? underflow? Introduction to Architecture & Code Sequences HM “Add” representative for any arithmetic operation Code emission for: a + b * c ^ ( d + e * f ^ g ) Let + and * be commutative, due to programming language rule Architecture here has 2 shadow registers, and compiler knows this Note: no sub and no div operation to avoid operand-order question # 1 2 3 4 5 6 7 8 9 10 11 12 13 Blind left to right push push push push push push push expo mult add expo mult add a b c d e f g cycles 1 2 2 4 4 4 4 4 1 3 3 3 3 3 Smart cache use push push expo push mult push add push expo push mult push add f g e d c b a cycles 2 2 2 1 2 1 2 1 2 1 2 1 2 1 Observations, Stack Machine with 2-Word Cache Blind code emission costs 40 cycles; i.e. not taking advantage of tcc knowledge: costs performance Smart Code emission with shadow register consideration costs 20 cycles Execution of code without cache would be 7 pushes = 7*4 cycles 6 memory operations = 6*6 cycles Total = 64 cycles True penalty for memory access is much worse in practice, order of tens of cycles; will get worse over time Caveat: Tremendous speed-up is generally an indicator that you are dealing with a system exhibiting severe flaws Such is the case here: Return of investment for 2 register hardware investment is enormous Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performance Note that indexing, looping, indirection, call/return are not addressed here 13 Introduction to Architecture & Code Sequences HM Dependences Inter-instruction dependencies arise. They are known in the technical jargon as dependences, data dependence, anti-dependence, etc. One instruction computes a result, the other needs that result. Or, one instruction uses data which after the use may be recomputed. True Dependence, AKA Data Dependence r3 ← r1 op r2 r5 ← r3 op r4 (Read after Write, RAW) Anti-Dependence r3 ← r1 op r2 r1 ← r5 op r4 (Write after read, WAR) Output Dependence r3 ← r1 op r2 r5 ← r3 op r4 r3 ← r6 op r7 (Write after Write, WAW, Read between) Control Dependence if ( condition1 ) { r3 = r1 op r2; }else{ r5 = r3 op r4; } // end if write( r3 ); 14 Introduction to Architecture & Code Sequences HM Register Renaming Only some dependences constitutes real dependence, AKA DataDependence. The others are artifacts of insufficient resources, generally register resources. If more registers were available, then replacing the conflicting names with new (renamed) registers could make the conflict disappear. Anti- and Output-Dependences are false dependencies. Assume all registers are live afterwards. Compilers need to be aware of such dependences during code-emission, to ensure data correctness. Similarly, HW score board needs to be aware of such dependences (register to register), to ensure accurate (timing for) register usage; i.e. at which moment is a register actually usable Original Uses + Dependences: L1: r1 ← r2 op r3 L2: r4 ← r1 op r5 L3: r1 ← r3 op r6 L4: r3 ← r1 op r7 Registers renamed, new dependences: r10 ← r2 op r30 r4 ← r10 op r5 r1 ← r30 op r6 r3 ← r1 op r7 L1, L1, L1, L3, L2, L3, L1, L2 true-Dep with r10 L3, L4 true-Dep with r1 L2 L3 L4 L4 L3 L4 true-Dep with r1 output-Dep with r1 anti-Dep with r3 true-Dep with r1 anti-Dep with r1 anti-Dep with r3 runs in half the time with renamed regs! First r1 -> r10 First r3 -> r30 Then regs are live afterwards, r1, r3, ... 15 Introduction to Architecture & Code Sequences HM Score Board Purpose of score-board --an array of programmable bits sb[]-- is to manage HW resources, specifically registers Single-bit array, one bit associated to some specific register, associated by index=name: sb[i] belongs to reg ri. Only if sb[i] = 0, register i has valid data, i.e. the register is NOT in progress of being written with new data. Instead the current data are valid! If bit i is set, i.e. if sb[i] = 1, then that register ri has stale data In-order execution: rd ← rs op rt if sb[rs] or if sb[ rt] is set → RAW dependence, hence stall if sb[rd] is set→ WAW dependence, hence stall else dispatch instruction To allow out of order (ooo) execution, upon computing the value of rd: Update rd, and clear sb[rd] Bibliography 1. http://en.wikipedia.org/wiki/Moore's_law 2. http://www.tomshardware.com/de/foren/240300-6-intel-cpus-mythosstunde-wahrheit 3. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16. 16