CDA 5155 and 4150 Computer Architecture Week 2: 2 September 2014 Goals of the course • Advanced coverage of computer architecture – general purpose processors, embedded processors,historically significant processors, design tools. – Instruction set architecture – Processor microarchitecture – Systems architecture • Memory systems • I/O systems 2/49 Teaching Staff • Professor Gary Tyson – PhD: University of California – Davis – Faculty jobs: • • • • • California State University Sacramento: 1987 - 1990 University of California – Davis: 1993 - 1994 University of California – Riverside: 1995 - 1996 University of Michigan: 1997 - 2003 Florida State University: 2003 – present 3/49 Grading in 5155 (Fall’14) Programming assignments Exams (2 @ 25% each) In class, 75 minutes Team Project (20%) In-order pipeline simulation (10%) Out-of-order pipeline simulation (10%) 3 or 4 students per team Class Participation (10%) 4/49 Time Management • 3 hours/week lecture – This is probably the most important time • 2 hours/week reading – Hennessy/Patterson: – Computer Architecture: A Quantitative Approach • 3-5 hours/week exam prep • 5+ hours/week Project (1/3 semester) Total: ~10-15 hours per week. 5/49 Tentative Course Timeline Week Date Topic 1 Aug 28 Performance, ISA, Pipelining 2 Sept 2 Pipelining, Branch Prediction 3 Sept 9 Superscalar, Exceptions 4 Sept 16 Compilers, VLIW 5 Sept 23 Dynamic Scheduling 6 Sept 30 Dynamic Scheduling 7 Oct 7 Advanced pipelines 8 Oct 14 Advanced pipelines 9 Oct 21 Cache design 10 Oct 28 Cache design, VM 11 Nov 4 Multiprocessor, Multithreading 12 Nov 11 Embedded processors 13 Nov 18 Embedded processors 14 Nov 25 Research Topics 15 Dec 2 Research Topics Holidays Due Dates Notes Exa m Exam project 6/49 Web Resources Course Web Page: http://www.cs.fsu.edu/~tyson/courses/CDA5155 Wikipedia: http://en.wikipedia.org/wiki/Microprocessor Wisconsin Architecture Page: http://arch-www.cs.wisc.edu/home 7/49 Levels of Abstraction • • • • • • • • • Problem/Idea (English?) Algorithm (pseudo-code) High-Level languages (C, Verilog) Assembly instructions (OS calls) Machine instructions (I/O interfaces) Microarchitecture/organization (block diagrams) Logic level: gates, flip-flops (schematic, HDL) Circuit level: transistors, sizing (schematic, HDL) Physical: VLSI layout, feature size, cabling, PC boards. What are the abstractions at each level? 8/49 Levels of Abstraction • • • • • • • • • Problem/Idea (English?) Algorithm (pseudo-code) High-Level languages (C, Verilog) Assembly instructions (OS calls) Machine instructions (I/O interfaces) Microarchitecture/organization (block diagrams) Logic level: gates, flip-flops (schematic, HDL) Circuit level: transistors, sizing (schematic, HDL) Physical: VLSI layout, feature size, cabling, PC boards. At what level do I perform a square root? Recursion? 9/49 Levels of Abstraction • • • • • • • • • Problem/Idea (English?) Algorithm (pseudo-code) High-Level languages (C, Verilog) Assembly instructions (OS calls) Machine instructions (I/O interfaces) Microarchitecture/organization (block diagrams) Logic level: gates, flip-flops (schematic, HDL) Circuit level: transistors, sizing (schematic, HDL) Physical: VLSI layout, feature size, cabling, PC boards. Who/what translates from one level to the next? 10/49 Role of Architecture • Responsible for hardware specification: – Instruction set design • Also responsible for structuring the overall implementation – Microarchitectural design. • Interacts with everyone – mainly compiler and logic level designers. • Cannot do a good job without knowledge of both sides 11/49 Design Issues: Performance • Get acceptable performance out of system. – Scientific: floating point throughput, memory&disk intensive, predictable – Commercial: string handling, disk (databases), predictable – Multimedia: specific data types (pixels), network? Predictable? – Embedded: what do you mean by performance? – Workstation: Maybe all of the above, maybe not 12/49 Calculating Performance • • • Execution time is often the best metric Throughput (tasks/sec) vs. latency (sec/task) Benchmarks: what are the tasks? – – – – – What I care about! Representative programs (SPEC, Linpack) Kernels: representative code fragments Toy programs: useful for testing end-conditions Synthetic programs: does nothing but with a representative instruction mix. 13/49 Design Issues: Cost • Processor – Die size, packaging, heat sink? Gold connectors? – Support: fan, connectors, motherboard specifications, etc. • Calculating processor cost: – Cost of device = (die + package + testing) / yield – Die cost = wafer cost / good die yield • Good die yield related to die size and defect density – Support costs: direct costs (components, labor), indirect costs ( sales, service, R&D) – Total costs amortized over number of systems sold(PC vs NASA) 14/49 Other design issues • Some applications care about other design issues. • NASA deep space mission – Reliability: software and hardware (radiation hardening) • AMD – Code compatibility • ARM – Power 15/49 A Quantitative Approach • Hardware systems performance is generally easy to quantify – Machine A is 10% faster than Machine B – Of course Machine B’s advertising will show the opposite conclusion • Many software systems tend to have much more subjective performance evaluations. 16/49 Measuring Performance • Total Execution Time: – A is 3 times faster than B for programs P1,P2 1 n n Σ Time i i=1 – Issue: Emphasizes long running programs 17/49 Measuring Performance • Weighted Execution Time: n ∑ Weighti X Timei i=1 – What if P1 is executed far more frequently? 18/49 Measuring Performance • Normalized Execution Time: – Compare machine performance to a reference machine and report a ratio. • SPEC ratings measure relative performance to a reference machine. 19/49 Amdahl’s Law • Rule of Thumb: Make the common case faster http://en.wikipedia.org/wiki/Amdahl's_law (Attack longest running part until it is no longer) repeat 20/49 Instruction Set Design • Software Systems: named variables; complex semantics. • Hardware systems: tight timing requirements; small storage structures; simple semantics • Instruction set: the interface between very different software and hardware systems 21/49 Design decisions • How much “state” is in the microarchitecture? – Registers; Flags; IP/PC • How is that state accessed/manipulated? – Operand encoding • What commands are supported? – Opcode; opcode encoding 22/49 Design Challenges: or why is architecture still relevant? • Clock frequency is increasing – This changes the number of levels of gates that can be completed each cycle so old designs don’t work. – It also tend to increase the ratio of time spent on wires (fixed speed of light) • Power – Faster chips are hotter; bigger chips are hotter 23/49 Design Challenges (cont) • Design Complexity – More complex designs to fix frequency/power issues leads to increased development/testing costs – Failures (design or transient) can be difficult to understand (and fix) • We seem far less willing to live with hardware errors (e.g. FDIV) than software errors – which are often dealt with through upgrades – that we pay for!) 24/49 Techniques for Encoding Operands • Explicit operands: – Includes a field to specify which state data is referenced – Example: register specifier • Implicit operands: – All state data can be inferred from the opcode – Example: function return (CISC-style) 25/49 Accumulator • Architectures with one implicit register – Acts as source and/or destination – One other source explicit • Example: C = A + B – Load A – Add B – Store C // (Acc)umulator = A // Acc = Acc + B // C = Acc Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology” 26/49 Stack • Architectures with implicit “stack” – Acts as source(s) and/or destination – Push and Pop operations have 1 explicit operand • Example: C = A + B – – – – Push A Push B Add Pop C // Stack = {A} // Stack = {A, B} // Stack = {A+B} // C = A+B ; Stack = {} Compact encoding; may require more instructions though 27/49 Registers • Most general (and common) approach – Small array of storage – Explicit operands (register file index) • Example: C = A + B Register-memory load/store Load R1, A Load R1, A Load R2, B Add R3, R1, B Add R3, R1, R2 Store R3, C Store R3, C 28/49 Memory • Big array of storage – More complex ways of indexing than registers • Build addressing modes to support efficient translation of software abstractions • Uses less space in instruction than 32-bit immediate field A[i]; use base (i) + displacement (A) (scaled?) a.ptr; use base (a) + displacement (ptr) 29/49 Addressing modes Register Immediate Base/Displacement Register Indirect Indexed Direct Memory Indirect Autoincrement Add R4, R3 Add R4, #3 Add R4, 100(R1) Add R4, (R1) Add R4, (R1+R2) Add R4, (1001) Add R4, @(R3) Add R4, (R2)+ 30/49 Other Memory Issues What is the size of each element in memory? 0x000 0-255 Byte 0x000 0 - 65535 Half word 0x000 0 - ~4B Word 31/49 Other Memory Issues Big-endian or Little-endian? Store 0x114488FF Points to most significant byte 0x000 11 Points to least significant byte 0x000 FF 44 88 88 44 FF 11 32/49 Other Memory Issues Non-word loads? ldb R3, (000) 0x000 11 00 00 00 11 44 88 FF 33/49 Other Memory Issues Non-word loads? ldb R3, (003) 11 FF FF FF FF 44 Sign extended 88 0x003 FF 34/49 Other Memory Issues Non-word loads? ldbu R3, (003) 11 00 00 00 FF 44 Zero filled 88 0x003 FF 35/49 Other Memory Issues Alignment? Word accesses only address ending in 00 Half-word accesses only ending in 0 Byte accesses any address 11 44 0x002 ldw R3, (002) is illegal! 88 FF Why is it important to be aligned? How can it be enforced? 36/49 Techniques for Encoding Operators • Opcode is translated to control signals that – direct data (MUX control) – select operation for ALU – Set read/write selects for register/memory/PC • Tradeoff between how flexible the control is and how compact the opcode encoding. – Microcode – direct control of signals (Improv) – Opcode – compact representation of a set of control signals. • You can make decode easier with careful opcode selection 37/49 Handling Control Flow • • • • • • Conditional branches (short range) Unconditional branches (jumps) Function calls Returns Traps (OS calls and exceptions) Predicates (conditional retirement) 38/49 Encoding branch targets • PC-relative addressing – Makes linking code easier • Indirect addressing – Jumps into shared libraries, virtual functions, case/switch statements • Some unusual modes to simplify target address calculation – (segment offset) or (trap number) 39/49 Condition codes • Flags – Implicit: flag(s) specified in opcode (bgt) – Flag(s) set by earlier instructions (compare, add, etc.) • Register – Uses a register; requires explicit specifier • Comparison operation – Two registers with compare operation specified in opcode. 40/49 Higher Level Semantics: Functions • Function call semantics – – – – • Simple approach: – • Save PC + 1 instruction for return Manage parameters Allocate space on stack Jump to function Use a jump instruction + other instructions Complex approach: – Build implicit operations into new “call” instruction 41/49 Role of the Compiler • Compilers make the complexity of the ISA (from the programmers point of view) less relevant. – Non-orthogonal ISAs are more challenging. – State allocation (register allocation) is better left to compiler heuristics – Complex Semantics lead to more global optimization – easier for a machine to do. People are good at optimizing 10 lines of code. Compilers are good at optimizing 10M lines. 42/49 LC processor • Little Computer Fall 2011 – For programming projects • Instruction Set Design opcode regA regB destReg 43/49 LC processor R-type instructions opcode regA regB 24- 22 21- 19 18 –16 add: destReg 15 – 3 2-0 destReg = regA + regB nand: destReg = regA & regB 44/49 LC processor I-type instructions opcode regA 24- 22 regB 21- 19 18 –16 offsetField 15 – 0 lw: regB = Memory[regA + offsetField] sw: Memory[regA +offsetField] = regB beq: if (regA= = regB) PC = PC + 1 + offsetField 45/49 LC processor O-type instructions opcode unused 24- 22 21 – 0 noop: do nothing halt: halt the simulation 46/49 LC assembly example lw 0 1 lw 1 2 start add 1 beq 0 1 beq 0 0 noop done halt five .fill neg1 .fill stAddr .fill five 3 2 1 2 start load reg1 with 5 (uses symbolic address) load reg2 with -1 (uses numeric address) decrement reg1 goto end of program when reg1==0 go back to the beginning of the loop end of program 5 -1 start will contain the address of start (2) 47/49 LC machine code example (address (address (address (address (address (address (address (address (address (address 0): 1): 2): 3): 4): 5): 6): 7): 8): 9): 8454151 (hex 0x810007) 9043971 (hex 0x8a0003) 655361 (hex 0xa0001) 16842754 (hex 0x1010002) 16842749 (hex 0x100fffd) 29360128 (hex 0x1c00000) 25165824 (hex 0x1800000) 5 (hex 0x5) -1 (hex 0xffffffff) 2 (hex 0x2) Input for simulator: 8454151 9043971 655361 16842754 16842749 29360128 25165824 5 -1 2 48/49