ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014 05 Single-Cycle Datapath Asynchronous Logic –combinatorial logic Synchronous Logic – latches, registers Data Path – actual determined by control signals and mux'es Instruction Decoding – determines values of control signals 3 Types of Instruction Formats – how used to simplify decoding Opcode( bits 31-26) "funct" (bits (5-0) determines ALU operation PC Register – 3 possible inputs: pc+4, branch, jump Given a diagram like slide 05-28, or VHDL like 05-32; know how to figure which control lines are set (true) for: add, lw, sw, j, jr, beq See 05-33, 34, 35, or 37 Energy use depends on average number of gates that switch per cycle Which uses more: Register Set or ALU (during add). Given VHDL code, or combinatorial logic, for simple ALU, be able to modify to add another operation (e.g., shift, subtract, bit-wise logic, …) 2 Datapath With Jumps Added Slide 05-37 3 06 Multi-Cycle Datapath Divide Datapath into five phases – Fetch Instr., Decode Instr., Execute (calculate), Memory R/W, Write Back - (IF, ID, EX, MEM, WB) Given a diagram like slide 06-8, 11, 17, 18, 20, 22, 23,24; know how to figure which control lines (ALUOp, ALUSrcA, MemWrite, …) are set (true) for: add, lw, sw, j, jr, beq, … Interpret a state diagram like 06-10 or 06-30 – how many clock cycles for br, add, lw. For given MIPS assembler code: how many cycles to execute. State diagram for decoding instruction - ways to implement: ROM Implementation – very inefficient PLA " – Inputs: 5 Opcode bits and 4 state bits (“funct” bits decoded apart) Outputs: 1 for each control line (control signal) Vertical Lines: Minimized by choosing similar bits for closely related functions (06-32). CISC (complex instruction set computer) – uses microcode, more ticks/instr Unexpected change in flow of control: Exception (error) vs. Interrupt (I/O) Two registers just for Exception Handling: cause and status 4 Pipelining Example add $14, $5, $6 lw $13, 24($1) add $12, $3, $4 0 M u x 1 sub $11, $2, $3 lw $10, 20($1) Note what is happening in the register file IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Instruction [20– 16] Instruction [15– 11] Pipeline stage execution time Sign extend Read data 1 M u x 0 32 0 M u x 1 RegDst Slide 08a-21 5 Pipelined Control Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Mem Mem Branch Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X • Control signals derived from instruction – As in single-cycle implementation • Pass control signals along like data Slide 08a-23 07 Performance Determined by: Algorithm & Language, Compiler, (hardware) Architecture Different Metrics, and different numbers for different "bench mark" programs SPEC2005 – performance doing a standard set of bench mark programs "Throughput" – work done per unit time "Response Time" (latency) – time to do a task (e.g., display a Web page) Trading throughput versus latency (for multitasking or parallel processing) Instructions per second = clock rate / average cycles per instruction Programs with heavy I/O or memory R/W's may depend less on CPU speed. Amdahl's Law – know definition, and do calculations like 07-18. Corollary – "make the common case fast" Increased clock rate higher temperature and lower energy efficiency, 7 08a Pipelined Datapath Use clocked registers to divide Datapath's five stages – Fetch Instr., Decode Instr., Execute, Memory R/W, Write Back - (IF, ID, EX, MEM, WB) Use more clocked registers to delay control signals as instruction moves through stages (08a-23). Hazards: Structural, Control, Data Structural – need to separate Instruction and Data memory (or caches) Control – guess "to branch" or not – stall when wrong – how? Can branch be determined 2 or 3 instructions ahead - if so ? How many real br instructions are there? only bre (branch if equal) Jump/Br must have next instruction started before (kill if wrong) Data – add logic devices to "forward" data needed by next instruction. Forwarding unit – controls when to forward – from EXE or MEM stage Problem with" Load Word" and Use –requires a" nop" insertion unless order of following instructions can be swapped ("code scheduling"). MIPS can predict branch always taken (wrong when loop ends) More complicated CPU's have "dynamic" branch prediction. MIPS – Exception causes Jump to "Exception Handler” procedure in OS. 8 08b Pipelined – Thread Level Parallelism Flynn's Taxonomy - Single/Multiple Instructions Single/Multiple Data Multithreading – MIMD – parallel threads or LWPs (light-weight processes) Process – running program with state (register values, include PC, pipeline r's). Scheduled by OS –swapped out, restarted later. Threads have own state, stack memory (local data), but share static data Hardware assisted switching between threads – separate hardware registers instructions may be interleaved in pipeline ("Fine-Grain Multithreading") "Course-Grained Multithreading" – swap on long stall (like cache miss) Synchronization between threads – requires hardware "atomic commands" "Test and Set" without switching threads between the Test and the Set. When altering common data, thread must tell OS "start/end Critical Path" Simultaneous Multithreading on multiple cores (data paths). Amdahl's Law – increased multithreading reaches "point of diminishing return” Big time Parallelism – requires special many-core computers, high-level languages, shared and local memory. "Grid" has cores on 2-d grid of buses. Communication (assigning tasks, collecting results) – by passing messages. 9 09a Memory Different levels – large, slow, and low cost small, fast, and high cost. Hard disk (Terabyte) $ 0.25 per Gigabyte millisecond DRAM (Gigabyte) $ 20.00 per Gigabyte several nanoseconds Static Ram (Megabytes) $ 2000 per Gigabyte fraction of nanosecond Registers On CPU – limited area and power Use multiple levels to get low average cost, high-speed for data being used. Multicore computers need a fast cache memory for each core Data movement takes more energy than computation What to put in cache? Temporal (recent) and Spatial (nearby) How to locate data. Direct Cache: address parts: [ tag ][ cache address or index][ byte location ] Size of cache: = 2^(no. of index bits) x 2^((no. of byte location bits) = number of blocks x block size Each block also has valid bit, dirty bit, tag. To read or write data, look at indexed block, match tag, select byte location. If not found (tag did not match), put contents in write queue if "dirty" before getting new data from memory (all byte-addresses that start with tag + index) "Write Through" immediately update memory, "Write Back" just write cache. 10 Previous State of Cache Index V 000 N 001 N 010 Y 011 N 100 N 101 N 110 Y 111 N Tag Data 11 Mem[11010] 09a-20 10 Mem[10110] Then This Happens Word addr Binary addr Hit/miss Cache block 16 10 000 ? 000 3 00 011 ? 011 16 10 000 ? 000 What is new State of Cache? 09a-21 Answer on 09a-21 11