ITEC 352 Lecture 21 Pipelining Review • Questions? • Homework 3 on Wed. • JVM vs Assembly Similarities / Differences Pipelining Outline • Pipelining – Motivation – Fits with execution model – Examples Pipelining CISC Vs. RISC • A long time back when memory costed – The focus of most computer architects • (E.g., Intel and Motorola) • Support fewer instructions that performed more complicated computations. • E.g., addld [a], [b], [c] would be a complex instruction that replaced: ld [a], %r1 ld [b], %r2 addcc %r1, %r2, %r3 st %r3, [c] Why? Complex instructions Shorter programs Smaller memory needed. However, memory became cheaper. So architects started thinking about techniques to use more memory that would speed up computations. Pipelining CISC Vs. RISC (2) • Solution: RISC (Reduced Instruction Set Computer) – E.g., ARC instructions. – The instructions are similar in complexity, i.e., they take more-or-less similar number of clock cycles to execute… • So architects thought why not have every RISC instruction execute in one CPU cycle using a technique called pipelining. • To use pipeling, instructions must be of similar complexity, ie.., every instruction must require more or less the same number of CPU cycles to execute. • However, what is the complexity of an instruction? • Trivia: – Apple’s original processor (PowerPC) based on Motorola chips: RISC processor – Intel: stuck to CISC – even today! Slide © Prem Uppuluri, Derived from Murdocca and Heuring Pipelining Pipelining and RISC • The complexity of an instruction can be based on the number of steps it takes to execute the fetch-execute cycle. • Recall the Fetch-Execute cycle (every instruction goes through a fetch-execute cycle) – – – – – Fetch instruction from memory to register. Decode the opcode Fetch operands from memory to register Execute operation Store result back into memory. • We said: “this is how every instruction is executed by the control unit”. – Ahem…this is not entirely true! It is almost true: each class of instruction has slightly different stages in the fetch-execute cycle. Slide © Prem Uppuluri, Derived from Murdocca and Huering Pipelining Complete ARC Instruction and PSR Formats Pipelining Arithmetic Instructions • Arithmetic instructions in RISC have these following 5 stages: – – – – – Fetch the instruction from memory Decode the instruction Fetch the operands from the register file Apply the operands to the ALU Write the result back to the register file. • E.g., take addcc %r1, %r2, r3 – Trace the 5 stages on this instruction as exercise. © Prem Uppuluri Pipelining RISC branch instruction • Branch instructions have the following stages – Fetch the instruction from memory – Decode the instruction – Fetch the components of the address from the instruction or register file – Apply the components of the address to the ALU – Copy the resulting effective address into the PC (program counter). • Exercise; Trace the stages for the instruction: be 2048 © Prem Uppuluri Pipelining Load/Store • Load and store instructions have the following stages – Fetch the instruction from the memory – Decode the instruction – Fetch the components of the address from the instruction or register file – Apply the components to the ALU – Apply the resulting effective address to memory along with a read or write signal. If write the data item to be written must be retrieved from the register file. • Exercise: Trace the ld %r1, %r2, %r3 instruction through these stages. © Prem Uppuluri Pipelining Summarizin g… • The fetch-execute stages differ across the different instructions…. – But they have similarities. All the instructions have the following stages: • • • • • Instruction fetch Decode Operand Fetch ALU operation Result writeback (to memory, from memory or to register depending on the type of instruction). – So computer architects decided to break the control unit into 5 parts – each part for one stage of the fetch execute cycle. © Prem Uppuluri Pipelining RISC control unit • RISC processors have 5 hardware units: – Each corresponding to one stage of the fetch-execute cycle. • E.g., the “Fetch instruction” hardware part of the control unit, fetches instruction while “Fetch operand” hardware fetches the operands. • These hardware units can execute in parallel. • Each hardware unit takes 1 CPU tick to execute. • How does this help? © Prem Uppuluri Pipelining RISC control unit Fetch Instr. Decode opcode Fetch operands A pipeline of five hardware units (together form the control unit). An instruction moves from left to right in the pipeline. Whenever the clock ticks, each unit passes the instruction to the next. © Prem Uppuluri. Derived from Doug Comer Pipelining Execute Instr. Store result RISC pipelining • E.g., life of an instruction: CPUcycle Unit1 Unit2 Unit3 Unit4 Unit5 1 inst1 2 inst1 3 inst1 4 inst1 5 inst1 In instruction moves through the pipeline in 5 clock ticks. So in effect, it takes 5 CPU cycles for an instruction to execute. However, consider this: “all the units can work in parallel”. Can we use this fact to speed up the instruction, i.e., can we make the instruction execute faster than 5 cycles? © Prem Uppuluri, Derived from Doug Comer Pipelining RISC pipelining: speeding up instructions When Unit2 in clock cycle 2 is executing inst1, Unit1 is idle… why not start using this unit to execute the next instruction: inst2 CPUcycle Unit1 Unit2 Unit3 Unit4 Unit5 1 inst1 2 inst2 inst1 3 inst1 4 inst1 5 inst1 © Prem Uppuluri, Derived from Doug Comer Pipelining Pipeline filling • E.g., CPUcycle Unit1 Unit2 Unit3 Unit4 Unit5 1 inst1 2 inst2 inst1 3 inst3 inst2 inst1 4 inst4 inst3 inst2 inst1 5 inst5 inst4 inst3 inst2 inst1 6 inst6 inst5 inst4 inst3 inst2 7 inst7 inst6 inst5 inst4 inst3 8 inst8 inst7 inst6 inst5 inst4 After the pipeline is filled up (in CPU cycle 5, after every CPU cycle, one instruction is getting executed. This is called Instruction Level Pipelining (ILP). Pipelining © Prem Uppuluri, Derived from Doug Comer Class Discussion • Implement the pipeline for the following: srl %r3, %r5 addcc %r1, 10, %r1 ld %r2, %r4 subcc %r3, %r1, %r4 be label Pipelining © Prem Uppuluri, Derived from Doug Comer Commercial processors • Intel Pentium Pro: one of the first to provide speculative executions. – 12 pipeline stages. • Intel Pentium 4: went from 10 to 20 pipeline stages – Why did Intel do this? • Increase in pipelining less work per clock cycle the clock cycle time can be reduced clock cycle speeds are increased. • Hence, intel could now support 1.2 Ghz+ speeds. Pipelining Commercial Processors • Video by Apple (Apple promotional material): http://www.youtube.com/watch?v=PKF9GOE2q38 • Intel Pentium Pro: one of the first to provide speculative executions. – 12 pipeline stages. • NEXT: Things that effect a pipelines performance. – Pipeline “bubbles”. Pipelining Discussion • Pipelining is not always efficient. Sometimes an instruction depends on its previous instruction’s results. – Implement the pipeline for the following: srl %r3, %r5 addcc %r1, 10, %r1 ld %r1, %r2 subcc %r2, %r4, %r4 • E.g., CPUcycle Unit1 Unit2 Unit3 Unit4 Unit5 Pipelining © Prem Uppuluri, Derived from Doug Comer Review • Pipelining Pipelining