www.getmyuni.com Central Processing Unit and Pipeline Subject: Computer Organization And Architecture 1 www.getmyuni.com CENTRAL PROCESSING UNIT • Introduction • General Register Organization • Stack Organization • Instruction Formats • Addressing Modes • Data Transfer and Manipulation • Program Control • Reduced Instruction Set Computer (RISC) 2 MAJOR COMPONENTS OF CPU www.getmyuni.com Storage Components: Registers Flip-flops Execution (Processing) Components: Arithmetic Logic Unit (ALU): Arithmetic calculations, Logical computations, Shifts/Rotates Transfer Components: Bus Control Components: Control Unit Register File ALU Control Unit 3 GENERAL REGISTER ORGANIZATION www.getmyuni.com Input Clock R1 R2 R3 R4 R5 R6 R7 Load (7 lines) SELA { 3x8 decoder MUX MUX A bus SELD OPR } SELB B bus ALU Output 4 OPERATION OF CONTROL UNIT www.getmyuni.com The control unit directs the information flow through ALU by: - Selecting various Components in the system - Selecting the Function of ALU Example: R1 <- R2 + R3 [1] MUX A selector (SELA): BUS A R2 [2] MUX B selector (SELB): BUS B R3 [3] ALU operation selector (OPR): ALU to ADD [4] Decoder destination selector (SELD): R1 Out Bus 3 SELA 3 SELB Control Word Encoding of register selection fields 3 SELD Binary Code 000 001 010 011 100 101 110 111 5 OPR SELA Input R1 R2 R3 R4 R5 R6 R7 SELB Input R1 R2 R3 R4 R5 R6 R7 SELD None R1 R2 R3 R4 R5 R6 R7 5 Control ALU CONTROL www.getmyuni.com Encoding of ALU operations OPR Select 00000 00001 00010 00101 00110 01000 01010 01100 01110 10000 11000 Operation Transfer A Increment A ADD A + B Subtract A - B Decrement A AND A and B OR A and B XOR A and B Complement A Shift right A Shift left A Symbol TSFA INCA ADD SUB DECA AND OR XOR COMA SHRA SHLA Examples of ALU Microoperations Symbolic Designation Microoperation SELA SELB SELD OPR Control Word R1 R2 - R3 R2 R3 R1 SUB 010 011 001 00101 R4 R4 R5 R4 R5 R4 OR 100 101 100 01010 R6 R6 + 1 R6 R6 INCA 110 000 110 00001 R7 R1 R1 R7 TSFA 001 000 111 00000 Output R2 R2 None TSFA 010 000 000 00000 Output Input Input None TSFA 000 000 000 00000 R4 shl R4 R4 R4 SHLA 100 000 100 11000 R5 0 R5 R5 R5 XOR 101 101 101 01100 6 REGISTER STACK ORGANIZATION www.getmyuni.com Stack - Very useful feature for nested subroutines, nested loops control - Also efficient for arithmetic expression evaluation - Storage which can be accessed in LIFO - Pointer: SP - Only PUSH and POP operations are applicable stack Register Stack 63 Flags FULL Address EMPTY Stack pointer SP Push, Pop operations C B A 4 3 2 1 0 DR /* Initially, SP = 0, EMPTY = 1, FULL = 0 */ PUSH POP SP SP + 1 DR M[SP] M[SP] DR SP SP - 1 If (SP = 0) then (FULL 1) If (SP = 0) then (EMPTY 1) EMPTY 0 FULL 0 7 MEMORY STACK ORGANIZATION www.getmyuni.com 1000 Memory with Program, Data, and Stack Segments PC Program (instructions) AR Data (operands) SP - A portion of memory is used as a stack with a processor register as a stack pointer - PUSH: - POP: 3000 stack 3997 3998 3999 4000 4001 DR SP SP - 1 M[SP] DR DR M[SP] SP SP + 1 - Most computers do not provide hardware to check stack overflow (full stack) or underflow(empty stack) 8 REVERSE POLISH NOTATION www.getmyuni.com Arithmetic Expressions: A + B A+B +AB AB+ Infix notation Prefix or Polish notation Postfix or reverse Polish notation - The reverse Polish notation is very suitable for stack manipulation Evaluation of Arithmetic Expressions Any arithmetic expression can be expressed in parenthesis-free Polish notation, including reverse Polish notation (3 * 4) + (5 * 6) 34*56*+ 3 4 3 12 5 12 6 5 12 3 4 * 5 6 30 12 42 * + 9 Instruction Format INSTRUCTION FORMAT www.getmyuni.com Instruction Fields OP-code field - specifies the operation to be performed Address field - designates memory address(s) or a processor register(s) Mode field - specifies the way the operand or the effective address is determined The number of address fields in the instruction format depends on the internal organization of CPU - The three most common CPU organizations: Single accumulator organization: ADD X /* AC AC + M[X] */ General register organization: ADD R1, R2, R3 /* R1 R2 + R3 */ ADD R1, R2 /* R1 R1 + R2 */ MOV R1, R2 /* R1 R2 */ ADD R1, X /* R1 R1 + M[X] */ Stack organization: PUSH X /* TOS M[X] */ ADD 10 THREE, and TWO-ADDRESS INSTRUCTIONS www.getmyuni.com Three-Address Instructions: Program to evaluate X = (A + B) * (C + D) : ADD R1, A, B /* R1 M[A] + M[B] */ ADD R2, C, D /* R2 M[C] + M[D] */ MUL X, R1, R2 /* M[X] R1 * R2 */ - Results in short programs - Instruction becomes long (many bits) Two-Address Instructions: Program to evaluate X = (A + B) * (C + D) : MOV ADD MOV ADD MUL MOV R1, A R1, B R2, C R2, D R1, R2 X, R1 /* R1 M[A] */ /* R1 R1 + M[B] */ /* R2 M[C] */ /* R2 R2 + M[D] */ /* R1 R1 * R2 */ /* M[X] R1 */ 11 ONE, and ZERO-ADDRESS INSTRUCTIONS www.getmyuni.com One-Address Instructions: - Use an implied AC register for all data manipulation - Program to evaluate X = (A + B) * (C + D) : LOAD A /* AC M[A] */ ADD B /* AC AC + M[B] */ STORE T /* M[T] AC */ LOAD C /* AC M[C] */ ADD D /* AC AC + M[D] */ MUL T /* AC AC * M[T] */ STORE X /* M[X] AC */ Zero-Address Instructions: - Can be found in a stack-organized computer - Program to evaluate X = (A + B) * (C + D) : PUSH A /* TOS A */ PUSH B /* TOS B */ ADD /* TOS (A + B) */ PUSH C /* TOS C */ PUSH D /* TOS D */ ADD /* TOS (C + D) */ MUL /* TOS (C + D) * (A + B) */ POP X /* M[X] TOS */ 12 www.getmyuni.com ADDRESSING MODES Addressing Modes: * Specifies a rule for interpreting or modifying the address field of the instruction (before the operand is actually referenced) * Variety of addressing modes - to give programming flexibility to the user - to use the bits in the address field of the instruction efficiently 13 TYPES OF ADDRESSING MODES www.getmyuni.com Implied Mode Address of the operands are specified implicitly in the definition of the instruction - No need to specify address in the instruction - EA = AC, or EA = Stack[SP], EA: Effective Address. Immediate Mode Instead of specifying the address of the operand, operand itself is specified - No need to specify address in the instruction - However, operand itself needs to be specified - Sometimes, require more bits than the address - Fast to acquire an operand Register Mode Address specified in the instruction is the register address - Designated operand need to be in a register - Shorter address than the memory address - Saving address field in the instruction - Faster to acquire an operand than the memory addressing - EA = IR(R) (IR(R): Register field of IR) 14 TYPES OF ADDRESSING MODES www.getmyuni.com Register Indirect Mode Instruction specifies a register which contains the memory address of the operand - Saving instruction bits since register address is shorter than the memory address - Slower to acquire an operand than both the register addressing or memory addressing - EA = [IR(R)] ([x]: Content of x) Auto-increment or Auto-decrement features: Same as the Register Indirect, but: - When the address in the register is used to access memory, the value in the register is incremented or decremented by 1 (after or before the execution of the instruction) 15 TYPES OF ADDRESSING MODES www.getmyuni.com Direct Address Mode Instruction specifies the memory address which can be used directly to the physical memory - Faster than the other memory addressing modes - Too many bits are needed to specify the address for a large physical memory space - EA = IR(address), (IR(address): address field of IR) Indirect Addressing Mode The address field of an instruction specifies the address of a memory location that contains the address of the operand - When the abbreviated address is used, large physical memory can be addressed with a relatively small number of bits - Slow to acquire an operand because of an additional memory access - EA = M[IR(address)] 16 TYPES OF ADDRESSING MODES www.getmyuni.com Relative Addressing Modes The Address fields of an instruction specifies the part of the address (abbreviated address) which can be used along with a designated register to calculate the address of the operand PC Relative Addressing Mode(R = PC) - EA = PC + IR(address) - Address field of the instruction is short - Large physical memory can be accessed with a small number of address bits Indexed Addressing Mode XR: Index Register: - EA = XR + IR(address) Base Register Addressing Mode BAR: Base Address Register: - EA = BAR + IR(address) 17 ADDRESSING MODES - EXAMPLES www.getmyuni.com Address PC = 200 Memory 200 201 202 Load to AC Mode Address = 500 Next instruction 399 400 450 700 500 800 600 900 702 325 800 300 R1 = 400 XR = 100 AC Addressing Effective Content Mode Address of AC Direct address 500 /* AC (500) */ 800 Immediate operand /* AC 500 */ 500 Indirect address 800 /* AC ((500)) */ 300 Relative address 702 /* AC (PC+500) */ 325 Indexed address 600 /* AC (XR+500) */ 900 Register - /* AC R1 */ 400 Register indirect 400 /* AC (R1) */ 700 Autoincrement 400 /* AC (R1)+ */ 700 Autodecrement 399 /* AC -(R) */ 450 18 DATA TRANSFER INSTRUCTIONS www.getmyuni.com Typical Data Transfer Instructions Name Mnemonic Load LD Store ST Move MOV Exchange XCH Input IN Output OUT Push PUSH Pop POP Data Transfer Instructions with Different Addressing Modes Assembly Register Transfer Convention Direct address LD ADR AC M[ADR] Indirect address LD @ADR AC M[M[ADR]] Relative address LD $ADR AC M[PC + ADR] Immediate operand LD #NBR AC NBR Index addressing LD ADR(X) AC M[ADR + XR] Register LD R1 AC R1 Register indirect LD (R1) AC M[R1] Autoincrement LD (R1)+ AC M[R1], R1 R1 + 1 Autodecrement LD -(R1) R1 R1 - 1, AC M[R1] Mode 19 DATA MANIPULATION INSTRUCTIONS www.getmyuni.com Three Basic Types: Arithmetic instructions Logical and bit manipulation instructions Shift instructions Arithmetic Instructions Name Increment Decrement Add Subtract Multiply Divide Add with Carry Subtract with Borrow Negate(2’s Complement) Mnemonic INC DEC ADD SUB MUL DIV ADDC SUBB NEG Logical and Bit Manipulation Instructions Name Mnemonic Clear CLR Complement COM AND AND OR OR Exclusive-OR XOR Clear carry CLRC Set carry SETC Complement carry COMC Enable interrupt EI Disable interrupt DI Shift Instructions Name Mnemonic Logical shift right SHR Logical shift left SHL Arithmetic shift right SHRA Arithmetic shift left SHLA Rotate right ROR Rotate left ROL Rotate right thru carry RORC Rotate left thru carry ROLC 20 PROGRAM CONTROL INSTRUCTIONS www.getmyuni.com +1 In-Line Sequencing (Next instruction is fetched from the next adjacent location in the memory) PC Address from other source; Current Instruction, Stack, etc Branch, Conditional Branch, Subroutine, etc Program Control Instructions Name Branch Jump Skip Call Return Compare(by - ) Test (by AND) Status Flag Circuit Mnemonic BR JMP SKP CALL RTN CMP TST A B 8 * CMP and TST instructions do not retain their results of operations(- and AND, respectively). They only set or clear certain Flags. 8 c7 c8 V Z S C 8-bit ALU F7 - F0 F7 8 Check for zero output F 21 CONDITIONAL BRANCH INSTRUCTIONS www.getmyuni.com Mnemonic Branch condition Tested condition BZ Branch if zero BNZ Branch if not zero BC Branch if carry BNC Branch if no carry BP Branch if plus BM Branch if minus BV Branch if overflow BNV Branch if no overflow Unsigned compare conditions (A - B) BHI Branch if higher BHE Branch if higher or equal BLO Branch if lower BLOE Branch if lower or equal BE Branch if equal BNE Branch if not equal Z=1 Z=0 C=1 C=0 S=0 S=1 V=1 V=0 A>B AB A<B AB A=B AB Signed compare conditions (A - B) BGT Branch if greater than BGE Branch if greater or equal BLT Branch if less than BLE Branch if less or equal BE Branch if equal BNE Branch if not equal A>B AB A<B AB A=B AB 22 SUBROUTINE CALL AND RETURN www.getmyuni.com SUBROUTINE CALL Call subroutine Jump to subroutine Branch to subroutine Branch and save return address Two Most Important Operations are Implied; * Branch to the beginning of the Subroutine - Same as the Branch or Conditional Branch * Save the Return Address to get the address of the location in the Calling Program upon exit from the Subroutine - Locations for storing Return Address: • Fixed Location in the subroutine(Memory) • Fixed Location in memory • In a processor Register • In a memory stack - most efficient way CALL SP SP - 1 M[SP] PC PC EA RTN PC M[SP] SP SP + 1 23 PROGRAM INTERRUPT Types of Interrupts: www.getmyuni.com External interrupts External Interrupts initiated from the outside of CPU and Memory - I/O Device -> Data transfer request or Data transfer complete - Timing Device -> Timeout - Power Failure Internal interrupts (traps) Internal Interrupts are caused by the currently running program - Register, Stack Overflow - Divide by zero - OP-code Violation - Protection Violation Software Interrupts Both External and Internal Interrupts are initiated by the computer Hardware. Software Interrupts are initiated by texecuting an instruction. - Supervisor Call -> Switching from a user mode to the supervisor mode -> Allows to execute a certain class of operations which are not allowed in the user mode 24 INTERRUPT PROCEDURE www.getmyuni.com Interrupt Procedure and Subroutine Call - The interrupt is usually initiated by an internal or an external signal rather than from the execution of an instruction (except for the software interrupt) - The address of the interrupt service program is determined by the hardware rather than from the address field of an instruction - An interrupt procedure usually stores all the information necessary to define the state of CPU rather than storing only the PC. The state of the CPU is determined from; Content of the PC Content of all processor registers Content of status bits Many ways of saving the CPU state depending on the CPU architectures 25 RISC: REDUCED INSTRUCTION SET COMPUTERS www.getmyuni.com Historical Background IBM System/360, 1964 - The real beginning of modern computer architecture - Distinction between Architecture and Implementation - Architecture: The abstract structure of a computer seen by an assembly-language programmer High-Level Language Compiler Instruction Set Architecture -program Hardware Implementation Continuing growth in semiconductor memory and microprogramming -> A much richer and complicated instruction sets => CISC(Complex Instruction Set Computer) - Arguments advanced at that time Richer instruction sets would simplify compilers Richer instruction sets would alleviate the software crisis - move as much functions to the hardware as possible - close Semantic Gap between machine language and the high-level language Richer instruction sets would improve the architecture quality 26 www.getmyuni.com COMPLEX INSTRUCTION SET COMPUTERS: CISC High Performance General Purpose Instructions Characteristics of CISC: 1. 2. 3. 4. 5. A large number of instructions (from 100-250 usually) Some instructions that performs a certain tasks are not used frequently. Many addressing modes are used (5 to 20) Variable length instruction format. Instructions that manipulate operands in memory. 27 PHYLOSOPHY OF RISC www.getmyuni.com Reduce the semantic gap between machine instruction and microinstruction 1-Cycle instruction Most of the instructions complete their execution in 1 CPU clock cycle - like a microoperation * Functions of the instruction (contrast to CISC) - Very simple functions - Very simple instruction format - Similar to microinstructions => No need for microprogrammed control * Register-Register Instructions - Avoid memory reference instructions except Load and Store instructions - Most of the operands can be found in the registers instead of main memory => Shorter instructions => Uniform instruction cycle => Requirement of large number of registers * Employ instruction pipeline 28 CHARACTERISTICS OF RISC www.getmyuni.com Common RISC Characteristics - Operations are register-to-register, with only LOAD and STORE accessing memory - The operations and addressing modes are reduced Instruction formats are simple 29 CHARACTERISTICS OF RISC www.getmyuni.com RISC Characteristics - Relatively few instructions - Relatively few addressing modes - Memory access limited to load and store instructions - All operations done within the registers of the CPU - Fixed-length, easily decoded instruction format - Single-cycle instruction format - Hardwired rather than microprogrammed control More RISC Characteristics -A relatively large numbers of registers in the processor unit. -Efficient instruction pipeline -Compiler support: provides efficient translation of high-level language programs into machine language programs. Advantages of RISC - VLSI Realization - Computing Speed - Design Costs and Reliability - High Level Language Support 30 www.getmyuni.com Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system may have two or more ALUs and be able to execute two or more instructions at the same time Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time www.getmyuni.com Parallel processing classification Single instruction stream, single data stream – SISD Single instruction stream, multiple data stream – SIMD Multiple instruction stream, single data stream – MISD Multiple instruction stream, multiple data stream – MIMD www.getmyuni.com Single instruction stream, single data stream – SISD • Single control unit, single computer, and a memory unit • Instructions are executed sequentially. Parallel processing may be achieved by means of multiple functional units or by pipeline processing www.getmyuni.com Single instruction stream, multiple data stream – SIMD • Represents an organization that includes many processing units under the supervision of a common control unit. • Includes multiple processing units with a single control unit. All processors receive the same instruction, but operate on different data. www.getmyuni.com Multiple instruction stream, single data stream – MISD • Theoretical only • processors receive different instructions, but operate on same data. www.getmyuni.com Multiple instruction stream, multiple data stream – MIMD • A computer system capable of processing several programs at the same time. • Most multiprocessor and multicomputer systems can be classified in this category www.getmyuni.com Pipelining: Laundry Example Small laundry has one washer, one dryer and one operator, it takes 90 minutes to finish one load: Washer takes 30 minutes Dryer takes 40 minutes “operator folding” takes 20 minutes A B C D Sequential Laundry www.getmyuni.com 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r A B C 90 min D This operator scheduled his loads to be delivered to the laundry every 90 minutes which is the time required to finish one load. In other words he will not start a new task unless he is already done with the previous task The process is sequential. Sequential laundry takes 6 hours for 4 loads Efficiently scheduled laundry: Pipelined Laundry Operator start work ASAP www.getmyuni.com 6 PM 7 8 9 10 11 Midnight Time 30 40 T a s k 40 40 40 20 40 40 40 A B O r d e r C D Another operator asks for the delivery of loads to the laundry every 40 minutes!?. Pipelined laundry takes 3.5 hours for 4 loads www.getmyuni.com Pipelining Facts 6 PM 7 8 9 Time T a s k O r d e r 30 40 A B C D The washer waits for the dryer for 10 minutes 40 40 40 20 Multiple tasks operating simultaneously Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Potential speedup = Number of pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup www.getmyuni.com Pipelining •Decomposes a sequential process into segments. •Divide the processor into segment processors each one is dedicated to a particular segment. •Each segment is executed in a dedicated segment-processor operates concurrently with all other segments. •Information flows through these multiple hardware segments. www.getmyuni.com Pipelining • Instruction execution is divided into k segments or stages • Instruction exits pipe stage k-1 and proceeds into pipe stage k • All pipe stages take the same amount of time; called one processor cycle • Length of the processor cycle is determined by the slowest pipe stage k segments www.getmyuni.com Pipelining • Suppose we want to perform the combined multiply and add operations with a stream of numbers: • Ai * Bi + Ci for i =1,2,3,…,7 www.getmyuni.com Pipelining • The suboperations performed in each segment of the pipeline are as follows: • R1 Ai, R2 Bi • R3 R1 * R2 R4 Ci • R5 R3 + R4 www.getmyuni.com www.getmyuni.com www.getmyuni.com Pipeline Performance n:instructions k: stages in pipeline : clockcycle Tk: total time n is equivalent to number of loads in the laundry example k is the stages (washing, drying and folding. Clock cycle is the slowest task time Tk (k (n 1)) T1 nk Speedup Tk k (n 1) n k www.getmyuni.com SPEEDUP • Consider a k-segment pipeline operating on n data sets. (In the above example, k = 3 and n = 4.) > It takes k clock cycles to fill the pipeline and get the first result from the output of the pipeline. After that the remaining (n - 1) results will come out at each clock cycle. > It therefore takes (k + n - 1) clock cycles to complete the task. www.getmyuni.com SPEEDUP • If we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles. • • The speedup gained by using the pipeline is: • S = k * n / (k + n - 1 ) www.getmyuni.com SPEEDUP S = k * n / (k + n - 1 ) For n >> k (such as 1 million data sets on a 3-stage pipeline), S ~ k So we can gain the speedup which is equal to the number of functional units for a large data sets. This is because the multiple functional units can work in parallel except for the filling and cleaningup cycles. www.getmyuni.com Example: 6 tasks, divided into 4 segments 1 2 3 4 5 6 7 8 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 9 T6 www.getmyuni.com Reference Reference Book • Computer Organization & Architecture 7e By Stallings • Computer System Architecture By Mano