Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과 1.1 Processor Architecture & Organization All modern general-purpose computers employ “stored program concept” IAS computer by von Neumann at Princeton Institute for Advanced Studies (in 1946) First implemented in ‘Baby Machine’ at Univ. of Manchester, England (in 1948) [Figure 1.1] The state in a stored-program digital computer FF.. FF16 ins truc tions regis ters addres s dat a processor ins truc tions and data 2015-04-09 memory PNU Computer Eng. 00. .0016 2 1.1 Processor Architecture & Organization 50 years of development: performance of processors h cost i cost-effective computers (principles of operation not changed much) Most of improvements: Advances in technology of electronics New insights: 2015-04-09 Vacuum tubes -> transistors -> ICs -> VLSI Virtual memory (early 1960s) Cache memory Pipelining RISC PNU Computer Eng. 3 1.2 Abstraction in Hardware Design Transistors (elementary component) Logically act as inverters Logic gates CMOS NAND gate (using 4 trs) If A = B = Vdd, output = Vss If either A or B (or both) = Vss, output =Vdd => output = not(A.B) Transistor circuit, logic symbol, truth table Vdd A A.B A B out put A B Output 0 0 1 0 1 1 1 0 1 1 1 0 B Vss 2015-04-09 Logic sy mbol PNU Computer Eng. Truth table 4 1.2 Abstraction in Hardware Design The gate abstraction Simplify the process of designing circuits with great number of trs Removes the need to know that the gate is built from trs Free from implementation technology in function level Eg. Field effect tr, bipolar tr, etc. However, performance difference exists Levels of abstraction Trs Gates, memory cells Adder, MUX, decoder, registers ALUs, shifters, memory blocks Processors, peripherals, memories ICs PCBs PCs, controllers, mobile phones 2015-04-09 PNU Computer Eng. 5 1.3 MU0 – a simple processor A simple form of processor can be built from a few basic components PC (program counter) ACC (accumulator) ALU (arithmetic-logic unit) IR (instruction register) Instruction decoder, control logic The MU0 instruction set A 16-bit machine with a 12-bit address space (4K x 2 bytes: 8K bytes memory) Instructions: 16 bits long (op: 4 bits, address field: 12 bits) 4 bits opcode 2015-04-09 12 bits S PNU Computer Eng. 6 1.3 MU0 – a simple processor [Table 1.1] The MU0 instruction set Instruction Opcode Effect LDA S 0000 ACC := mem 16[S] STO S 0001 mem 16[S] := ACC ADD S 0010 ACC := ACC + mem 16[S] SUB S 0011 ACC := ACC - mem 16[S] JMP S 0100 PC := S JGE S 0101 if ACC >= 0 PC := S JNE S 0110 if ACC !=0 PC := S STP 0111 stop 2015-04-09 PNU Computer Eng. 7 1.3 MU0 – a simple processor Datapath A register transfer level (RTL) design style based on registers, MUXs, and so on [Figure 1.5] MU0 datapath example address bus PC control IR memory ALU ACC data bus 2015-04-09 PNU Computer Eng. 8 RTL level design [Figure 1.6] MU0 register transfer level organization Control signals: enables on all of regs function select lines to ALU select control lines for two MUXs control for a tri-state driver to send ACC value to memory MEMrq (memory request) RnW (read/write control lines) 2015-04-09 PNU Computer Eng. 9 1.4 Instruction set design To build a high-performance processor (beyond MU0 inst. set), inst. set design is important. 4 address insts (the most general form) Ex) add d, s1, s2, next_i; d := s1 + s2 f bits n bits f unc tion op 1 addr. n bits op 2 addr. n bits n bits des t. addr. nex t_i addr. 3 address insts Make address of the next inst. implicit using PC (except for branch) Ex) add d, s1, s2; d := s1 + s2 f bits n bits f unction op 1 addr. 2015-04-09 n bits op 2 addr. n bits dest. addr. PNU Computer Eng. 10 1.4 Instruction set design 2 address insts Make destination reg. the same as one of source reg. Ex) add d, s1; d := d + s1 f bits n bits f unct ion op 1 addr. n bits des t. addr. 1 address insts AC is used as destination Ex) add s1; AC := AC + s1 f bits n bits f unction op 1 addr. 0 address insts (using a stack) Ex) add; tos := tos + next on stack f bits f unc tion 2015-04-09 PNU Computer Eng. 11 1.4 Instruction set design Addressing modes Immediate addressing: immediate data Absolute addressing: inst. contains full address for data Indirect addressing: inst. contains address of location that contains address of data Register addressing: data is in a reg. Register indirect addressing Index addressing Stack addressing 2015-04-09 PNU Computer Eng. 12 1.4 Instruction set design Control flow instructions Subroutine calls & returns System calls Branch, jump Conditional branch Branch to an operating system routine Exceptions Error handling 2015-04-09 PNU Computer Eng. 13 1.5 Processor design trade-offs CISC vs RISC CISC To reduce semantic gap b/w high level language & machine instruction Complex sequence of operations Make compiler’s job easy RISC ARM’s middle name: from RISC Reducing semantic gap is not the right way to make an efficient computer [Table 1.3] Typical dynamic instruction usage Instruction type 2015-04-09 Dynamic usage Data movement Control flow 43% 23% Arithmetic operations 15% Comparisons 13% Logical operations 5% Other 1% PNU Computer Eng. 14 1.5 Processor design trade-offs Data movement b/w regs and memory: almost half Control flow such as branches & procedure calls: almost quarter Arithmetic operations: only 15% Complex arithmetic insts do not help much The most important tech: pipelining, cache memory To make processors go faster 2015-04-09 PNU Computer Eng. 15 1.5 Processor design trade-offs Pipelines 1. 2. 3. 4. 5. 6. Fetch Decode REG: get operands from register bank ALU MEM: access memory for an operand, if necessary RES: write result back to register bank [Figure 1.13] Pipelined instruction execution 1 f et ch dec 2 3 instruction reg f et ch dec ALU mem res reg f et ch dec ALU mem res reg ALU mem res time 2015-04-09 PNU Computer Eng. 16 1.5 Processor design trade-offs Pipeline hazards Read after write hazard (data hazard) 1 Result from one inst is used as an operand by the next inst => inst2 must stall until the result is available [Figure 1.14] Read-after-write pipeline hazard f etch dec 2 reg ALU mem res f etch dec stall reg ALU mem res instruction time 2015-04-09 PNU Computer Eng. 17 1.5 Processor design trade-offs Branch hazard Solution: Compute branch target earlier (if possible) The target may be computed speculatively Delayed branch [Figure 1.15] Pipelined branch behavior 1 (branch) f et ch dec 2 reg f et ch dec 3 ALU mem res reg f et ch dec 4 5 (br anch tar get) ALU mem res reg f et ch dec ALU mem res reg f et ch dec ALU mem res reg ALU mem res instruction time Pipeline efficiency 2015-04-09 The deeper the pipeline, the worse the problems get: RISC approach is better PNU Computer Eng. 18 1.6 RISC In 1980, Patterson: RISCI project RISCI arch Fixed (32-bit) inst size with few formats Load-store arch: RISCI organization Insts that process data operate only on regs Separate insts to access memory A large register bank (32 32-bit regs) to allow load-store arch to operate efficiently Hard-wired inst decode logic Pipelined execution Single cycle execution RISCI advantages A smaller die size A shorter development time A higher performance (controversial) 2015-04-09 PNU Computer Eng. 19