COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 2 Instructions: Language of the Computer (Part 2) Outline 2.8 Supporting Procedures in Computer Hardware 2.9 Communicating with People 2.10 RISC-V Addressing for Wide Immediates and Addresses 2.11 Parallelism and Instructions: Synchronization 2.12 Translating and Starting a Program Chapter 2 (Part 2) — 2 Procedure calls change control flow ... c = sum(a,b); ... Caller Must specify: Procedure address Arguments Local variables Return value Return address Callee int sum(int x, int y) { int temp; temp = x + y; return temp; } 2.8 Supporting Procedures in Computer Hardware Procedure Call: An Example Where to put these data? How to translate into machine code? Chapter 2 (Part 2) — 3 Procedure Call Required steps 1. 2. 3. 4. 5. 6. Place arguments in registers (x10 to x17) Transfer control to procedure Acquire storage for procedure Perform procedure’s operations Place result in register (x10, x11) Return to place of call (return address in x1) Chapter 2 (Part 2) — 4 Procedure Call Instructions Procedure call: jump and link jal x1, ProcedureLabel Address of following instruction put in x1 Jumps to target address Procedure return: jump and link register jalr x0, 0(x1) Jumps to 0 + address in x1 Use x0 as rd We do not need the return address and x0 cannot be changed (always 0) Can also be used for computed jumps e.g., for case/switch statements Chapter 2 (Part 2) — 5 Procedure Call Instructions jal x1, ProcedureAddr Jump to ProcedureAddr and simultaneously save the address of the following instruction, (PC + 4), in register x1 call jal x1, ProcedureAddr return jalr x0, 0(x1) Chapter 2 (Part 2) — 6 Execution of a Procedure Caller (1) Place arguments in registers (x10 – x17) If more than 8 arguments, push them into the stack (stack pointer: x2 or sp) (2) Transfers control to callee by jal x1, ProcedureAddr Callee (3) Allocate required storage on stack (4) Perform the desired task (5) Place the result in register (x10, x11) (6) Return control to caller by jalr x0, 0(x1) Caller (7) Get return result from register (x10, x11) Chapter 2 (Part 2) — 7 Leaf Procedure Example C code: long long int leaf_example ( long long int g, long long int h, long long int i, long long int j) { long long int f; f = (g + h) - (i + j); return f; } Arguments g, …, j in x10, …, x13 Local variable f in x20 Temporaries x5, x6 Save x5, x6, x20 on stack before use Result in x10 Chapter 2 (Part 2) — 8 Leaf Procedure Example RISC-V code: leaf_example: addi sp,sp,-24 sd x5,16(sp) sd x6,8(sp) sd x20,0(sp) add x5,x10,x11 add x6,x12,x13 sub x20,x5,x6 addi x10,x20,0 ld x20,0(sp) ld x6,8(sp) ld x5,16(sp) addi sp,sp,24 jalr x0,0(x1) # save x5, x6, x20 on stack # x5 = g + h # x6 = i + j # f = x5 – x6 # copy f to result register x10 # restore x5, x6, x20 from stack # return to caller Chapter 2 (Part 2) — 9 Local Data on the Stack (a): before procedural call (b): during procedural call (c): after procedural call Chapter 2 (Part 2) — 10 Register Name, Use, Calling Convention Register ABI Name Use Saver x0 x1 x2 x3 x4 x5-7 x8 x9 x10-11 x12-17 x18-27 x28-31 zero ra sp gp tp t0-t2 s0/fp s1 a0-a1 a2-a7 s2-s11 t3-t6 Hard-wired zero Return address Stack pointer Global pointer Thread pointer Temporaries Saved register/frame pointer Saved register Function arguments/return values Function arguments Saved registers Temporaries − Caller Callee − − Caller Callee Callee Caller Caller Callee Caller Chapter 2 (Part 2) — 11 Procedure Call Convention ... sum(a,b)...; long long int sum(long long int x, long long int y) { long long int temp; temp = x + y; return temp; } Return address ra (x1) Procedure address Labels Arguments a0~a7 (x10~x17) Local variables t0~t6 (x5~x7,x28~x31) Return value a0,a1 (x10,x11) Chapter 2 (Part 2) — 12 Why Procedure Call Convention? As a contract between caller and callee, so that People who have never seen or even communicated with each other can write procedures that work together Preserved: if used, callee saves and restores them in stack Not preserved: callee uses them freely without preserving So if caller needs them after the call, it has to preserve them If the software relies on the global pointer register, it is also preserved. Based on this convention, x5, x6 in leaf procedure example (page 9) need not be saved Chapter 2 (Part 2) — 13 Non-Leaf Procedures Procedures that call other procedures For nested calls, caller needs to save on the stack: Its return address (x1) Any arguments (x10-x17) and temporaries (x5-x7, x28-x31) needed after the call Restore the placed registers from the stack after the call Chapter 2 (Part 2) — 14 Non-Leaf Procedure Example C code: long long int fact (long long int n) { if (n < 1) return 1; else return n * fact(n - 1); } Argument n in x10 Result in x10 Chapter 2 (Part 2) — 15 Non-Leaf Procedure Example fact: addi sd sd addi bge addi addi jalr L1: addi jal addi ld ld addi mul jalr sp,sp,-16 x1,8(sp) x10,0(sp) x5,x10,-1 x5,x0,L1 x10,x0,1 sp,sp,16 x0,0(x1) x10,x10,-1 x1,fact x6,x10,0 x10,0(sp) x1,8(sp) sp,sp,16 x10,x10,x6 x0,0(x1) # make space on stack # save return address in x1 onto stack # save argument in x10 onto stack # x5 = n – 1 # if n >= 1, go to L1 # else, set return value to 1 # pop stack, don’t bother restoring values # return #n=n–1 # call fact(n – 1) # move return value of fact(n – 1) to x6 # restore caller’s n # restore return address # return space on stack # return n * fact(n – 1) # return Chapter 2 (Part 2) — 16 Local Data on the Stack (a): before procedural call (b): during procedural call (c): after procedural call Local data allocated by callee e.g., C automatic variables Procedure frame (activation record) Used by some compilers to manage stack storage Frame pointer: x8 or fp Chapter 2 (Part 2) — 17 Memory Layout Text: program code Static data: global variables Dynamic data: heap e.g., static variables in C, constant arrays and strings x3 (global pointer, gp) x3 initialized to address allowing ±offsets into this segment E.g., malloc() and free() in C, new in Java Stack: automatic storage (local variables) Chapter 2 (Part 2) — 18 Summary: Procedure Calls Compiler (or assembly programmer) and processor hardware work together to support/translate procedure calls in HLLs Processor hardware provides Registers: sp (stack pointer), ra (return address), … Instructions: jal, jalr Compiler does Allocation of memory space for stack and local variables Generation of instructions for managing stack, passing arguments and return values, jumping to and returning from procedure Chapter 2 (Part 2) — 19 2.9 Communicating with People Character Data Byte-encoded character sets ASCII: 128 characters 95 graphic, 33 control Unicode: 32-bit character set Used in Java, … Most of the world’s alphabets, plus symbols UTF-8, UTF-16: variable-length encodings Chapter 2 (Part 2) — 20 Byte/Halfword/Word Operations RISC-V byte/halfword/word load/store Load byte/halfword/word: Sign extend to 64 bits in rd Load byte/halfword/word unsigned: Zero extend to 64 bits in rd lb rd, offset(rs1) lh rd, offset(rs1) lw rd, offset(rs1) lbu rd, offset(rs1) lhu rd, offset(rs1) lwu rd, offset(rs1) Store byte/halfword/word: Store rightmost 8/16/32 bits sb rs2, offset(rs1) sh rs2, offset(rs1) sw rs2, offset(rs1) Chapter 2 (Part 2) — 21 String Copy Example C code: Null-terminated string void strcpy (char x[], char y[]) { size_t i; i = 0; while ((x[i]=y[i])!='\0') i += 1; } x:x10, y:x11 (argument registers) i:x19 (saved register need to be preserved) Compiler should actually use registers for temporaries (x5x7, x28-x31) Chapter 2 (Part 2) — 22 String Copy Example RISC-V code: strcpy: addi sd add L1: add lbu add sb beq addi jal L2: ld addi jalr sp,sp,-8 x19,0(sp) x19,x0,x0 x5,x19,x11 x6,0(x5) x7,x19,x10 x6,0(x7) x6,x0,L2 x19,x19, 1 x0,L1 x19,0(sp) sp,sp,8 x0,0(x1) # # # # # # # # # # # # # adjust stack for 1 doubleword push x19 i=0 x5 = addr of y[i] x6 = y[i] x7 = addr of x[i] x[i] = y[i] if y[i] == 0 then exit i = i + 1 next iteration of loop restore saved x19 pop 1 doubleword from stack and return Chapter 2 (Part 2) — 23 I-format only allows 12-bit immediate; what if we want to load a 32-bit constant? Use Load Upper Immediate (lui) + addi lui rd, constant Immediate[31-12] 20 bits U-type rd opcode 5 bits 7 bits Copy 20-bit constant to bits [31:12] of rd Extend bit 31 to bits [63:32], clear bits [11:0] of rd to 0 lui x19,976 # 976=0x003D0 2.10 RISC-V Addressing for Wide Immediates and Addresses 32-bit Constants x19 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0011 1101 0000 0000 0000 0000 addi x19,x19,1280 # 1280=0x500 x19 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0011 1101 0000 0101 0000 0000 Chapter 2 (Part 2) — 24 Branch Addressing RISC-V Code: Loop: beq addi jal x6,x0,End x19,x19,1 x0,Loop End: Branch instructions (SB-type) imm[10:5] imm[12] rs2 rs1 funct3 imm [4:1] opcode imm[11] bne x10,x11,2000 # 2000 = 0 0111 1101 0000 0 111110 01011 01010 001 1000 0 1100011 But immediate field can only take 12 bits for target address How to get the full address for the target instruction? Chapter 2 (Part 2) — 25 Branch Addressing Most branch targets are near, forward or backward Solution: PC-relative addressing Use PC to give the 64-bit address and +/- immediate, because most branch targets are near branch instruction, whose address is currently held in PC 12-bit immediate is a signed two’s complement integer to be added to the PC if branch taken The addresses actually point to halfwords, i.e., target address = PC + immediate × 2 = PC + {imm | 0} Keep the flexibility of supporting 2-byte instructions in RISCV, so the branch immediates represent the number of halfwords between the branch instruction and the branch target Chapter 2 (Part 2) — 26 Jump Addressing Jump and link (jal) target uses 20-bit immediate for larger range: UJ-type imm[10:1] imm[19:12] imm[20] imm[11] jal x0,2000 0 rd opcode 5 bits 7 bits # 2000 = 0 0000 0000 0111 1101 0000 1111101000 0 00000000 00000 1101111 For long jumps, e.g., to 32-bit absolute address lui: load address[31:12] to temp register jalr: add address[11:0] and jump to target jalr x0,100(x19) Chapter 2 (Part 2) — 27 Target Addressing Calculation Assume Loop at location 80000 Loop: slli x10,x22,3 80000 0 3 22 1 10 19 add x10,x10,x25 80004 0 25 10 0 10 51 ld x9,0(x10) 80008 0 0 10 3 9 3 bne x9,x24,Exit 80012 0 24 9 1 12 99 80016 0 1 22 0 22 19 0 0 0 13 99 addi x22,x22,1 beq x0,x0,Loop -20 +12 Exit: … imm[10:5] imm[12] 80020 127 80024 rs2 rs1 funct3 imm opcode [4:1] imm[11] 80012: 0000000 11000 01001 001 01100 1100111 0 0 000000 0110 0 = +12 80020: 1111111 00000 00000 000 01101 1100111 1 1 111111 0110 0 = -20 Chapter 2 (Part 2) — 28 RISC-V Addressing Modes addi x6,x21,4 add x6,x21,x22 ld x6,0(x21) beq x20,x21,L1 Chapter 2 (Part 2) — 29 RISC-V Instruction Formats 31 25 24 funct7 20 19 rs2 imm[11:0] 15 14 12 11 7 6 0 rs1 funct3 rd opcode R-type rs1 funct3 rd opcode I-type imm[11:5] rs2 rs1 funct3 imm[4:0] opcode S-type imm[12, 10:5] rs2 rs1 funct3 imm[4:1, 11] opcode SB-type imm[31:12] rd opcode U-type imm[20, 10:1, 11, 19:12] rd opcode UJ-type R-type: Arithmetic instructions I-type: Loads & immediate arithmetic S-type: Stores SB-type: Conditional branch format UJ-type: Unconditional jump format U-type: Upper immediate format Chapter 2 (Part 2) — 30 RISC-V Encoding of Opcodes What is the assembly code corresponding to the machine instruction 00578833hex? 0000 0000 0101 0111 1000 1000 0011 0011 opcode: 0110011 funct3: 000 funct7: 0000000 rd: 10000 rs1: 01111 rs2: 00101 add x16, x15, x5 Chapter 2 (Part 2) — 31 Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don’t synchronize Hardware support required Result depends on order of accesses Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction 2.11 Parallelism and Instructions: Synchronization Synchronization E.g., atomic swap of register ↔ memory Or an atomic pair of instructions Chapter 2 (Part 2) — 32 Synchronization in RISC-V Load reserved: lr.d rd,(rs1) Load from address in rs1 to rd Place reservation on memory address Store conditional: sc.d rd,rs2,(rs1) Succeeds if location not changed since the lr.d Stores from rs2 to address in rs1 Returns 0 in rd Fails if location is changed Returns non-zero value in rd Chapter 2 (Part 2) — 33 Synchronization in RISC-V Example 1: atomic swap again: lr.d sc.d bne addi Example 2: lock (acquire lock at location in x20, 0: lock is free, 1: lock is acquired) addi again: lr.d bne sc.d bne x10,(x20) x11,x23,(x20) # X11 = status x11,x0,again # branch if store failed x23,x10,0 # X23 = loaded value x12,x0,1 x10,(x20) x10,x0,again x11,x12,(x20) x11,x0,again # # # # # copy locked value read lock check if it is 0 yet attempt to store branch if fails x0,0(x20) # free lock Unlock: sd Chapter 2 (Part 2) — 34 Many compilers produce object modules directly 2.12 Translating and Starting a Program Translation and Startup Chapter 2 (Part 2) — 35 Assembler: Producing an Object Module Assembler (or compiler) translates programs into object files Object file Header: describes the size and position of the other pieces of the object file Text segment: contains the machine code Static data segment: contains data allocated for the life of the program Relocation information: identifies instructions and data words that depend on absolute addresses when the program is loaded into memory Symbol table: contains the remaining labels that are not defined, such as external references Debug information: contains a description of how the modules were compiled so that a debugger can associate machine code with source code and make data structures readable Chapter 2 (Part 2) — 36 Assembler Pseudoinstructions Most assembly instructions represent machine instructions one-to-one Pseudoinstructions: assembly instructions defined by the assembler to help assembly programming, but they are not really implemented by the hardware, because they can be realized by true instructions li j mv and x9,123 L1 x10,x11 x9,x10,15 addi jal addi andi x9,x0,123 x0,L1 x10,x11,0 x9,x10,15 Chapter 2 (Part 2) — 37 Linker: Linking Object Modules Produces an executable file that can be run on a computer Merge object modules by placing code and data modules symbolically in memory Determine addresses of data and instruction labels using relocation information and symbol table Patch internal and external references Determine memory locations where each module will occupy All absolute references must be relocated to reflect true locations Could leave location dependencies for fixing by a relocating loader But with virtual memory, no need to do this Program can be loaded into absolute location in virtual memory space Chapter 2 (Part 2) — 38 Example: Linking Object Files Chapter 2 (Part 2) — 39 Example: Linking Object Files Chapter 2 (Part 2) — 40 Loader: Loading a Program Load from an executable file on disk into memory 1. Read header to determine segment sizes 2. Create virtual address space 3. Copy text and initialized data into memory Or set page table entries so they can be faulted in 4. Set up arguments for main () on stack 5. Initialize registers (including sp, fp, gp) 6. Jump to startup routine Copies arguments to x10, … and calls main () When main () returns, do exit () syscall Chapter 2 (Part 2) — 41 Dynamic Linking Problems with static libraries Library routines become part of the executable code cannot use newer version libraries unless rebuilt It loads all routines in the library that are called in the executable, even if those calls are not executed Dynamically linked libraries (DLLs): Only link/load library procedure when it is called Requires procedure code to be relocatable Avoids image bloat caused by static linking of all (transitively) referenced libraries Automatically picks up new library versions Chapter 2 (Part 2) — 42 Lazy Procedure Linkage Indirection table Stub: Loads routine ID, Jump to linker/loader Linker/loader code Dynamically mapped code Chapter 2 (Part 2) — 43 Translation Hierarchy for Java Simple portable instruction set for the JVM Compiles bytecodes of “hot” methods into native code for host machine Interprets bytecodes Chapter 2 (Part 2) — 44