Chapter 2 - Iowa State University

CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Chapter 2 Instructions: Language of the Computer Zhao Zhang Iowa State University Revised from original slides provided by MKP      MIPS procedure/function call convention Leaf and non-leaf examples Clearing array example String copy example Other issues:   Load 32-bit immediate Assembler, loader, and compiler effects §2.8 Supporting Procedures in Computer Hardware Review of Week 4 Chapter 2 — Instructions: Language of the Computer — 2 Announcements     Exam 1 on Friday Oct. 4 Course review on Wednesday Oct. 2 HW4 is due on Sep. 27 HW5 will be due on Oct. 11     Do HW5 as exercise before Exam 1 No HW and quizzes next week Lab 2 demo is due this week and Lab 3 demo due next week Lab 4 starts next week, due in one week Chapter 1 — Computer Abstractions and Technology — 3 Exam 1   Open book, open notes, calculator are allowed E-book reader is allowed   Must be put in airplane mode Coverage     Chapter 1, Computer Abstraction and Technology Chapter 2, Instructions: Language of the Computer Some contents from Appendix B MIPS floating-point instructions Chapter 1 — Computer Abstractions and Technology — 4 Exam Question Types    Short conceptual questions Calculation: speedup, power saving, CPI, etc. MIPS assembly programming    Translate C statements to MIPS (arithmetic, load/store, branch and jump, others) Translate C functions to MIPS (call convention) Among others Suggestions:  Review slides and textbook  Review homework and quizzes Chapter 1 — Computer Abstractions and Technology — 5 Overview for Week 5 Overview for Week 5, Sep. 23 - 27  Bubble sorting example    It will be used in Mini-Projects Floating point instructions ARM and x86 instruction set overview Chapter 1 — Computer Abstractions and Technology — 6 Classic Bubble Sorting Bubble sort: Swap two adjacent elements if they are out of order  Pass the array n times, each time a largest element will float to the top  Look at the first pass of five elements 1st try: 5 3 8 2 7 => 3 5 8 2 7 2nd try: 3 5 8 2 7 => 3 5 8 2 7 3rd try: 3 5 8 2 7 => 3 5 2 8 7 4th try: 3 5 2 7 8 => 3 5 2 7 8  Chapter 1 — Computer Abstractions and Technology — 7 Classic Bubble Sorting  Pass i only has to check for (n-i) swaps  In each pass, an element may float up until it meets a larger element  The sorted sub-array increments by one 1st 2nd 3nd 4nd pass: pass: pass: pass: 5 3 3 2 3 5 2 3 8 2 5 5 2 7 7 7 7 8 8 8 => => => => 3 3 2 2 5 2 3 3 2 5 5 5 7 7 7 7 8 8 8 8 Chapter 1 — Computer Abstractions and Technology — 8 Revised Bubble Sorting  The textbook bubble-sort is optimized to reduce comparisons void sort (int v[], int n) { int i, j; for (i = 0; i < n; i++) { for (j = i – 1; j >= 0 && v[j] > v[j+1]; j--) swap(v, j); } } Chapter 1 — Computer Abstractions and Technology — 9 Revised Bubble Sorting  The classic one let a largest element float to the top of the unsorted sub-array  The revised one let an element float to its right place in the sorted sub-array 1st 2nd 3nd 4nd pass: pass: pass: pass: 5 3 3 2 3 5 5 3 8 8 8 5 2 2 2 8 7 7 7 7 => => => => 3 3 2 2 5 5 3 3 8 8 5 5 2 2 8 7 7 7 7 8 Chapter 1 — Computer Abstractions and Technology — 10  The swap function is a leaf function  void swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } v in $a0, k in $a1, temp in $t0 §2.13 A C Sort Example to Put It All Together The Swap Function Chapter 2 — Instructions: Language of the Computer — 11 The Swap Function swap: sll $t1, $a1, 2 # $t1 = k * 4 add $t1, $a0, $t1 # $t1 = v+(k*4) # (address of v[k]) lw $t0, 0($t1) # $t0 (temp) = v[k] lw $t2, 4($t1) # $t2 = v[k+1] sw $t2, 0($t1) # v[k] = $t2 (v[k+1]) sw $t0, 4($t1) # v[k+1] = $t0 (temp) jr $ra # return to calling routine Chapter 2 — Instructions: Language of the Computer — 12 The Sort Function for (i = 0; i < n; i++) { for (j = i – 1; j >= 0 && v[j] > v[j+1]; j--) swap(v, j); }   Save $ra to stack, as it’s a non-leaf function Assign i and j to $s0 and $s1   Move v, n from $a0 and $a1 to $s2 and $s2    They must be preserved when calling swap() They must be preserved, too $a0 and $a1 are used when calling swap() We need a stack frame of 5 words or 20 bytes Chapter 1 — Computer Abstractions and Technology — 13 Sort Prologue and Epilogue sort: addi $sp,$sp, –20 sw $ra, 16($sp) sw $s3,12($sp) sw $s2, 8($sp) sw $s1, 4($sp) sw $s0, 0($sp) … … exit1: lw $s0, 0($sp) lw $s1, 4($sp) lw $s2, 8($sp) lw $s3,12($sp) lw $ra,16($sp) addi $sp,$sp, 20 jr $ra # # # # # # # make room on stack for 5 registers save $ra on stack save $s3 on stack save $s2 on stack save $s1 on stack save $s0 on stack procedure body # # # # # # # restore $s0 from stack restore $s1 from stack restore $s2 from stack restore $s3 from stack restore $ra from stack restore stack pointer return to calling routine • Entry: Get a frame, save $ra and $s3-$s0 • Exit: Restore $s0-$s3 and $ra, free the frame Chapter 2 — Instructions: Language of the Computer — 14 Sort Function Body A new pseudo instruction move rd, rs is equivalent to add rd, rs, $zero Example move move $s2, $a0 $s3, $a1 # $s2 = $zero # $s3 = $a1 No use of pseudo assembly instructions in Exam 1 Chapter 1 — Computer Abstractions and Technology — 15 Sort Function Body move move move for1tst: slt beq addi for2tst: slti bne sll add lw lw slt beq move move jal addi j exit2: addi j $s2, $a0 $s3, $a1 $s0, $zero $t0, $s0, $s3 $t0, $zero, exit1 $s1, $s0, –1 $t0, $s1, 0 $t0, $zero, exit2 $t1, $s1, 2 $t2, $s2, $t1 $t3, 0($t2) $t4, 4($t2) $t0, $t4, $t3 $t0, $zero, exit2 $a0, $s2 $a1, $s1 swap $s1, $s1, –1 for2tst $s0, $s0, 1 for1tst # # # # # # # # # # # # # # # # # # # # # save $a0 into $s2 save $a1 into $s3 i = 0 $t0 = 0 if $s0 ≥ $s3 (i ≥ n) go to exit1 if $s0 ≥ $s3 (i ≥ n) j = i – 1 $t0 = 1 if $s1 < 0 (j < 0) go to exit2 if $s1 < 0 (j < 0) $t1 = j * 4 $t2 = v + (j * 4) $t3 = v[j] $t4 = v[j + 1] $t0 = 0 if $t4 ≥ $t3 go to exit2 if $t4 ≥ $t3 1st param of swap is v (old $a0) 2nd param of swap is j call swap procedure j –= 1 jump to test of inner loop i += 1 jump to test of outer loop Move params Outer loop Inner loop Pass params & call Inner loop Outer loop Chapter 2 — Instructions: Language of the Computer — 16 Sort Function Optimized Old version: void sort(int v[], int n) int i, j; for (i = 0; i < n; i++) { for (j = i – 1; j >= 0 && v[j] > v[j+1]; j--) swap(v, j); } New version: void sort(int v[], int n) { int *pi, *pj; for (pi = v; pi < &v[n]; pi++) for (pj = pj - 1; pj >= v && swap(pj); pj--) {} } Chapter 1 — Computer Abstractions and Technology — 17 New Swap Function  A more efficient swap function that reduces memory loads // swap two adjacent elements if they are // out of order. Return 1 if swapped, 0 // otherwise int swap(int *p) { if (p[0] > p[1]) { int tmp = p[0]; p[0] = p[1]; p[1] = tmp; return 1; } else return 0; } Chapter 1 — Computer Abstractions and Technology — 18 New Swap Function  A new swap function swap: lw lw slt beq sw sw addi jr else: addi jr $t0, $t1, $t2, $t2, $t1, $t0, $v0, $ra 0($a0) # 4($a0) # $t1, $t0 # $zero, else 0($a0) # 4($a0) # $zero, 1 # $v0, $zero, 0 $ra load p[0] load p[1] p[1] < p[0]? swap swap $v0 = 1 # $v0 = 0 Chapter 1 — Computer Abstractions and Technology — 19 New Sort Function The sort() function optimized  Register usage      $s0: $s1: $s2: $s3: v &v[n] pi pj Need a frame of 5 words to save $ra and $s0-$s2 Chapter 1 — Computer Abstractions and Technology — 20 Sort Prologue and Epilogue sort: addi sw sw sw sw sw $sp, $ra, $s3, $s2, $s1, $s0, $sp, -20 16($sp) 12($sp) 8($sp) 4($sp) 0($sp) # frame of 5 words MIPS code for sort function body lw lw lw lw lw addi jr $s0, $s1, $s2, $s3, $ra, $sp, $ra 0($sp) 4($sp) 8($sp) 12($sp) 16($sp) $sp, 20 # release frame Chapter 1 — Computer Abstractions and Technology — 21 New Sort: Outer Loop for (pi = v; pi < &v[n]; pi++) for pj loop - 1; pj >= v && swap(pj); pj--) C code(pj for the=inner {} add $s0, $a0, sll $a1, $a1, add $s1, $s0, add $s2, $s0, j for1_tst for1_loop: $zero 2 $a1 $zero # # # # $s0 = v $a1 = 4*n $s1 = &v[n] pi = v MIPS code for the inner loop addi $s2, $s2, 4 # pi++ for1_tst: slt $t0, $s2, $s1 # pi < &v[n]? bne $t0, $zero, for1_loop # yes? repeat Chapter 1 — Computer Abstractions and Technology — 22 New Sort: Inner Loop for (pj = pi-1; pj >= v && swap(pj); pj--) {} addi $s3, $s2, -4 j for2_tst for2_loop: addi $s3, $s3, -4 for2_tst: slt $t0, $s3, $s0 bne $t0, $zero,for2_exit add $a0, $s3, $zero jal swap bne $v0, $zero,for2_loop cont for2_exit: # pj = pi-1 # pj-- # # # # # pj < v? yes? exit $a0 = pj swap(pj) ret 1? Chapter 1 — Computer Abstractions and Technology — 23 Lab Mini-Projects   You will use the sorting code to test your CPU design in the lab mini-projects Use the new sorting code   The new code is more optimized It will simplify the debugging Chapter 1 — Computer Abstractions and Technology — 24 FP Instructions in MIPS Reading: Textbook Ch. 3.5 and B-71 – B80  FP hardware is coprocessor 1  Adjunct processor that extends the ISA  Separate FP registers  32 single-precision: $f0, $f1, … $f31  Paired for double-precision: $f0/$f1, $f2/$f3, …  Release 2 of MIPS ISA supports 32 × 64-bit FP reg’s Chapter 3 — Arithmetic for Computers — 25 FP Instructions in MIPS  FP instructions operate only on FP registers  Programs generally don’t do integer ops on FP data, or vice versa  More registers with minimal code-size impact Chapter 1 — Computer Abstractions and Technology — 26 FP Instructions in MIPS  FP load and store instructions  lwc1, ldc1, swc1, sdc1  e.g., ldc1 $f8, 32($sp) lwc1, swc1: Load/store singleprecision  ldc1, swc1: Load/store doubleprecision  Chapter 1 — Computer Abstractions and Technology — 27 FP Instructions in MIPS   Single-precision arithmetic  add.s, sub.s, mul.s, div.s  e.g., add.s $f0, $f1, $f6 Double-precision arithmetic  add.d, sub.d, mul.d, div.d  e.g., mul.d $f4, $f4, $f6 Chapter 3 — Arithmetic for Computers — 28 FP Instructions in MIPS   Single- and double-precision comparison  c.xx.s, c.xx.d (xx is eq, lt, le, …)  Sets or clears FP condition-code bit  e.g. c.lt.s $f3, $f4 Branch on FP condition code true or false  bc1t, bc1f  e.g., bc1t TargetLabel Chapter 1 — Computer Abstractions and Technology — 29 MIPS Call Convention: FP  The first two FP parameters in registers  1st parameter in $f12 or $f12:$f13        A double-precision parameter takes two registers 2nd FP parameter in $f14 or $f14:$f15 Extra parameters in stack $f0 stores single-precision FP return value $f0:$f1 stores double-precision FP return value $f0-$f19 are FP temporary registers $f20-$f31 are FP saved temporary registers Chapter 1 — Computer Abstractions and Technology — 30 FP Example: °F to °C  C code: float f2c (float fahr) { return ((5.0/9.0) * (fahr - 32.0)); }   fahr in $f12, result in $f0 Assume literals in global memory space, e.g. const5 for 5.0 and const9 for 9.0  Can FP immediate be encoded in MIPS instructions? Chapter 3 — Arithmetic for Computers — 31 FP Example: °F to °C  Compiled MIPS code: f2c: lwc1 lwc1 div.s lwc1 sub.s mul.s jr $f16, $f18, $f16, $f18, $f18, $f0, $ra const5($gp) const9($gp) $f16, $f18 const32($gp) $f12, $f18 $f16, $f18 Chapter 1 — Computer Abstractions and Technology — 32 FP Example: Function Call extern float fahr, cel; cel = f2c(fahr); Assume fahr is at 100($gp), cel is at 104($gp) lwc1 jal swcl $f12, 100($gp) f2c $f0, 104($gp); # load 1st para # save ret val Chapter 1 — Computer Abstractions and Technology — 33 FP Example: Max double max(double x, double y) { return (x > y) ? x : y; } max: c.lt.d bc1f mov.d jr else: mov.d jr $f14, $f12 else $f0, $f12 $ra # y < x? # if false, do else # $f0:$f1 = x $f0, $f14 $ra # $f0:$f1 = y Chapter 1 — Computer Abstractions and Technology — 34 FP Example: Max  How to call max?  Assume a, b, c at 100($gp), 108($gp), and 116($gp) extern double a, b, c; c = max(a, b); ldc1 ldc1 jal sdc1 $f12, 100($gp) $f14, 108($gp) max $f0, 116($gp) # $f12:$f13 = a # $f14:$f15 = b # c = $f0:$f1 Chapter 1 — Computer Abstractions and Technology — 35 FP Example: Search Value int search(double X[], int size, double value) { for (int i = 0; i < size; i++) if (X[i] == value) return 1; return 0; } Note 1: There are integer and FP parameters, and the return value is integer Note 2: A real program may search a value in a range, e.g. [value - delta, value + delta] Chapter 1 — Computer Abstractions and Technology — 36 FP Example: Search Value search: add j for_loop: sll add lwc1 c.eq.d bc1f addi jr endif: addi for_cond: slt bne add jr $t0, $zero, $zero for_cond # i = 0 $t1, $t0, 3 $t1, $a0, $t1 $f2, 0($t1) $f2, $f12 endif $v0, $zero, 1 $ra # # # # # # # $t0, $t0, 1 # i++ $t1 = 8*i $t1 = &X[i] $f2 = X[i] X[i] == value? if false, skip $v0 = 1 return $t1, $t0, $a1 # i < size? $t1, $zero, for_loop # repeat if true $v0, $zero, $zero # to return 0 $ra Chapter 1 — Computer Abstractions and Technology — 37 FP Example: Array Multiplication  X=X+Y×Z   All 32 × 32 matrices, 64-bit double-precision elements C code: void mm (double x[][], double y[][], double z[][]) { int i, j, k; for (i = 0; i! = 32; i = i + 1) for (j = 0; j! = 32; j = j + 1) for (k = 0; k! = 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j]; }  Addresses of x, y, z in $a0, $a1, $a2, and i, j, k in $s0, $s1, $s2 Chapter 3 — Arithmetic for Computers — 38 FP Example: Array Multiplication  MIPS code: li li L1: li L2: li sll addu sll addu l.d L3: sll addu sll addu l.d … $t1, 32 $s0, 0 $s1, 0 $s2, 0 $t2, $s0, 5 $t2, $t2, $s1 $t2, $t2, 3 $t2, $a0, $t2 $f4, 0($t2) $t0, $s2, 5 $t0, $t0, $s1 $t0, $t0, 3 $t0, $a2, $t0 $f16, 0($t0) # # # # # # # # # # # # # # $t1 = 32 (row size/loop end) i = 0; initialize 1st for loop j = 0; restart 2nd for loop k = 0; restart 3rd for loop $t2 = i * 32 (size of row of x) $t2 = i * size(row) + j $t2 = byte offset of [i][j] $t2 = byte address of x[i][j] $f4 = 8 bytes of x[i][j] $t0 = k * 32 (size of row of z) $t0 = k * size(row) + j $t0 = byte offset of [k][j] $t0 = byte address of z[k][j] $f16 = 8 bytes of z[k][j] Chapter 3 — Arithmetic for Computers — 39 FP Example: Array Multiplication … sll $t0, $s0, 5 addu $t0, $t0, $s2 sll $t0, $t0, 3 addu $t0, $a1, $t0 l.d $f18, 0($t0) mul.d $f16, $f18, $f16 add.d $f4, $f4, $f16 addiu $s2, $s2, 1 bne $s2, $t1, L3 s.d $f4, 0($t2) addiu $s1, $s1, 1 bne $s1, $t1, L2 addiu $s0, $s0, 1 bne $s0, $t1, L1 # # # # # # # # # # # # # # $t0 = i*32 (size of row of y) $t0 = i*size(row) + k $t0 = byte offset of [i][k] $t0 = byte address of y[i][k] $f18 = 8 bytes of y[i][k] $f16 = y[i][k] * z[k][j] f4=x[i][j] + y[i][k]*z[k][j] $k k + 1 if (k != 32) go to L3 x[i][j] = $f4 $j = j + 1 if (j != 32) go to L2 $i = i + 1 if (i != 32) go to L1 Chapter 3 — Arithmetic for Computers — 40   ARM: the most popular embedded core Similar basic set of instructions to MIPS ARM MIPS 1985 1985 Instruction size 32 bits 32 bits Address space 32-bit flat 32-bit flat Data alignment Aligned Aligned 9 3 15 × 32-bit 31 × 32-bit Memory mapped Memory mapped Date announced Data addressing modes Registers Input/output §2.16 Real Stuff: ARM Instructions ARM & MIPS Similarities Chapter 2 — Instructions: Language of the Computer — 41 Compare and Branch in ARM  Uses condition codes for result of an arithmetic/logical instruction    Negative, zero, carry, overflow Compare instructions to set condition codes without keeping the result Each instruction can be conditional   Top 4 bits of instruction word: condition value Can avoid branches over single instructions Chapter 2 — Instructions: Language of the Computer — 42 Instruction Encoding Chapter 2 — Instructions: Language of the Computer — 43  Evolution with backward compatibility  8080 (1974): 8-bit microprocessor   8086 (1978): 16-bit extension to 8080   Adds FP instructions and register stack 80286 (1982): 24-bit addresses, MMU   Complex instruction set (CISC) 8087 (1980): floating-point coprocessor   Accumulator, plus 3 index-register pairs §2.17 Real Stuff: x86 Instructions The Intel x86 ISA Segmented memory mapping and protection 80386 (1985): 32-bit extension (now IA-32)   Additional addressing modes and operations Paged memory mapping as well as segments Chapter 2 — Instructions: Language of the Computer — 44 The Intel x86 ISA  Further evolution…  i486 (1989): pipelined, on-chip caches and FPU   Pentium (1993): superscalar, 64-bit datapath    New microarchitecture (see Colwell, The Pentium Chronicles) Pentium III (1999)   Later versions added MMX (Multi-Media eXtension) instructions The infamous FDIV bug Pentium Pro (1995), Pentium II (1997)   Compatible competitors: AMD, Cyrix, … Added SSE (Streaming SIMD Extensions) and associated registers Pentium 4 (2001)   New microarchitecture Added SSE2 instructions Chapter 2 — Instructions: Language of the Computer — 45 The Intel x86 ISA  And further…   AMD64 (2003): extended architecture to 64 bits EM64T – Extended Memory 64 Technology (2004)    Intel Core (2006)   Intel declined to follow, instead… Advanced Vector Extension (announced 2008)   Added SSE4 instructions, virtual machine support AMD64 (announced 2007): SSE5 instructions   AMD64 adopted by Intel (with refinements) Added SSE3 instructions Longer SSE registers, more instructions If Intel didn’t extend with compatibility, its competitors would!  Technical elegance ≠ market success Chapter 2 — Instructions: Language of the Computer — 46 Basic x86 Registers Chapter 2 — Instructions: Language of the Computer — 47 Basic x86 Addressing Modes   Two operands per instruction Source/dest operand Second source operand Register Register Register Immediate Register Memory Memory Register Memory Immediate Memory addressing modes     Address in register Address = Rbase + displacement Address = Rbase + 2scale × Rindex (scale = 0, 1, 2, or 3) Address = Rbase + 2scale × Rindex + displacement Chapter 2 — Instructions: Language of the Computer — 48 x86 Instruction Encoding  Variable length encoding   Postfix bytes specify addressing mode Prefix bytes modify operation  Operand length, repetition, locking, … Chapter 2 — Instructions: Language of the Computer — 49 Implementing IA-32  Complex instruction set makes implementation difficult  Hardware translates instructions to simpler microoperations      Simple instructions: 1–1 Complex instructions: 1–many Microengine similar to RISC Market share makes this economically viable Comparable performance to RISC  Compilers avoid complex instructions Chapter 2 — Instructions: Language of the Computer — 50  Powerful instruction  higher performance   Fewer instructions required But complex instructions are hard to implement    May slow down all instructions, including simple ones §2.18 Fallacies and Pitfalls Fallacies Compilers are good at making fast code from simple instructions Use assembly code for high performance   But modern compilers are better at dealing with modern processors More lines of code  more errors and less productivity Chapter 2 — Instructions: Language of the Computer — 51 Fallacies  Backward compatibility  instruction set doesn’t change  But they do accrete more instructions x86 instruction set Chapter 2 — Instructions: Language of the Computer — 52 Pitfalls  Sequential words are not at sequential addresses   Increment by 4, not by 1! Keeping a pointer to an automatic variable after procedure returns   e.g., passing pointer back via an argument Pointer becomes invalid when stack popped Chapter 2 — Instructions: Language of the Computer — 53  Design principles 1. 2. 3. 4.  Layers of software/hardware   Simplicity favors regularity Smaller is faster Make the common case fast Good design demands good compromises §2.19 Concluding Remarks Concluding Remarks Compiler, assembler, hardware MIPS: typical of RISC ISAs  c.f. x86 Chapter 2 — Instructions: Language of the Computer — 54 Concluding Remarks  Measure MIPS instruction executions in benchmark programs   Consider making the common case fast Consider compromises Instruction class MIPS examples SPEC2006 Int SPEC2006 FP Arithmetic add, sub, addi 16% 48% Data transfer lw, sw, lb, lbu, lh, lhu, sb, lui 35% 36% Logical and, or, nor, andi, ori, sll, srl 12% 4% Cond. Branch beq, bne, slt, slti, sltiu 34% 8% Jump j, jr, jal 2% 0% Chapter 2 — Instructions: Language of the Computer — 55

Chapter 2 - Iowa State University

Related documents

Products

Support

Chapter 2 - Iowa State University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib