Machine Programming – Procedures and IA32 Stack CENG331: Introduction to Computer Systems 6th Lecture Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ. IA32 Stack Stack “Bottom” Region of memory managed with stack discipline Grows toward lower addresses Increasing Addresses Register %esp contains lowest stack address = address of “top” element Stack Grows Down Stack Pointer: %esp Stack “Top” IA32 Stack: Push Stack “Bottom” pushl Src Fetch operand at Src Decrement %esp by 4 Write operand at address given Increasing Addresses by %esp Stack Grows Down Stack Pointer: %esp -4 Stack “Top” IA32 Stack: Pop Stack “Bottom” popl Dest Read operand at address %esp Increment %esp by 4 Write operand to Dest Stack Pointer: %esp Increasing Addresses Stack Grows Down +4 Stack “Top” Procedure Control Flow Use stack to support procedure call and return Procedure call: call label Push return address on stack Jump to label Return address: Address of instruction beyond call Example from disassembly 804854e: e8 3d 06 00 00 8048553: 50 Return address = 0x8048553 Procedure return: ret Pop address from stack Jump to address call pushl 8048b90 <main> %eax Procedure Call Example 804854e: 8048553: e8 3d 06 00 00 50 call 0x110 0x110 0x10c 0x10c 0x108 123 0x108 call pushl 8048b90 123 0x104 0x8048553 %esp 0x108 %eip 0x804854e %eip: program counter %esp 0x108 0x104 %eip 0x8048b90 0x804854e 8048b90 <main> %eax Procedure Return Example 8048591: c3 ret ret 0x110 0x110 0x10c 0x10c 0x108 123 0x108 0x104 0x8048553 %esp 0x104 %eip 0x8048591 %eip: program counter 123 0x8048553 %esp 0x104 0x108 %eip 0x8048553 0x8048591 Stack-Based Languages Languages that support recursion e.g., C, Pascal, Java Code must be “Reentrant” Multiple simultaneous instantiations of single procedure Need some place to store state of each instantiation Arguments Local variables Return pointer Stack discipline State for given procedure needed for limited time From when called to when return Callee returns before caller does Stack allocated in Frames state for single procedure instantiation Call Chain Example yoo(…) { • • who(); • • } Example Call Chain yoo who(…) { • • • amI(); • • • amI(); • • • } who amI(…) { • • amI(); • • } Procedure amI is recursive amI amI amI amI Stack Frames Previous Frame Contents Local variables Return information Temporary space Frame Pointer: %ebp Frame for proc Stack Pointer: %esp Management Space allocated when enter procedure “Set-up” code Deallocated when return “Finish” code Stack “Top” Stack Example yoo(…) { • • who(); • • } yoo %ebp yoo who amI amI amI %esp amI Stack Example who(…) { • • • amI(); • • • amI(); • • • } yoo yoo who %ebp amI amI who %esp amI amI Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI amI who %ebp amI amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI who amI amI amI %ebp amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI who amI amI amI amI %ebp amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI who amI amI amI %ebp amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI amI who %ebp amI amI %esp Stack Example who(…) { • • • amI(); • • • amI(); • • • } yoo yoo who %ebp amI amI who %esp amI amI Stack Example amI(…) { • • • • • } yoo yoo who amI amI amI who %ebp amI amI %esp Stack Example who(…) { • • • amI(); • • • amI(); • • • } yoo yoo who %ebp amI amI who %esp amI amI Stack Example yoo(…) { • • who(); • • } yoo %ebp yoo who amI amI amI %esp amI IA32/Linux Stack Frame Current Stack Frame (“Top” to Bottom) “Argument build:” Parameters for function about to call Local variables If can’t keep in registers Saved register context Old frame pointer Caller Frame Arguments Frame pointer %ebp Return Addr Old %ebp Saved Registers + Local Variables Caller Stack Frame Return address Pushed by call instruction Arguments for this call Stack pointer %esp Argument Build Revisiting swap Calling swap from call_swap int zip1 = 15213; int zip2 = 91125; void call_swap() { swap(&zip1, &zip2); } void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } call_swap: • • • pushl $zip2 pushl $zip1 call swap • • • • • • # Global Var # Global Var Resulting Stack &zip2 &zip1 Rtn adr %esp Revisiting swap void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } swap: pushl %ebp movl %esp,%ebp pushl %ebx movl movl movl movl movl movl 12(%ebp),%ecx 8(%ebp),%edx (%ecx),%eax (%edx),%ebx %eax,(%edx) %ebx,(%ecx) movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Set Up Body Finish swap Setup #1 Entering Stack Resulting Stack %ebp %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %esp swap Setup #1 Entering Stack %ebp %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %esp swap Setup #1 Entering Stack Resulting Stack %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %ebp %esp swap Setup #1 Entering Stack %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %ebp %esp swap Setup #1 Entering Stack Resulting Stack %ebp • • • • • • Offset relative to %ebp &zip2 &zip1 Rtn adr %esp movl 12(%ebp),%ecx # get yp movl 8(%ebp),%edx # get xp . . . 12 8 4 yp xp Rtn adr Old %ebp %ebp Old %ebx %esp swap Finish #1 swap’s Stack Resulting Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebp %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation: Saved and restored register %ebx swap Finish #2 swap’s Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebp %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret swap Finish #2 swap’s Stack Resulting Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Old %ebp %ebp %esp swap Finish #2 swap’s Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Old %ebp %ebp %esp swap Finish #3 swap’s Stack Resulting Stack %ebp • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret %esp swap Finish #4 swap’s Stack %ebp • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret %esp swap Finish #4 swap’s Stack Resulting Stack %ebp • • • • • • yp yp xp xp %esp Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation Saved & restored register %ebx Didn’t do so for %eax, %ecx, or %edx Disassembled swap 080483a4 <swap>: 80483a4: 55 80483a5: 89 e5 80483a7: 53 80483a8: 8b 55 08 80483ab: 8b 4d 0c 80483ae: 8b 1a 80483b0: 8b 01 80483b2: 89 02 80483b4: 89 19 80483b6: 5b 80483b7: c9 80483b8: c3 push mov push mov mov mov mov mov mov pop leave ret %ebp %esp,%ebp %ebx 0x8(%ebp),%edx 0xc(%ebp),%ecx (%edx),%ebx (%ecx),%eax %eax,(%edx) %ebx,(%ecx) %ebx Calling Code 8048409: 804840e: e8 96 ff ff ff 8b 45 f8 call 80483a4 <swap> mov 0xfffffff8(%ebp),%eax Register Saving Conventions When procedure yoo calls who: yoo is the caller who is the callee Can Register be used for temporary storage? yoo: • • • movl $15213, %edx call who addl %edx, %eax • • • ret who: • • • movl 8(%ebp), %edx addl $91125, %edx • • • ret Contents of register %edx overwritten by who Register Saving Conventions When procedure yoo calls who: yoo is the caller who is the callee Can register be used for temporary storage? Conventions “Caller Save” Caller saves temporary in its frame before calling “Callee Save” Callee saves temporary in its frame before using IA32/Linux Register Usage %eax, %edx, %ecx Caller saves prior to call if values are used later value Callee-Save Temporaries %ebx, %esi, %edi Callee saves if wants to use them Caller-Save Temporaries %eax also used to return integer %eax %esp, %ebp special Special %edx %ecx %ebx %esi %edi %esp %ebp IA 32 Procedure Summary The Stack Makes Recursion Work Private storage for each instance of procedure call Instantiations don’t clobber each other Addressing of locals + arguments can be relative to stack positions Managed by stack discipline Procedures return in inverse order of calls Caller Frame Arguments %ebp IA32 Procedures Combination of Instructions + Conventions Saved Registers + Local Variables Call / Ret instructions Register usage conventions Caller / Callee save %ebp and %esp Stack frame organization conventions Return Addr Old %ebp %esp Argument Build Today Arrays One-dimensional Multi-dimensional (nested) Multi-level Structures Basic Data Types Integral Stored & operated on in general (integer) registers Signed vs. unsigned depends on instructions used Intel byte word double word quad word GAS b w l q Bytes 1 2 4 8 C [unsigned] [unsigned] [unsigned] [unsigned] char short int long int (x86-64) Floating Point Stored & operated on in floating point registers Intel Single Double Extended GAS s l t Bytes 4 8 10/12/16 C float double long double Array Allocation Basic Principle T A[L]; Array of data type T and length L Contiguously allocated region of L * sizeof(T) bytes char string[12]; x x + 12 int val[5]; x x+4 x+8 x + 12 x + 16 x + 20 double a[3]; x x+8 char *p[3]; x + 16 x + 24 IA32 x x+4 x+8 x + 12 x86-64 x x+8 x + 16 x + 24 Array Access Basic Principle T A[L]; Array of data type T and length L Identifier A can be used as a pointer to array element 0: Type T* int val[5]; 1 x Reference val[4] val val+1 &val[2] val[5] *(val+1) val + i x+4 Type int int int int int int int 5 * * * * 2 x+8 1 x + 12 Value 3 x x+4 Will xdisappear +8 Blackboard? ?? 5 x+4i 3 x + 16 x + 20 Array Access Basic Principle T A[L]; Array of data type T and length L Identifier A can be used as a pointer to array element 0: Type T* int val[5]; 1 x Reference val[4] val val+1 &val[2] val[5] *(val+1) val + i 5 x+4 2 x+8 Type Value int int int int int int int 3 x x+4 x+8 ?? 5 x+4i * * * * 1 x + 12 3 x + 16 x + 20 Array Example typedef int zip_dig[5]; zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; zip_dig cmu; 1 16 zip_dig mit; 20 0 36 zip_dig ucb; 2 24 2 40 9 56 5 28 1 44 4 60 1 32 3 48 7 64 3 9 52 2 68 36 56 0 72 76 Declaration “zip_dig cmu” equivalent to “int cmu[5]” Example arrays were allocated in successive 20 byte blocks Not guaranteed to happen in general Array Accessing Example zip_dig cmu; 1 16 5 20 2 24 1 28 int get_digit (zip_dig z, int dig) { return z[dig]; } 32 IA32 # %edx = z # %eax = dig movl (%edx,%eax,4),%eax 3 # z[dig] 36 Register %edx contains starting address of array Register %eax contains array index Desired digit at 4*%eax + %edx Use memory reference (%edx,%eax,4) Referencing Examples zip_dig cmu; 1 16 zip_dig mit; 20 0 36 zip_dig ucb; Reference 2 24 2 40 9 56 5 28 1 44 4 60 1 Address 32 3 48 7 64 3 9 52 2 68 Value mit[3] mit[5] mit[-1] 36 + 4* 3 = 48 3 36 + Will 4* 5disappear = 56 9 36 + 4*-1 = 32 3 Blackboard? cmu[15] 16 + 4*15 = 76 ?? 36 56 0 72 76 Guaranteed? Referencing Examples zip_dig cmu; 1 16 zip_dig mit; 20 0 36 zip_dig mit; Reference 2 24 2 40 9 56 5 28 1 44 4 60 1 32 3 48 7 64 3 36 9 52 2 68 56 0 72 76 Address Value Guaranteed? mit[3] mit[5] mit[-1] 36 + 4* 3 = 48 36 + 4* 5 = 56 36 + 4*-1 = 32 3 9 3 cmu[15] 16 + 4*15 = 76 ?? Yes No No No No bound checking Out of range behavior implementation-dependent No guaranteed relative allocation of different arrays Array Loop Example Original Transformed As generated by GCC Eliminate loop variable i Convert array code to pointer code Express in do-while form (no test at entrance) int zd2int(zip_dig z) { int i; int zi = 0; for (i = 0; i < 5; i++) { zi = 10 * zi + z[i]; } return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while (z <= zend); return zi; } Array Loop Implementation (IA32) Registers %ecx z %eax zi %ebx zend Computations 10*zi + *z implemented as *z + 2*(zi+4*zi) z++ increments by 4 int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } # %ecx = z xorl %eax,%eax leal 16(%ecx),%ebx .L59: leal (%eax,%eax,4),%edx movl (%ecx),%eax addl $4,%ecx leal (%eax,%edx,2),%eax cmpl %ebx,%ecx jle .L59 # zi = 0 # zend = z+4 # # # # # # 5*zi *z z++ zi = *z + 2*(5*zi) z : zend if <= goto loop Nested Array Example #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; zip_dig pgh[4]; 1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1 76 96 116 136 “zip_dig pgh[4]” equivalent to “int pgh[4][5]” Variable pgh: array of 4 elements, allocated contiguously Each element is an array of 5 int’s, allocated contiguously 156 “Row-Major” ordering of all elements guaranteed Multidimensional (Nested) Arrays Declaration T A[R][C]; 2D array of data type T R rows, C columns Type T element requires K bytes Array Size A[0][0] • • • • • • A[0][C-1] • • • A[R-1][0] • • • A[R-1][C-1] R * C * K bytes Arrangement Row-Major Ordering int A[R][C]; A [0] [0] A A • • • [0] [1] [C-1] [0] A • • • [1] [C-1] 4*R*C Bytes • • • A A [R-1] • • • [R-1] [0] [C-1] Nested Array Row Access Row Vectors A[i] is array of C elements Each element of type T requires K bytes Starting address A + i * (C * K) int A[R][C]; A[0] A [0] [0] A ••• A[i] A [0] [C-1] • • • A [i] [0] ••• A+i*C*4 A[R-1] A [i] [C-1] • • • A [R-1] [0] ••• A+(R-1)*C*4 A [R-1] [C-1] Nested Array Row Access Code int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; What data type is pgh[index]? What is its starting address? # %eax = index Will disappear leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh Blackboard? + (20 * index) Nested Array Row Access Code int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; # %eax = index leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh + (20 * index) Row Vector pgh[index] is array of 5 int’s Starting address pgh+20*index IA32 Code Computes and returns address Compute as pgh + 4*(index+4*index) Nested Array Row Access Array Elements A[i][j] is element of type T, which requires K bytes Address A + i * (C * K) + j * K = A + (i * C + j)* K int A[R][C]; A[0] A [0] [0] A ••• A[i] A [0] [C-1] • • • ••• A [i] [j] A[R-1] ••• A+i*C*4 A+i*C*4+j*4 • • • A [R-1] [0] ••• A+(R-1)*C*4 A [R-1] [C-1] Nested Array Element Access Code int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } # %ecx = dig # %eax = index leal 0(,%ecx,4),%edx leal (%eax,%eax,4),%eax movl pgh(%edx,%eax,4),%eax # 4*dig # 5*index # *(pgh + 4*dig + 20*index) Array Elements pgh[index][dig] is int Address: pgh + 20*index + 4*dig IA32 Code Computes address pgh + 4*dig + 4*(index+4*index) movl performs memory reference Strange Referencing Examples zip_dig pgh[4]; 1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1 76 Reference 96 116 136 156 Address Value Guaranteed? pgh[3][3] pgh[2][5] pgh[2][-1] pgh[4][-1] pgh[0][19] 76+20*3+4*3 = 148 76+20*2+4*5 = 136 76+20*2+4*-1 = 112 Will disappear 76+20*4+4*-1 = 152 76+20*0+4*19 = 152 2 1 3 1 1 pgh[0][-1] 76+20*0+4*-1 = 72 ?? Strange Referencing Examples zip_dig pgh[4]; 1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1 76 Reference 96 116 136 156 Address Value Guaranteed? pgh[3][3] pgh[2][5] pgh[2][-1] pgh[4][-1] pgh[0][19] 76+20*3+4*3 = 148 76+20*2+4*5 = 136 76+20*2+4*-1 = 112 76+20*4+4*-1 = 152 76+20*0+4*19 = 152 2 1 3 1 1 Yes pgh[0][-1] 76+20*0+4*-1 = 72 ?? No Code does not do any bounds checking Ordering of elements within array guaranteed Yes Yes Yes Multi-Level Array Example zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; #define UCOUNT 3 int *univ[UCOUNT] = {mit, cmu, ucb}; cmu univ 160 36 164 16 168 56 mit 1 16 5 20 0 ucb 36 2 24 2 40 9 56 1 28 1 44 4 60 Variable univ denotes array of 3 elements Each element is a pointer 4 bytes Each pointer points to array of int’s 32 3 48 7 64 3 9 52 2 68 36 56 0 72 76 Element Access in Multi-Level Array int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig Will disappear leal 0(,%ecx,4),%edx # 4*index Blackboard? movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Element Access in Multi-Level Array int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig leal 0(,%ecx,4),%edx # 4*index movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Computation (IA32) Element access Mem[Mem[univ+4*index]+4*dig] Must do two memory reads First get pointer to row array Then access element within array Array Element Accesses Nested array int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } Multi-level array int get_univ_digit (int index, int dig) { return univ[index][dig]; } Access looks similar, but element: Mem[pgh+20*index+4*dig] Mem[Mem[univ+4*index]+4*dig] Strange Referencing Examples cmu univ 160 36 164 16 168 56 mit 1 16 5 20 0 ucb 36 Reference univ[2][3] univ[1][5] univ[2][-1] univ[3][-1] univ[1][12] Address 24 2 40 9 56 2 1 4 Value 56+4*3 = 68 2 16+4*5 = 36 0 56+4*-1 = 52 9 Will disappear ?? ?? 16+4*12 = 64 7 28 44 60 1 32 3 48 7 64 3 9 52 2 68 36 56 0 72 Guaranteed? 76 Strange Referencing Examples cmu univ 160 36 164 16 168 56 mit 1 16 5 20 0 ucb 36 Reference univ[2][3] univ[1][5] univ[2][-1] univ[3][-1] univ[1][12] Address 56+4*3 16+4*5 56+4*-1 ?? 16+4*12 24 2 40 9 56 2 28 1 44 4 60 = 64 3 7 64 2 0 9 ?? 7 Code does not do any bounds checking Ordering of elements in different arrays not guaranteed 3 32 48 Value = 68 = 36 = 52 1 9 52 2 68 36 56 0 72 Guaranteed? Yes No No No No 76 Using Nested Arrays Strengths C compiler handles doubly subscripted arrays Generates very efficient code Avoids multiply in index computation Limitation Only works for fixed array size #define N 16 typedef int fix_matrix[N][N]; /* Compute element i,k of fixed matrix product */ int fix_prod_ele (fix_matrix a, fix_matrix b, int i, int k) { int j; int result = 0; for (j = 0; j < N; j++) result += a[i][j]*b[j][k]; return result; } a b x i-th row j-th column Dynamic Nested Arrays Strength Can create matrix of any size Programming Must do index computation explicitly Performance Accessing single element costly Must do multiplication int * new_var_matrix(int n) { return (int *) calloc(sizeof(int), n*n); } int var_ele (int *a, int i, int j, int n) { return a[i*n+j]; } movl 12(%ebp),%eax movl 8(%ebp),%edx imull 20(%ebp),%eax addl 16(%ebp),%eax movl (%edx,%eax,4),%eax # # # # # i a n*i n*i+j Mem[a+4*(i*n+j)] Dynamic Array Multiplication Without Optimizations Multiplies: 3 2 for subscripts 1 for data Adds: 4 2 for array indexing 1 for loop index 1 for data /* Compute element i,k of variable matrix product */ int var_prod_ele (int *a, int *b, int i, int k, int n) { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; } Optimizing Dynamic Array Multiplication Optimizations { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; Performed when set optimization level to -O2 Code Motion Expression i*n can be computed outside loop Strength Reduction } { int j; int result = 0; int iTn = i*n; int jTnPk = k; for (j = 0; j < n; j++) { result += a[iTn+j] * b[jTnPk]; jTnPk += n; } return result; Incrementing j has effect of incrementing j*n+k by n Operations count 4 adds, 1 mult Compiler can optimize regular access patterns } Today Structures Alignment Unions Floating point Structures struct rec { int i; int a[3]; int *p; }; Memory Layout i a 0 4 p 16 20 Concept Contiguously-allocated region of memory Refer to members within structure by names Members may be of different types Accessing Structure Member void set_i(struct rec *r, int val) { r->i = val; } IA32 Assembly # %eax = val # %edx = r movl %eax,(%edx) # Mem[r] = val Generating Pointer to Structure Member struct rec { int i; int a[3]; int *p; }; r r+4+4*idx i a 0 4 p 16 20 int *find_a (struct rec *r, int idx) { return &r->a[idx]; } What does it do? # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # Will 4*idx disappear leal 4(%eax,%edx),%eax # r+4*idx+4 blackboard? Generating Pointer to Structure Member struct rec { int i; int a[3]; int *p; }; Generating Pointer to Array Element Offset of each structure member determined at compile time r r+4+4*idx i a 0 4 p 16 20 int *find_a (struct rec *r, int idx) { return &r->a[idx]; } # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 Structure Referencing (Cont.) C Code struct rec { int i; int a[3]; int *p; }; i a 0 i a void set_p(struct rec *r) { r->p = &r->a[r->i]; } What does it do? # %edx = r movl (%edx),%ecx leal 0(,%ecx,4),%eax leal 4(%edx,%eax),%eax movl %eax,16(%edx) 4 p 16 20 0 4 Element i # # # # r->i 4*(r->i) r+4+4*(r->i) Update r->p 16 20 Today Structures Alignment Unions Floating point Alignment Aligned Data Primitive data type requires K bytes Address must be multiple of K Required on some machines; advised on IA32 treated differently by IA32 Linux, x86-64 Linux, and Windows! Motivation for Aligning Data Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent) Inefficient to load or store datum that spans quad word boundaries Virtual memory very tricky when datum spans 2 pages Compiler Inserts gaps in structure to ensure correct alignment of fields Specific Cases of Alignment (IA32) 1 byte: char, … no restrictions on address 2 bytes: short, … lowest 1 bit of address must be 02 4 bytes: int, float, char *, … lowest 2 bits of address must be 002 8 bytes: double, … Windows (and most other OS’s & instruction sets): lowest 3 bits of address must be 0002 Linux: lowest 2 bits of address must be 002 i.e., treated the same as a 4-byte primitive data type 12 bytes: long double Windows, Linux: lowest 2 bits of address must be 002 i.e., treated the same as a 4-byte primitive data type Satisfying Alignment with Structures Within structure: struct S1 { char c; int i[2]; double v; } *p; Must satisfy element’s alignment requirement Overall structure placement Each structure has alignment requirement K K = Largest alignment of any element Initial address & structure length must be multiples of K Example (under Windows or x86-64): K = 8, due to double element c p+0 i[0] 3 bytes p+4 i[1] p+8 Multiple of 4 Multiple of 8 v 4 bytes p+16 p+24 Multiple of 8 Multiple of 8 Different Alignment Conventions struct S1 { char c; int i[2]; double v; } *p; x86-64 or IA32 Windows: K = 8, due to double element c p+0 3 bytes i[0] p+4 i[1] v 4 bytes p+8 p+16 p+24 IA32 Linux K = 4; double treated like a 4-byte data type c p+0 3 bytes p+4 i[0] i[1] p+8 v p+12 p+20 Saving Space Put large data types first struct S1 { char c; int i[2]; double v; } *p; struct S2 { double v; int i[2]; char c; } *p; Effect (example x86-64, both have K=8) c p+0 i[0] 3 bytes p+4 i[1] p+8 v p+0 p+16 i[0] p+8 v 4 bytes i[1] c p+16 p+24 Arrays of Structures Satisfy alignment requirement for every element a[0] a+0 a[1] a+24 v a+24 i[0] a+32 struct S2 { double v; int i[2]; char c; } a[10]; a[2] a+48 i[1] ••• a+36 c a+40 7 bytes a+48 Accessing Array Elements struct S3 { short i; float v; short j; } a[10]; Compute array offset 12i Compute offset 8 with structure Assembler gives offset a+8 Resolved during linking a[0] • • • a+0 • • • a+12i i a+12i short get_j(int idx) { return a[idx].j; } a[i] 2 bytes v j 2 bytes a+12i+8 # %eax = idx leal (%eax,%eax,2),%eax # 3*idx movswl a+8(,%eax,4),%eax Today Structures Alignment Unions Floating point Union Allocation Allocate according to largest element Can only use ones field at a time union U1 { char c; int i[2]; double v; } *up; c i[0] i[1] v up+0 up+4 up+8 struct S1 { char c; int i[2]; double v; } *sp; c sp+0 3 bits sp+4 i[0] i[1] sp+8 v 4 bits sp+16 sp+24 Using Union to Access Bit Patterns typedef union { float f; unsigned u; } bit_float_t; u f 0 float bit2float(unsigned u) { bit_float_t arg; arg.u = u; return arg.f; } Same as (float) u ? 4 unsigned float2bit(float f) { bit_float_t arg; arg.f = f; return arg.u; } Same as (unsigned) f ? Byte Ordering Revisited Idea Short/long/quad words stored in memory as 2/4/8 consecutive bytes Which is most (least) significant? Can cause problems when exchanging binary data between machines Big Endian Most significant byte has lowest address PowerPC, Sparc Little Endian Least significant byte has lowest address Intel x86 Byte Ordering Example union { unsigned unsigned unsigned unsigned } dw; char c[8]; short s[4]; int i[2]; long l[1]; c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] s[0] i[0] l[0] s[1] s[2] i[1] s[3] Byte Ordering Example (Cont). int j; for (j = 0; j < 8; j++) dw.c[j] = 0xf0 + j; printf("Characters 0-7 == [0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n", dw.c[0], dw.c[1], dw.c[2], dw.c[3], dw.c[4], dw.c[5], dw.c[6], dw.c[7]); printf("Shorts 0-3 == [0x%x,0x%x,0x%x,0x%x]\n", dw.s[0], dw.s[1], dw.s[2], dw.s[3]); printf("Ints 0-1 == [0x%x,0x%x]\n", dw.i[0], dw.i[1]); printf("Long 0 == [0x%lx]\n", dw.l[0]); Byte Ordering on IA32 Little Endian f0 f1 f2 f3 f4 f5 f6 f7 c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] LSB MSB s[0] LSB MSB s[1] LSB LSB s[2] MSB i[0] LSB MSB LSB MSB s[3] MSB i[1] LSB MSB l[0] Print Output on IA32: Characters Shorts Ints Long 0-7 0-3 0-1 0 == == == == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] [0xf1f0,0xf3f2,0xf5f4,0xf7f6] [0xf3f2f1f0,0xf7f6f5f4] [0xf3f2f1f0] Byte Ordering on Sun Big Endian f0 f1 f2 f3 f4 f5 f6 f7 c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] MSB LSB s[0] MSB LSB s[1] MSB MSB s[2] LSB i[0] MSB LSB MSB LSB s[3] LSB i[1] MSB LSB l[0] Print Output on Sun: Characters Shorts Ints Long 0-7 0-3 0-1 0 == == == == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] [0xf0f1,0xf2f3,0xf4f5,0xf6f7] [0xf0f1f2f3,0xf4f5f6f7] [0xf0f1f2f3] Byte Ordering on x86-64 Little Endian f0 f1 f2 f3 f4 f5 f6 f7 c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] LSB MSB s[0] LSB MSB s[1] LSB LSB s[2] MSB i[0] LSB MSB LSB MSB s[3] MSB i[1] LSB MSB l[0] Print Output on x86-64: Characters Shorts Ints Long 0-7 0-3 0-1 0 == == == == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] [0xf1f0,0xf3f2,0xf5f4,0xf7f6] [0xf3f2f1f0,0xf7f6f5f4] [0xf7f6f5f4f3f2f1f0] Summary Arrays in C Contiguous allocation of memory Aligned to satisfy every element’s alignment requirement Pointer to first element No bounds checking Structures Allocate bytes in order declared Pad in middle and at end to satisfy alignment Unions Overlay declarations Way to circumvent type system Today Structures Alignment Unions Floating point x87 (available with IA32, becoming obsolete) SSE3 (available with x86-64) IA32 Floating Point (x87) History 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Becoming obsolete with x86-64 Summary Hardware to add, multiply, and divide Floating point data registers Various control & status registers Instruction decoder and sequencer Integer Unit FPU Floating Point Formats single precision (C float): 32 bits double precision (C double): 64 bits extended precision (C long double): 80 bits Memory FPU Data Register Stack (x87) FPU register format (80 bit extended precision) 79 78 s exp 0 64 63 frac FPU registers 8 registers %st(0) - %st(7) Logically form stack Top: %st(0) Bottom disappears (drops out) after too many pushs %st(3) %st(2) %st(1) “Top” %st(0) FPU instructions (x87) Large number of floating point instructions and formats ~50 basic instruction types load, store, add, multiply sin, cos, tan, arctan, and log Often slower than math lib Sample instructions: Instruction Effect Description fldz flds Addr fmuls Addr faddp push 0.0 push Mem[Addr] %st(0) %st(0)*M[Addr] %st(1) %st(0)+%st(1);pop Load zero Load single precision real Multiply Add and pop FP Code Example (x87) Compute inner product of two vectors Single precision arithmetic Common computation float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%ecx movl 16(%ebp),%edx fldz xorl %eax,%eax cmpl %edx,%eax jge .L3 .L5: flds (%ebx,%eax,4) fmuls (%ecx,%eax,4) faddp incl %eax cmpl %edx,%eax jl .L5 .L3: movl -4(%ebp),%ebx movl %ebp, %esp popl %ebp ret # setup # # # # # # %ebx=&x %ecx=&y %edx=n push +0.0 i=0 if i>=n done # # # # # push x[i] st(0)*=y[i] st(1)+=st(0); pop i++ if i<n repeat # finish # st(0) = result Inner Product Stack Trace eax = i ebx = *x ecx = *y Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 0.0 x[0] 5. flds (%ebx,%eax,4) %st(1) %st(0) 3. fmuls (%ecx,%eax,4) 0.0 x[0]*y[0] %st(1) %st(0) 4. faddp 0.0+x[0]*y[0] x[0]*y[0] x[1] %st(1) %st(0) 6. fmuls (%ecx,%eax,4) x[0]*y[0] x[1]*y[1] %st(1) %st(0) 7. faddp %st(0) x[0]*y[0]+x[1]*y[1] %st(0) Machine Programming – x86-64 extensions CENG331: Introduction to Computer Systems Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ. x86-64 Integer Registers %rax %eax %r8 %r8d %rbx %ebx %r9 %r9d %rcx %ecx %r10 %r10d %rdx %edx %r11 %r11d %rsi %esi %r12 %r12d %rdi %edi %r13 %r13d %rsp %esp %r14 %r14d %rbp %ebp %r15 %r15d Twice the number of registers Accessible as 8, 16, 32, 64 bits x86-64 Integer Registers %rax Return value %r8 Argument #5 %rbx Callee saved %r9 Argument #6 %rcx Argument #4 %r10 Callee saved %rdx Argument #3 %r11 Used for linking %rsi Argument #2 %r12 C: Callee saved %rdi Argument #1 %r13 Callee saved %rsp Stack pointer %r14 Callee saved %rbp Callee saved %r15 Callee saved x86-64 Registers Arguments passed to functions via registers If more than 6 integral parameters, then pass rest on stack These registers can be used as caller-saved as well All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp Other Registers 6+1 callee saved 2 or 3 have special uses x86-64 Long Swap void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } swap: movq movq movq movq ret Operands passed in registers First (xp) in %rdi, second (yp) in %rsi 64-bit pointers No stack operations required (except ret) Avoiding stack Can hold all local information in registers (%rdi), %rdx (%rsi), %rax %rax, (%rdi) %rdx, (%rsi) x86-64 Locals in the Red Zone /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } swap_a: movq movq movq movq movq movq movq movq ret Avoiding Stack Pointer Change Can hold all information within small window beyond stack pointer (%rdi), %rax %rax, -24(%rsp) (%rsi), %rax %rax, -16(%rsp) -16(%rsp), %rax %rax, (%rdi) -24(%rsp), %rax %rax, (%rsi) rtn Ptr −8 unused −16 loc[1] −24 loc[0] %rsp x86-64 NonLeaf without Stack Frame long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele_se (long a[], int i) { swap(&a[i], &a[i+1]); scount++; } No values held while swap being invoked No callee save registers needed swap_ele_se: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() incq scount(%rip) # scount++; ret x86-64 Call using Jump long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap() x86-64 Stack Frame Example long sum = 0; /* Swap a[i] & a[i+1] */ void swap_ele_su (long a[], int i) { swap(&a[i], &a[i+1]); sum += a[i]; } Keeps values of a and i in callee save registers Must set up stack frame to save these registers swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret # # # # # # # # # # # # # Save %rbx Extend & save i Save %r12 Save a &a[i] Allocate stack frame &a[i+1] swap() a[i] sum += a[i] Restore %rbx Restore %r12 Deallocate stack frame Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret # # # # # # # # # # # # # Save %rbx %rsp rtn addr Extend & save i −8 %r12 Save %r12 −16 %rbx Save a &a[i] Allocate stack frame &a[i+1] rtn addr swap() a[i] +8 %r12 sum += a[i] %rsp %rbx Restore %rbx Restore %r12 Deallocate stack frame Interesting Features of Stack Frame Allocate entire frame at once All stack accesses can be relative to %rsp Do by decrementing stack pointer Can delay allocation, since safe to temporarily use red zone Simple deallocation Increment stack pointer No base/frame pointer needed Interesting Features of Stack Frame Many compiled functions do not require a stack frame other than saving their return address. A function does not require a stack frame if: All local variables can be held in registers The function does not call other functions (referred to as leaf procedures) A function would require a stack frame if the function: Has too many local variables to hold in registers Has some local variables are arrays or structures uses &-operator to compute the address of a local variable must pass some arguments on the stack to another function Needs to save the state of a calllee-save register General Conditional Expression Translation C Code val = Test ? Then-Expr : Else-Expr; val = x>y ? x-y : y-x; Goto Version nt = !Test; if (nt) goto Else; val = Then-Expr; Done: . . . Else: val = Else-Expr; goto Done; Test is expression returning integer = 0 interpreted as false 0 interpreted as true Create separate code regions for then & else expressions Execute appropriate one Conditionals: x86-64 int absdiff( int x, int y) { int result; if (x > y) { result = x-y; } else { result = y-x; } return result; } absdiff: movl movl subl subl cmpl cmovle ret # x in %edi, y in %esi %edi, %eax # eax = x %esi, %edx # edx = y %esi, %eax # eax = x-y %edi, %edx # edx = y-x %esi, %edi # x:y %edx, %eax # eax=edx if <= Conditional move instruction cmovC src, dest Move value from src to dest if condition C holds More efficient than conditional branching (simple control flow) But overhead: both branches are evaluated General Form with Conditional Move C Code val = Test ? Then-Expr : Else-Expr; Conditional Move Version val1 val2 val1 = Then-Expr; = Else-Expr; = val2 if !Test; Both values get computed Overwrite then-value with else-value if condition doesn’t hold Don’t use when: Then or else expression have side effects Then and else expression are to expensive Specific Cases of Alignment (x86-64) 1 byte: char, … no restrictions on address 2 bytes: short, … lowest 1 bit of address must be 02 4 bytes: int, float, … lowest 2 bits of address must be 002 8 bytes: double, char *, … Windows & Linux: lowest 3 bits of address must be 0002 16 bytes: long double Linux: lowest 3 bits of address must be 0002 i.e., treated the same as a 8-byte primitive data type Vector Instructions: SSE Family SIMD (single-instruction, multiple data) vector instructions New data types, registers, operations Parallel operation on small (length 2-8) vectors of integers or floats Example: + x “4-way” Floating point vector instructions Available with Intel’s SSE (streaming SIMD extensions) family SSE starting with Pentium III: 4-way single precision SSE2 starting with Pentium 4: 2-way double precision All x86-64 have SSE3 (superset of SSE2, SSE) SSE3 Registers All caller saved %xmm0 for floating point return value 128 bit = 2 doubles = 4 singles %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15 SSE3 Registers Different data types and associated instructions 128 bit Integer vectors: 16-way byte 8-way 2 bytes 4-way 4 bytes Floating point vectors: 4-way single 2-way double Floating point scalars: single double LSB SSE3 Instructions: Examples Single precision 4-way vector add: addps %xmm0 %xmm1 %xmm0 + %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 Extending to x86-64 Pointers and long ints are 64 bits long. Integer arithmetic operations support 8, 16, 32 and 64-bit data types The set of general purpose registers expanded from 8 to 16 Much of the program state is held in registers rather than on stack. Integer and pointer arguments (upto 6) to procedures are passsed via registers. Some procedures do not need to access to stack at all. Conditional operations are implemented using conditional move instructions, when possible, yielding better performance than traditional branching Floating point operations are implemented using registeroriented SSE2, rather than stack-based x87 Procedures (x86-64): Optimizations No base/frame pointer Passing arguments to functions through registers (if possible) Sometimes: Writing into the “red zone” (below stack pointer) rtn Ptr −8 unused −16 loc[1] −24 )loc[0] Sometimes: Function call using jmp (instead of call Reason: Performance use stack as little as possible while obeying rules (e.g., caller/callee save registers) %rsp