A [1]

Machine Programming – Procedures and IA32 Stack CENG331: Introduction to Computer Systems 6th Lecture Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ. IA32 Stack Stack “Bottom”    Region of memory managed with stack discipline Grows toward lower addresses Increasing Addresses Register %esp contains lowest stack address = address of “top” element Stack Grows Down Stack Pointer: %esp Stack “Top” IA32 Stack: Push Stack “Bottom”  pushl Src  Fetch operand at Src  Decrement %esp by 4  Write operand at address given Increasing Addresses by %esp Stack Grows Down Stack Pointer: %esp -4 Stack “Top” IA32 Stack: Pop Stack “Bottom”  popl Dest  Read operand at address %esp  Increment %esp by 4  Write operand to Dest Stack Pointer: %esp Increasing Addresses Stack Grows Down +4 Stack “Top” Procedure Control Flow   Use stack to support procedure call and return Procedure call: call label  Push return address on stack  Jump to label  Return address:  Address of instruction beyond call  Example from disassembly 804854e: e8 3d 06 00 00 8048553: 50  Return address = 0x8048553  Procedure return: ret  Pop address from stack  Jump to address call pushl 8048b90 <main> %eax Procedure Call Example 804854e: 8048553: e8 3d 06 00 00 50 call 0x110 0x110 0x10c 0x10c 0x108 123 0x108 call pushl 8048b90 123 0x104 0x8048553 %esp 0x108 %eip 0x804854e %eip: program counter %esp 0x108 0x104 %eip 0x8048b90 0x804854e 8048b90 <main> %eax Procedure Return Example 8048591: c3 ret ret 0x110 0x110 0x10c 0x10c 0x108 123 0x108 0x104 0x8048553 %esp 0x104 %eip 0x8048591 %eip: program counter 123 0x8048553 %esp 0x104 0x108 %eip 0x8048553 0x8048591 Stack-Based Languages  Languages that support recursion  e.g., C, Pascal, Java  Code must be “Reentrant” Multiple simultaneous instantiations of single procedure  Need some place to store state of each instantiation  Arguments  Local variables  Return pointer   Stack discipline  State for given procedure needed for limited time From when called to when return  Callee returns before caller does   Stack allocated in Frames  state for single procedure instantiation Call Chain Example yoo(…) { • • who(); • • } Example Call Chain yoo who(…) { • • • amI(); • • • amI(); • • • } who amI(…) { • • amI(); • • } Procedure amI is recursive amI amI amI amI Stack Frames  Previous Frame Contents  Local variables  Return information  Temporary space Frame Pointer: %ebp Frame for proc Stack Pointer: %esp  Management  Space allocated when enter procedure “Set-up” code  Deallocated when return  “Finish” code  Stack “Top” Stack Example yoo(…) { • • who(); • • } yoo %ebp yoo who amI amI amI %esp amI Stack Example who(…) { • • • amI(); • • • amI(); • • • } yoo yoo who %ebp amI amI who %esp amI amI Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI amI who %ebp amI amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI who amI amI amI %ebp amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI who amI amI amI amI %ebp amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI who amI amI amI %ebp amI %esp Stack Example amI(…) { • • amI(); • • } yoo yoo who amI amI amI who %ebp amI amI %esp Stack Example who(…) { • • • amI(); • • • amI(); • • • } yoo yoo who %ebp amI amI who %esp amI amI Stack Example amI(…) { • • • • • } yoo yoo who amI amI amI who %ebp amI amI %esp Stack Example who(…) { • • • amI(); • • • amI(); • • • } yoo yoo who %ebp amI amI who %esp amI amI Stack Example yoo(…) { • • who(); • • } yoo %ebp yoo who amI amI amI %esp amI IA32/Linux Stack Frame  Current Stack Frame (“Top” to Bottom)  “Argument build:” Parameters for function about to call  Local variables If can’t keep in registers  Saved register context  Old frame pointer  Caller Frame Arguments Frame pointer %ebp Return Addr Old %ebp Saved Registers + Local Variables Caller Stack Frame  Return address  Pushed by call instruction  Arguments for this call Stack pointer %esp Argument Build Revisiting swap Calling swap from call_swap int zip1 = 15213; int zip2 = 91125; void call_swap() { swap(&zip1, &zip2); } void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } call_swap: • • • pushl $zip2 pushl $zip1 call swap • • • • • • # Global Var # Global Var Resulting Stack &zip2 &zip1 Rtn adr %esp Revisiting swap void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } swap: pushl %ebp movl %esp,%ebp pushl %ebx movl movl movl movl movl movl 12(%ebp),%ecx 8(%ebp),%edx (%ecx),%eax (%edx),%ebx %eax,(%edx) %ebx,(%ecx) movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Set Up Body Finish swap Setup #1 Entering Stack Resulting Stack %ebp %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %esp swap Setup #1 Entering Stack %ebp %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %esp swap Setup #1 Entering Stack Resulting Stack %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %ebp %esp swap Setup #1 Entering Stack %ebp • • • • • • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp swap: pushl %ebp movl %esp,%ebp pushl %ebx %ebp %esp swap Setup #1 Entering Stack Resulting Stack %ebp • • • • • • Offset relative to %ebp &zip2 &zip1 Rtn adr %esp movl 12(%ebp),%ecx # get yp movl 8(%ebp),%edx # get xp . . . 12 8 4 yp xp Rtn adr Old %ebp %ebp Old %ebx %esp swap Finish #1 swap’s Stack Resulting Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebp %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation: Saved and restored register %ebx swap Finish #2 swap’s Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebp %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret swap Finish #2 swap’s Stack Resulting Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Old %ebp %ebp %esp swap Finish #2 swap’s Stack • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Old %ebp %ebp %esp swap Finish #3 swap’s Stack Resulting Stack %ebp • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret %esp swap Finish #4 swap’s Stack %ebp • • • • • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret %esp swap Finish #4 swap’s Stack Resulting Stack %ebp • • • • • • yp yp xp xp %esp Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret  Observation  Saved & restored register %ebx  Didn’t do so for %eax, %ecx, or %edx Disassembled swap 080483a4 <swap>: 80483a4: 55 80483a5: 89 e5 80483a7: 53 80483a8: 8b 55 08 80483ab: 8b 4d 0c 80483ae: 8b 1a 80483b0: 8b 01 80483b2: 89 02 80483b4: 89 19 80483b6: 5b 80483b7: c9 80483b8: c3 push mov push mov mov mov mov mov mov pop leave ret %ebp %esp,%ebp %ebx 0x8(%ebp),%edx 0xc(%ebp),%ecx (%edx),%ebx (%ecx),%eax %eax,(%edx) %ebx,(%ecx) %ebx Calling Code 8048409: 804840e: e8 96 ff ff ff 8b 45 f8 call 80483a4 <swap> mov 0xfffffff8(%ebp),%eax Register Saving Conventions  When procedure yoo calls who:  yoo is the caller  who is the callee  Can Register be used for temporary storage? yoo: • • • movl $15213, %edx call who addl %edx, %eax • • • ret who: • • • movl 8(%ebp), %edx addl $91125, %edx • • • ret  Contents of register %edx overwritten by who Register Saving Conventions  When procedure yoo calls who:  yoo is the caller  who is the callee   Can register be used for temporary storage? Conventions  “Caller Save” Caller saves temporary in its frame before calling  “Callee Save”  Callee saves temporary in its frame before using  IA32/Linux Register Usage  %eax, %edx, %ecx  Caller saves prior to call if values are used later  value Callee-Save Temporaries %ebx, %esi, %edi  Callee saves if wants to use them  Caller-Save Temporaries %eax  also used to return integer  %eax %esp, %ebp  special Special %edx %ecx %ebx %esi %edi %esp %ebp IA 32 Procedure Summary  The Stack Makes Recursion Work  Private storage for each instance of procedure call Instantiations don’t clobber each other  Addressing of locals + arguments can be relative to stack positions  Managed by stack discipline  Procedures return in inverse order of calls   Caller Frame Arguments %ebp IA32 Procedures Combination of Instructions + Conventions Saved Registers + Local Variables  Call / Ret instructions  Register usage conventions Caller / Callee save  %ebp and %esp  Stack frame organization conventions Return Addr Old %ebp  %esp Argument Build Today  Arrays  One-dimensional  Multi-dimensional (nested)  Multi-level  Structures Basic Data Types  Integral  Stored & operated on in general (integer) registers  Signed vs. unsigned depends on instructions used Intel byte word double word quad word  GAS b w l q Bytes 1 2 4 8 C [unsigned] [unsigned] [unsigned] [unsigned] char short int long int (x86-64) Floating Point  Stored & operated on in floating point registers Intel Single Double Extended GAS s l t Bytes 4 8 10/12/16 C float double long double Array Allocation  Basic Principle T A[L];  Array of data type T and length L  Contiguously allocated region of L * sizeof(T) bytes char string[12]; x x + 12 int val[5]; x x+4 x+8 x + 12 x + 16 x + 20 double a[3]; x x+8 char *p[3]; x + 16 x + 24 IA32 x x+4 x+8 x + 12 x86-64 x x+8 x + 16 x + 24 Array Access  Basic Principle T A[L];  Array of data type T and length L  Identifier A can be used as a pointer to array element 0: Type T* int val[5]; 1 x  Reference val[4] val val+1 &val[2] val[5] *(val+1) val + i x+4 Type int int int int int int int 5 * * * * 2 x+8 1 x + 12 Value 3 x x+4 Will xdisappear +8 Blackboard? ?? 5 x+4i 3 x + 16 x + 20 Array Access  Basic Principle T A[L];  Array of data type T and length L  Identifier A can be used as a pointer to array element 0: Type T* int val[5]; 1 x  Reference val[4] val val+1 &val[2] val[5] *(val+1) val + i 5 x+4 2 x+8 Type Value int int int int int int int 3 x x+4 x+8 ?? 5 x+4i * * * * 1 x + 12 3 x + 16 x + 20 Array Example typedef int zip_dig[5]; zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; zip_dig cmu; 1 16 zip_dig mit; 20 0 36 zip_dig ucb;  2 24 2 40 9 56  5 28 1 44 4 60 1 32 3 48 7 64 3 9 52 2 68 36 56 0 72 76 Declaration “zip_dig cmu” equivalent to “int cmu[5]” Example arrays were allocated in successive 20 byte blocks  Not guaranteed to happen in general Array Accessing Example zip_dig cmu; 1 16 5 20 2 24 1 28 int get_digit (zip_dig z, int dig) { return z[dig]; } 32  IA32 # %edx = z # %eax = dig movl (%edx,%eax,4),%eax 3  # z[dig]   36 Register %edx contains starting address of array Register %eax contains array index Desired digit at 4*%eax + %edx Use memory reference (%edx,%eax,4) Referencing Examples zip_dig cmu; 1 16 zip_dig mit; 20 0 36 zip_dig ucb; Reference 2 24 2 40 9 56  5 28 1 44 4 60 1 Address 32 3 48 7 64 3 9 52 2 68 Value mit[3] mit[5] mit[-1] 36 + 4* 3 = 48 3 36 + Will 4* 5disappear = 56 9 36 + 4*-1 = 32 3 Blackboard? cmu[15] 16 + 4*15 = 76 ?? 36 56 0 72 76 Guaranteed? Referencing Examples zip_dig cmu; 1 16 zip_dig mit; 20 0 36 zip_dig mit; Reference 2 24 2 40 9 56  5 28 1 44 4 60 1 32 3 48 7 64 3 36 9 52 2 68 56 0 72 76 Address Value Guaranteed? mit[3] mit[5] mit[-1] 36 + 4* 3 = 48 36 + 4* 5 = 56 36 + 4*-1 = 32 3 9 3 cmu[15] 16 + 4*15 = 76 ?? Yes No No No  No bound checking  Out of range behavior implementation-dependent  No guaranteed relative allocation of different arrays Array Loop Example  Original  Transformed     As generated by GCC Eliminate loop variable i Convert array code to pointer code Express in do-while form (no test at entrance) int zd2int(zip_dig z) { int i; int zi = 0; for (i = 0; i < 5; i++) { zi = 10 * zi + z[i]; } return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while (z <= zend); return zi; } Array Loop Implementation (IA32)  Registers %ecx z %eax zi %ebx zend  Computations  10*zi + *z implemented as *z + 2*(zi+4*zi)  z++ increments by 4 int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } # %ecx = z xorl %eax,%eax leal 16(%ecx),%ebx .L59: leal (%eax,%eax,4),%edx movl (%ecx),%eax addl $4,%ecx leal (%eax,%edx,2),%eax cmpl %ebx,%ecx jle .L59 # zi = 0 # zend = z+4 # # # # # # 5*zi *z z++ zi = *z + 2*(5*zi) z : zend if <= goto loop Nested Array Example #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; zip_dig pgh[4]; 1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1 76  96 116 136 “zip_dig pgh[4]” equivalent to “int pgh[4][5]”  Variable pgh: array of 4 elements, allocated contiguously  Each element is an array of 5 int’s, allocated contiguously  156 “Row-Major” ordering of all elements guaranteed Multidimensional (Nested) Arrays  Declaration T A[R][C];  2D array of data type T  R rows, C columns  Type T element requires K bytes  Array Size A[0][0] • • • • • • A[0][C-1] • • • A[R-1][0] • • • A[R-1][C-1]  R * C * K bytes  Arrangement  Row-Major Ordering int A[R][C]; A [0] [0] A A • • • [0] [1] [C-1] [0] A • • • [1] [C-1] 4*R*C Bytes • • • A A [R-1] • • • [R-1] [0] [C-1] Nested Array Row Access  Row Vectors  A[i] is array of C elements  Each element of type T requires K bytes  Starting address A + i * (C * K) int A[R][C]; A[0] A [0] [0] A ••• A[i] A [0] [C-1] • • • A [i] [0] ••• A+i*C*4 A[R-1] A [i] [C-1] • • • A [R-1] [0] ••• A+(R-1)*C*4 A [R-1] [C-1] Nested Array Row Access Code int *get_pgh_zip(int index) { return pgh[index]; }   #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; What data type is pgh[index]? What is its starting address? # %eax = index Will disappear leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh Blackboard? + (20 * index) Nested Array Row Access Code int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; # %eax = index leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh + (20 * index)  Row Vector  pgh[index] is array of 5 int’s  Starting address pgh+20*index  IA32 Code  Computes and returns address  Compute as pgh + 4*(index+4*index) Nested Array Row Access  Array Elements  A[i][j] is element of type T, which requires K bytes  Address A + i * (C * K) + j * K = A + (i * C + j)* K int A[R][C]; A[0] A [0] [0] A ••• A[i] A [0] [C-1] • • • ••• A [i] [j] A[R-1] ••• A+i*C*4 A+i*C*4+j*4 • • • A [R-1] [0] ••• A+(R-1)*C*4 A [R-1] [C-1] Nested Array Element Access Code int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } # %ecx = dig # %eax = index leal 0(,%ecx,4),%edx leal (%eax,%eax,4),%eax movl pgh(%edx,%eax,4),%eax  # 4*dig # 5*index # *(pgh + 4*dig + 20*index) Array Elements  pgh[index][dig] is int  Address: pgh + 20*index + 4*dig  IA32 Code  Computes address pgh + 4*dig + 4*(index+4*index)  movl performs memory reference Strange Referencing Examples zip_dig pgh[4]; 1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1 76  Reference 96 116 136 156 Address Value Guaranteed? pgh[3][3] pgh[2][5] pgh[2][-1] pgh[4][-1] pgh[0][19] 76+20*3+4*3 = 148 76+20*2+4*5 = 136 76+20*2+4*-1 = 112 Will disappear 76+20*4+4*-1 = 152 76+20*0+4*19 = 152 2 1 3 1 1 pgh[0][-1] 76+20*0+4*-1 = 72 ?? Strange Referencing Examples zip_dig pgh[4]; 1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1 76  Reference 96 116 136 156 Address Value Guaranteed? pgh[3][3] pgh[2][5] pgh[2][-1] pgh[4][-1] pgh[0][19] 76+20*3+4*3 = 148 76+20*2+4*5 = 136 76+20*2+4*-1 = 112 76+20*4+4*-1 = 152 76+20*0+4*19 = 152 2 1 3 1 1 Yes pgh[0][-1] 76+20*0+4*-1 = 72 ?? No  Code does not do any bounds checking  Ordering of elements within array guaranteed Yes Yes Yes Multi-Level Array Example  zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 };  #define UCOUNT 3 int *univ[UCOUNT] = {mit, cmu, ucb}; cmu univ 160 36 164 16 168 56 mit 1 16 5 20 0 ucb 36 2 24 2 40 9 56  1 28 1 44 4 60 Variable univ denotes array of 3 elements Each element is a pointer  4 bytes Each pointer points to array of int’s 32 3 48 7 64 3 9 52 2 68 36 56 0 72 76 Element Access in Multi-Level Array int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig Will disappear leal 0(,%ecx,4),%edx # 4*index Blackboard? movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Element Access in Multi-Level Array int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig leal 0(,%ecx,4),%edx # 4*index movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig]  Computation (IA32)  Element access Mem[Mem[univ+4*index]+4*dig]  Must do two memory reads   First get pointer to row array Then access element within array Array Element Accesses Nested array int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } Multi-level array int get_univ_digit (int index, int dig) { return univ[index][dig]; } Access looks similar, but element: Mem[pgh+20*index+4*dig] Mem[Mem[univ+4*index]+4*dig] Strange Referencing Examples cmu univ 160 36 164 16 168 56 mit 1 16 5 20 0 ucb 36  Reference univ[2][3] univ[1][5] univ[2][-1] univ[3][-1] univ[1][12] Address 24 2 40 9 56 2 1 4 Value 56+4*3 = 68 2 16+4*5 = 36 0 56+4*-1 = 52 9 Will disappear ?? ?? 16+4*12 = 64 7 28 44 60 1 32 3 48 7 64 3 9 52 2 68 36 56 0 72 Guaranteed? 76 Strange Referencing Examples cmu univ 160 36 164 16 168 56 mit 1 16 5 20 0 ucb 36  Reference univ[2][3] univ[1][5] univ[2][-1] univ[3][-1] univ[1][12] Address 56+4*3 16+4*5 56+4*-1 ?? 16+4*12 24 2 40 9 56 2 28 1 44 4 60 = 64 3 7 64 2 0 9 ?? 7  Code does not do any bounds checking  Ordering of elements in different arrays not guaranteed 3 32 48 Value = 68 = 36 = 52 1 9 52 2 68 36 56 0 72 Guaranteed? Yes No No No No 76 Using Nested Arrays  Strengths  C compiler handles doubly subscripted arrays  Generates very efficient code  Avoids multiply in index computation  Limitation  Only works for fixed array size #define N 16 typedef int fix_matrix[N][N]; /* Compute element i,k of fixed matrix product */ int fix_prod_ele (fix_matrix a, fix_matrix b, int i, int k) { int j; int result = 0; for (j = 0; j < N; j++) result += a[i][j]*b[j][k]; return result; } a b x i-th row j-th column Dynamic Nested Arrays  Strength  Can create matrix of any size  Programming  Must do index computation explicitly  Performance  Accessing single element costly  Must do multiplication int * new_var_matrix(int n) { return (int *) calloc(sizeof(int), n*n); } int var_ele (int *a, int i, int j, int n) { return a[i*n+j]; } movl 12(%ebp),%eax movl 8(%ebp),%edx imull 20(%ebp),%eax addl 16(%ebp),%eax movl (%edx,%eax,4),%eax # # # # # i a n*i n*i+j Mem[a+4*(i*n+j)] Dynamic Array Multiplication  Without Optimizations  Multiplies: 3 2 for subscripts  1 for data  Adds: 4  2 for array indexing  1 for loop index  1 for data  /* Compute element i,k of variable matrix product */ int var_prod_ele (int *a, int *b, int i, int k, int n) { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; } Optimizing Dynamic Array Multiplication  Optimizations { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result;  Performed when set optimization level to -O2  Code Motion  Expression i*n can be computed outside loop  Strength Reduction } { int j; int result = 0; int iTn = i*n; int jTnPk = k; for (j = 0; j < n; j++) { result += a[iTn+j] * b[jTnPk]; jTnPk += n; } return result;  Incrementing j has effect of incrementing j*n+k by n  Operations count  4 adds, 1 mult  Compiler can optimize regular access patterns } Today     Structures Alignment Unions Floating point Structures struct rec { int i; int a[3]; int *p; };  Memory Layout i a 0 4 p 16 20 Concept  Contiguously-allocated region of memory  Refer to members within structure by names  Members may be of different types  Accessing Structure Member void set_i(struct rec *r, int val) { r->i = val; } IA32 Assembly # %eax = val # %edx = r movl %eax,(%edx) # Mem[r] = val Generating Pointer to Structure Member struct rec { int i; int a[3]; int *p; }; r r+4+4*idx i a 0 4 p 16 20 int *find_a (struct rec *r, int idx) { return &r->a[idx]; } What does it do? # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # Will 4*idx disappear leal 4(%eax,%edx),%eax # r+4*idx+4 blackboard? Generating Pointer to Structure Member struct rec { int i; int a[3]; int *p; };  Generating Pointer to Array Element  Offset of each structure member determined at compile time r r+4+4*idx i a 0 4 p 16 20 int *find_a (struct rec *r, int idx) { return &r->a[idx]; } # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 Structure Referencing (Cont.)  C Code struct rec { int i; int a[3]; int *p; }; i a 0 i a void set_p(struct rec *r) { r->p = &r->a[r->i]; } What does it do? # %edx = r movl (%edx),%ecx leal 0(,%ecx,4),%eax leal 4(%edx,%eax),%eax movl %eax,16(%edx) 4 p 16 20 0 4 Element i # # # # r->i 4*(r->i) r+4+4*(r->i) Update r->p 16 20 Today     Structures Alignment Unions Floating point Alignment  Aligned Data  Primitive data type requires K bytes  Address must be multiple of K  Required on some machines; advised on IA32   treated differently by IA32 Linux, x86-64 Linux, and Windows! Motivation for Aligning Data  Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent)  Inefficient to load or store datum that spans quad word boundaries  Virtual memory very tricky when datum spans 2 pages  Compiler  Inserts gaps in structure to ensure correct alignment of fields Specific Cases of Alignment (IA32)  1 byte: char, …  no restrictions on address  2 bytes: short, …  lowest 1 bit of address must be 02  4 bytes: int, float, char *, …  lowest 2 bits of address must be 002  8 bytes: double, …  Windows (and most other OS’s & instruction sets): lowest 3 bits of address must be 0002  Linux:  lowest 2 bits of address must be 002  i.e., treated the same as a 4-byte primitive data type   12 bytes: long double  Windows, Linux: lowest 2 bits of address must be 002  i.e., treated the same as a 4-byte primitive data type  Satisfying Alignment with Structures  Within structure: struct S1 { char c; int i[2]; double v; } *p;  Must satisfy element’s alignment requirement  Overall structure placement  Each structure has alignment requirement K K = Largest alignment of any element  Initial address & structure length must be multiples of K   Example (under Windows or x86-64):  K = 8, due to double element c p+0 i[0] 3 bytes p+4 i[1] p+8 Multiple of 4 Multiple of 8 v 4 bytes p+16 p+24 Multiple of 8 Multiple of 8 Different Alignment Conventions  struct S1 { char c; int i[2]; double v; } *p; x86-64 or IA32 Windows:  K = 8, due to double element c p+0  3 bytes i[0] p+4 i[1] v 4 bytes p+8 p+16 p+24 IA32 Linux  K = 4; double treated like a 4-byte data type c p+0 3 bytes p+4 i[0] i[1] p+8 v p+12 p+20 Saving Space  Put large data types first struct S1 { char c; int i[2]; double v; } *p;  struct S2 { double v; int i[2]; char c; } *p; Effect (example x86-64, both have K=8) c p+0 i[0] 3 bytes p+4 i[1] p+8 v p+0 p+16 i[0] p+8 v 4 bytes i[1] c p+16 p+24 Arrays of Structures  Satisfy alignment requirement for every element a[0] a+0 a[1] a+24 v a+24 i[0] a+32 struct S2 { double v; int i[2]; char c; } a[10]; a[2] a+48 i[1] ••• a+36 c a+40 7 bytes a+48 Accessing Array Elements    struct S3 { short i; float v; short j; } a[10]; Compute array offset 12i Compute offset 8 with structure Assembler gives offset a+8  Resolved during linking a[0] • • • a+0 • • • a+12i i a+12i short get_j(int idx) { return a[idx].j; } a[i] 2 bytes v j 2 bytes a+12i+8 # %eax = idx leal (%eax,%eax,2),%eax # 3*idx movswl a+8(,%eax,4),%eax Today     Structures Alignment Unions Floating point Union Allocation   Allocate according to largest element Can only use ones field at a time union U1 { char c; int i[2]; double v; } *up; c i[0] i[1] v up+0 up+4 up+8 struct S1 { char c; int i[2]; double v; } *sp; c sp+0 3 bits sp+4 i[0] i[1] sp+8 v 4 bits sp+16 sp+24 Using Union to Access Bit Patterns typedef union { float f; unsigned u; } bit_float_t; u f 0 float bit2float(unsigned u) { bit_float_t arg; arg.u = u; return arg.f; } Same as (float) u ? 4 unsigned float2bit(float f) { bit_float_t arg; arg.f = f; return arg.u; } Same as (unsigned) f ? Byte Ordering Revisited  Idea  Short/long/quad words stored in memory as 2/4/8 consecutive bytes  Which is most (least) significant?  Can cause problems when exchanging binary data between machines  Big Endian  Most significant byte has lowest address  PowerPC, Sparc  Little Endian  Least significant byte has lowest address  Intel x86 Byte Ordering Example union { unsigned unsigned unsigned unsigned } dw; char c[8]; short s[4]; int i[2]; long l[1]; c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] s[0] i[0] l[0] s[1] s[2] i[1] s[3] Byte Ordering Example (Cont). int j; for (j = 0; j < 8; j++) dw.c[j] = 0xf0 + j; printf("Characters 0-7 == [0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n", dw.c[0], dw.c[1], dw.c[2], dw.c[3], dw.c[4], dw.c[5], dw.c[6], dw.c[7]); printf("Shorts 0-3 == [0x%x,0x%x,0x%x,0x%x]\n", dw.s[0], dw.s[1], dw.s[2], dw.s[3]); printf("Ints 0-1 == [0x%x,0x%x]\n", dw.i[0], dw.i[1]); printf("Long 0 == [0x%lx]\n", dw.l[0]); Byte Ordering on IA32 Little Endian f0 f1 f2 f3 f4 f5 f6 f7 c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] LSB MSB s[0] LSB MSB s[1] LSB LSB s[2] MSB i[0] LSB MSB LSB MSB s[3] MSB i[1] LSB MSB l[0] Print Output on IA32: Characters Shorts Ints Long 0-7 0-3 0-1 0 == == == == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] [0xf1f0,0xf3f2,0xf5f4,0xf7f6] [0xf3f2f1f0,0xf7f6f5f4] [0xf3f2f1f0] Byte Ordering on Sun Big Endian f0 f1 f2 f3 f4 f5 f6 f7 c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] MSB LSB s[0] MSB LSB s[1] MSB MSB s[2] LSB i[0] MSB LSB MSB LSB s[3] LSB i[1] MSB LSB l[0] Print Output on Sun: Characters Shorts Ints Long 0-7 0-3 0-1 0 == == == == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] [0xf0f1,0xf2f3,0xf4f5,0xf6f7] [0xf0f1f2f3,0xf4f5f6f7] [0xf0f1f2f3] Byte Ordering on x86-64 Little Endian f0 f1 f2 f3 f4 f5 f6 f7 c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] LSB MSB s[0] LSB MSB s[1] LSB LSB s[2] MSB i[0] LSB MSB LSB MSB s[3] MSB i[1] LSB MSB l[0] Print Output on x86-64: Characters Shorts Ints Long 0-7 0-3 0-1 0 == == == == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] [0xf1f0,0xf3f2,0xf5f4,0xf7f6] [0xf3f2f1f0,0xf7f6f5f4] [0xf7f6f5f4f3f2f1f0] Summary  Arrays in C      Contiguous allocation of memory Aligned to satisfy every element’s alignment requirement Pointer to first element No bounds checking Structures  Allocate bytes in order declared  Pad in middle and at end to satisfy alignment  Unions  Overlay declarations  Way to circumvent type system Today     Structures Alignment Unions Floating point  x87 (available with IA32, becoming obsolete)  SSE3 (available with x86-64) IA32 Floating Point (x87)  History  8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit)  486: merged FPU and Integer Unit onto one chip  Becoming obsolete with x86-64   Summary  Hardware to add, multiply, and divide  Floating point data registers  Various control & status registers  Instruction decoder and sequencer Integer Unit FPU Floating Point Formats  single precision (C float): 32 bits  double precision (C double): 64 bits  extended precision (C long double): 80 bits Memory FPU Data Register Stack (x87)  FPU register format (80 bit extended precision) 79 78 s exp  0 64 63 frac FPU registers     8 registers %st(0) - %st(7) Logically form stack Top: %st(0) Bottom disappears (drops out) after too many pushs %st(3) %st(2) %st(1) “Top” %st(0) FPU instructions (x87)  Large number of floating point instructions and formats  ~50 basic instruction types  load, store, add, multiply  sin, cos, tan, arctan, and log   Often slower than math lib Sample instructions: Instruction Effect Description fldz flds Addr fmuls Addr faddp push 0.0 push Mem[Addr] %st(0)  %st(0)*M[Addr] %st(1)  %st(0)+%st(1);pop Load zero Load single precision real Multiply Add and pop FP Code Example (x87)  Compute inner product of two vectors  Single precision arithmetic  Common computation float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%ecx movl 16(%ebp),%edx fldz xorl %eax,%eax cmpl %edx,%eax jge .L3 .L5: flds (%ebx,%eax,4) fmuls (%ecx,%eax,4) faddp incl %eax cmpl %edx,%eax jl .L5 .L3: movl -4(%ebp),%ebx movl %ebp, %esp popl %ebp ret # setup # # # # # # %ebx=&x %ecx=&y %edx=n push +0.0 i=0 if i>=n done # # # # # push x[i] st(0)*=y[i] st(1)+=st(0); pop i++ if i<n repeat # finish # st(0) = result Inner Product Stack Trace eax = i ebx = *x ecx = *y Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 0.0 x[0] 5. flds (%ebx,%eax,4) %st(1) %st(0) 3. fmuls (%ecx,%eax,4) 0.0 x[0]*y[0] %st(1) %st(0) 4. faddp 0.0+x[0]*y[0] x[0]*y[0] x[1] %st(1) %st(0) 6. fmuls (%ecx,%eax,4) x[0]*y[0] x[1]*y[1] %st(1) %st(0) 7. faddp %st(0) x[0]*y[0]+x[1]*y[1] %st(0) Machine Programming – x86-64 extensions CENG331: Introduction to Computer Systems Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ. x86-64 Integer Registers %rax %eax %r8 %r8d %rbx %ebx %r9 %r9d %rcx %ecx %r10 %r10d %rdx %edx %r11 %r11d %rsi %esi %r12 %r12d %rdi %edi %r13 %r13d %rsp %esp %r14 %r14d %rbp %ebp %r15 %r15d  Twice the number of registers  Accessible as 8, 16, 32, 64 bits x86-64 Integer Registers %rax Return value %r8 Argument #5 %rbx Callee saved %r9 Argument #6 %rcx Argument #4 %r10 Callee saved %rdx Argument #3 %r11 Used for linking %rsi Argument #2 %r12 C: Callee saved %rdi Argument #1 %r13 Callee saved %rsp Stack pointer %r14 Callee saved %rbp Callee saved %r15 Callee saved x86-64 Registers  Arguments passed to functions via registers  If more than 6 integral parameters, then pass rest on stack  These registers can be used as caller-saved as well  All references to stack frame via stack pointer  Eliminates need to update %ebp/%rbp  Other Registers  6+1 callee saved  2 or 3 have special uses x86-64 Long Swap void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; }  swap: movq movq movq movq ret Operands passed in registers  First (xp) in %rdi, second (yp) in %rsi  64-bit pointers   No stack operations required (except ret) Avoiding stack  Can hold all local information in registers (%rdi), %rdx (%rsi), %rax %rax, (%rdi) %rdx, (%rsi) x86-64 Locals in the Red Zone /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; }  swap_a: movq movq movq movq movq movq movq movq ret Avoiding Stack Pointer Change  Can hold all information within small window beyond stack pointer (%rdi), %rax %rax, -24(%rsp) (%rsi), %rax %rax, -16(%rsp) -16(%rsp), %rax %rax, (%rdi) -24(%rsp), %rax %rax, (%rsi) rtn Ptr −8 unused −16 loc[1] −24 loc[0] %rsp x86-64 NonLeaf without Stack Frame long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele_se (long a[], int i) { swap(&a[i], &a[i+1]); scount++; }  No values held while swap being invoked  No callee save registers needed swap_ele_se: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() incq scount(%rip) # scount++; ret x86-64 Call using Jump long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); }   When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap() x86-64 Stack Frame Example long sum = 0; /* Swap a[i] & a[i+1] */ void swap_ele_su (long a[], int i) { swap(&a[i], &a[i+1]); sum += a[i]; }   Keeps values of a and i in callee save registers Must set up stack frame to save these registers swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret # # # # # # # # # # # # # Save %rbx Extend & save i Save %r12 Save a &a[i] Allocate stack frame &a[i+1] swap() a[i] sum += a[i] Restore %rbx Restore %r12 Deallocate stack frame Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret # # # # # # # # # # # # # Save %rbx %rsp rtn addr Extend & save i −8 %r12 Save %r12 −16 %rbx Save a &a[i] Allocate stack frame &a[i+1] rtn addr swap() a[i] +8 %r12 sum += a[i] %rsp %rbx Restore %rbx Restore %r12 Deallocate stack frame Interesting Features of Stack Frame  Allocate entire frame at once  All stack accesses can be relative to %rsp  Do by decrementing stack pointer  Can delay allocation, since safe to temporarily use red zone  Simple deallocation  Increment stack pointer  No base/frame pointer needed Interesting Features of Stack Frame   Many compiled functions do not require a stack frame other than saving their return address. A function does not require a stack frame if:  All local variables can be held in registers  The function does not call other functions (referred to as leaf procedures)  A function would require a stack frame if the function:      Has too many local variables to hold in registers Has some local variables are arrays or structures uses &-operator to compute the address of a local variable must pass some arguments on the stack to another function Needs to save the state of a calllee-save register General Conditional Expression Translation C Code val = Test ? Then-Expr : Else-Expr; val = x>y ? x-y : y-x; Goto Version nt = !Test; if (nt) goto Else; val = Then-Expr; Done: . . . Else: val = Else-Expr; goto Done;  Test is expression returning integer = 0 interpreted as false 0 interpreted as true  Create separate code regions for then & else expressions  Execute appropriate one Conditionals: x86-64 int absdiff( int x, int y) { int result; if (x > y) { result = x-y; } else { result = y-x; } return result; }  absdiff: movl movl subl subl cmpl cmovle ret # x in %edi, y in %esi %edi, %eax # eax = x %esi, %edx # edx = y %esi, %eax # eax = x-y %edi, %edx # edx = y-x %esi, %edi # x:y %edx, %eax # eax=edx if <= Conditional move instruction     cmovC src, dest Move value from src to dest if condition C holds More efficient than conditional branching (simple control flow) But overhead: both branches are evaluated General Form with Conditional Move C Code val = Test ? Then-Expr : Else-Expr; Conditional Move Version val1 val2 val1    = Then-Expr; = Else-Expr; = val2 if !Test; Both values get computed Overwrite then-value with else-value if condition doesn’t hold Don’t use when:  Then or else expression have side effects  Then and else expression are to expensive Specific Cases of Alignment (x86-64)  1 byte: char, …  no restrictions on address  2 bytes: short, …  lowest 1 bit of address must be 02  4 bytes: int, float, …  lowest 2 bits of address must be 002  8 bytes: double, char *, …  Windows & Linux:   lowest 3 bits of address must be 0002 16 bytes: long double  Linux: lowest 3 bits of address must be 0002  i.e., treated the same as a 8-byte primitive data type  Vector Instructions: SSE Family  SIMD (single-instruction, multiple data) vector instructions  New data types, registers, operations  Parallel operation on small (length 2-8) vectors of integers or floats  Example: +  x “4-way” Floating point vector instructions     Available with Intel’s SSE (streaming SIMD extensions) family SSE starting with Pentium III: 4-way single precision SSE2 starting with Pentium 4: 2-way double precision All x86-64 have SSE3 (superset of SSE2, SSE) SSE3 Registers   All caller saved %xmm0 for floating point return value 128 bit = 2 doubles = 4 singles %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15 SSE3 Registers   Different data types and associated instructions 128 bit Integer vectors:  16-way byte  8-way 2 bytes  4-way 4 bytes  Floating point vectors:  4-way single  2-way double  Floating point scalars:  single  double LSB SSE3 Instructions: Examples  Single precision 4-way vector add: addps %xmm0 %xmm1 %xmm0 + %xmm1  Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 Extending to x86-64    Pointers and long ints are 64 bits long. Integer arithmetic operations support 8, 16, 32 and 64-bit data types The set of general purpose registers expanded from 8 to 16 Much of the program state is held in registers rather than on stack.  Integer and pointer arguments (upto 6) to procedures are passsed via registers.  Some procedures do not need to access to stack at all.  Conditional operations are implemented using conditional move instructions, when possible,  yielding better performance than traditional branching  Floating point operations are implemented using registeroriented SSE2, rather than stack-based x87 Procedures (x86-64): Optimizations  No base/frame pointer  Passing arguments to functions through registers (if possible)  Sometimes: Writing into the “red zone” (below stack pointer) rtn Ptr −8 unused −16 loc[1] −24 )loc[0] Sometimes: Function call using jmp (instead of call   Reason: Performance   use stack as little as possible while obeying rules (e.g., caller/callee save registers) %rsp

A [1]

Related documents

Products

Support

A [1]

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib