A [1]

advertisement
Machine Programming – Procedures
and IA32 Stack
CENG331: Introduction to Computer Systems
6th Lecture
Instructor:
Erol Sahin
Acknowledgement: Most of the slides are adapted from the ones prepared
by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
IA32 Stack
Stack “Bottom”



Region of memory managed
with stack discipline
Grows toward lower addresses
Increasing
Addresses
Register %esp contains
lowest stack address
= address of “top” element
Stack Grows
Down
Stack Pointer: %esp
Stack “Top”
IA32 Stack: Push
Stack “Bottom”

pushl Src
 Fetch operand at Src
 Decrement %esp by 4
 Write operand at address given
Increasing
Addresses
by %esp
Stack Grows
Down
Stack Pointer: %esp
-4
Stack “Top”
IA32 Stack: Pop
Stack “Bottom”

popl Dest
 Read operand at address %esp
 Increment %esp by 4
 Write operand to Dest
Stack Pointer: %esp
Increasing
Addresses
Stack Grows
Down
+4
Stack “Top”
Procedure Control Flow


Use stack to support procedure call and return
Procedure call: call label
 Push return address on stack
 Jump to label

Return address:
 Address of instruction beyond call
 Example from disassembly
804854e: e8 3d 06 00 00
8048553: 50
 Return address = 0x8048553

Procedure return: ret
 Pop address from stack
 Jump to address
call
pushl
8048b90 <main>
%eax
Procedure Call Example
804854e:
8048553:
e8 3d 06 00 00
50
call
0x110
0x110
0x10c
0x10c
0x108
123
0x108
call
pushl
8048b90
123
0x104 0x8048553
%esp
0x108
%eip 0x804854e
%eip: program counter
%esp
0x108
0x104
%eip 0x8048b90
0x804854e
8048b90 <main>
%eax
Procedure Return Example
8048591:
c3
ret
ret
0x110
0x110
0x10c
0x10c
0x108
123
0x108
0x104 0x8048553
%esp
0x104
%eip 0x8048591
%eip: program counter
123
0x8048553
%esp
0x104
0x108
%eip 0x8048553
0x8048591
Stack-Based Languages

Languages that support recursion
 e.g., C, Pascal, Java
 Code must be “Reentrant”
Multiple simultaneous instantiations of single procedure
 Need some place to store state of each instantiation
 Arguments
 Local variables
 Return pointer


Stack discipline
 State for given procedure needed for limited time
From when called to when return
 Callee returns before caller does


Stack allocated in Frames
 state for single procedure instantiation
Call Chain Example
yoo(…)
{
•
•
who();
•
•
}
Example
Call Chain
yoo
who(…)
{
• • •
amI();
• • •
amI();
• • •
}
who
amI(…)
{
•
•
amI();
•
•
}
Procedure amI is recursive
amI
amI
amI
amI
Stack Frames

Previous
Frame
Contents
 Local variables
 Return information
 Temporary space
Frame Pointer: %ebp
Frame
for
proc
Stack Pointer: %esp

Management
 Space allocated when enter procedure
“Set-up” code
 Deallocated when return
 “Finish” code

Stack “Top”
Stack
Example
yoo(…)
{
•
•
who();
•
•
}
yoo
%ebp
yoo
who
amI
amI
amI
%esp
amI
Stack
Example
who(…)
{
• • •
amI();
• • •
amI();
• • •
}
yoo
yoo
who
%ebp
amI
amI
who
%esp
amI
amI
Stack
Example
amI(…)
{
•
•
amI();
•
•
}
yoo
yoo
who
amI
amI
amI
who
%ebp
amI
amI
%esp
Stack
Example
amI(…)
{
•
•
amI();
•
•
}
yoo
yoo
who
amI
amI
who
amI
amI
amI
%ebp
amI
%esp
Stack
Example
amI(…)
{
•
•
amI();
•
•
}
yoo
yoo
who
amI
amI
who
amI
amI
amI
amI
%ebp
amI
%esp
Stack
Example
amI(…)
{
•
•
amI();
•
•
}
yoo
yoo
who
amI
amI
who
amI
amI
amI
%ebp
amI
%esp
Stack
Example
amI(…)
{
•
•
amI();
•
•
}
yoo
yoo
who
amI
amI
amI
who
%ebp
amI
amI
%esp
Stack
Example
who(…)
{
• • •
amI();
• • •
amI();
• • •
}
yoo
yoo
who
%ebp
amI
amI
who
%esp
amI
amI
Stack
Example
amI(…)
{
•
•
•
•
•
}
yoo
yoo
who
amI
amI
amI
who
%ebp
amI
amI
%esp
Stack
Example
who(…)
{
• • •
amI();
• • •
amI();
• • •
}
yoo
yoo
who
%ebp
amI
amI
who
%esp
amI
amI
Stack
Example
yoo(…)
{
•
•
who();
•
•
}
yoo
%ebp
yoo
who
amI
amI
amI
%esp
amI
IA32/Linux Stack Frame

Current Stack Frame (“Top” to Bottom)
 “Argument build:”
Parameters for function about to call
 Local variables
If can’t keep in registers
 Saved register context
 Old frame pointer

Caller
Frame
Arguments
Frame pointer
%ebp
Return Addr
Old %ebp
Saved
Registers
+
Local
Variables
Caller Stack Frame
 Return address
 Pushed by call instruction
 Arguments for this call
Stack pointer
%esp
Argument
Build
Revisiting swap
Calling swap from call_swap
int zip1 = 15213;
int zip2 = 91125;
void call_swap()
{
swap(&zip1, &zip2);
}
void swap(int *xp, int *yp)
{
int t0 = *xp;
int t1 = *yp;
*xp = t1;
*yp = t0;
}
call_swap:
• • •
pushl $zip2
pushl $zip1
call swap
• • •
•
•
•
# Global Var
# Global Var
Resulting
Stack
&zip2
&zip1
Rtn adr
%esp
Revisiting swap
void swap(int *xp, int *yp)
{
int t0 = *xp;
int t1 = *yp;
*xp = t1;
*yp = t0;
}
swap:
pushl %ebp
movl %esp,%ebp
pushl %ebx
movl
movl
movl
movl
movl
movl
12(%ebp),%ecx
8(%ebp),%edx
(%ecx),%eax
(%edx),%ebx
%eax,(%edx)
%ebx,(%ecx)
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
Set
Up
Body
Finish
swap Setup #1
Entering Stack
Resulting Stack
%ebp
%ebp
•
•
•
•
•
•
&zip2
yp
&zip1
xp
Rtn adr
%esp
Rtn adr
Old %ebp
swap:
pushl %ebp
movl %esp,%ebp
pushl %ebx
%esp
swap Setup #1
Entering Stack
%ebp
%ebp
•
•
•
•
•
•
&zip2
yp
&zip1
xp
Rtn adr
%esp
Rtn adr
Old %ebp
swap:
pushl %ebp
movl %esp,%ebp
pushl %ebx
%esp
swap Setup #1
Entering Stack
Resulting Stack
%ebp
•
•
•
•
•
•
&zip2
yp
&zip1
xp
Rtn adr
%esp
Rtn adr
Old %ebp
swap:
pushl %ebp
movl %esp,%ebp
pushl %ebx
%ebp
%esp
swap Setup #1
Entering Stack
%ebp
•
•
•
•
•
•
&zip2
yp
&zip1
xp
Rtn adr
%esp
Rtn adr
Old %ebp
swap:
pushl %ebp
movl %esp,%ebp
pushl %ebx
%ebp
%esp
swap Setup #1
Entering Stack
Resulting Stack
%ebp
•
•
•
•
•
•
Offset relative to %ebp
&zip2
&zip1
Rtn adr
%esp
movl 12(%ebp),%ecx # get yp
movl 8(%ebp),%edx # get xp
. . .
12
8
4
yp
xp
Rtn adr
Old %ebp
%ebp
Old %ebx
%esp
swap Finish #1
swap’s Stack
Resulting Stack
•
•
•
•
•
•
yp
yp
xp
xp
Rtn adr
Rtn adr
Old %ebp
%ebp
Old %ebp
%ebp
Old %ebx
%esp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
Observation: Saved and restored
register %ebx
swap Finish #2
swap’s Stack
•
•
•
•
•
•
yp
yp
xp
xp
Rtn adr
Rtn adr
Old %ebp
%ebp
Old %ebp
%ebp
Old %ebx
%esp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
swap Finish #2
swap’s Stack
Resulting Stack
•
•
•
•
•
•
yp
yp
xp
xp
Rtn adr
Rtn adr
Old %ebp
%ebp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
Old %ebp
%ebp
%esp
swap Finish #2
swap’s Stack
•
•
•
•
•
•
yp
yp
xp
xp
Rtn adr
Rtn adr
Old %ebp
%ebp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
Old %ebp
%ebp
%esp
swap Finish #3
swap’s Stack
Resulting Stack
%ebp
•
•
•
•
•
•
yp
yp
xp
xp
Rtn adr
Rtn adr
Old %ebp
%ebp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
%esp
swap Finish #4
swap’s Stack
%ebp
•
•
•
•
•
•
yp
yp
xp
xp
Rtn adr
Rtn adr
Old %ebp
%ebp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret
%esp
swap Finish #4
swap’s Stack
Resulting Stack
%ebp
•
•
•
•
•
•
yp
yp
xp
xp
%esp
Rtn adr
Old %ebp
%ebp
Old %ebx
%esp
movl -4(%ebp),%ebx
movl %ebp,%esp
popl %ebp
ret

Observation
 Saved & restored register %ebx
 Didn’t do so for %eax, %ecx, or %edx
Disassembled swap
080483a4 <swap>:
80483a4:
55
80483a5:
89 e5
80483a7:
53
80483a8:
8b 55 08
80483ab:
8b 4d 0c
80483ae:
8b 1a
80483b0:
8b 01
80483b2:
89 02
80483b4:
89 19
80483b6:
5b
80483b7:
c9
80483b8:
c3
push
mov
push
mov
mov
mov
mov
mov
mov
pop
leave
ret
%ebp
%esp,%ebp
%ebx
0x8(%ebp),%edx
0xc(%ebp),%ecx
(%edx),%ebx
(%ecx),%eax
%eax,(%edx)
%ebx,(%ecx)
%ebx
Calling Code
8048409:
804840e:
e8 96 ff ff ff
8b 45 f8
call 80483a4 <swap>
mov 0xfffffff8(%ebp),%eax
Register Saving Conventions

When procedure yoo calls who:
 yoo is the caller
 who is the callee

Can Register be used for temporary storage?
yoo:
• • •
movl $15213, %edx
call who
addl %edx, %eax
• • •
ret
who:
• • •
movl 8(%ebp), %edx
addl $91125, %edx
• • •
ret
 Contents of register %edx overwritten by who
Register Saving Conventions

When procedure yoo calls who:
 yoo is the caller
 who is the callee


Can register be used for temporary storage?
Conventions
 “Caller Save”
Caller saves temporary in its frame before calling
 “Callee Save”
 Callee saves temporary in its frame before using

IA32/Linux Register Usage

%eax, %edx, %ecx
 Caller saves prior to call if
values are used later

value
Callee-Save
Temporaries
%ebx, %esi, %edi
 Callee saves if wants to
use them

Caller-Save
Temporaries
%eax
 also used to return integer

%eax
%esp, %ebp
 special
Special
%edx
%ecx
%ebx
%esi
%edi
%esp
%ebp
IA 32 Procedure Summary

The Stack Makes Recursion Work
 Private storage for each instance of procedure call
Instantiations don’t clobber each other
 Addressing of locals + arguments can be
relative to stack positions
 Managed by stack discipline
 Procedures return in inverse order of calls


Caller
Frame
Arguments
%ebp
IA32 Procedures Combination of Instructions
+ Conventions
Saved
Registers
+
Local
Variables
 Call / Ret instructions
 Register usage conventions
Caller / Callee save
 %ebp and %esp
 Stack frame organization conventions
Return Addr
Old %ebp

%esp
Argument
Build
Today

Arrays
 One-dimensional
 Multi-dimensional (nested)
 Multi-level

Structures
Basic Data Types

Integral
 Stored & operated on in general (integer) registers
 Signed vs. unsigned depends on instructions used
Intel
byte
word
double word
quad word

GAS
b
w
l
q
Bytes
1
2
4
8
C
[unsigned]
[unsigned]
[unsigned]
[unsigned]
char
short
int
long int (x86-64)
Floating Point
 Stored & operated on in floating point registers
Intel
Single
Double
Extended
GAS
s
l
t
Bytes
4
8
10/12/16
C
float
double
long double
Array Allocation

Basic Principle
T A[L];
 Array of data type T and length L
 Contiguously allocated region of L * sizeof(T) bytes
char string[12];
x
x + 12
int val[5];
x
x+4
x+8
x + 12
x + 16
x + 20
double a[3];
x
x+8
char *p[3];
x + 16
x + 24
IA32
x
x+4
x+8
x + 12
x86-64
x
x+8
x + 16
x + 24
Array Access

Basic Principle
T A[L];
 Array of data type T and length L
 Identifier A can be used as a pointer to array element 0: Type T*
int val[5];
1
x

Reference
val[4]
val
val+1
&val[2]
val[5]
*(val+1)
val + i
x+4
Type
int
int
int
int
int
int
int
5
*
*
*
*
2
x+8
1
x + 12
Value
3
x
x+4
Will xdisappear
+8
Blackboard?
??
5
x+4i
3
x + 16
x + 20
Array Access

Basic Principle
T A[L];
 Array of data type T and length L
 Identifier A can be used as a pointer to array element 0: Type T*
int val[5];
1
x

Reference
val[4]
val
val+1
&val[2]
val[5]
*(val+1)
val + i
5
x+4
2
x+8
Type
Value
int
int
int
int
int
int
int
3
x
x+4
x+8
??
5
x+4i
*
*
*
*
1
x + 12
3
x + 16
x + 20
Array Example
typedef int zip_dig[5];
zip_dig cmu = { 1, 5, 2, 1, 3 };
zip_dig mit = { 0, 2, 1, 3, 9 };
zip_dig ucb = { 9, 4, 7, 2, 0 };
zip_dig cmu;
1
16
zip_dig mit;
20
0
36
zip_dig ucb;

2
24
2
40
9
56

5
28
1
44
4
60
1
32
3
48
7
64
3
9
52
2
68
36
56
0
72
76
Declaration “zip_dig cmu” equivalent to “int cmu[5]”
Example arrays were allocated in successive 20 byte blocks
 Not guaranteed to happen in general
Array Accessing Example
zip_dig cmu;
1
16
5
20
2
24
1
28
int get_digit
(zip_dig z, int dig)
{
return z[dig];
}
32

IA32
# %edx = z
# %eax = dig
movl (%edx,%eax,4),%eax
3

# z[dig]


36
Register %edx contains
starting address of array
Register %eax contains
array index
Desired digit at
4*%eax + %edx
Use memory reference
(%edx,%eax,4)
Referencing Examples
zip_dig cmu;
1
16
zip_dig mit;
20
0
36
zip_dig ucb;
Reference
2
24
2
40
9
56

5
28
1
44
4
60
1
Address
32
3
48
7
64
3
9
52
2
68
Value
mit[3]
mit[5]
mit[-1]
36 + 4* 3 = 48
3
36 + Will
4* 5disappear
= 56
9
36 + 4*-1
= 32
3
Blackboard?
cmu[15]
16 + 4*15 = 76
??
36
56
0
72
76
Guaranteed?
Referencing Examples
zip_dig cmu;
1
16
zip_dig mit;
20
0
36
zip_dig mit;
Reference
2
24
2
40
9
56

5
28
1
44
4
60
1
32
3
48
7
64
3
36
9
52
2
68
56
0
72
76
Address
Value
Guaranteed?
mit[3]
mit[5]
mit[-1]
36 + 4* 3 = 48
36 + 4* 5 = 56
36 + 4*-1 = 32
3
9
3
cmu[15]
16 + 4*15 = 76
??
Yes
No
No
No
 No bound checking
 Out of range behavior implementation-dependent
 No guaranteed relative allocation of different arrays
Array Loop Example
 Original
 Transformed




As generated by GCC
Eliminate loop variable i
Convert array code to
pointer code
Express in do-while form
(no test at entrance)
int zd2int(zip_dig z)
{
int i;
int zi = 0;
for (i = 0; i < 5; i++) {
zi = 10 * zi + z[i];
}
return zi;
}
int zd2int(zip_dig z)
{
int zi = 0;
int *zend = z + 4;
do {
zi = 10 * zi + *z;
z++;
} while (z <= zend);
return zi;
}
Array Loop Implementation (IA32)

Registers
%ecx z
%eax zi
%ebx zend

Computations
 10*zi + *z implemented as
*z + 2*(zi+4*zi)
 z++ increments by 4
int zd2int(zip_dig z)
{
int zi = 0;
int *zend = z + 4;
do {
zi = 10 * zi + *z;
z++;
} while(z <= zend);
return zi;
}
# %ecx = z
xorl %eax,%eax
leal 16(%ecx),%ebx
.L59:
leal (%eax,%eax,4),%edx
movl (%ecx),%eax
addl $4,%ecx
leal (%eax,%edx,2),%eax
cmpl %ebx,%ecx
jle .L59
# zi = 0
# zend = z+4
#
#
#
#
#
#
5*zi
*z
z++
zi = *z + 2*(5*zi)
z : zend
if <= goto loop
Nested Array Example
#define PCOUNT 4
zip_dig pgh[PCOUNT] =
{{1, 5, 2, 0, 6},
{1, 5, 2, 1, 3 },
{1, 5, 2, 1, 7 },
{1, 5, 2, 2, 1 }};
zip_dig
pgh[4];
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
76

96
116
136
“zip_dig pgh[4]” equivalent to “int pgh[4][5]”
 Variable pgh: array of 4 elements, allocated contiguously
 Each element is an array of 5 int’s, allocated contiguously

156
“Row-Major” ordering of all elements guaranteed
Multidimensional (Nested) Arrays

Declaration
T A[R][C];
 2D array of data type T
 R rows, C columns
 Type T element requires K bytes

Array Size
A[0][0]
•
•
•
• • •
A[0][C-1]
•
•
•
A[R-1][0] • • • A[R-1][C-1]
 R * C * K bytes

Arrangement
 Row-Major Ordering
int A[R][C];
A
[0]
[0]
A
A
• • • [0] [1]
[C-1] [0]
A
• • • [1]
[C-1]
4*R*C Bytes
•
•
•
A
A
[R-1] • • • [R-1]
[0]
[C-1]
Nested Array Row Access

Row Vectors
 A[i] is array of C elements
 Each element of type T requires K bytes
 Starting address A + i * (C * K)
int A[R][C];
A[0]
A
[0]
[0]
A
•••
A[i]
A
[0]
[C-1]
• • •
A
[i]
[0]
•••
A+i*C*4
A[R-1]
A
[i]
[C-1]
• • •
A
[R-1]
[0]
•••
A+(R-1)*C*4
A
[R-1]
[C-1]
Nested Array Row Access Code
int *get_pgh_zip(int index)
{
return pgh[index];
}


#define PCOUNT 4
zip_dig pgh[PCOUNT] =
{{1, 5, 2, 0, 6},
{1, 5, 2, 1, 3 },
{1, 5, 2, 1, 7 },
{1, 5, 2, 2, 1 }};
What data type is pgh[index]?
What is its starting address?
# %eax = index
Will disappear
leal (%eax,%eax,4),%eax # 5 * index
leal pgh(,%eax,4),%eax # pgh Blackboard?
+ (20 * index)
Nested Array Row Access Code
int *get_pgh_zip(int index)
{
return pgh[index];
}
#define PCOUNT 4
zip_dig pgh[PCOUNT] =
{{1, 5, 2, 0, 6},
{1, 5, 2, 1, 3 },
{1, 5, 2, 1, 7 },
{1, 5, 2, 2, 1 }};
# %eax = index
leal (%eax,%eax,4),%eax # 5 * index
leal pgh(,%eax,4),%eax # pgh + (20 * index)

Row Vector
 pgh[index] is array of 5 int’s
 Starting address pgh+20*index

IA32 Code
 Computes and returns address
 Compute as pgh + 4*(index+4*index)
Nested Array Row Access

Array Elements
 A[i][j] is element of type T, which requires K bytes
 Address A + i * (C * K) + j * K = A + (i * C + j)* K
int A[R][C];
A[0]
A
[0]
[0]
A
•••
A[i]
A
[0]
[C-1]
• • •
•••
A
[i]
[j]
A[R-1]
•••
A+i*C*4
A+i*C*4+j*4
• • •
A
[R-1]
[0]
•••
A+(R-1)*C*4
A
[R-1]
[C-1]
Nested Array Element Access Code
int get_pgh_digit
(int index, int dig)
{
return pgh[index][dig];
}
# %ecx = dig
# %eax = index
leal 0(,%ecx,4),%edx
leal (%eax,%eax,4),%eax
movl pgh(%edx,%eax,4),%eax

# 4*dig
# 5*index
# *(pgh + 4*dig + 20*index)
Array Elements
 pgh[index][dig] is int
 Address: pgh + 20*index + 4*dig

IA32 Code
 Computes address pgh + 4*dig + 4*(index+4*index)
 movl performs memory reference
Strange Referencing Examples
zip_dig
pgh[4];
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
76

Reference
96
116
136
156
Address
Value Guaranteed?
pgh[3][3]
pgh[2][5]
pgh[2][-1]
pgh[4][-1]
pgh[0][19]
76+20*3+4*3 = 148
76+20*2+4*5 = 136
76+20*2+4*-1 = 112
Will disappear
76+20*4+4*-1
= 152
76+20*0+4*19 = 152
2
1
3
1
1
pgh[0][-1]
76+20*0+4*-1 = 72
??
Strange Referencing Examples
zip_dig
pgh[4];
1 5 2 0 6 1 5 2 1 3 1 5 2 1 7 1 5 2 2 1
76

Reference
96
116
136
156
Address
Value Guaranteed?
pgh[3][3]
pgh[2][5]
pgh[2][-1]
pgh[4][-1]
pgh[0][19]
76+20*3+4*3 = 148
76+20*2+4*5 = 136
76+20*2+4*-1 = 112
76+20*4+4*-1 = 152
76+20*0+4*19 = 152
2
1
3
1
1
Yes
pgh[0][-1]
76+20*0+4*-1 = 72
??
No
 Code does not do any bounds checking
 Ordering of elements within array guaranteed
Yes
Yes
Yes
Multi-Level Array Example

zip_dig cmu = { 1, 5, 2, 1, 3 };
zip_dig mit = { 0, 2, 1, 3, 9 };
zip_dig ucb = { 9, 4, 7, 2, 0 };

#define UCOUNT 3
int *univ[UCOUNT] = {mit, cmu, ucb};
cmu
univ
160
36
164
16
168
56
mit
1
16
5
20
0
ucb 36
2
24
2
40
9
56

1
28
1
44
4
60
Variable univ denotes
array of 3 elements
Each element is a pointer
 4 bytes
Each pointer points to array
of int’s
32
3
48
7
64
3
9
52
2
68
36
56
0
72
76
Element Access in Multi-Level Array
int get_univ_digit
(int index, int dig)
{
return univ[index][dig];
}
# %ecx = index
# %eax = dig
Will disappear
leal 0(,%ecx,4),%edx
# 4*index
Blackboard?
movl univ(%edx),%edx
# Mem[univ+4*index]
movl (%edx,%eax,4),%eax # Mem[...+4*dig]
Element Access in Multi-Level Array
int get_univ_digit
(int index, int dig)
{
return univ[index][dig];
}
# %ecx = index
# %eax = dig
leal 0(,%ecx,4),%edx
# 4*index
movl univ(%edx),%edx
# Mem[univ+4*index]
movl (%edx,%eax,4),%eax # Mem[...+4*dig]

Computation (IA32)
 Element access Mem[Mem[univ+4*index]+4*dig]
 Must do two memory reads


First get pointer to row array
Then access element within array
Array Element Accesses
Nested array
int get_pgh_digit
(int index, int dig)
{
return pgh[index][dig];
}
Multi-level array
int get_univ_digit
(int index, int dig)
{
return univ[index][dig];
}
Access looks similar, but element:
Mem[pgh+20*index+4*dig]
Mem[Mem[univ+4*index]+4*dig]
Strange Referencing Examples
cmu
univ
160
36
164
16
168
56
mit
1
16
5
20
0
ucb 36

Reference
univ[2][3]
univ[1][5]
univ[2][-1]
univ[3][-1]
univ[1][12]
Address
24
2
40
9
56
2
1
4
Value
56+4*3 = 68
2
16+4*5 = 36
0
56+4*-1
= 52
9
Will disappear
??
??
16+4*12 = 64
7
28
44
60
1
32
3
48
7
64
3
9
52
2
68
36
56
0
72
Guaranteed?
76
Strange Referencing Examples
cmu
univ
160
36
164
16
168
56
mit
1
16
5
20
0
ucb 36

Reference
univ[2][3]
univ[1][5]
univ[2][-1]
univ[3][-1]
univ[1][12]
Address
56+4*3
16+4*5
56+4*-1
??
16+4*12
24
2
40
9
56
2
28
1
44
4
60
= 64
3
7
64
2
0
9
??
7
 Code does not do any bounds checking
 Ordering of elements in different arrays not guaranteed
3
32
48
Value
= 68
= 36
= 52
1
9
52
2
68
36
56
0
72
Guaranteed?
Yes
No
No
No
No
76
Using Nested Arrays

Strengths
 C compiler handles doubly
subscripted arrays
 Generates very efficient code
 Avoids multiply in index
computation

Limitation
 Only works for fixed array size
#define N 16
typedef int fix_matrix[N][N];
/* Compute element i,k of
fixed matrix product */
int fix_prod_ele
(fix_matrix a, fix_matrix b,
int i, int k)
{
int j;
int result = 0;
for (j = 0; j < N; j++)
result += a[i][j]*b[j][k];
return result;
}
a
b
x
i-th row
j-th column
Dynamic Nested Arrays

Strength
 Can create matrix of any size

Programming
 Must do index computation
explicitly

Performance
 Accessing single element costly
 Must do multiplication
int * new_var_matrix(int n)
{
return (int *)
calloc(sizeof(int), n*n);
}
int var_ele
(int *a, int i, int j, int n)
{
return a[i*n+j];
}
movl 12(%ebp),%eax
movl 8(%ebp),%edx
imull 20(%ebp),%eax
addl 16(%ebp),%eax
movl (%edx,%eax,4),%eax
#
#
#
#
#
i
a
n*i
n*i+j
Mem[a+4*(i*n+j)]
Dynamic Array Multiplication

Without Optimizations
 Multiplies: 3
2 for subscripts
 1 for data
 Adds: 4
 2 for array indexing
 1 for loop index
 1 for data

/* Compute element i,k of
variable matrix product */
int var_prod_ele
(int *a, int *b,
int i, int k, int n)
{
int j;
int result = 0;
for (j = 0; j < n; j++)
result +=
a[i*n+j] * b[j*n+k];
return result;
}
Optimizing Dynamic Array Multiplication

Optimizations
{
int j;
int result = 0;
for (j = 0; j < n; j++)
result +=
a[i*n+j] * b[j*n+k];
return result;
 Performed when set
optimization level to -O2

Code Motion
 Expression i*n can be
computed outside loop

Strength Reduction
}
{
int j;
int result = 0;
int iTn = i*n;
int jTnPk = k;
for (j = 0; j < n; j++) {
result +=
a[iTn+j] * b[jTnPk];
jTnPk += n;
}
return result;
 Incrementing j has effect of
incrementing j*n+k by n

Operations count
 4 adds, 1 mult

Compiler can optimize
regular access patterns
}
Today




Structures
Alignment
Unions
Floating point
Structures
struct rec {
int i;
int a[3];
int *p;
};

Memory Layout
i a
0
4
p
16 20
Concept
 Contiguously-allocated region of memory
 Refer to members within structure by names
 Members may be of different types

Accessing Structure Member
void
set_i(struct rec *r,
int val)
{
r->i = val;
}
IA32 Assembly
# %eax = val
# %edx = r
movl %eax,(%edx)
# Mem[r] = val
Generating Pointer to Structure Member
struct rec {
int i;
int a[3];
int *p;
};
r
r+4+4*idx
i a
0
4
p
16 20
int *find_a
(struct rec *r, int idx)
{
return &r->a[idx];
}
What does it do?
# %ecx = idx
# %edx = r
leal 0(,%ecx,4),%eax
# Will
4*idx
disappear
leal 4(%eax,%edx),%eax # r+4*idx+4
blackboard?
Generating Pointer to Structure Member
struct rec {
int i;
int a[3];
int *p;
};

Generating Pointer to
Array Element
 Offset of each structure
member determined at
compile time
r
r+4+4*idx
i a
0
4
p
16 20
int *find_a
(struct rec *r, int idx)
{
return &r->a[idx];
}
# %ecx = idx
# %edx = r
leal 0(,%ecx,4),%eax
# 4*idx
leal 4(%eax,%edx),%eax # r+4*idx+4
Structure Referencing (Cont.)

C Code
struct rec {
int i;
int a[3];
int *p;
};
i a
0
i a
void
set_p(struct rec *r)
{
r->p =
&r->a[r->i];
}
What does it do?
# %edx = r
movl (%edx),%ecx
leal 0(,%ecx,4),%eax
leal 4(%edx,%eax),%eax
movl %eax,16(%edx)
4
p
16 20
0
4
Element i
#
#
#
#
r->i
4*(r->i)
r+4+4*(r->i)
Update r->p
16 20
Today




Structures
Alignment
Unions
Floating point
Alignment

Aligned Data
 Primitive data type requires K bytes
 Address must be multiple of K
 Required on some machines; advised on IA32


treated differently by IA32 Linux, x86-64 Linux, and Windows!
Motivation for Aligning Data
 Memory accessed by (aligned) chunks of 4 or 8 bytes (system
dependent)
 Inefficient to load or store datum that spans quad word
boundaries
 Virtual memory very tricky when datum spans 2 pages

Compiler
 Inserts gaps in structure to ensure correct alignment of fields
Specific Cases of Alignment (IA32)

1 byte: char, …
 no restrictions on address

2 bytes: short, …
 lowest 1 bit of address must be 02

4 bytes: int, float, char *, …
 lowest 2 bits of address must be 002

8 bytes: double, …
 Windows (and most other OS’s & instruction sets):
lowest 3 bits of address must be 0002
 Linux:
 lowest 2 bits of address must be 002
 i.e., treated the same as a 4-byte primitive data type


12 bytes: long double
 Windows, Linux:
lowest 2 bits of address must be 002
 i.e., treated the same as a 4-byte primitive data type

Satisfying Alignment with Structures

Within structure:
struct S1 {
char c;
int i[2];
double v;
} *p;
 Must satisfy element’s alignment requirement

Overall structure placement
 Each structure has alignment requirement K
K = Largest alignment of any element
 Initial address & structure length must be multiples of K


Example (under Windows or x86-64):
 K = 8, due to double element
c
p+0
i[0]
3 bytes
p+4
i[1]
p+8
Multiple of 4
Multiple of 8
v
4 bytes
p+16
p+24
Multiple of 8
Multiple of 8
Different Alignment Conventions

struct S1 {
char c;
int i[2];
double v;
} *p;
x86-64 or IA32 Windows:
 K = 8, due to double element
c
p+0

3 bytes
i[0]
p+4
i[1]
v
4 bytes
p+8
p+16
p+24
IA32 Linux
 K = 4; double treated like a 4-byte data type
c
p+0
3 bytes
p+4
i[0]
i[1]
p+8
v
p+12
p+20
Saving Space

Put large data types first
struct S1 {
char c;
int i[2];
double v;
} *p;

struct S2 {
double v;
int i[2];
char c;
} *p;
Effect (example x86-64, both have K=8)
c
p+0
i[0]
3 bytes
p+4
i[1]
p+8
v
p+0
p+16
i[0]
p+8
v
4 bytes
i[1]
c
p+16
p+24
Arrays of Structures

Satisfy alignment requirement
for every element
a[0]
a+0
a[1]
a+24
v
a+24
i[0]
a+32
struct S2 {
double v;
int i[2];
char c;
} a[10];
a[2]
a+48
i[1]
•••
a+36
c
a+40
7 bytes
a+48
Accessing Array Elements



struct S3 {
short i;
float v;
short j;
} a[10];
Compute array offset 12i
Compute offset 8 with structure
Assembler gives offset a+8
 Resolved during linking
a[0]
• • •
a+0
• • •
a+12i
i
a+12i
short get_j(int idx)
{
return a[idx].j;
}
a[i]
2 bytes
v
j
2 bytes
a+12i+8
# %eax = idx
leal (%eax,%eax,2),%eax # 3*idx
movswl a+8(,%eax,4),%eax
Today




Structures
Alignment
Unions
Floating point
Union Allocation


Allocate according to largest element
Can only use ones field at a time
union U1 {
char c;
int i[2];
double v;
} *up;
c
i[0]
i[1]
v
up+0
up+4
up+8
struct S1 {
char c;
int i[2];
double v;
} *sp;
c
sp+0
3 bits
sp+4
i[0]
i[1]
sp+8
v
4 bits
sp+16
sp+24
Using Union to Access Bit Patterns
typedef union {
float f;
unsigned u;
} bit_float_t;
u
f
0
float bit2float(unsigned u)
{
bit_float_t arg;
arg.u = u;
return arg.f;
}
Same as (float) u ?
4
unsigned float2bit(float f)
{
bit_float_t arg;
arg.f = f;
return arg.u;
}
Same as (unsigned) f ?
Byte Ordering Revisited

Idea
 Short/long/quad words stored in memory as 2/4/8 consecutive bytes
 Which is most (least) significant?
 Can cause problems when exchanging binary data between machines

Big Endian
 Most significant byte has lowest address
 PowerPC, Sparc

Little Endian
 Least significant byte has lowest address
 Intel x86
Byte Ordering Example
union {
unsigned
unsigned
unsigned
unsigned
} dw;
char c[8];
short s[4];
int i[2];
long l[1];
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
s[0]
i[0]
l[0]
s[1]
s[2]
i[1]
s[3]
Byte Ordering Example (Cont).
int j;
for (j = 0; j < 8; j++)
dw.c[j] = 0xf0 + j;
printf("Characters 0-7 ==
[0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n",
dw.c[0], dw.c[1], dw.c[2], dw.c[3],
dw.c[4], dw.c[5], dw.c[6], dw.c[7]);
printf("Shorts 0-3 ==
[0x%x,0x%x,0x%x,0x%x]\n",
dw.s[0], dw.s[1], dw.s[2], dw.s[3]);
printf("Ints 0-1 == [0x%x,0x%x]\n",
dw.i[0], dw.i[1]);
printf("Long 0 == [0x%lx]\n",
dw.l[0]);
Byte Ordering on IA32
Little Endian
f0
f1
f2
f3
f4
f5
f6
f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
LSB
MSB
s[0]
LSB
MSB
s[1]
LSB
LSB
s[2]
MSB
i[0]
LSB
MSB
LSB
MSB
s[3]
MSB
i[1]
LSB
MSB
l[0]
Print
Output on IA32:
Characters
Shorts
Ints
Long
0-7
0-3
0-1
0
==
==
==
==
[0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
[0xf1f0,0xf3f2,0xf5f4,0xf7f6]
[0xf3f2f1f0,0xf7f6f5f4]
[0xf3f2f1f0]
Byte Ordering on Sun
Big Endian
f0
f1
f2
f3
f4
f5
f6
f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
MSB
LSB
s[0]
MSB
LSB
s[1]
MSB
MSB
s[2]
LSB
i[0]
MSB
LSB
MSB
LSB
s[3]
LSB
i[1]
MSB
LSB
l[0]
Print
Output on Sun:
Characters
Shorts
Ints
Long
0-7
0-3
0-1
0
==
==
==
==
[0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
[0xf0f1,0xf2f3,0xf4f5,0xf6f7]
[0xf0f1f2f3,0xf4f5f6f7]
[0xf0f1f2f3]
Byte Ordering on x86-64
Little Endian
f0
f1
f2
f3
f4
f5
f6
f7
c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7]
LSB
MSB
s[0]
LSB
MSB
s[1]
LSB
LSB
s[2]
MSB
i[0]
LSB
MSB
LSB
MSB
s[3]
MSB
i[1]
LSB
MSB
l[0]
Print
Output on x86-64:
Characters
Shorts
Ints
Long
0-7
0-3
0-1
0
==
==
==
==
[0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7]
[0xf1f0,0xf3f2,0xf5f4,0xf7f6]
[0xf3f2f1f0,0xf7f6f5f4]
[0xf7f6f5f4f3f2f1f0]
Summary

Arrays in C





Contiguous allocation of memory
Aligned to satisfy every element’s alignment requirement
Pointer to first element
No bounds checking
Structures
 Allocate bytes in order declared
 Pad in middle and at end to satisfy alignment

Unions
 Overlay declarations
 Way to circumvent type system
Today




Structures
Alignment
Unions
Floating point
 x87 (available with IA32, becoming obsolete)
 SSE3 (available with x86-64)
IA32 Floating Point (x87)

History
 8086: first computer to implement IEEE FP
separate 8087 FPU (floating point unit)
 486: merged FPU and Integer Unit onto one chip
 Becoming obsolete with x86-64


Summary
 Hardware to add, multiply, and divide
 Floating point data registers
 Various control & status registers

Instruction
decoder and
sequencer
Integer
Unit
FPU
Floating Point Formats
 single precision (C float): 32 bits
 double precision (C double): 64 bits
 extended precision (C long double): 80 bits
Memory
FPU Data Register Stack (x87)

FPU register format (80 bit extended precision)
79 78
s exp

0
64 63
frac
FPU registers




8 registers %st(0) - %st(7)
Logically form stack
Top: %st(0)
Bottom disappears (drops out) after too many pushs
%st(3)
%st(2)
%st(1)
“Top”
%st(0)
FPU instructions (x87)

Large number of floating point instructions and formats
 ~50 basic instruction types
 load, store, add, multiply
 sin, cos, tan, arctan, and log


Often slower than math lib
Sample instructions:
Instruction
Effect
Description
fldz
flds Addr
fmuls Addr
faddp
push 0.0
push Mem[Addr]
%st(0)  %st(0)*M[Addr]
%st(1)  %st(0)+%st(1);pop
Load zero
Load single precision real
Multiply
Add and pop
FP Code Example (x87)

Compute inner product of two
vectors
 Single precision arithmetic
 Common computation
float ipf (float x[],
float y[],
int n)
{
int i;
float result = 0.0;
for (i = 0; i < n; i++)
result += x[i]*y[i];
return result;
}
pushl %ebp
movl %esp,%ebp
pushl %ebx
movl 8(%ebp),%ebx
movl 12(%ebp),%ecx
movl 16(%ebp),%edx
fldz
xorl %eax,%eax
cmpl %edx,%eax
jge .L3
.L5:
flds (%ebx,%eax,4)
fmuls (%ecx,%eax,4)
faddp
incl %eax
cmpl %edx,%eax
jl .L5
.L3:
movl -4(%ebp),%ebx
movl %ebp, %esp
popl %ebp
ret
# setup
#
#
#
#
#
#
%ebx=&x
%ecx=&y
%edx=n
push +0.0
i=0
if i>=n done
#
#
#
#
#
push x[i]
st(0)*=y[i]
st(1)+=st(0); pop
i++
if i<n repeat
# finish
# st(0) = result
Inner Product Stack Trace
eax = i
ebx = *x
ecx = *y
Initialization
1. fldz
0.0
%st(0)
Iteration 0
Iteration 1
2. flds (%ebx,%eax,4)
0.0
x[0]
5. flds (%ebx,%eax,4)
%st(1)
%st(0)
3. fmuls (%ecx,%eax,4)
0.0
x[0]*y[0]
%st(1)
%st(0)
4. faddp
0.0+x[0]*y[0]
x[0]*y[0]
x[1]
%st(1)
%st(0)
6. fmuls (%ecx,%eax,4)
x[0]*y[0]
x[1]*y[1]
%st(1)
%st(0)
7. faddp
%st(0)
x[0]*y[0]+x[1]*y[1]
%st(0)
Machine Programming – x86-64
extensions
CENG331: Introduction to Computer Systems
Instructor:
Erol Sahin
Acknowledgement: Most of the slides are adapted from the ones prepared
by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
x86-64 Integer Registers
%rax
%eax
%r8
%r8d
%rbx
%ebx
%r9
%r9d
%rcx
%ecx
%r10
%r10d
%rdx
%edx
%r11
%r11d
%rsi
%esi
%r12
%r12d
%rdi
%edi
%r13
%r13d
%rsp
%esp
%r14
%r14d
%rbp
%ebp
%r15
%r15d
 Twice the number of registers
 Accessible as 8, 16, 32, 64 bits
x86-64 Integer Registers
%rax
Return value
%r8
Argument #5
%rbx
Callee saved
%r9
Argument #6
%rcx
Argument #4
%r10
Callee saved
%rdx
Argument #3
%r11
Used for linking
%rsi
Argument #2
%r12
C: Callee saved
%rdi
Argument #1
%r13
Callee saved
%rsp
Stack pointer
%r14
Callee saved
%rbp
Callee saved
%r15
Callee saved
x86-64 Registers

Arguments passed to functions via registers
 If more than 6 integral parameters, then pass rest on stack
 These registers can be used as caller-saved as well

All references to stack frame via stack pointer
 Eliminates need to update %ebp/%rbp

Other Registers
 6+1 callee saved
 2 or 3 have special uses
x86-64 Long Swap
void swap(long *xp, long *yp)
{
long t0 = *xp;
long t1 = *yp;
*xp = t1;
*yp = t0;
}

swap:
movq
movq
movq
movq
ret
Operands passed in registers
 First (xp) in %rdi, second (yp) in %rsi
 64-bit pointers


No stack operations required (except ret)
Avoiding stack
 Can hold all local information in registers
(%rdi), %rdx
(%rsi), %rax
%rax, (%rdi)
%rdx, (%rsi)
x86-64 Locals in the Red Zone
/* Swap, using local array */
void swap_a(long *xp, long *yp)
{
volatile long loc[2];
loc[0] = *xp;
loc[1] = *yp;
*xp = loc[1];
*yp = loc[0];
}

swap_a:
movq
movq
movq
movq
movq
movq
movq
movq
ret
Avoiding Stack Pointer Change
 Can hold all information within small
window beyond stack pointer
(%rdi), %rax
%rax, -24(%rsp)
(%rsi), %rax
%rax, -16(%rsp)
-16(%rsp), %rax
%rax, (%rdi)
-24(%rsp), %rax
%rax, (%rsi)
rtn Ptr
−8
unused
−16 loc[1]
−24 loc[0]
%rsp
x86-64 NonLeaf without Stack Frame
long scount = 0;
/* Swap a[i] & a[i+1] */
void swap_ele_se
(long a[], int i)
{
swap(&a[i], &a[i+1]);
scount++;
}

No values held while swap
being invoked

No callee save registers
needed
swap_ele_se:
movslq %esi,%rsi
# Sign extend i
leaq
(%rdi,%rsi,8), %rdi # &a[i]
leaq
8(%rdi), %rsi
# &a[i+1]
call
swap
# swap()
incq
scount(%rip)
# scount++;
ret
x86-64 Call using Jump
long scount = 0;
/* Swap a[i] & a[i+1] */
void swap_ele(long a[], int i)
{
swap(&a[i], &a[i+1]);
}


When swap executes ret,
it will return from
swap_ele
Possible since swap is a
“tail call”
(no instructions afterwards)
swap_ele:
movslq %esi,%rsi
# Sign extend i
leaq
(%rdi,%rsi,8), %rdi # &a[i]
leaq
8(%rdi), %rsi
# &a[i+1]
jmp
swap
# swap()
x86-64 Stack Frame Example
long sum = 0;
/* Swap a[i] & a[i+1] */
void swap_ele_su
(long a[], int i)
{
swap(&a[i], &a[i+1]);
sum += a[i];
}


Keeps values of a and i in
callee save registers
Must set up stack frame to
save these registers
swap_ele_su:
movq
%rbx, -16(%rsp)
movslq %esi,%rbx
movq
%r12, -8(%rsp)
movq
%rdi, %r12
leaq
(%rdi,%rbx,8), %rdi
subq
$16, %rsp
leaq
8(%rdi), %rsi
call
swap
movq
(%r12,%rbx,8), %rax
addq
%rax, sum(%rip)
movq
(%rsp), %rbx
movq
8(%rsp), %r12
addq
$16, %rsp
ret
Understanding x86-64 Stack Frame
swap_ele_su:
movq
%rbx, -16(%rsp)
movslq %esi,%rbx
movq
%r12, -8(%rsp)
movq
%rdi, %r12
leaq
(%rdi,%rbx,8), %rdi
subq
$16, %rsp
leaq
8(%rdi), %rsi
call
swap
movq
(%r12,%rbx,8), %rax
addq
%rax, sum(%rip)
movq
(%rsp), %rbx
movq
8(%rsp), %r12
addq
$16, %rsp
ret
#
#
#
#
#
#
#
#
#
#
#
#
#
Save %rbx
Extend & save i
Save %r12
Save a
&a[i]
Allocate stack frame
&a[i+1]
swap()
a[i]
sum += a[i]
Restore %rbx
Restore %r12
Deallocate stack frame
Understanding x86-64 Stack Frame
swap_ele_su:
movq
%rbx, -16(%rsp)
movslq %esi,%rbx
movq
%r12, -8(%rsp)
movq
%rdi, %r12
leaq
(%rdi,%rbx,8), %rdi
subq
$16, %rsp
leaq
8(%rdi), %rsi
call
swap
movq
(%r12,%rbx,8), %rax
addq
%rax, sum(%rip)
movq
(%rsp), %rbx
movq
8(%rsp), %r12
addq
$16, %rsp
ret
#
#
#
#
#
#
#
#
#
#
#
#
#
Save %rbx
%rsp
rtn addr
Extend & save i
−8 %r12
Save %r12
−16 %rbx
Save a
&a[i]
Allocate stack frame
&a[i+1]
rtn addr
swap()
a[i]
+8 %r12
sum += a[i] %rsp
%rbx
Restore %rbx
Restore %r12
Deallocate stack frame
Interesting Features of Stack Frame

Allocate entire frame at once
 All stack accesses can be relative to %rsp
 Do by decrementing stack pointer
 Can delay allocation, since safe to temporarily use red zone

Simple deallocation
 Increment stack pointer
 No base/frame pointer needed
Interesting Features of Stack Frame


Many compiled functions do not require a stack frame
other than saving their return address.
A function does not require a stack frame if:
 All local variables can be held in registers
 The function does not call other functions (referred to as leaf
procedures)

A function would require a stack frame if the function:





Has too many local variables to hold in registers
Has some local variables are arrays or structures
uses &-operator to compute the address of a local variable
must pass some arguments on the stack to another function
Needs to save the state of a calllee-save register
General Conditional Expression Translation
C Code
val = Test ? Then-Expr : Else-Expr;
val = x>y ? x-y : y-x;
Goto Version
nt = !Test;
if (nt) goto Else;
val = Then-Expr;
Done:
. . .
Else:
val = Else-Expr;
goto Done;
 Test is expression returning integer
= 0 interpreted as false
0 interpreted as true
 Create separate code regions for
then & else expressions
 Execute appropriate one
Conditionals: x86-64
int absdiff(
int x, int y)
{
int result;
if (x > y) {
result = x-y;
} else {
result = y-x;
}
return result;
}

absdiff:
movl
movl
subl
subl
cmpl
cmovle
ret
# x in %edi, y in %esi
%edi, %eax # eax = x
%esi, %edx # edx = y
%esi, %eax # eax = x-y
%edi, %edx # edx = y-x
%esi, %edi # x:y
%edx, %eax # eax=edx if <=
Conditional move instruction




cmovC src, dest
Move value from src to dest if condition C holds
More efficient than conditional branching (simple control flow)
But overhead: both branches are evaluated
General Form with Conditional Move
C Code
val = Test ? Then-Expr : Else-Expr;
Conditional Move Version
val1
val2
val1



= Then-Expr;
= Else-Expr;
= val2 if !Test;
Both values get computed
Overwrite then-value with else-value if condition doesn’t hold
Don’t use when:
 Then or else expression have side effects
 Then and else expression are to expensive
Specific Cases of Alignment (x86-64)

1 byte: char, …
 no restrictions on address

2 bytes: short, …
 lowest 1 bit of address must be 02

4 bytes: int, float, …
 lowest 2 bits of address must be 002

8 bytes: double, char *, …
 Windows & Linux:


lowest 3 bits of address must be 0002
16 bytes: long double
 Linux:
lowest 3 bits of address must be 0002
 i.e., treated the same as a 8-byte primitive data type

Vector Instructions: SSE Family

SIMD (single-instruction, multiple data) vector
instructions
 New data types, registers, operations
 Parallel operation on small (length 2-8) vectors of integers or floats
 Example:
+

x
“4-way”
Floating point vector instructions




Available with Intel’s SSE (streaming SIMD extensions) family
SSE starting with Pentium III: 4-way single precision
SSE2 starting with Pentium 4: 2-way double precision
All x86-64 have SSE3 (superset of SSE2, SSE)
SSE3 Registers


All caller saved
%xmm0 for floating point return value
128 bit = 2 doubles = 4 singles
%xmm0
Argument #1
%xmm8
%xmm1
Argument #2
%xmm9
%xmm2
Argument #3
%xmm10
%xmm3
Argument #4
%xmm11
%xmm4
Argument #5
%xmm12
%xmm5
Argument #6
%xmm13
%xmm6
Argument #7
%xmm14
%xmm7
Argument #8
%xmm15
SSE3 Registers


Different data types and associated instructions
128 bit
Integer vectors:
 16-way byte
 8-way 2 bytes
 4-way 4 bytes

Floating point vectors:
 4-way single
 2-way double

Floating point scalars:
 single
 double
LSB
SSE3 Instructions: Examples

Single precision 4-way vector add: addps %xmm0 %xmm1
%xmm0
+
%xmm1

Single precision scalar add: addss %xmm0 %xmm1
%xmm0
+
%xmm1
Extending to x86-64



Pointers and long ints are 64 bits long. Integer arithmetic
operations support 8, 16, 32 and 64-bit data types
The set of general purpose registers expanded from 8 to
16
Much of the program state is held in registers rather than
on stack.
 Integer and pointer arguments (upto 6) to procedures are passsed
via registers.
 Some procedures do not need to access to stack at all.

Conditional operations are implemented using
conditional move instructions, when possible,
 yielding better performance than traditional branching

Floating point operations are implemented using registeroriented SSE2, rather than stack-based x87
Procedures (x86-64): Optimizations
 No base/frame pointer
 Passing arguments to functions through registers (if possible)
 Sometimes: Writing into the “red zone” (below stack pointer)
rtn Ptr
−8
unused
−16 loc[1]
−24 )loc[0]
Sometimes: Function call using jmp (instead of call

 Reason: Performance


use stack as little as possible
while obeying rules (e.g., caller/callee save registers)
%rsp
Download