a[i+1]

advertisement
Machine Programming – IA32 memory
layout and buffer overflow
CENG331: Introduction to Computer Systems
7th Lecture
Instructor:
Erol Sahin
Acknowledgement: Most of the slides are adapted from the ones prepared
by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
FF
Linux Memory Layout

C0
BF Stack
Upper
2 hex
digits of
address 80
7F
Red Hat
Heap
v. 6.2
~1920M
B
40 DLLs
memory 3F
Heap
limit
Data
08 Text
00




Stack
 Runtime stack (8MB limit)
Heap
 Dynamically allocated storage
 When call malloc, calloc, new
DLLs
 Dynamically Linked Libraries
 Library routines (e.g., printf, malloc)
 Linked into object code when first executed
Data
 Statically allocated data
 E.g., arrays & strings declared in code
Text
 Executable machine instructions
 Read-only
Linux Memory Allocation
Initially
BF
Stack
80
7F
Some
Heap
Linked
BF
Stack
80
7F
BF
Stack
80
7F
More
Heap
BF
Stack
80
7F
Heap
Heap
40
3F
40 DLLs
3F
40 DLLs
3F
Data
08 Text
00
Data
08 Text
00
Data
08 Text
00
40 DLLs
3F
Heap
Data
08 Text
00
Text & Stack Example
(gdb) break main
(gdb) run
Breakpoint 1, 0x804856f in main ()
(gdb) print $esp
$3 = (void *) 0xbffffc78

BF
Stack
80
7F
Main
 Address 0x804856f should be read
0x0804856f

Initially
Stack
 Address 0xbffffc78
40
3F
Data
08 Text
00
Dynamic Linking Example
(gdb) print malloc
$1 = {<text variable, no debug info>}
0x8048454 <malloc>
(gdb) run
Program exited normally.
(gdb) print malloc
$2 = {void *(unsigned int)}
0x40006240 <malloc>

Initially
 Code in text segment that invokes dynamic linker
 Address 0x8048454 should be read 0x08048454

Final
 Code in DLL region
Linked
BF
Stack
80
7F
40 DLLs
3F
Data
08 Text
00
Memory Allocation Example
char big_array[1<<24]; /* 16 MB */
char huge_array[1<<28]; /* 256 MB */
int beyond;
char *p1, *p2, *p3, *p4;
int useless() {
int
{
p1
p2
p3
p4
/*
}
return 0; }
main()
= malloc(1
= malloc(1
= malloc(1
= malloc(1
Some print
<<28); /*
<< 8); /*
<<28); /*
<< 8); /*
statements
256
256
256
256
...
MB
B
MB
B
*/
*/
*/
*/
*/
Example Addresses
$esp
p3
p1
Final malloc
p4
p2
beyond
big_array
huge_array
main()
useless()
Initial malloc
0xbffffc78
0x500b5008
0x400b4008
0x40006240
0x1904a640
0x1904a538
0x1904a524
0x1804a520
0x0804a510
0x0804856f
0x08048560
0x08048454
BF
Stack
80
7F
Heap
40 DLLs
3F
Heap
Data
08 Text
00
Internet Worm and IM War

November, 1988
 Internet Worm attacks thousands of Internet hosts.
 How did it happen?

July, 1999
 Microsoft launches MSN Messenger (instant messaging system).
 Messenger clients can access popular AOL Instant Messaging Service
(AIM) servers
AIM
client
MSN
server
MSN
client
AIM
server
AIM
client
Internet Worm and IM War (cont.)

August 1999
 Mysteriously, Messenger clients can no longer access AIM servers.
 Microsoft and AOL begin the IM war:
AOL changes server to disallow Messenger clients
 Microsoft makes changes to clients to defeat AOL changes.
 At least 13 such skirmishes.
 How did it happen?


The Internet Worm and AOL/Microsoft War were both based
on stack buffer overflow exploits!


many Unix functions do not check argument sizes.
allows target buffers to overflow.
String Library Code
 Implementation of Unix function gets

No way to specify limit on number of characters to read
/* Get string from stdin */
char *gets(char *dest)
{
int c = getc();
char *p = dest;
while (c != EOF && c != '\n') {
*p++ = c;
c = getc();
}
*p = '\0';
return dest;
}
 Similar problems with other Unix functions


strcpy: Copies string of arbitrary length
scanf, fscanf, sscanf, when given %s conversion specification
Vulnerable Buffer Code
/* Echo Line */
void echo()
{
char buf[4];
gets(buf);
puts(buf);
}
/* Way too small! */
int main()
{
printf("Type a string:");
echo();
return 0;
}
Buffer Overflow Executions
unix>./bufdemo
Type a string:123
123
unix>./bufdemo
Type a string:12345
Segmentation Fault
unix>./bufdemo
Type a string:12345678
Segmentation Fault
Buffer Overflow Stack
Stack
Frame
for main
Return Address
Saved %ebp
%ebp
[3][2][1][0] buf
Stack
Frame
for echo
/* Echo Line */
void echo()
{
char buf[4];
gets(buf);
puts(buf);
}
echo:
pushl %ebp
movl %esp,%ebp
subl $20,%esp
pushl %ebx
addl $-12,%esp
leal -4(%ebp),%ebx
pushl %ebx
call gets
. . .
/* Way too small! */
# Save %ebp on stack
#
#
#
#
#
#
Allocate space on stack
Save %ebx
Allocate space on stack
Compute buf as %ebp-4
Push buf on stack
Call gets
Buffer Overflow
Stack Example
unix> gdb bufdemo
(gdb) break echo
Breakpoint 1 at 0x8048583
(gdb) run
Breakpoint 1, 0x8048583 in echo ()
(gdb) print /x *(unsigned *)$ebp
$1 = 0xbffff8f8
(gdb) print /x *((unsigned *)$ebp + 1)
$3 = 0x804864d
Stack
Frame
for main
Stack
Frame
for main
Return Address
Saved %ebp
%ebp
[3][2][1][0] buf
08
04Address
86 4d
Return
bf
f8 f8 0xbffff8d8
Savedff
%ebp
[3][2][1][0]
xx
xx xx xx buf
Stack
Frame
for echo
Stack
Frame
for echo
Before call to gets
8048648: call 804857c <echo>
804864d: mov 0xffffffe8(%ebp),%ebx # Return Point
Buffer Overflow Example #1
Before Call to gets
Input = “123”
Stack
Frame
for main
Stack
Frame
for main
Return Address
Saved %ebp
%ebp
[3][2][1][0] buf
08
04Address
86 4d
Return
bf
f8 f8 0xbffff8d8
Savedff
%ebp
[3][2][1][0]
00
33 32 31 buf
Stack
Frame
for echo
Stack
Frame
for echo
No Problem
Buffer Overflow Stack Example #2
Input = “12345”
Stack
Frame
for main
Stack
Frame
for main
Return Address
Saved %ebp
%ebp
[3][2][1][0] buf
08
04Address
86 4d
Return
bf
00 35 0xbffff8d8
Savedff
%ebp
[3][2][1][0]
34
33 32 31 buf
Stack
Frame
for echo
Stack
Frame
for echo
echo code:
8048592:
8048593:
8048598:
804859b:
804859d:
804859e:
push
call
mov
mov
pop
ret
Saved value of %ebp set to
0xbfff0035
Bad news when later attempt
to restore %ebp
%ebx
80483e4 <_init+0x50> # gets
0xffffffe8(%ebp),%ebx
%ebp,%esp
%ebp
# %ebp gets set to invalid value
Buffer Overflow Stack Example #3
Stack
Frame
for main
Stack
Frame
for main
Return Address
Saved %ebp
%ebp
[3][2][1][0] buf
08
04Address
86 00
Return
38
36 35 0xbffff8d8
Saved37
%ebp
[3][2][1][0]
34
33 32 31 buf
Stack
Frame
for echo
Stack
Frame
for echo
Input = “12345678”
%ebp and return address
corrupted
Invalid address
No longer pointing to
desired return point
8048648: call 804857c <echo>
804864d: mov 0xffffffe8(%ebp),%ebx # Return Point
Malicious Use of Buffer Overflow
Stack
after call to gets()
return
address
A
void foo(){
bar();
...
}
void bar() {
char buf[64];
gets(buf);
...
}
foo stack frame
data
written
by
gets()
B
B
pad
exploit
code
 Input string contains byte representation of executable code
 Overwrite return address with address of buffer
 When bar() executes ret, will jump to exploit code
bar stack frame
Exploits Based on Buffer Overflows


Buffer overflow bugs allow remote machines to execute
arbitrary code on victim machines.
Internet worm
 Early versions of the finger server (fingerd) used gets() to read the
argument sent by the client:
 finger droh@cs.cmu.edu
 Worm attacked fingerd server by sending phony argument:
 finger “exploit-code
padding new-returnaddress”
 exploit code: executed a root shell on the victim machine with a
direct TCP connection to the attacker.
Exploits Based on Buffer Overflows


Buffer overflow bugs allow remote machines to execute
arbitrary code on victim machines.
IM War
 AOL exploited existing buffer overflow bug in AIM clients
 exploit code: returned 4-byte signature (the bytes at some location in
the AIM client) to server.
 When Microsoft changed code to match signature, AOL changed
signature location.

Date: Wed, 11 Aug 1999 11:30:57 -0700 (PDT)
From: Phil Bucking <philbucking@yahoo.com>
Subject: AOL exploiting buffer overrun bug in their own software!
To: rms@pharlap.com

Mr. Smith,






















I am writing you because I have discovered something that I think you
might find interesting because you are an Internet security expert with
experience in this area. I have also tried to contact AOL but received
no response.
I am a developer who has been working on a revolutionary new instant
messaging client that should be released later this year.
...
It appears that the AIM client has a buffer overrun bug. By itself
this might not be the end of the world, as MS surely has had its share.
But AOL is now *exploiting their own buffer overrun bug* to help in
its efforts to block MS Instant Messenger.
....
Since you have significant credibility with the press I hope that you
can use this information to help inform people that behind AOL's
friendly exterior they are nefariously compromising peoples' security.
Sincerely,
Phil Bucking
Founder, Bucking Consulting
philbucking@yahoo.com
It was later determined that this email
originated from within Microsoft!
Code Red Worm

History
 June 18, 2001. Microsoft announces buffer overflow vulnerability
in IIS Internet server
 July 19, 2001. over 250,000 machines infected by new virus in 9
hours
 White house must change its IP address. Pentagon shut down
public WWW servers for day

When We Set Up CS:APP Web Site
 Received strings of form
GET
/default.ida?NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN....N
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN%u9090%u6858%ucbd
3%u7801%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u
9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u0078%u000
0%u00=a
HTTP/1.0" 400 325 "-" "-"
Code Red Exploit Code
 Starts 100 threads running
 Spread self
Generate random IP addresses & send attack string
 Between 1st & 19th of month
 Attack www.whitehouse.gov
 Send 98,304 packets; sleep for 4-1/2 hours; repeat
– Denial of service attack
 Between 21st & 27th of month
 Deface server’s home page
 After waiting 2 hours

Code Red Effects

Later Version Even More Malicious
 Code Red II
 As of April, 2002, over 18,000 machines infected
 Still spreading

Paved Way for NIMDA
 Variety of propagation methods
 One was to exploit vulnerabilities left behind by Code Red II
Avoiding Overflow Vulnerability
/* Echo Line */
void echo()
{
char buf[4]; /* Way too small! */
fgets(buf, 4, stdin);
puts(buf);
}

Use Library Routines that Limit String Lengths
 fgets instead of gets
 strncpy instead of strcpy
 Don’t use scanf with %s conversion specification

Use fgets to read the string
Final Observations

Memory Layout
 OS/machine dependent (including kernel version)
 Basic partitioning: stack/data/text/heap/DLL found in most
machines

Type Declarations in C
 Notation obscure, but very systematic

Working with Strange Code
 Important to analyze nonstandard cases
E.g., what happens when stack corrupted due to buffer
overflow
 Helps to step through with GDB

Machine Programming – x86-64
CENG331: Introduction to Computer Systems
7th Lecture
Instructor:
Erol Sahin
Acknowledgement: Most of the slides are adapted from the ones prepared
by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
x86-64 Integer Registers
%rax
%eax
%r8
%r8d
%rbx
%ebx
%r9
%r9d
%rcx
%ecx
%r10
%r10d
%rdx
%edx
%r11
%r11d
%rsi
%esi
%r12
%r12d
%rdi
%edi
%r13
%r13d
%rsp
%esp
%r14
%r14d
%rbp
%ebp
%r15
%r15d
 Twice the number of registers
 Accessible as 8, 16, 32, 64 bits
x86-64 Integer Registers
%rax
Return value
%r8
Argument #5
%rbx
Callee saved
%r9
Argument #6
%rcx
Argument #4
%r10
Callee saved
%rdx
Argument #3
%r11
Used for linking
%rsi
Argument #2
%r12
C: Callee saved
%rdi
Argument #1
%r13
Callee saved
%rsp
Stack pointer
%r14
Callee saved
%rbp
Callee saved
%r15
Callee saved
x86-64 Registers

Arguments passed to functions via registers
 If more than 6 integral parameters, then pass rest on stack
 These registers can be used as caller-saved as well

All references to stack frame via stack pointer
 Eliminates need to update %ebp/%rbp

Other Registers
 6+1 callee saved
 2 or 3 have special uses
x86-64 Long Swap
void swap(long *xp, long *yp)
{
long t0 = *xp;
long t1 = *yp;
*xp = t1;
*yp = t0;
}

swap:
movq
movq
movq
movq
ret
Operands passed in registers
 First (xp) in %rdi, second (yp) in %rsi
 64-bit pointers


No stack operations required (except ret)
Avoiding stack
 Can hold all local information in registers
(%rdi), %rdx
(%rsi), %rax
%rax, (%rdi)
%rdx, (%rsi)
x86-64 Locals in the Red Zone
/* Swap, using local array */
void swap_a(long *xp, long *yp)
{
volatile long loc[2];
loc[0] = *xp;
loc[1] = *yp;
*xp = loc[1];
*yp = loc[0];
}

swap_a:
movq
movq
movq
movq
movq
movq
movq
movq
ret
(%rdi), %rax
%rax, -24(%rsp)
(%rsi), %rax
%rax, -16(%rsp)
-16(%rsp), %rax
%rax, (%rdi)
-24(%rsp), %rax
%rax, (%rsi)
Avoiding Stack Pointer Change
 Can hold all information within small
window beyond stack pointer
Volatile tells the compiler that the value of the variable may change at
any time--without any action being taken by the code the compiler
finds nearby. Useful when working with I/O devices and interrupts.
rtn Ptr
−8
unused
−16 loc[1]
−24 loc[0]
%rsp
x86-64 NonLeaf without Stack Frame
long scount = 0;
/* Swap a[i] & a[i+1] */
void swap_ele_se
(long a[], int i)
{
swap(&a[i], &a[i+1]);
scount++;
}

No values held while swap
being invoked

No callee save registers
needed
swap_ele_se:
movslq %esi,%rsi
# Sign extend i
leaq
(%rdi,%rsi,8), %rdi # &a[i]
leaq
8(%rdi), %rsi
# &a[i+1]
call
swap
# swap()
incq
scount(%rip)
# scount++;
ret
x86-64 Call using Jump
long scount = 0;
/* Swap a[i] & a[i+1] */
void swap_ele(long a[], int i)
{
swap(&a[i], &a[i+1]);
}
swap_ele:
movslq %esi,%rsi
# Sign extend i
leaq
(%rdi,%rsi,8), %rdi # Will
&a[i]
disappear
leaq
8(%rdi), %rsi
# &a[i+1]
Blackboard?
jmp
swap
# swap()
x86-64 Call using Jump
long scount = 0;
/* Swap a[i] & a[i+1] */
void swap_ele(long a[], int i)
{
swap(&a[i], &a[i+1]);
}


When swap executes ret,
it will return from
swap_ele
Possible since swap is a
“tail call”
(no instructions afterwards)
swap_ele:
movslq %esi,%rsi
# Sign extend i
leaq
(%rdi,%rsi,8), %rdi # &a[i]
leaq
8(%rdi), %rsi
# &a[i+1]
jmp
swap
# swap()
x86-64 Stack Frame Example
long sum = 0;
/* Swap a[i] & a[i+1] */
void swap_ele_su
(long a[], int i)
{
swap(&a[i], &a[i+1]);
sum += a[i];
}


Keeps values of a and i in
callee save registers
Must set up stack frame to
save these registers
swap_ele_su:
movq
%rbx, -16(%rsp)
movslq %esi,%rbx
movq
%r12, -8(%rsp)
movq
%rdi, %r12
leaq
(%rdi,%rbx,8), %rdi
subq
$16, %rsp
leaq
8(%rdi), %rsi
call
swap
movq
(%r12,%rbx,8), %rax
addq
%rax, sum(%rip)
movq
(%rsp), %rbx
movq
8(%rsp), %r12
addq
$16, %rsp
ret
Understanding x86-64 Stack Frame
swap_ele_su:
movq
%rbx, -16(%rsp)
movslq %esi,%rbx
movq
%r12, -8(%rsp)
movq
%rdi, %r12
leaq
(%rdi,%rbx,8), %rdi
subq
$16, %rsp
leaq
8(%rdi), %rsi
call
swap
movq
(%r12,%rbx,8), %rax
addq
%rax, sum(%rip)
movq
(%rsp), %rbx
movq
8(%rsp), %r12
addq
$16, %rsp
ret
#
#
#
#
#
#
#
#
#
#
#
#
#
Save %rbx
Extend & save i
Save %r12
Save a
&a[i]
Allocate stack frame
&a[i+1]
swap()
a[i]
sum += a[i]
Restore %rbx
Restore %r12
Deallocate stack frame
Understanding x86-64 Stack Frame
swap_ele_su:
movq
%rbx, -16(%rsp)
movslq %esi,%rbx
movq
%r12, -8(%rsp)
movq
%rdi, %r12
leaq
(%rdi,%rbx,8), %rdi
subq
$16, %rsp
leaq
8(%rdi), %rsi
call
swap
movq
(%r12,%rbx,8), %rax
addq
%rax, sum(%rip)
movq
(%rsp), %rbx
movq
8(%rsp), %r12
addq
$16, %rsp
ret
#
#
#
#
#
#
#
#
#
#
#
#
#
Save %rbx
%rsp
rtn addr
Extend & save i
−8 %r12
Save %r12
−16 %rbx
Save a
&a[i]
Allocate stack frame
&a[i+1]
rtn addr
swap()
a[i]
+8 %r12
sum += a[i] %rsp
%rbx
Restore %rbx
Restore %r12
Deallocate stack frame
Interesting Features of Stack Frame

Allocate entire frame at once
 All stack accesses can be relative to %rsp
 Do by decrementing stack pointer
 Can delay allocation, since safe to temporarily use red zone

Simple deallocation
 Increment stack pointer
 No base/frame pointer needed
x86-64 Procedure Summary

Heavy use of registers
 Parameter passing
 More temporaries since more registers

Minimal use of stack
 Sometimes none
 Allocate/deallocate entire block

Many tricky optimizations
 What kind of stack frame to use
 Calling with jump
 Various allocation techniques
IA32 Floating Point (x87)

History
 8086: first computer to implement IEEE FP
separate 8087 FPU (floating point unit)
 486: merged FPU and Integer Unit onto one chip
 Becoming obsolete with x86-64


Summary
 Hardware to add, multiply, and divide
 Floating point data registers
 Various control & status registers

Instruction
decoder and
sequencer
Integer
Unit
FPU
Floating Point Formats
 single precision (C float): 32 bits
 double precision (C double): 64 bits
 extended precision (C long double): 80 bits
Memory
FPU Data Register Stack (x87)

FPU register format (80 bit extended precision)
79 78
s exp

0
64 63
frac
FPU registers




8 registers %st(0) - %st(7)
Logically form stack
Top: %st(0)
Bottom disappears (drops out) after too many pushs
%st(3)
%st(2)
%st(1)
“Top”
%st(0)
FPU instructions (x87)

Large number of floating point instructions and formats
 ~50 basic instruction types
 load, store, add, multiply
 sin, cos, tan, arctan, and log


Often slower than math lib
Sample instructions:
Instruction
Effect
Description
fldz
flds Addr
fmuls Addr
faddp
push 0.0
push Mem[Addr]
%st(0)  %st(0)*M[Addr]
%st(1)  %st(0)+%st(1);pop
Load zero
Load single precision real
Multiply
Add and pop
FP Code Example (x87)

Compute inner product of two
vectors
 Single precision arithmetic
 Common computation
float ipf (float x[],
float y[],
int n)
{
int i;
float result = 0.0;
for (i = 0; i < n; i++)
result += x[i]*y[i];
return result;
}
pushl %ebp
movl %esp,%ebp
pushl %ebx
movl 8(%ebp),%ebx
movl 12(%ebp),%ecx
movl 16(%ebp),%edx
fldz
xorl %eax,%eax
cmpl %edx,%eax
jge .L3
.L5:
flds (%ebx,%eax,4)
fmuls (%ecx,%eax,4)
faddp
incl %eax
cmpl %edx,%eax
jl .L5
.L3:
movl -4(%ebp),%ebx
movl %ebp, %esp
popl %ebp
ret
# setup
#
#
#
#
#
#
%ebx=&x
%ecx=&y
%edx=n
push +0.0
i=0
if i>=n done
#
#
#
#
#
push x[i]
st(0)*=y[i]
st(1)+=st(0); pop
i++
if i<n repeat
# finish
# st(0) = result
Inner Product Stack Trace
eax = i
ebx = *x
ecx = *y
Initialization
1. fldz
0.0
%st(0)
Iteration 0
Iteration 1
2. flds (%ebx,%eax,4)
0.0
x[0]
5. flds (%ebx,%eax,4)
%st(1)
%st(0)
3. fmuls (%ecx,%eax,4)
0.0
x[0]*y[0]
%st(1)
%st(0)
4. faddp
0.0+x[0]*y[0]
x[0]*y[0]
x[1]
%st(1)
%st(0)
6. fmuls (%ecx,%eax,4)
x[0]*y[0]
x[1]*y[1]
%st(1)
%st(0)
7. faddp
%st(0)
x[0]*y[0]+x[1]*y[1]
%st(0)
Vector Instructions: SSE Family


SSE : Streaming SIMD Extensions
SIMD (single-instruction, multiple data) vector
instructions
 New data types, registers, operations
 Parallel operation on small (length 2-8) vectors of integers or floats
 Example:
+

x
“4-way”
Floating point vector instructions




Available with Intel’s SSE (streaming SIMD extensions) family
SSE starting with Pentium III: 4-way single precision
SSE2 starting with Pentium 4: 2-way double precision
All x86-64 have SSE3 (superset of SSE2, SSE)
Intel Architectures (Focus Floating Point)
Processors
8086
Architectures
Features
x86-16
286
386
486
Pentium
Pentium MMX
time
x86-32
MMX
Pentium III
SSE
4-way single precision fp
Pentium 4
SSE2
2-way double precision fp
Pentium 4E
SSE3
Pentium 4F
x86-64 / em64t
Core 2 Duo
SSE4
Our focus: SSE3
used for scalar (non-vector)
floating point
SSE3 Registers


All caller saved
%xmm0 for floating point return value
128 bit = 2 doubles = 4 singles
%xmm0
Argument #1
%xmm8
%xmm1
Argument #2
%xmm9
%xmm2
Argument #3
%xmm10
%xmm3
Argument #4
%xmm11
%xmm4
Argument #5
%xmm12
%xmm5
Argument #6
%xmm13
%xmm6
Argument #7
%xmm14
%xmm7
Argument #8
%xmm15
SSE3 Registers


Different data types and associated instructions
128 bit
Integer vectors:
 16-way byte
 8-way 2 bytes
 4-way 4 bytes

Floating point vectors:
 4-way single
 2-way double

Floating point scalars:
 single
 double
LSB
SSE3 Instructions: Examples

Single precision 4-way vector add: addps %xmm0 %xmm1
%xmm0
+
%xmm1

Single precision scalar add: addss %xmm0 %xmm1
%xmm0
+
%xmm1
SSE3 Instruction Names
packed (vector)
addps
single slot (scalar)
addss
single precision
addpd
double precision
addsd
this course
SSE3 Basic Instructions

Moves
Single
Double
Effect
movss
movsd
D←S
 Usual operand form: reg → reg, reg → mem, mem → reg

Arithmetic
Single
Double
Effect
addss
addsd
D←D+S
subss
subsd
D←D–S
mulss
mulsd
D←DxS
divss
divsd
D←D/S
maxss
maxsd
D ← max(D,S)
minss
minsd
D ← min(D,S)
sqrtss
sqrtsd
D ← sqrt(S)
x86-64 FP Code Example

Compute inner product of
two vectors
float ipf (float x[],
float y[],
int n) {
int i;
float result = 0.0;
 Single precision arithmetic
 Uses SSE3 instructions
for (i = 0; i < n; i++)
result += x[i]*y[i];
return result;
}
ipf:
xorps
%xmm1, %xmm1
xorl
%ecx, %ecx
jmp
.L8
.L10:
movslq %ecx,%rax
incl
%ecx
movss (%rsi,%rax,4), %xmm0
mulss (%rdi,%rax,4), %xmm0
addss
%xmm0, %xmm1
.L8:
cmpl
%edx, %ecx
jl
.L10
movaps %xmm1, %xmm0
ret
#
#
#
#
#
#
#
#
#
#
#
#
#
result = 0.0
i = 0
goto middle
loop:
icpy = i
i++
t = y[icpy]
t *= x[icpy]
result += t
middle:
i:n
if < goto loop
return result
SSE3 Conversion Instructions

Conversions
 Same operand forms as moves
Instruction
Description
cvtss2sd
single → double
cvtsd2ss
double → single
cvtsi2ss
int → single
cvtsi2sd
int → double
cvtsi2ssq
quad int → single
cvtsi2sdq
quad int → double
cvttss2si
single → int (truncation)
cvttsd2si
double → int (truncation)
cvttss2siq
single → quad int (truncation)
cvttss2siq
double → quad int (truncation)
x86-64 FP Code Example
double funct(double a, float x, double b, int i)
{
return a*x - b/i;
}
a
x
b
i
%xmm0 double
%xmm1 float
%xmm2 double
%edi int
funct:
cvtss2sd %xmm1, %xmm1
mulsd %xmm0, %xmm1
cvtsi2sd %edi, %xmm0
divsd %xmm0, %xmm2
movsd %xmm1, %xmm0
subsd %xmm2, %xmm0
ret
#
#
#
#
#
#
%xmm1 = (double) x
%xmm1 = a*x
%xmm0 = (double) i
%xmm2 = b/i
%xmm0 = a*x
return a*x - b/i
Constants
double cel2fahr(double temp)
{
return 1.8 * temp + 32.0;
}
# Constant declarations
.LC2:
.long 3435973837
#
.long 1073532108
#
.LC4:
.long 0
#
.long 1077936128
#

Here: Constants in decimal
format


compiler decision
hex more readable
Low order four bytes of 1.8
High order four bytes of 1.8
Low order four bytes of 32.0
High order four bytes of 32.0
# Code
cel2fahr:
mulsd .LC2(%rip), %xmm0
addsd .LC4(%rip), %xmm0
ret
# Multiply by 1.8
# Add 32.0
Checking Constant

Previous slide: Claim
.LC4:
.long 0
.long 1077936128

Convert to hex format:
.LC4:
.long 0x0
.long 0x40400000

# Low order four bytes of 32.0
# High order four bytes of 32.0
# Low order four bytes of 32.0
# High order four bytes of 32.0
Convert to double (blackboard?):
 Remember: e = 11 exponent bits, bias = 2e-1-1 = 1023
Comments

SSE3 floating point
 Uses lower ½ (double) or ¼ (single) of vector
 Finally departure from awkward x87
 Assembly very similar to integer code

x87 still supported
 Even mixing with SSE3 possible
 Not recommended

For highest floating point performance
 Vectorization a must (but not in this course)
 See next slide
Vector Instructions

Starting with version 4.1.1, gcc can autovectorize to some
extent






-O3 or –ftree-vectorize
No speed-up guaranteed
Very limited
icc as of now much better
Fish machines: gcc 3.4
For highest performance vectorize yourself using intrinsics
 Intrinsics = C interface to vector instructions
 Learn in 18-645

Future
 Intel AVX announced: 4-way double, 8-way single
Download