Machine Programming – IA32 memory layout and buffer overflow CENG331: Introduction to Computer Systems 7th Lecture Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ. FF Linux Memory Layout C0 BF Stack Upper 2 hex digits of address 80 7F Red Hat Heap v. 6.2 ~1920M B 40 DLLs memory 3F Heap limit Data 08 Text 00 Stack Runtime stack (8MB limit) Heap Dynamically allocated storage When call malloc, calloc, new DLLs Dynamically Linked Libraries Library routines (e.g., printf, malloc) Linked into object code when first executed Data Statically allocated data E.g., arrays & strings declared in code Text Executable machine instructions Read-only Linux Memory Allocation Initially BF Stack 80 7F Some Heap Linked BF Stack 80 7F BF Stack 80 7F More Heap BF Stack 80 7F Heap Heap 40 3F 40 DLLs 3F 40 DLLs 3F Data 08 Text 00 Data 08 Text 00 Data 08 Text 00 40 DLLs 3F Heap Data 08 Text 00 Text & Stack Example (gdb) break main (gdb) run Breakpoint 1, 0x804856f in main () (gdb) print $esp $3 = (void *) 0xbffffc78 BF Stack 80 7F Main Address 0x804856f should be read 0x0804856f Initially Stack Address 0xbffffc78 40 3F Data 08 Text 00 Dynamic Linking Example (gdb) print malloc $1 = {<text variable, no debug info>} 0x8048454 <malloc> (gdb) run Program exited normally. (gdb) print malloc $2 = {void *(unsigned int)} 0x40006240 <malloc> Initially Code in text segment that invokes dynamic linker Address 0x8048454 should be read 0x08048454 Final Code in DLL region Linked BF Stack 80 7F 40 DLLs 3F Data 08 Text 00 Memory Allocation Example char big_array[1<<24]; /* 16 MB */ char huge_array[1<<28]; /* 256 MB */ int beyond; char *p1, *p2, *p3, *p4; int useless() { int { p1 p2 p3 p4 /* } return 0; } main() = malloc(1 = malloc(1 = malloc(1 = malloc(1 Some print <<28); /* << 8); /* <<28); /* << 8); /* statements 256 256 256 256 ... MB B MB B */ */ */ */ */ Example Addresses $esp p3 p1 Final malloc p4 p2 beyond big_array huge_array main() useless() Initial malloc 0xbffffc78 0x500b5008 0x400b4008 0x40006240 0x1904a640 0x1904a538 0x1904a524 0x1804a520 0x0804a510 0x0804856f 0x08048560 0x08048454 BF Stack 80 7F Heap 40 DLLs 3F Heap Data 08 Text 00 Internet Worm and IM War November, 1988 Internet Worm attacks thousands of Internet hosts. How did it happen? July, 1999 Microsoft launches MSN Messenger (instant messaging system). Messenger clients can access popular AOL Instant Messaging Service (AIM) servers AIM client MSN server MSN client AIM server AIM client Internet Worm and IM War (cont.) August 1999 Mysteriously, Messenger clients can no longer access AIM servers. Microsoft and AOL begin the IM war: AOL changes server to disallow Messenger clients Microsoft makes changes to clients to defeat AOL changes. At least 13 such skirmishes. How did it happen? The Internet Worm and AOL/Microsoft War were both based on stack buffer overflow exploits! many Unix functions do not check argument sizes. allows target buffers to overflow. String Library Code Implementation of Unix function gets No way to specify limit on number of characters to read /* Get string from stdin */ char *gets(char *dest) { int c = getc(); char *p = dest; while (c != EOF && c != '\n') { *p++ = c; c = getc(); } *p = '\0'; return dest; } Similar problems with other Unix functions strcpy: Copies string of arbitrary length scanf, fscanf, sscanf, when given %s conversion specification Vulnerable Buffer Code /* Echo Line */ void echo() { char buf[4]; gets(buf); puts(buf); } /* Way too small! */ int main() { printf("Type a string:"); echo(); return 0; } Buffer Overflow Executions unix>./bufdemo Type a string:123 123 unix>./bufdemo Type a string:12345 Segmentation Fault unix>./bufdemo Type a string:12345678 Segmentation Fault Buffer Overflow Stack Stack Frame for main Return Address Saved %ebp %ebp [3][2][1][0] buf Stack Frame for echo /* Echo Line */ void echo() { char buf[4]; gets(buf); puts(buf); } echo: pushl %ebp movl %esp,%ebp subl $20,%esp pushl %ebx addl $-12,%esp leal -4(%ebp),%ebx pushl %ebx call gets . . . /* Way too small! */ # Save %ebp on stack # # # # # # Allocate space on stack Save %ebx Allocate space on stack Compute buf as %ebp-4 Push buf on stack Call gets Buffer Overflow Stack Example unix> gdb bufdemo (gdb) break echo Breakpoint 1 at 0x8048583 (gdb) run Breakpoint 1, 0x8048583 in echo () (gdb) print /x *(unsigned *)$ebp $1 = 0xbffff8f8 (gdb) print /x *((unsigned *)$ebp + 1) $3 = 0x804864d Stack Frame for main Stack Frame for main Return Address Saved %ebp %ebp [3][2][1][0] buf 08 04Address 86 4d Return bf f8 f8 0xbffff8d8 Savedff %ebp [3][2][1][0] xx xx xx xx buf Stack Frame for echo Stack Frame for echo Before call to gets 8048648: call 804857c <echo> 804864d: mov 0xffffffe8(%ebp),%ebx # Return Point Buffer Overflow Example #1 Before Call to gets Input = “123” Stack Frame for main Stack Frame for main Return Address Saved %ebp %ebp [3][2][1][0] buf 08 04Address 86 4d Return bf f8 f8 0xbffff8d8 Savedff %ebp [3][2][1][0] 00 33 32 31 buf Stack Frame for echo Stack Frame for echo No Problem Buffer Overflow Stack Example #2 Input = “12345” Stack Frame for main Stack Frame for main Return Address Saved %ebp %ebp [3][2][1][0] buf 08 04Address 86 4d Return bf 00 35 0xbffff8d8 Savedff %ebp [3][2][1][0] 34 33 32 31 buf Stack Frame for echo Stack Frame for echo echo code: 8048592: 8048593: 8048598: 804859b: 804859d: 804859e: push call mov mov pop ret Saved value of %ebp set to 0xbfff0035 Bad news when later attempt to restore %ebp %ebx 80483e4 <_init+0x50> # gets 0xffffffe8(%ebp),%ebx %ebp,%esp %ebp # %ebp gets set to invalid value Buffer Overflow Stack Example #3 Stack Frame for main Stack Frame for main Return Address Saved %ebp %ebp [3][2][1][0] buf 08 04Address 86 00 Return 38 36 35 0xbffff8d8 Saved37 %ebp [3][2][1][0] 34 33 32 31 buf Stack Frame for echo Stack Frame for echo Input = “12345678” %ebp and return address corrupted Invalid address No longer pointing to desired return point 8048648: call 804857c <echo> 804864d: mov 0xffffffe8(%ebp),%ebx # Return Point Malicious Use of Buffer Overflow Stack after call to gets() return address A void foo(){ bar(); ... } void bar() { char buf[64]; gets(buf); ... } foo stack frame data written by gets() B B pad exploit code Input string contains byte representation of executable code Overwrite return address with address of buffer When bar() executes ret, will jump to exploit code bar stack frame Exploits Based on Buffer Overflows Buffer overflow bugs allow remote machines to execute arbitrary code on victim machines. Internet worm Early versions of the finger server (fingerd) used gets() to read the argument sent by the client: finger droh@cs.cmu.edu Worm attacked fingerd server by sending phony argument: finger “exploit-code padding new-returnaddress” exploit code: executed a root shell on the victim machine with a direct TCP connection to the attacker. Exploits Based on Buffer Overflows Buffer overflow bugs allow remote machines to execute arbitrary code on victim machines. IM War AOL exploited existing buffer overflow bug in AIM clients exploit code: returned 4-byte signature (the bytes at some location in the AIM client) to server. When Microsoft changed code to match signature, AOL changed signature location. Date: Wed, 11 Aug 1999 11:30:57 -0700 (PDT) From: Phil Bucking <philbucking@yahoo.com> Subject: AOL exploiting buffer overrun bug in their own software! To: rms@pharlap.com Mr. Smith, I am writing you because I have discovered something that I think you might find interesting because you are an Internet security expert with experience in this area. I have also tried to contact AOL but received no response. I am a developer who has been working on a revolutionary new instant messaging client that should be released later this year. ... It appears that the AIM client has a buffer overrun bug. By itself this might not be the end of the world, as MS surely has had its share. But AOL is now *exploiting their own buffer overrun bug* to help in its efforts to block MS Instant Messenger. .... Since you have significant credibility with the press I hope that you can use this information to help inform people that behind AOL's friendly exterior they are nefariously compromising peoples' security. Sincerely, Phil Bucking Founder, Bucking Consulting philbucking@yahoo.com It was later determined that this email originated from within Microsoft! Code Red Worm History June 18, 2001. Microsoft announces buffer overflow vulnerability in IIS Internet server July 19, 2001. over 250,000 machines infected by new virus in 9 hours White house must change its IP address. Pentagon shut down public WWW servers for day When We Set Up CS:APP Web Site Received strings of form GET /default.ida?NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN....N NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN%u9090%u6858%ucbd 3%u7801%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u 9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u0078%u000 0%u00=a HTTP/1.0" 400 325 "-" "-" Code Red Exploit Code Starts 100 threads running Spread self Generate random IP addresses & send attack string Between 1st & 19th of month Attack www.whitehouse.gov Send 98,304 packets; sleep for 4-1/2 hours; repeat – Denial of service attack Between 21st & 27th of month Deface server’s home page After waiting 2 hours Code Red Effects Later Version Even More Malicious Code Red II As of April, 2002, over 18,000 machines infected Still spreading Paved Way for NIMDA Variety of propagation methods One was to exploit vulnerabilities left behind by Code Red II Avoiding Overflow Vulnerability /* Echo Line */ void echo() { char buf[4]; /* Way too small! */ fgets(buf, 4, stdin); puts(buf); } Use Library Routines that Limit String Lengths fgets instead of gets strncpy instead of strcpy Don’t use scanf with %s conversion specification Use fgets to read the string Final Observations Memory Layout OS/machine dependent (including kernel version) Basic partitioning: stack/data/text/heap/DLL found in most machines Type Declarations in C Notation obscure, but very systematic Working with Strange Code Important to analyze nonstandard cases E.g., what happens when stack corrupted due to buffer overflow Helps to step through with GDB Machine Programming – x86-64 CENG331: Introduction to Computer Systems 7th Lecture Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ. x86-64 Integer Registers %rax %eax %r8 %r8d %rbx %ebx %r9 %r9d %rcx %ecx %r10 %r10d %rdx %edx %r11 %r11d %rsi %esi %r12 %r12d %rdi %edi %r13 %r13d %rsp %esp %r14 %r14d %rbp %ebp %r15 %r15d Twice the number of registers Accessible as 8, 16, 32, 64 bits x86-64 Integer Registers %rax Return value %r8 Argument #5 %rbx Callee saved %r9 Argument #6 %rcx Argument #4 %r10 Callee saved %rdx Argument #3 %r11 Used for linking %rsi Argument #2 %r12 C: Callee saved %rdi Argument #1 %r13 Callee saved %rsp Stack pointer %r14 Callee saved %rbp Callee saved %r15 Callee saved x86-64 Registers Arguments passed to functions via registers If more than 6 integral parameters, then pass rest on stack These registers can be used as caller-saved as well All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp Other Registers 6+1 callee saved 2 or 3 have special uses x86-64 Long Swap void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } swap: movq movq movq movq ret Operands passed in registers First (xp) in %rdi, second (yp) in %rsi 64-bit pointers No stack operations required (except ret) Avoiding stack Can hold all local information in registers (%rdi), %rdx (%rsi), %rax %rax, (%rdi) %rdx, (%rsi) x86-64 Locals in the Red Zone /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } swap_a: movq movq movq movq movq movq movq movq ret (%rdi), %rax %rax, -24(%rsp) (%rsi), %rax %rax, -16(%rsp) -16(%rsp), %rax %rax, (%rdi) -24(%rsp), %rax %rax, (%rsi) Avoiding Stack Pointer Change Can hold all information within small window beyond stack pointer Volatile tells the compiler that the value of the variable may change at any time--without any action being taken by the code the compiler finds nearby. Useful when working with I/O devices and interrupts. rtn Ptr −8 unused −16 loc[1] −24 loc[0] %rsp x86-64 NonLeaf without Stack Frame long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele_se (long a[], int i) { swap(&a[i], &a[i+1]); scount++; } No values held while swap being invoked No callee save registers needed swap_ele_se: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() incq scount(%rip) # scount++; ret x86-64 Call using Jump long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # Will &a[i] disappear leaq 8(%rdi), %rsi # &a[i+1] Blackboard? jmp swap # swap() x86-64 Call using Jump long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap() x86-64 Stack Frame Example long sum = 0; /* Swap a[i] & a[i+1] */ void swap_ele_su (long a[], int i) { swap(&a[i], &a[i+1]); sum += a[i]; } Keeps values of a and i in callee save registers Must set up stack frame to save these registers swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret # # # # # # # # # # # # # Save %rbx Extend & save i Save %r12 Save a &a[i] Allocate stack frame &a[i+1] swap() a[i] sum += a[i] Restore %rbx Restore %r12 Deallocate stack frame Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret # # # # # # # # # # # # # Save %rbx %rsp rtn addr Extend & save i −8 %r12 Save %r12 −16 %rbx Save a &a[i] Allocate stack frame &a[i+1] rtn addr swap() a[i] +8 %r12 sum += a[i] %rsp %rbx Restore %rbx Restore %r12 Deallocate stack frame Interesting Features of Stack Frame Allocate entire frame at once All stack accesses can be relative to %rsp Do by decrementing stack pointer Can delay allocation, since safe to temporarily use red zone Simple deallocation Increment stack pointer No base/frame pointer needed x86-64 Procedure Summary Heavy use of registers Parameter passing More temporaries since more registers Minimal use of stack Sometimes none Allocate/deallocate entire block Many tricky optimizations What kind of stack frame to use Calling with jump Various allocation techniques IA32 Floating Point (x87) History 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Becoming obsolete with x86-64 Summary Hardware to add, multiply, and divide Floating point data registers Various control & status registers Instruction decoder and sequencer Integer Unit FPU Floating Point Formats single precision (C float): 32 bits double precision (C double): 64 bits extended precision (C long double): 80 bits Memory FPU Data Register Stack (x87) FPU register format (80 bit extended precision) 79 78 s exp 0 64 63 frac FPU registers 8 registers %st(0) - %st(7) Logically form stack Top: %st(0) Bottom disappears (drops out) after too many pushs %st(3) %st(2) %st(1) “Top” %st(0) FPU instructions (x87) Large number of floating point instructions and formats ~50 basic instruction types load, store, add, multiply sin, cos, tan, arctan, and log Often slower than math lib Sample instructions: Instruction Effect Description fldz flds Addr fmuls Addr faddp push 0.0 push Mem[Addr] %st(0) %st(0)*M[Addr] %st(1) %st(0)+%st(1);pop Load zero Load single precision real Multiply Add and pop FP Code Example (x87) Compute inner product of two vectors Single precision arithmetic Common computation float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%ecx movl 16(%ebp),%edx fldz xorl %eax,%eax cmpl %edx,%eax jge .L3 .L5: flds (%ebx,%eax,4) fmuls (%ecx,%eax,4) faddp incl %eax cmpl %edx,%eax jl .L5 .L3: movl -4(%ebp),%ebx movl %ebp, %esp popl %ebp ret # setup # # # # # # %ebx=&x %ecx=&y %edx=n push +0.0 i=0 if i>=n done # # # # # push x[i] st(0)*=y[i] st(1)+=st(0); pop i++ if i<n repeat # finish # st(0) = result Inner Product Stack Trace eax = i ebx = *x ecx = *y Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 0.0 x[0] 5. flds (%ebx,%eax,4) %st(1) %st(0) 3. fmuls (%ecx,%eax,4) 0.0 x[0]*y[0] %st(1) %st(0) 4. faddp 0.0+x[0]*y[0] x[0]*y[0] x[1] %st(1) %st(0) 6. fmuls (%ecx,%eax,4) x[0]*y[0] x[1]*y[1] %st(1) %st(0) 7. faddp %st(0) x[0]*y[0]+x[1]*y[1] %st(0) Vector Instructions: SSE Family SSE : Streaming SIMD Extensions SIMD (single-instruction, multiple data) vector instructions New data types, registers, operations Parallel operation on small (length 2-8) vectors of integers or floats Example: + x “4-way” Floating point vector instructions Available with Intel’s SSE (streaming SIMD extensions) family SSE starting with Pentium III: 4-way single precision SSE2 starting with Pentium 4: 2-way double precision All x86-64 have SSE3 (superset of SSE2, SSE) Intel Architectures (Focus Floating Point) Processors 8086 Architectures Features x86-16 286 386 486 Pentium Pentium MMX time x86-32 MMX Pentium III SSE 4-way single precision fp Pentium 4 SSE2 2-way double precision fp Pentium 4E SSE3 Pentium 4F x86-64 / em64t Core 2 Duo SSE4 Our focus: SSE3 used for scalar (non-vector) floating point SSE3 Registers All caller saved %xmm0 for floating point return value 128 bit = 2 doubles = 4 singles %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15 SSE3 Registers Different data types and associated instructions 128 bit Integer vectors: 16-way byte 8-way 2 bytes 4-way 4 bytes Floating point vectors: 4-way single 2-way double Floating point scalars: single double LSB SSE3 Instructions: Examples Single precision 4-way vector add: addps %xmm0 %xmm1 %xmm0 + %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 SSE3 Instruction Names packed (vector) addps single slot (scalar) addss single precision addpd double precision addsd this course SSE3 Basic Instructions Moves Single Double Effect movss movsd D←S Usual operand form: reg → reg, reg → mem, mem → reg Arithmetic Single Double Effect addss addsd D←D+S subss subsd D←D–S mulss mulsd D←DxS divss divsd D←D/S maxss maxsd D ← max(D,S) minss minsd D ← min(D,S) sqrtss sqrtsd D ← sqrt(S) x86-64 FP Code Example Compute inner product of two vectors float ipf (float x[], float y[], int n) { int i; float result = 0.0; Single precision arithmetic Uses SSE3 instructions for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } ipf: xorps %xmm1, %xmm1 xorl %ecx, %ecx jmp .L8 .L10: movslq %ecx,%rax incl %ecx movss (%rsi,%rax,4), %xmm0 mulss (%rdi,%rax,4), %xmm0 addss %xmm0, %xmm1 .L8: cmpl %edx, %ecx jl .L10 movaps %xmm1, %xmm0 ret # # # # # # # # # # # # # result = 0.0 i = 0 goto middle loop: icpy = i i++ t = y[icpy] t *= x[icpy] result += t middle: i:n if < goto loop return result SSE3 Conversion Instructions Conversions Same operand forms as moves Instruction Description cvtss2sd single → double cvtsd2ss double → single cvtsi2ss int → single cvtsi2sd int → double cvtsi2ssq quad int → single cvtsi2sdq quad int → double cvttss2si single → int (truncation) cvttsd2si double → int (truncation) cvttss2siq single → quad int (truncation) cvttss2siq double → quad int (truncation) x86-64 FP Code Example double funct(double a, float x, double b, int i) { return a*x - b/i; } a x b i %xmm0 double %xmm1 float %xmm2 double %edi int funct: cvtss2sd %xmm1, %xmm1 mulsd %xmm0, %xmm1 cvtsi2sd %edi, %xmm0 divsd %xmm0, %xmm2 movsd %xmm1, %xmm0 subsd %xmm2, %xmm0 ret # # # # # # %xmm1 = (double) x %xmm1 = a*x %xmm0 = (double) i %xmm2 = b/i %xmm0 = a*x return a*x - b/i Constants double cel2fahr(double temp) { return 1.8 * temp + 32.0; } # Constant declarations .LC2: .long 3435973837 # .long 1073532108 # .LC4: .long 0 # .long 1077936128 # Here: Constants in decimal format compiler decision hex more readable Low order four bytes of 1.8 High order four bytes of 1.8 Low order four bytes of 32.0 High order four bytes of 32.0 # Code cel2fahr: mulsd .LC2(%rip), %xmm0 addsd .LC4(%rip), %xmm0 ret # Multiply by 1.8 # Add 32.0 Checking Constant Previous slide: Claim .LC4: .long 0 .long 1077936128 Convert to hex format: .LC4: .long 0x0 .long 0x40400000 # Low order four bytes of 32.0 # High order four bytes of 32.0 # Low order four bytes of 32.0 # High order four bytes of 32.0 Convert to double (blackboard?): Remember: e = 11 exponent bits, bias = 2e-1-1 = 1023 Comments SSE3 floating point Uses lower ½ (double) or ¼ (single) of vector Finally departure from awkward x87 Assembly very similar to integer code x87 still supported Even mixing with SSE3 possible Not recommended For highest floating point performance Vectorization a must (but not in this course) See next slide Vector Instructions Starting with version 4.1.1, gcc can autovectorize to some extent -O3 or –ftree-vectorize No speed-up guaranteed Very limited icc as of now much better Fish machines: gcc 3.4 For highest performance vectorize yourself using intrinsics Intrinsics = C interface to vector instructions Learn in 18-645 Future Intel AVX announced: 4-way double, 8-way single