cs3843 syllabus outline lecture notes Brief History of Computers Year Processor programming assignments recitations homework Max Address Caches Space 16 KB Speed set up 1972 8008 Memory Size 3500 1978 1982 8086 80286 29 K 134 K 1 MB 16 MB 0.33 - 0.75 MIPS 0.9 - 2.66 MIPS 1985 i386 275 K 4 GB 2.5 - 9.9 MIPS 1989 i486 1.2 M 4 GB, virtual 1 TB 8 KB cache on chip 25-50 MHz ; 20-41 MIPS 1993 Pentium 3.1 M 4 GB, virtual 64 TB 60-100 MHz; 100-150 MIPS 1995 Pentium Pro 5.5 M 1997 Pentium II 7.5 M 4 GB; virtual 64 TB 5 GB; virtual 64 TB 1999 Pentium III 9.5 M " 8 KB instruction cache; 8 KB data cache 16 KB L1 cache 256 KB L2 cache 32 KB L1 cache 512 KB ext L2 cache 256KB-2MB L2 cache 2000 2006 Pentium 4 Core 2 42 M 291M " 64 KB L1 cache per core; 0.05 MIPS Instr set designed by Datapoint in San Antonio IBM PC IBM PC-AT Compaq suitcase computers With the speed of the i386 and the convenience of MS Windows 3.0, Microsoft grew significantly Floating Pt coprocessor on chip 166-200 MHz 233-450 MHz 450-600 MHz 1.6 - 1.8 GHz 1.86 - 3.0 GHz 2 cores TX Instruments Computers MS DOS MS Windows, SCO Xenix IBM OS/2 MS Windows 2.0 MS Windows 3.0 (1990) 2008 Core i7 781 M 2010 Itanium Tukwila 2B 2012 Xeon Phi 5B 4MB L2 cache 64 KB L1 cache per core; 256 KB L2 cache per core; 8MB L3 cache 24 MB L3 cache 2.8-3.5 GHz 4 cores; hyper threading 1.6-1.73 GHz up to 6 instr per clock cycle 240-320 GHz 1-1.2 teraflops double precision 2-4 cores; instrlevel parallelism HP Servers 62 cores HP & Cray Servers Moore's Law (revised, 1975): The complexity for minimum component costs has increased at a rate of roughly a factor of two every two years. Note: from 1978 to 2012, that would be 3.6B based on Moore's Law. Suppose we need to add two integer variables (which are externals) and store We will focus on the Intel Architecture 32 bit machines the result in another integer variable (which is an external). (IA32). There are two different syntaxes for IA32 Assembly Intel Asm Syntax AT&T Asm Syntax Meaning Language: mov eax, valx movl valx, %eax Load register eax with Intel (used by Microsoft) valx add eax, valy addl valy, %eax Add valy to register eax AT&T (used by GNU; therefore we use this) mov valz, eax movl %eax, valz Store the result in valz The underlying machine code is the fundamentally the same. AT&T Assembly language syntax uses Source-Destination operands; whereas, Intel uses Destination-Source. AT&T also places % in front of register names. Machine Instructions The actual machine code (which executes) is binary and is interpreted by the CPU. Assembly Language is a lot easier to read than machine code. IA32 machine instructions vary in size from 1 to 15 bytes. Some other machine architectures use fixed length instructions. We will discuss the actual IA32 machine instruction format later in the semester. Most instruction formats include: Op Code - tells the CPU what needs to be done Operand Type (if needed) - may include data type, whether the operand is a register, immediate (constant) or memory reference Operand 1 (if needed) reg value, immediate operand, memory reference info Operand 2 (if needed) reg value, immediate operand, memory reference info Note: both operands usually will not be memory references IA32 Hardware Architecture Overview Program Counter is called %eip (extended instruction pointer). 8 Integer Registers, each storing 32 bit values. o These are named: %eax, %ebx, %ecx, %edx, %esi, %edi, %esp, %ebp o The lower 2 bytes in the first 4 integer registers can be referenced as: %ax, %bx, %cx, %dx o We can further divide those 2 lower byte names into two single bytes:: %al, %ah, %bl, %bh, %cl, %ch, %dl, %dh o Registers %esp and %ebp are for runtime stack manipulation Based on the history of Intel chips, backward compatibility forced the inclusion of these 2-byte and 1-byte registers IA32 Hardware Architecture Overview Continued Condition Code Registers are single bit flags which are set based on the outcome of the most recent arithmetic or logical instructions. o OF - overflow flag; set when a signed arithmetic operation is either too large or too small to fit in the destination o CF - carry flag; set when an unsigned arithmetic operation is too large to fit in the destination o ZF - zero flag; set when the result is zero; it is ON if a comparison shows values are equal o SF - sign flag; set when the result is a negative value o PF - parity flag; its parity is even (PE) when an short ix = 21234; short iy = 20841; short iresult; iresult = ix + iy; printf("Result is %d\n"); output: -23461 Since the sum is greater than 31767, it overflows. Some languages would generate an error, but C assumes overflows are expected and does not generate a runtime error. Note: 21234 + 20841 = 42075 216 = 65536 42075 - 65536 = -23461 even number of 1 bits in the 8 low order bits. IA32 Hardware Architecture Overview Continued Floating Point Registers are used for floating point arithmetic. There are 8 floating point registers, each having 80 bits. These are stack based o Top of the stack: register ST0 o Next: register ST1 o Bottom: register ST7 IA32 Assembly Language AT&T Syntax Comments begin with #. Labels begin in column 1 and end with a colon. They are used to reference an instruction address for JMP and CALL instructions. They are also used to reference external variables (basis), static variables, and string constants. Dot directives begin with a dot and tell the assembler things like name of your source code, variables which are external global basis (.globl), data types, lengths, and other assembler information. Instruction Operators should not begin in column 1 for readability. Instruction Operands might reference constants, symbol labels, registers or memory references, but are all dependent on the instruction operations. Operands Operands have several different forms to help reference registers, constants, symbols, and memory. %reg $constant symbol $symbol Register references begin with a %. They reference the 4 byte registers (begin with "e" for extended), 2 byte registers or 1 byte registers. Numeric constants can be base-10 or hexadecimal (begin with 0x). A symbol can be an external variable, an external function, a static variable or a .label. The address of the specified symbol is In addition to arithmetic operations, comparisons set those condition codes. (see the notes on Flow Instructions) We will discuss floating point in detail after the midterm exam. See sample code below Some examples: %ax register %ax %eax register %eax $150 integer constant 150 $0xAFF3 hexadecimal constant AFF3 valx a symbol for an external or static variable .L5 a label to an address of an instruction which could be used in a jump instruction. It can also be a label for a character string literal. 8(%ebp) the memory address which is 8 + the value of memRef typically the address of a variable Memory references can take on many forms: symbol Memory address based on the symbol's address off(%reg) Memory address is an offset from the value of %reg (%reg1,%reg2) Memory address is sum of the values of %reg1 and %reg2 off(%reg1,%reg2) Memory address is an offset from the sum of the values of %reg1 and %reg2 The offsets can be positive integers, negative integer symbolics, or symbolics with a positive or negative offset. Overview of the Machine instruction categories Move - move from source to destination movS source, dest Load Effective Address - load the address instead of the value from an address leaS source, dest Arithmetic - 2 byte and 4 byte addS operand1, operand2 subS operand1, operand2 imulS operand idivS operand incS operand decS operand negS operand add op1 to ap2 subtract op1 from op2 multiply by operand divide by operand increment the operand decrement the operand negate the operand Shift salS k,reg sarS k,reg shlS k,reg shrS k,reg shift arithmetic left shift arithmetic right shift logical left shift logical right Note: S is the size and must be one of register %ebp the memory address which is the value of register %ebp + value of register %ebx. studentData+4 the memory address which is 4 + the instruction address of studentData. (%ebp,%ebx) There is another form of memory references using a scale which we will discuss later. Examples: movl iValA,%edx movl $iValA,%edx movl %edx,lresult # Moves the long value of iValA to %edx # Moves the address of iValA to %edx # Moves the long value in %edx to lresult addl 4(%ebp),%edx # # # # # incw %dx movl lresult,%edx leal lresult,%edx sarl $3, %edx The value at the address computed by an offset of 4 plus the value of ebp is added to the value of %edx. The result is stored in %edx. Increment the 2 byte value in %dx by 1. # Moves the long value found at lresult # to %edx # Move the address of lresult to %edx # Arithmetic shift of the long value in # %edx 3 bits to the right b w l q byte (1 byte) word (2 bytes) long (4 bytes) quad words (8 bytes) W for word is based on the old machines where a word was 2 bytes. Overview of the Machine instruction categories Flow jmp label unconditional jump to label cmpS operand1, operand2 compare setting condition code flags jle label jump less than or equal jl label jump less than je label jump equal jne label jump not equal jge label jump greater than or equal jg label jump greater than call dest using calling convention to invoke the function at dest ret return to the caller based on the calling convention Stack - these manipulate the runtime memory stack pushS operand pushes the operand onto the runtime memory stack popS operand pops the top of the stack and stores it in operand leave prepare to leave the subroutine based on calling convention Note that call and ret also manipulate the stack. C code for calculating the average using the final exam and the higher of the first two exams. int calculateAverage(int iExam1, int iExam2, int iFinalExam) { int iSum; if (iExam1 > iExam2) iSum = iExam1 + iFinalExam; Consider the following C statement snippet: if (iX > iY) true part else false part In Assembly Language: movl iX, %edx cmpl iY, %edx jle .L3 … jmp .L4 # # # # # # load reg edx with the iX variable compare iX:iY (we are comparing the second operand(edx) with the first) if <=, jump to .L3 code for the true part jump over the false part .L3: … # code for the false part .L4: … # code following the entire if Corresponding assembly language code generated by gcc -O1 -S (comments added by me) 1 .file "calculateAverage.c" 2 .text 3 .globl calculateAverage 4 5 6 7 8 9 # The Linker will need to know # this. .type calculateAverage, @function calculateAverage: pushl %ebp # tbd movl %esp, %ebp # tbd movl 8(%ebp), %edx # load iExam1 in %edx movl 12(%ebp), %eax # load iExam2 in %eax else iSum = iExam2 + iFinalExam; return iSum / 2; } 3/2 = 1.5, truncating 1 -3/2 = -1.5, truncating -1 -3 + 1 = -2, if we divide by 2 we get -1 -4 + 1 = -3, divide by 2 = -2 C code for averageDriver.c StudentData; void readStudents(); int calculateAverage(int iExam1, int iExam2, int iFinalExam); int main(int argc, char *argv[]) { int i; studentData.iStudentCnt = 0; readStudents(); for (i = 0; i < studentData.iStudentCnt; i++) printf("%s %d\n" , studentData.studentM[i].szStudentId , calculateAverage (studentData.studentM[i].iExam1 , studentData.studentM[i].iExam2 , studentData.studentM[i].iFinalExam) ); } 10 11 12 13 14 .L2: 15 16 17 .L3: 18 19 20 21 22 23 24 25 26 cmpl jle addl jmp %eax, %edx .L2 16(%ebp), %edx .L3 # # # # movl addl 16(%ebp), %edx %eax, %edx # move iFinal to %edx # add iExam2 to %edx (iFinal) compare iExam1:iExam2 if <=, jump to .L2 add iFinal to %edx (iExam1) jump over false part movl %edx, %eax # move sum to shrl $31, %eax # shift makes addl %edx, %eax # increase by sarl %eax # divide by 2 popl %ebp # tbd ret # tbd .size calculateAverage, .-calculateAverage .ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3" .section .note.GNU-stack,"",@progbits %eax for sign this 0 or 1 0 or 1 via shifting Assembly Language for averageDriver.c using gcc -O1 -S 1 .file "averageDriver.c" 2 .section .rodata.str1.1,"aMS",@progbits,1 3 .LC0: 4 .string "%s %d\n" 5 .text 6 .globl main # linker will need to know about this 7 .type main, @function 8 main: 9 leal 4(%esp), %ecx # tbd 10 andl $-16, %esp # tbd 11 pushl -4(%ecx) # tbd 12 pushl %ebp # tbd 13 movl %esp, %ebp # tbd 14 pushl %edi # tbd 15 pushl %esi # tbd 16 pushl %ebx # tbd 17 pushl %ecx # tbd 18 subl $24, %esp # reserve 24 bytes on the stack 19 movl $studentData, %ebx # address of studentData -> %edx # # What is at studentData vs. studentData+4 ? # 20 movl $0, (%ebx) 21 call readStudents # gcc decided to just compare iStudentCnt:0 instead of using i 22 cmpl $0, (%ebx) 23 jle .L5 24 movl %ebx, %esi 25 movl $0, %ebx 26 movl %esi, %edi 27 .L3: # # # # # set iStudentCnt to 0 # call readStudents # compare iStudentCnt:0 # if iStudentCnt <= 0, jump to .L5 # save addr of studentData # move 0 to %ebx (gcc using %ebx for i) # addr of studentCnt offset of iExam1 is after iStudentCnt, szStudentId, and a slack byte. 7+4+1 = 12 offset of iExam2 is 4 past iExam1. 12+4 = 16 offset of iFinalExam is 4 past iExam2. 16+4 = 20 # pass the parameters to calculateAverage by loading the stack 28 movl 20(%esi), %eax # iFinalExam -> %eax 29 movl %eax, 8(%esp) # load it onto the stack as a parm 30 movl 16(%esi), %eax # iExam2 -> %eax 31 movl %eax, 4(%esp) # load it onto the stack as a parm 32 movl 12(%esi), %eax # iExam1 -> %eax 33 movl %eax, (%esp) # load it onto the stack as a parm 34 call calculateAverage # call calculateAverage # result of calculateAverage was returned in %eax # prepare the parameters for printf call 35 movl %eax, 12(%esp) # move calculateAverage result to stack # determine the address of the szStudentId[i] # since each element is 20 bytes long, we need to multiply # the subscript by 20. # using leal (x,x,4) will multiply register x by 5 # using leal (,x,4) will multiply register x by 4 # Doing both of those leal instructions is x*20 # # First multiply i (which is in %ebx) by 5 36 leal (%ebx,%ebx,4), %eax # # Now multiply that by 4 which effectively mult by 20 # and add the address of the beginning of the array 37 leal studentData+4(,%eax,4), %eax 38 movl %eax, 8(%esp) # load szStudentId[i] on the stack 39 movl $.LC0, 4(%esp) # load the address of the format string # onto the stack 40 movl $1, (%esp) # move 1 onto the stack. This is a error # checking/ optimization flag 41 call __printf_chk # call printf 42 addl $1, %ebx # increment i # Increment ptr into array by size of one element 43 addl $20, %esi # add element size to ptr 44 cmpl %ebx, (%edi) # compare iStudentCnt:i 45 jg .L3 # if iStudentCnt > i, loop back to .L3 46 .L5: 47 addl $24, %esp # tbd 48 popl %ecx # tbd 49 popl %ebx # tbd 50 popl %esi # tbd 51 popl %edi # tbd 52 popl %ebp # tbd 53 leal -4(%ecx), %esp # tbd 54 ret # tbd 55 .size main, .-main 56 .comm studentData,404,32 57 .ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3" 58 .section .note.GNU-stack,"",@progbits Slack Bytes (Alignment) If each Student is 19 bytes (7+4+4+4), 19 bytes * 20 students is 380 bytes. Adding 4 for iStudentCnt would be 384. Why is Why did it reserve 404 bytes for studentData? typedef struct { char szStudentId[7]; it 404 bytes? It is because of slack bytes. It can be easier to read and write 4 byte numeric values if their addresses are a multiple of 4. If the compiler assumes a slack byte between szStudentId and iExam1, iExam1 would be aligned on an address which is a multiple of 4. If each student is 20 bytes, 20*20 is 400 bytes. Adding 4 bytes for iStudentCnt gives 404. int iExam1; int iExam2; int iFinalExam; } Student; typedef struct { int iStudentCnt; Student studentM[20]; } StudentData; If a student ID is only 4 characters and we had declared it szStudentId[5], how many slack bytes would have been included?