IA32 programming for Linux Concepts and requirements for writing Linux assembly language

advertisement
IA32 programming for Linux
Concepts and requirements for
writing Linux assembly language
programs for Pentium CPUs
A source program’s format
•
•
•
•
•
•
•
Source-file: a pure ASCII-character textfile
It’s created using a text-editor (such as ‘vi’)
You cannot use a ‘word processor’ (why?)
Program consists of series of ‘statements’
Every program-statement fits on one line
Program-statements all have same layout
Designed in 1950s for IBM’s punch-cards
Statement Layout (1950s)
• Each ‘statement’ was comprised of four ‘fields’
• Fields appear in a prescribed left-to-right order
• These four fields were named (in order):
-- the ‘label’ field
-- the ‘opcode’ field
-- the ‘operand’ field
-- the ‘comment’ field
• In many cases some fields could be left blank
• Extreme case (very useful): whole line is blank!
Statement layout
Label:
opcode
operand(s)
# comment
• Parsing an assembly language statement
– A colon character (‘:’) terminates the label
– A white-space character (e.g., blank or tab)
separates the opcode from its operand(s)
– A hash character (‘#’) begins the comment
The ‘as’ program
•
•
•
•
•
•
•
The ‘assembler’ is a computer program
It accepts a specified text-file as its input
It must be able to ‘parse’ each statement
It can produce onscreen ‘error messages’
It can generate an ELF-format output file
(That file is known as an ‘object module’)
It can also generate a ‘listing file’ (optional)
The ‘label’ field
• A label is a ‘symbol’ followed by a colon (‘:’)
• The programmer invents his/her own ‘symbols’
• Symbols can use letters and digits, plus a very
small number of ‘special’ characters ( ‘.’, ‘_’, ‘$’ )
• A ‘symbol’ is allowed to be of arbitrarily length
• The Linux assembler (‘as’) was designed for
translating source-text produced by a high-level
language compiler (such as ‘cc’)
• But humans can also write such files directly
The ‘opcode’ field
• Opcodes are predefined symbols that are
recognized by the GNU assembler
• There are two categories of ‘opcodes’
(called ‘instructions’ and ‘directives’)
• ‘Instructions’ represent operations that the
CPU is able to perform (e.g., ‘add’, ‘inc’)
• ‘Directives’ are commands that guide the
work of the assembler (e.g., ‘.global’, ‘.int’)
Instructions vs Directives
• Each ‘instruction’ gets translated by ‘as’
into a machine-language statement that
will be fetched and executed by the CPU
when the program runs (i.e., at ‘run time’)
• Each ‘directive’ modifies the behavior of
the assembler (i.e., at ‘assembly time’)
• With GNU assembly language, these are
easy to distinguish: directives begin with ‘.’
A list of the Pentium opcodes
• An ‘official’ list of the instruction codes can
be found in Intel’s programmer manuals:
http://developer.intel.com
• But it’s three volumes, nearly 1000 pages
(it describes ‘everything’ about Pentiums)
• An ‘unofficial’ list of (most) Intel instruction
codes can fit on one sheet, front and back:
http://www.jegerlehner/intel/
The AT&T syntax
• The GNU assembler uses AT&T syntax
(instead of official Intel/Microsoft syntax)
so the opcode names differ slightly from
names that you will see on ‘official’ lists:
Intel-syntax
--------------ADD
INC
CMP



AT&T-syntax
---------------------addb/addw/addl
incb/incw/incl
cmpb/cmpw/cmpl
The UNIX culture
• Linux is intended to be a version of UNIX (so
that UNIX-trained users already know Linux)
• UNIX was developed at AT&T (in early 1970s)
and AT&T’s computers were built by DEC, thus
UNIX users learned DEC’s assembley language
• Intel was early ally of DEC’s competitor, IBM,
which deliberately used ‘incompatible’ designs
• Also: an ‘East Coast’ versus ‘West Coast’ thing
(California, versus New York and New Jersey)
Bytes, Words, Longwords
• CPU Instructions usually operate on data-items
• Only certain sizes of data are supported:
BYTE: one byte consists of 8 bits
WORD: consists of two bytes (16 bits)
LONGWORD: uses four bytes (32 bits)
• With AT&T’s syntax, an instruction’s name also
incorporates its effective data-size (as a suffix)
• With Intel syntax, data-size usually isn’t explicit,
but is inferred by context (e.g., from operands)
The ‘operand’ field
• Operands can be of several types:
-- a CPU register may hold the datum
-- a memory location may hold the datum
-- an instruction can have ‘built-in’ data
-- frequently there are multiple data-items
-- and sometimes there are no data-items
• An instruction’s operands usually are ‘explicit’,
but in a few cases they also could be ‘implicit’
Examples of operands
• Some instruction will have two operands:
movl
%ebx, %ecx
addl
$4, %esp
• Some instructions will have one operand:
incl
%eax
pushl
$fmt
• An instruction that lacks explicit operands:
ret
The ‘comment’ field
• An assembly language program often can
be hard for a human being to understand
• Even a program’s author may not be able
to recall his programming idea after awhile
• So programmer ‘comments’ can be vital!
• A comment begin with the ‘#’ character
• The assembler disregards all comments
(but they will appear in program listings)
‘Directives’
•
•
•
•
•
•
Sometimes called ‘pseudo-instructions’
They tell the assembler what to do
The assembler will recognize them
Their names begin with a dot (‘.’)
Examples: ‘.section’, ‘.int,’ ‘.string’, …
The names of valid directives appears in
the table-of-contents of the GNU manual
New program example
• Let’s look at a demo program (‘squares.s’)
• It will display a mathematical table showing
some numbers and their squares
• But it doesn’t use any multiplications!
• It uses an algorithm based on algebra:
(n+1)2 - n2 = n + n + 1
If you already know the square of a given
number n , you can get the square of the
next number n+1 by just doing additions
Visualizing the algorithm idea
n
n
(n + 1)2 = n2 + 2n + 1
Our program uses a ‘loop’
• Here’s our program idea (expressed in C++)
int
num = 0, val = 0;
do {
val += num + num + 1;
num += 1;
printf( “ %d %d \n”, num, val );
}
while ( num != 20 );
Some program ‘directives’
• ‘.equ’ – equates a symbol to a value:
.equ
MAX, 20
• ‘.string’ – just an alternative for ‘.asciz’:
body: .string “ %d %d \n”
• ‘.space’ – reserves uninitialized memory
num: .space 4
# to hold an ‘int’
Some new ‘instructions’
• ‘inc’ – adds one to the specified operand:
incl
arg
• ‘cmp’ – compares two specified operands:
cmpl
$MAX, arg
• ‘je’ – jumps (to a specified instruction) if
the condition ‘equal to’ is currently true:
je
again
Comparisons can be ‘tricky’
• It’s easy to get confused by AT&T syntax:
mov $5, %eax
while:
add $-1, %eax
cmp $5, %eax
jge while
(e.g., will this loop ever finish executing?)
• REMEMBER: ‘compare’ means ‘subtract’
The FLAGS register
O
F
D
F
Legend:
I
F
T
F
S
F
Z
F
0
A
F
ZF = Zero Flag
SF = Sign Flag
CF = Carry Flag
PF = Parity Flag
OF = Overflow Flag
AF = Auxiliary Flag
0
P
F
1
C
F
In-class exercise #1
• How would you modify the source-code for
the ‘squares’ program so that it prints out a
larger table (i.e., more than 20 lines)?
• Question: How many lines can you display
on the screen before your program starts
to show ‘wrong’ entries?
In-class exercise #2
• Can you write an enhanced version of our
‘squares.s’ program that would also show
the cube for each number (but using only
the ‘add’ and ‘inc’ arithmetic operations)?
• HINT: Remember these algebra formulas:
(n+1)3 - n3 = 3n2 + 3n + 1
3n = n + n + n
Download