CHAPTER 4 ELEMENTS OF THE A86 LANGUAGE This chapter begins the description of the A86 language. It's a bit more tutorial in nature than the rest of the manual. I'll start by describing the elementary building blocks of the language. General Categories of A86 Elements The statements in an A86 source file can be classified in three general categories: instruction statements, data allocation statements, and assembler directives. An instruction statement uses an easily remembered name (a mnemonic) and possibly one or more operands to specify a machine instruction to be generated. A data allocation statement reserves, and optionally initializes, memory space for program data. An assembler directive is a statement that gives special instructions to the assembler. Directives are unlike the instruction and data allocation statements in that they do not specify the actual contents of memory. Examples of the three types of A86 statements are given below. These are provided to give you a general idea of what the different kinds of statements look like. Instruction Statements MOV AX,BX CALL SORT_PROCEDURE ADD AL,7 Data Allocation Statements A_VARIABLE DW 0 DB 'HELLO' Assembler Directives CODE SEGMENT ITEM_COUNT EQU 5 The statements in an A86 source file are made up of reserved symbols, user symbols, numbers, strings, special characters, and comments. Symbols are the "words" of the A86 language. All symbols are a collection of consecutive letters, numbers, and assorted special characters: _, @, $, and ?. Symbols cannot begin with digits: anything that begins with a digit is a number. Symbols can begin with any of the special characters just listed. Symbols can also begin with a period, which is the only place within the symbol name a period can appear. 4-2 Reserved symbols have a built-in meaning to the assembler: instruction mnemonics (MOV, CALL), directive names (DB, STRUC), register names, expression operators, etc. User symbols have meanings defined by the programmer: program locations, variable names, equated constants, etc. The user symbol name is considered unique up to 127 characters, but it can be of any length (up to 255 characters). Examples of user symbols are: COUNT, L1, and A_BYTE. Numbers in A86 may be expressed as decimal, hexadecimal, octal, binary, or decimal "K". These must begin with a decimal digit and, except in the case of a decimal or hexadecimal number, must end with "x" followed by a letter identifying the base of the number. A number without an identifying base is hexadecimal if the first digit is 0; decimal if the first digit is 1 through 9. Examples of A86 numbers are: 123 (decimal), 0ABC (hexadecimal), 1776xQ (octal), 10100110xB (binary), and 32K (decimal 32 times 1024). Strings are characters enclosed in either single or double quotes. Examples of strings are: '1st string' and "SIGN-ON MESSAGE, V1.0". If you wish to include a quote mark within a string, you can double it; for example, 'that''s nice' specifies a single quote mark within the string. The single quote and double quote are two of many special characters used in the assembly language. Others, run together in a list, are: ! $ ? ; : = , [ ] . + - ( ) * / >. The space and tab characters are also special characters, used as separators in the assembly language. A comment is a sequence of characters used for program documentation only; it is ignored by the assembler. Comments begin with a semicolon (;) and run to the end of the line on which they are started. Examples of lines with comments are shown below: ; This entire line is a comment. MOV AX,BX ; This is a comment next to an instruction statement. Alternatively, for compatibility with other assemblers, I provide the COMMENT directive. The next non-blank character after COMMENT is a delimiter to a comment that can run across many lines; all text is ignored, until a second instance of the delimiter is seen. For example, COMMENT 'This comment runs across two lines' I don't like COMMENT, because I think it's very dangerous. If, for example, you have two COMMENTs in your program, and you forget to close the first one, the assembler will happily ignore all source code between the comments. If that source code does not happen to contain any labels referenced elsewhere, the error may not be detected until your program blows up. For multiline comments, I urge you to simply start each line with a semicolon. 4-3 Statements in the A86 are line oriented, which means that statements may not be broken across line boundaries. A86 source lines may be entered in a free form fashion; that is, without regard to the column orientation of the symbols and special characters. PLEASE NOTE: Because an A86 line is free formatted, there is no need for you to put the operands to your instructions in a separate column. You organize things into columns when you want to visually scan down the column; and you practically never scan operands separate from their opcodes. Realizing this, you may wish to separate your operands from the mnemonic with a space instead of a tab, making the line less disjointed and hence easier to read. You will also have room for a longer comment after the instruction. Operand Typing and Code Generation A86 is a strongly typed assembly language. What this means is that operands to instructions (registers, variables, labels, constants) have a type attribute associated with them which tells the assembler something about them. For example, the operand 4 has type "number", which tells the assembler that it is a numerical constant, rather than a register or an address in the code or data. The following discussion explains the types associated with instruction operands and how this type information is used to generate particular machine opcodes from general purpose instruction mnemonics. Registers The 8086 has 8 general purpose word (two-byte) registers: AX,BX,CX,DX,SI,DI,BP, and SP. The first four of those registers are subdivided into 8 general purpose one-byte registers AH,AL,BH,BL,CH,CL,DH, and DL. There are also 4 16-bit segment registers CS,DS,ES, and SS, used for addressing memory; and the implicit instruction-pointer register (referred to as IP, although "IP" is not part of the A86 assembly language). My A386 assembler supports the two additional segment registers FS and GS, plus the 32-bit general registers EAX,EBX,ECX,EDX,ESI,EDI,EBP, and ESP. The lower 16 bits of each 32-bit register is the corresponding 16-bit register (without the E in its name). Variables A variable is a unit of program data with a symbolic name, residing at a specific location in 8086 memory. A variable is given a type at the time it is defined, which indicates the number of bytes associated with its symbol. Variables defined with a DB statement are given type BYTE (one byte), and those defined with the DW statement are given type WORD (two bytes). Examples: 4-4 BYTE_VAR DB 0 WORD_VAR DW 0 ; A byte variable. ; A word variable. Labels A label is a symbol referring to a location in the program code. It is defined as an identifier, followed by a colon (:), used to represent the location of a particular instruction or data structure. Such a label may be on a line by itself or it may immediately precede an instruction statement (on the same line). In the following example, LABEL_1 and LABEL_2 are both labels for the MOV AL,BL instruction. LABEL_1: LABEL_2: MOV AL,BL In the A86 assembly language, labels have a type identical to that of constants. Thus, the instruction MOV BX,LABEL_2 is accepted, and the code to move the immediate constant address of LABEL2 into BX, is generated. IMPORTANT: you must understand the distinction between a label and a variable, because you may generate a different instruction than you intended if you confuse them. For example, if you declare XXX: DW ?, the colon following the XXX means that XXX is a label; the instruction MOV SI,XXX moves the immediate constant address of XXX into the SI register. On the other hand, if you declare XXX DW ?, with no colon, then XXX is a word variable; the same instruction MOV SI,XXX now does something different: it loads the run-time value of the memory word XXX into the SI register. You can override the definition of a symbol in any usage with the immediate-value operator OFFSET or the memory-variable opertors B,W,D,Q, or T. Thus, MOV SI,OFFSET XXX loads the immediate value pointing to XXX no matter how XXX was declared; MOV SI,XXX W loads the word-variable at XXX no matter how XXX was declared. Constants A constant is a numerical value computed from an assembly-time expression. For example, 123 and 3 + 2 - 1 both represent constants. A constant differs from a variable in that it specifies a pure number, known by the assembler before the program is run, rather than a number fetched from memory when the program is running. 4-5 Generating Opcodes from General Purpose Mnemonics My A86 assembly language is modeled after Intel's ASM86 language, which uses general purpose mnemonics to represent classes of machine instructions rather than having a different mnemonic for each opcode. For example, the MOV mnemonic is used for all of the following: move byte register to byte register, load word register from memory, load byte register with constant, move word register to memory, move immediate value to word register, move immediate value to memory, etc. This feature saves you from having to distinguish "move" from "load," "move constant" from "move memory," "move byte" from "move word," etc. Because the same general purpose mnemonic can apply to several different machine opcodes, A86 uses the type information associated with an instruction's operands in determining the particular opcode to produce. The type information associated with instruction operands is also used to discover programmer errors, such as attempting to move a word register to a byte register. The examples that follow illustrate the use of operand types in generating machine opcodes and discovering programmer errors. In each of the examples, the MOV instruction produces a different 8086 opcode, or an error. The symbols used in the examples are assumed to be defined as follows: BVAR is a byte variable, WVAR is a word variable, and LAB is a label. As you examine these MOV instructions, notice that, in each case, the operand on the right is considered to be the source and the operand on the left is the destination. This is a general rule that applies to all two-operand instruction statements. MOV MOV MOV MOV MOV MOV MOV MOV AX,BX AX,BL CX,5 BVAR,AL AL,WVAR LAB,5 WVAR,SI BL,1024 ; ; ; ; ; ; ; ; (8B) Move word register to word register. ERROR: Type conflict (word,byte). (B9) Move constant to word register. (A0) Move AL register to byte in memory. ERROR: Type conflict (byte,word). ERROR: Can't use label/constant as dest. to MOV. (89) Move word register to word in memory. ERROR: Constant is too large to fit in a byte.