LEXICAL ELEMENTS OF PROGRAMMING LANGUAGES The lexical elements of a programming language consist of the following: the character set; the rules for grouping characters into words (or lexems); the use of reserved words or keywords; the manner in which blanks and line termination characters are handled; How comments are written. 1. Character set (vocabulary /alphabet): Two approaches are taken when deciding on the character set of a programming language: a) Choose all the characters deemed necessary Examples: APL, ALGOL 90 Drawbacks: b) - use special I/O equipments or - make changes to the published language when it is used on a computer. Choose only the characters commonly available with current I/O devices Examples: FORTRAN, COBOL, PASCAL - The most commonly used character sets are: the ASCII character set and the EBCDIC character set. - Nonprintable characters may be represented by their decimal equivalent or some other form Examples: escape sequences in C, and EOF for end of file character. In the definition of programming languages, characters that are treated identically by the compiler are often grouped into classes. Examples: Digits = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) Letters = (A | B | C | ... | Z) White spaces = blanks, tabs, carriage return . . . etc A class may consist of a single character. Examples: 2. 3. PLUS is + EQUAL is = Any character not contained in a class is ignored by the scanner. It is also possible to define a class of illegal characters to force the scanner to produce an error message when an illegal character is encountered. White Spaces: - The class name given to spaces (blanks), tabs, carriage return, new line, and form feed. - The ways White spaces are handled by programming languages vary. - Some languages (C language) use them only to indicate where lexems start and end. Comments: any sequence of characters used to clarify a program code. 4. Tokens: - Are the classes (sets) of words (lexems) that are allowed in the text of a program written in a programming language. - The following are the tokens in most programming languages: Identifiers: are names given to the programming language objects by the programmer. Examples: num1, _sum in the C/C++ language. Keywords: are words with an assigned meaning in a language Examples: main, auto, if, while in the C/C++ language Reserved word: words that should not be used as a user’s identifier. In C/C++, every keyword is a reserved word; but in languages such as FORTRAN, keywords REAL and INTEGER are not reserved words. Constants: are words representing fixed numeric or character values: Examples: integer constants, character constants, real constants. String-literals: Examples: quoted strings in the C/C++ language. Operators: Examples: + (plus), - (minus), * (multiplication), / (division), % (modulus), = (Assignment) in the C/C++ language Punctuators: also known as separators. Examples: { } [ ] , : ; = ( ) * # in the C/C++ language. 5. The rules for grouping characters (in a program source code) into words (lexems). These rules are usually specified using regular expressions or context-free grammars Strings Given an alphabet V, a string over V is a finite sequence of characters of V. Examples 1. 01101 , 1010001 , 1111, 0 2. Jo , sum , number are strings over the alphabet {0 , 1} are strings over the roman alphabet The string with no symbol is called the empty string or null string It is denoted by: or The concatenation of strings s and t (over the same alphabet) is denoted by: s.t or st It is the string consisting of the characters of string s followed by those of string t. Examples: 011.001 = 011001; jo.ann = joann Given a string w and a natural number n ≥ 0, w0 = wn = w.w.w.w . . . w do0 = do1 = do do3 = dododo Examples: and (n times) for (n >0) Operations on Sets of Strings (Languages) L , L1, and L2 are languages over the same alphabet. S The union of L1 and L2 Denoted by L1 L2 or L1 | L2 is the language that contains all the strings that belong in either L1 or L2 S The intersection of L1 and L2 Denoted by L1 L2 is the language that contains all the strings that belong to both L1 and L2 S The concatenation of L1 and L2 Denoted by L1.L2 or L1L2 is the language that contains all the strings of the form uv where u belongs to L1 and v belongs to L2 . S The complement of L denoted by Not(L) is the language that contains all the strings that do not belong to L. S The closure or Kleene star of L denoted by L* is the language obtained by concatenation of zero, one or more strings of L. L* = { } L L2 L3 . . . . Note L+ = LL* = L L2 L3 . . . . Examples Let S = {a, aa, aaa } , T = {bb, bbb} , P = { ab, bb, ba} S T= {a, aa, aaa, bb, bbb} T P= { bb } S.T= {abb, abbb, aabb, aabbb, aaabb, aaabbb} P*= {, ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . } P+ = {ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . } Regular Expressions Regular expressions are formal systems that can be used to specify a class of languages (set of strings) called regular sets They are defined as follows: is a regular expression denoting the empty set is a regular expression denoting the set { } Every character a in the alphabet is a regular expression denoting the set {a} If A and B are regular expressions, then: AB is a regular expression denoting the set L(A).L(B) A|B is a regular expression denoting the set L(A) | L(B) A* is a regular expression denoting the set L(A)* Examples: for the following examples, the alphabet is {0,1} 1. ( 0 | 1)* defines the set of all strings of 0's and 1's 2. (0 | 1)*0(0 | 1)* defines the set of all strings of 0's and 1's with at least one 0. 3. 1*01*01* defines the set of all strings with exactly two 0's 4. 0*1* = defines the set of all strings consisting of the NULL string and any string of 0's followed by a string of 1's. Notes: 1. Parentheses may be used to clarify the meaning of a regular expression 2. ( , ) , | , * , and are called meta-symbols of the language. If one of these symbols is in the alphabet of a language, it must be quoted. 3. Any finite set of string { s1 , s2 , s3 , s4 , . . . , sk } regular expression s1 | s2 | s3 | s4 | . . . |sk Can be represented by the Examples: The set of digits { 0 , 1, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 } is represented by D = The set { 00 , 11 } O|1|2|3|4|5|6|7|8|9 is represented by P = 00|11 Notations If A is a regular expression, then: - A+ = A A* - Not(A) is the regular expression denoting the language - Ak is the regular expression representing the language L(A)k Not(L(A)) Exercises 1. Give regular expressions for the following languages: a) Binary strings ending in 01. b) Decimal integers divisible by 5 c) Binary strings consisting of either an odd number of 1s or an odd number of 0s. d) Binary strings containing the string 010 e) binary strings that do not contain the string 010 2. 3. Given the alphabet languages: = { a, b}. Give the regular expressions for the following a) all strings with no more that three a’s b) all strings with the number of a’s divisible by three. c) all strings with exactly one occurrence of the substring aaa Which of the following are true? Explain. a) baa a* b* a* b* b) (b* a*) (a* b*) = a* | b* c) (a* b*) (b* a*) = d) abcd (a (cd)* b)* 4. What is the language represented by each of the following regular expressions? a) ((a* a) b) | b b) (1 | 01)+ c) 1* (0 |1)* 0+ d) (0 | 1)* 011(0 | 1)* 5. Which of the following strings belong to the language described by the regular expression 0* | (0* 1 0* 1 0* 1 0*)*? a) 1001101 f) 1001001 b) 000 g) 111 c) 110101 h) 11101101 d) 10001000 i) 11000101 e) 011001000111 Using regular Expressions to Specify Tokens and Other Words of a Programming Language Let D = 2. A positive or negative integer. 3. A fixed point decimal constant that requires a digit on both side of the decimal point: Lit 4. 0|1|2|3|4|5|6|7|8|9 and = = A|B|C|D|E|F| . . . |Z Int = ( | + | -)D+ D+.D+ A Pascal language real constant is either a string of digits or fixed point decimal constant: Pascal-Real = = 5. L D+ | ( D+ . D+) D+( | .D+) A C language floating point constant is a string of digits or requires digits on either side of the decimal point: C-Float = D+ | (D+ .) | (. D+) | (D+ . D+) = (D+(| .)) | (D* . D+) 6. A Pascal language identifier is composed of letters and digits, with the first character a letter: Pascal-ID 7. = L+(L | D)* = L (L | D)* A C language identifier is composed of letters, digits and the underscore, with the first character a letter or the underscore: C-ID = (L | _ )(L | _ | D)* A C language comment 9. A comment that begins with // and ends with the end of the line (EOL) Comment 10. C-Comment = /’*’(Not(‘*’/))*’*’/ 8. = //(Not(EOL)*EOL An identifier composed of letters, digits and the underscore, begin with a letter, end with a letter or a digit, and contain no consecutive underscores ID = L (L | D | _ (L | D))* = L(L | D)* ( _ (L | D)+)*) Note: The set { [i ]i | i 1 } = { [ ] , [[ ]] , [[[ ]]] , . . . } of balanced brackets is not regular: It cannot be specified using a regular expression. Exercise Note: You may use the following regular expressions D, Z, and L to write your answers. D = (0 | 1 | 2 | 3 | 4 | . . . | 9), Z = (1 | 2 | 3 | 4 | . . . | 9), and L = (A | B | C | . . . | Z). 1. Give a regular expression that describes the set of identifiers consisting of the letters and digits, and that start with a letter, and end with a digit. 2. Give a regular expression that describes the set of identifiers consisting of the letters, the decimal digits, and the underscore characters, and that start with a letter, and with no consecutive underscore. 3. Give a regular expression that describes the multiples of 100. 4. Give a regular expression that describes the set of identifiers composed of letters, digits, and the underscore, that begins with a letter or the underscore, and end with a letter. 5. Give a regular expression that describes C-like fixed-decimal constants with no superfluous leading or trailing zeros. (Note that a digit is not required on either side of the decimal point). That is 0.0, .25, 30. , 123.01, and 123005.0 are legal, but 00.0, 001.000 and 0002345.100 are illegal. 6. Describe in English the language generated by the regular expression: L+ ( D | L | _(D | L) )* D. Language Recognition Devices A language recognition device for a language (set of strings) described using a regular expression is an algorithm such that given the characters of a string s (one at a time from left to right) as input, it outputs "yes" if the string s belongs to the language, and "no" otherwise. It is in general specified using Finite state automata, or implemented in the code of a program. It is sometime necessary to add to a language recognition device the possibility to reconstruct the string from the characters read from the input. REMARKS The fact that reserved words look like identifiers in most programming languages complicates the specification of the tokens of a language using regular expressions. A simple solution is to treat reserved words as identifiers in the specification by regular expressions, and have a table of reserved words that is searched every time an identifier is recognized. If the identifier is in the table, then it is a reserved word; otherwise it is an identifier.