Lexical Elements

advertisement
LEXICAL ELEMENTS OF PROGRAMMING LANGUAGES

The lexical elements of a programming language consist of the following:

the character set;

the rules for grouping characters into words (or lexems);

the use of reserved words or keywords;

the manner in which blanks and line termination characters are handled;

How comments are written.
1.
Character set (vocabulary /alphabet):
Two approaches are taken when deciding on the character set of a programming
language:
a)
Choose all the characters deemed necessary
Examples:
APL, ALGOL 90
Drawbacks:
b)
-
use special I/O equipments or
-
make changes to the published language when it is used on a computer.
Choose only the characters commonly available with current I/O devices
Examples: FORTRAN, COBOL, PASCAL
-
The most commonly used character sets are:
the ASCII character set and the EBCDIC character set.
-
Nonprintable characters may be represented by their decimal equivalent or
some other form
Examples: escape sequences in C, and EOF for end of file character.

In the definition of programming languages, characters that are treated identically
by the compiler are often grouped into classes.
Examples:

Digits
=
(0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
Letters
=
(A | B | C | ... | Z)
White spaces
=
blanks, tabs, carriage return . . . etc
A class may consist of a single character.
Examples:
2.
3.
PLUS
is
+
EQUAL
is
=

Any character not contained in a class is ignored by the scanner.

It is also possible to define a class of illegal characters to force the scanner to
produce an error message when an illegal character is encountered.
White Spaces:
-
The class name given to spaces (blanks), tabs, carriage return, new line, and form
feed.
-
The ways White spaces are handled by programming languages vary.
-
Some languages (C language) use them only to indicate where lexems start and
end.
Comments: any sequence of characters used to clarify a program code.
4.
Tokens:
-
Are the classes (sets) of words (lexems) that are allowed in the text of a program
written in a programming language.
-
The following are the tokens in most programming languages:
Identifiers:
are names given to the programming language objects by the
programmer.
Examples: num1, _sum in the C/C++ language.
Keywords:
are words with an assigned meaning in a language
Examples: main, auto, if, while in the C/C++ language
Reserved word:
words that should not be used as a user’s identifier.
In C/C++, every keyword is a reserved word; but in languages
such as FORTRAN, keywords REAL and INTEGER are not
reserved words.
Constants:
are words representing fixed numeric or character values:
Examples:
integer constants,
character constants,
real constants.
String-literals:
Examples: quoted strings in the C/C++ language.
Operators:
Examples: + (plus), - (minus), * (multiplication), / (division),
% (modulus), = (Assignment)
in the C/C++
language
Punctuators:
also known as separators.
Examples: { } [ ] , : ; = ( ) * # in the C/C++ language.
5.
The rules for grouping characters (in a program source code) into words (lexems).
These rules are usually specified using
regular expressions or
context-free grammars
Strings
 Given an alphabet V, a string over V is a finite sequence of characters of V.
Examples
1.
01101 , 1010001 , 1111, 0
2.
Jo , sum , number
are strings over the alphabet {0 , 1}
are strings over the roman alphabet
 The string with no symbol is called the empty string or null string
It is denoted by:

or

 The concatenation of strings s and t (over the same alphabet) is denoted by:
s.t
or
st
It is the string consisting of the characters of string s followed by those of string t.
Examples:
011.001
=
011001;
jo.ann
=
joann
 Given a string w and a natural number n ≥ 0,
w0
=

wn
=
w.w.w.w . . . w
do0
=

do1
=
do
do3
=
dododo
Examples:
and
(n times)
for (n >0)
Operations on Sets of Strings (Languages)
L , L1, and L2 are languages over the same alphabet.
S
The union of L1 and L2
Denoted by
L1  L2
or
L1 | L2
is the language that contains all the strings that belong in either L1 or L2
S
The intersection of L1 and L2
Denoted by
L1 L2
is the language that contains all the strings that belong to both L1 and L2
S
The concatenation of L1 and L2
Denoted by L1.L2 or
L1L2
is the language that contains all the strings of the form uv where u belongs to L1 and v
belongs to L2 .
S
The complement of L denoted by Not(L)
is the language that contains all the strings that do not belong to L.
S
The closure or Kleene star of L denoted by
L*
is the language obtained by concatenation of zero, one or more strings of L.
L* = {  }  L  L2  L3  . . . .
Note
L+
=
LL*
=
L  L2  L3  . . . .
Examples
Let
S = {a, aa, aaa } ,
T = {bb, bbb} ,
P = { ab, bb, ba}
S  T=
{a, aa, aaa, bb, bbb}
T  P=
{ bb }
S.T= {abb, abbb, aabb, aabbb, aaabb, aaabbb}
P*= {, ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . }
P+
=
{ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . }
Regular Expressions

Regular expressions are formal systems that can be used to specify a class of languages
(set of strings) called regular sets

They are defined as follows:

is a regular expression denoting the empty set

is a regular expression denoting the set {  }
Every character a in the alphabet is a regular expression denoting the set {a}
If A and B are regular expressions, then:
AB
is a regular expression denoting the set
L(A).L(B)
A|B
is a regular expression denoting the set
L(A) | L(B)
A*
is a regular expression denoting the set
L(A)*
Examples:
for the following examples, the alphabet is
{0,1}
1.
( 0 | 1)*
defines the set of all strings of 0's and 1's
2.
(0 | 1)*0(0 | 1)*
defines the set of all strings of 0's and 1's with at least one 0.
3.
1*01*01*
defines the set of all strings with exactly two 0's
4.
0*1* =
defines the set of all strings consisting of the NULL string  and any
string of 0's followed by a string of 1's.
Notes:
1.
Parentheses may be used to clarify the meaning of a regular expression
2.
( , ) , | , * , and  are called meta-symbols of the language. If one of these
symbols is in the alphabet of a language, it must be quoted.
3.
Any finite set of string { s1 , s2 , s3 , s4 , . . . , sk }
regular expression s1 | s2 | s3 | s4 | . . . |sk
Can be represented by the
Examples:
The set of digits { 0 , 1, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 } is represented by
D
=
The set { 00 , 11 }
O|1|2|3|4|5|6|7|8|9
is represented by P
=
00|11
Notations
If A is a regular expression, then:
-
A+
=
A A*
-
Not(A)
is the regular expression denoting the language
-
Ak
is the regular expression representing the language L(A)k
Not(L(A))
Exercises
1.
Give regular expressions for the following languages:
a) Binary strings ending in 01.
b) Decimal integers divisible by 5
c) Binary strings consisting of either an odd number of 1s or an odd number of 0s.
d) Binary strings containing the string 010
e) binary strings that do not contain the string 010
2.
3.
Given the alphabet
languages:
 = { a, b}.
Give the regular expressions for the following
a)
all strings with no more that three a’s
b)
all strings with the number of a’s divisible by three.
c)
all strings with exactly one occurrence of the substring aaa
Which of the following are true? Explain.
a) baa  a* b* a* b*
b) (b* a*)  (a* b*) = a* | b*
c) (a* b*)  (b* a*) = 
d) abcd  (a (cd)* b)*
4.
What is the language represented by each of the following regular expressions?
a)
((a* a) b) | b
b)
(1 | 01)+
c)
1* (0 |1)* 0+
d)
(0 | 1)* 011(0 | 1)*
5.
Which of the following strings belong to the language described by the regular
expression 0* | (0* 1 0* 1 0* 1 0*)*?
a) 1001101
f)
1001001
b) 000
g)
111
c) 110101
h)
11101101
d) 10001000
i)
11000101
e) 011001000111
Using regular Expressions to Specify Tokens and Other Words of a
Programming Language
Let
D
=
2.
A positive or negative integer.
3.
A fixed point decimal constant that requires a digit on both side of the decimal point:
Lit
4.
0|1|2|3|4|5|6|7|8|9 and
=
=
A|B|C|D|E|F| . . . |Z
Int
=
( | + | -)D+
D+.D+
A Pascal language real constant is either a string of digits or fixed point decimal constant:
Pascal-Real =
=
5.
L
D+ | ( D+ . D+)
D+( | .D+)
A C language floating point constant is a string of digits or requires digits on either side
of the decimal point:
C-Float
=
D+ | (D+ .) | (. D+) | (D+ . D+)
=
(D+(| .)) | (D* . D+)
6.
A Pascal language identifier is composed of letters and digits, with the first character a
letter:
Pascal-ID
7.
=
L+(L | D)*
=
L (L | D)*
A C language identifier is composed of letters, digits and the underscore, with the first
character a letter or the underscore:
C-ID
=
(L | _ )(L | _ | D)*
A C language comment
9.
A comment that begins with // and ends with the end of the line (EOL)
Comment
10.
C-Comment =
/’*’(Not(‘*’/))*’*’/
8.
=
//(Not(EOL)*EOL
An identifier composed of letters, digits and the underscore, begin with a letter, end with
a letter or a digit, and contain no consecutive underscores
ID
=
L (L | D | _ (L | D))*
=
L(L | D)* ( _ (L | D)+)*)
Note:
The set { [i ]i | i 1 }
= { [ ] , [[ ]] , [[[ ]]] , . . . }
of balanced brackets is not regular: It cannot be specified using a regular expression.
Exercise
Note: You may use the following regular expressions D, Z, and L to write your answers.
D = (0 | 1 | 2 | 3 | 4 | . . . | 9), Z = (1 | 2 | 3 | 4 | . . . | 9), and L = (A | B | C | . . . | Z).
1.
Give a regular expression that describes the set of identifiers consisting of the letters and
digits, and that start with a letter, and end with a digit.
2.
Give a regular expression that describes the set of identifiers consisting of the letters, the
decimal digits, and the underscore characters, and that start with a letter, and with no
consecutive underscore.
3.
Give a regular expression that describes the multiples of 100.
4.
Give a regular expression that describes the set of identifiers composed of letters, digits,
and the underscore, that begins with a letter or the underscore, and end with a letter.
5.
Give a regular expression that describes C-like fixed-decimal constants with no
superfluous leading or trailing zeros. (Note that a digit is not required on either side of
the decimal point). That is 0.0, .25, 30. , 123.01, and 123005.0 are legal, but 00.0,
001.000 and 0002345.100 are illegal.
6.
Describe in English the language generated by the regular expression: L+ ( D | L | _(D |
L) )* D.
Language Recognition Devices

A language recognition device for a language (set of strings) described using a regular
expression is an algorithm such that given the characters of a string s (one at a time from
left to right) as input, it outputs "yes" if the string s belongs to the language, and "no"
otherwise.

It is in general specified using Finite state automata, or implemented in the code of a
program.

It is sometime necessary to add to a language recognition device the possibility to
reconstruct the string from the characters read from the input.
REMARKS

The fact that reserved words look like identifiers in most programming languages
complicates the specification of the tokens of a language using regular expressions.

A simple solution is to treat reserved words as identifiers in the specification by regular
expressions, and have a table of reserved words that is searched every time an identifier is
recognized. If the identifier is in the table, then it is a reserved word; otherwise it is an
identifier.
Download