Role of Lexical Analyser:

advertisement
Role of Lexical Analyser:
 First phase of a compiler.
 Main task is to read the input characters and produce the sequence of tokens that
the parser uses for syntax analysis.
 It does dual tasks: one stripping out the various tokens from the input and also
associating error messages.
 Sometimes they are divided into scanners and lexical analysis.
Issues with Lexical Analysis:
 If parser does the work of reading the tokens, identifying the tokens and also
parsing them , than parser’s task would be complex. Hence before paring a phase
of lexical analysis is associated for doing these works.
 Compiler’s efficiency would increase
 Compiler’s protability would increase: input related problems would be delt by
lexical analyzer would be solved by it.
Tokens, Patterns, Lexemes
Token
A lexical token is a sequence of characters that can be treated as a unit in the grammar of
the programming languages.
Example of tokens:



Type token (id, num, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Example of non-tokens:

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .
Patterns
There is a set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token.
Regular expressions are an important notation for specifying patterns.
For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.
Lexeme
A lexeme is a sequence of characters in the source program that is matched by the pattern
for a token.
For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >,
>=) so the lexical analyzer should return a RELOP token to parser whenever it sees any
one of the six.
Consider the following line of Pascal text:
1:if a=0 then a:=b; (* entry := default *)
It contains 12 lexemes: 1, :, if, a, =, 0, then, a, :=, b, ;, and (* entry := default
*). The last lexeme matches the pattern for comments and does not have a corresponding token.
The tokens for the other lexemes might be as follows: label, colon, keyword, id, equal, integer,
keyword, id, assign, id, and semicolon. We divide the task of the lexer into two subtasks:
scanning and conversion. Scanning refers to the act of identifying a source character sequence as
an instance of a class of tokens, i.e., finding a lexeme. Conversion refers to the act of obtaining
the representation and attributes of the particular lexeme that has been identified.


Tokens are terminal symbols in CFG.
So when in the above example pi is identified by the lexer as token, then it will be
sent to the parser in the form of a number named id
Attributes of the Tokens:
If a situation is such where one lexemes belongs more than one pattern than what token
should we associate with it. For example if in the source programme there are many
constatant numbers falling under the category of pattern num than we have to associate
values of the number pattern to identify them separately.
Lexical Errors:
Input buffering:
Download