Uploaded by St.Bontu Girma

chapter 2

advertisement
Chapter Two : Lexical Analysis
oToken
oLanguage
oRegular Expressions and Finite Automata
oConversion RE-NFA-DFA
oLexical Analyzer Generator
PreparedbyBontuG.:2014E.
C
Chapter Tow : Lexical Analysis
oLexical analysis is the first phase of a compiler.
oThe role of the lexical analyzer is to read a sequence
of characters from the source program and produce
tokens to be used by the parser.
oThe lexical analyzer breaks these sentence (source
code) into a series of tokens, by removing any
whitespace and comments in the source code.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
oThe main task of the lexical analyzer is to read the
input characters of the source program, group them
into lexemes, and produce as an output a sequence
of tokens for each lexeme in the source program.
oThe stream of tokens is sent to the parser for syntax
analysis.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
oThe lexical analyzer also interacts with the symbol
table, e.g., when the lexical analyzer discovers a
lexeme constituting an identifier, it needs to enter
that lexeme into the symbol table.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
oThe following are additional tasks performed by the
lexical analyzer other than identifying lexemes:
o Stripping out comments and whitespace (blank, newline,
and tab)
o Correlating error messages generated by the compiler
with the source program by keeping track of line
numbers (using newline characters)
o Expanding macros in some lexical analyzers
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Tokens, Patterns, and Lexemes
o A token is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit, e.g., a keyword, an identifier, etc. The token names are input
symbols that the parser processes. In any programming language
(Keywords, operators, identifiers, constants, literals, punctuation
symbols) are token
o Lexemes are said to be a sequence of characters (alphanumeric) in a
token. There are some predefined rules for every lexeme to be identified
as a valid token. These rules are defined by grammar rules, by means of a
pattern.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Tokens, Patterns, and Lexemes
• A pattern is a description of the form that lexemes of a token may take.
In case of a keyword as a token, the pattern is the sequence of characters
that forms the keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many strings.
• In programming language, keywords, constants, identifiers, strings,
numbers, operators and punctuations symbols can be considered as
tokens.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Tokens, Patterns, and Lexemes
Token
Some lexemes
Informal pattern
begin
begin, Begin, BEGIN, beGin,
…
Begin in small or capital
letters
if
if, IF, iF, If
if in small or capital letters
ident
Distance, F1, x, Dist1, …
Letter followed by zero or
more letters and/or digits
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Attributes of tokens
oWhen more than one pattern matches a lexeme, the
scanner must provide additional information about
the particular lexeme to the subsequent phases of
the compiler.
oFor ex., both 0 and 1 match the pattern for the token
num. But the code generator needs to know which
number is recognized.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Attributes of tokens
• The lexical analyzer collects information about tokens into
their associated attributes.
• Practically, a token has one attribute: a pointer to the
symbol table entry in which the information about the
token is kept.
• Symbol table entry contains information about the token
such as the lexeme, the line number in which it was first
seen, …
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Attributes of tokens
• For ex. consider x = y + 2
The tokens and their attributes are written as:
<id, pointer to symbol-table entry for x>
<assign_op, >
<id, pointer to symbol-table entry for y>
<plus_op, >
<num, integer value 2>
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Errors
oVery few errors are detected by the lexical analyzer.
oFor ex., if the programmer mistakes wihle for while,
the lexical analyzer cannot detect the error (why?)
oNonetheless, if a certain sequence of characters
follows none of the specified patterns, the lexical
analyzer can detect the error.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Errors
oWhen an error occurs, the lexical analyzer recovers by:
o skipping (deleting) successive characters from the remaining
input until the lexical analyzer can find a well-formed token
(panic mode recovery)
o deleting extraneous(unimportant) characters
o inserting missing characters
o replacing an incorrect character by a correct character
o transposing two adjacent characters
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Specifying and recognizing tokens
oRegular expressions are used to specify the patters of
tokens.
oEach pattern matches a set of strings.
Example
letter  A|B|C|…|Z|a|b|c|…|z
digit  0|1|…|9
identifier  letter (letter|digit)*
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Languages
o Alphabet: It is defined as a finite set of symbols.
o String: A “string” over an alphabet is a finite sequence of symbols
from that alphabet, which is usually written next to one another and
not separated by commas.
o Sentence and word are also used in terms of string
o ε is the empty string
o |s| is the length of string s.
o Substring: z is a substring of w if z appears consecutively within w.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Languages
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Operations on Languages
L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }
o Concatenation:
L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
o Union:
o Exponentiation: L0 = {ε}
L1 = L L2 = LL
o Kleene Closure: L* = include the empty string
o Positive Closure: L+ = doesn’t include the empty string .
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Examples
Preparedby BontuG.:2014E.C
Chapter Tow : Lexical Analysis
Examples
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Grammar
Preparedby BontuG.:2014E.C
Chapter Tow : Lexical Analysis
Grammar
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Example
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Example
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Automata—What is it?
• An automaton is an abstract model of a digital computer.
• An automaton has a mechanism to read input, which is a
string over a given alphabet. This input is actually written
on an “input file”, which can be read by the automaton but
can not change it.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Automata—What is it?
• The automaton has a temporary “storage”
device, which has unlimited number of cells,
the contents of which can be altered by the
automaton.
• Automaton has a control unit, which is said to
be in one of a finite number of “internal states”.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Types of Automaton
oDeterministic Automata
oNon-deterministic Automata
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Deterministic Automata
oA deterministic automata is one in which each move
(transition from one state to another) is unequally
determined by the current configuration.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Deterministic Automata
oA deterministic automata is one in which each move
(transition from one state to another) is unequally
determined by the current configuration.
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Deterministic Automata
Preparedby BontuG.:2014E.C
Chapter Tow : Lexical Analysis
Deterministic Automata
Preparedby BontuG.:2014E.C
Chapter Tow : Lexical Analysis
Deterministic Automata(EX)
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Regular Expiration
o Regular expressions were designed to represent regular
languages with a mathematical tool, a tool built from a set of
primitives and operations.
o
This representation involves a combination of strings of
symbols from some alphabet S, parentheses and the
operators +, ×, and *.
Preparedby BontuG.:2014E.C
Chapter Tow : Lexical Analysis
Building Regular Expressions
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Languages defined by Regular Expressions
• There is a very simple correspondence between
regular expressions and the languages they denote:
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Languages defined by Regular Expressions
• There is a very simple correspondence between
regular expressions and the languages they denote:
Preparedby BontuG.:2014E.
C
Chapter Tow : Lexical Analysis
Ex (revision)
• Determine a deterministic Finite State Automaton
from the given Nondeterministic FSA.
Preparedby BontuG.:2014E.
C
Download