syntax

advertisement
Language Translation
Issues
Lecture 5:
Dolores Zage
Programming Language Syntax

The arrangement of words as elements in a
sentence to show their relationship
 In
C, X = Y + Z represents a valid sequence of
symbols, XY +- does not

provides significant information for
 understanding
a program
 translation into an object program
 rules:
2 + 3 x 4 is 14 not 20
 (2+3) x 4 - specify interpretation by syntax - syntax
guides the translator
General Syntactic Criteria


Provide a common notation between the
programmer and the programming language
processor
the choice is constrained only slightly by the
necessity to communicate particular items of
information
 for
example: a variable may be represented as a real can
be done by an explicit declaration as in Pascal or by an
implicit naming convention as FORTRAN

general criteria: easy to read, write, translate and
unambiguous
Readability








Algorithm is apparent from inspection of text
self-documenting
natural statement formats
liberal use of key words and noise words
provision for embedded comments
unrestricted length identifiers
mnemonic operator symbols
COBOL design emphasizes readability often
at the expense of ease of writing and
translation
Writeability



Enhanced by concise and regular structures
(notice readability->verbose, different; help
us to distinguish programming features)
FORTRAN - implicit naming does not help us
catch misspellings (like indx and index, both
are good integer variables, even though the
programmer wanted indx to be index)
redundancy can be good
 easier
to read and allows for error checking
Translation




Ease of
Key of easy translation is regularity of
structure
LISP can be translated in a few short easy
rules, but it is a bear to read.
COBOL has large number of syntactic
constructs -> hard to translate
Lack of ambiguity




Central problem in every language design!
Ambiguous construction allows for two or
more different interpretations
these do not arise in the structure of
individual program elements but in the
interplay between structures
The dangling else is a classic example:
If then else





If (boolean) then
if(boolean) then
statement 1
else
statement 2 S1
B1
B2
B1
S2
B2
S1
S2
Resolve dangling else



Include begin … end delimiter around
embedded conditional -ALGOL
Ada-> delimiter end if
C and Pascal -> final else is paired with the
nearest then
Character set






ASCII
26 letters ->
other languages have hundreds of letters
identifiers and key words and reserved words
blanks can be not significant except in literal
character-string data (FORTRAN) or used as
separators
delimiters -> begin, end { }
Other elements


Identifiers, operators, key words, reserved
words
Free vrs Fixed format  free
written anywhere
 fixed - FORTRAN - first five characters are
reserved for labels

statements  simple
- no embedding
 structured or nested - embedded
Overall Program-Subprogram
Structure






Separate subprogram definitions ( Common blocks in
FORTRAN)
separate data definitions ( class mechanism)
nested subprogram definitions (Pascal nesting one
subprogram in the other)
separate interface definitions - package interface in Ada
- in C you can do this with an include file
data descriptions separated from executable
statements (COBOL data and environment divisions)
unseparated subprogram divisions - no organization early BASIC and SNOBOL
Stages in Translation



Process of translation of a program from its
original syntax into executable form is central
in every programming implementation
translation can be quite simple as in LISP
and Prolog but more often quite complex
most languages could be implemented with
only trivial translation if you wrote a software
interpreter and willing to accept slow
execution speeds
Stages in Translation

Syntactic recognition parts of compiler theory
are fairly standard
 Analysis
of the Source Program
 the
structure of the program must be laboriously built
up character by character during translation
 Synthesis
of the Object Program
 construction
of the executable program from the
output of the semantic analysis
Structure of a Compiler
source program
SOURCE
PROGRAM
RECOGNITION
PHASES
Lexical analysis
lexical tokens
Syntactic analysis
Symbol
table
Other
tables
parse tree
Semantic analysis
intermediate code
Object code from
other compilations
Optimization
OBJECT
optimized intermediate code
CODE
GENERATION
Code generation Object
code
PHASES
linking
Executable
code
Analysis of the Source Program



lexical analysis (tokenizing)
parsing ( syntactic analysis)
semantic analysis
 symbol-table
maintenance
 insertion of implicit information (default
settings)
 macro processing and compile-time
operations(#ifdefs)
Synthesis of the Object Program



Optimization
code generation - internal representation must
be formed into assembly language statements,
machine code or other object form
linking and loading - references to external data or
other subprograms
Translator Groupings


Crudely grouped by the number of passes they
make over the source code
standard - uses 2 passes
 decomposes
into components, variable name usage
 generates an object program from collected information

one pass - fast compilation - Pascal was designed so

that it could be done in one pass
three or more passes - if execution speed is paramount
Formal Translation Models



Based on the context-free theory of
languages
the formal definition of the syntax of a
programming language is called a grammar
a grammar consists of a set of rules
(production) that specify the sequences of
characters (lexical items) that form allowable
programs in the language beginning defined
Chomsky Hierarchy


Language syntax was one of the earliest
formal modes to be applied to programming
language design
in 1959 Chomsky outlined a model of
grammars
Classes of grammar and abstract
machines
Chomsky Level
0
1
2
3
Grammar Class
Unrestricted
Context sensitive
Context free
Regular
Machine Class
Turning machine
Linear-bounded automaton
Pushdown automaton
Finite-state automaton
Type 2 are our BNF grammars. Type 2 and 3 are what we use in
programming languages
A type n language is one that is generated by a type n grammar,
where there is no grammar type n + 1 that also generates it. Every
grammar of type is, by definition, also a grammar of type n-1.
Grammar







To Chomsky it is a 4-tuple (V, T, P, Z) where
V is an alphabet
T in V is an alphabet of terminal symbols
P is a finite set of rewriting rules
Z the distinguished symbol, is a member of T-V
The language of a grammar is the set of terminal
strings which can be represented from Z
The difference in the four types is in the form of the
rewriting rules allowed in P
Type 0 or phrase structure


Rules can have the form:
u :: = V with
 u in V+ and V in V*



That is, the left part u can also be a
sequence of symbols and the right part can
be empty
abc -> dca
a -> nil
Type 1 or context sensitive or
context dependent



Restrict the rewriting rules
xUy ::= xuy
we are only allowed to Rewrite U as u only in
the context x…y
 all
productions a -> b where
 the length side a always must be less than or
equal to the length of b
G = ( {S,B,C}, {a,b,c}, S, P)








P=
S -> aSBC
S -> abC
bB -> bb
bC -> bc
CB -> BC
cC -> cc
What language is generated by this context
sensitive grammar?
Deciding the language?



always start with the start rule: in this case it
is S but it can any nonTerminal (look at the 4tuple definition)
create a tree starting with the start rule and
apply the productions finally finishing with all
terminals
“generalize” the pattern
Identifying L
given G
P = 1.
2.
3.
4.
5.
6.
S -> aSBC
S -> abC
bB -> bb
bC -> bc
CB -> BC
cC -> cc
S
abC
abc
aSBC
aabCBC
aabBCC
aabbCC
aaSBCBC
aaabCBCBC
aabbcC
aaabBBCCC
aabbcc
aaabbBCCC
aaabBCCBC
aaabBCBCC
aaabbbCCC
aaabbbcCC
L -> anbncn where n>= 1
aaabbbccC
aaabbbccc
Type 2 or context free



U can be rewritten as u regardless of the
context in which it appears
This grammar has only one symbol on the
left hand side
It also allows a rule to go the empty string
Context Free Expression Grammar



E-> E + T | E - T | T
T -> T * F | T / F | F
F -> number | name | (E)
Type 3 - regular grammars



Restrict the rules once more
all rules must have the form
u :: N or u :: WN
Grammars




As we moved from type 3 to type 2 to type 1
to type 0, the resulting languages became
more complex
type 2 and type 3 became important in
programming languages
type 3 provided a model (FSM) for building
lexical analyzers
type 2 (BNF) for developing parse trees of
programs
BNF Grammars

Consider the structure of an English
sentence. We usually describe it as
sequence of categories
subject / verb / object
Examples:
The girl/ played / baseball.
The boy / cooked / dinner.
BNF Grammars


Each category can be further divided.
For example subject is represented by
article noun
article / noun / verb / object
There are other possible sentence structures besides
the simple declarative ones, such as questions.
auxiliary verb / subject / predicate
Is / the boy / cooking dinner?
Represent sentences by a set of
rules




<sentence> ::= <declarative> | <question>
<declarative> ::= <subject> <verb> <object>.
<subject> ::= <article><noun>
<question> ::= <auxiliary verb> <subject> <predicate>

This specific notation is called BNF (Backus-Naur form) and
was developed in the late 1950s by John Backus as way to
express the syntactic definition of ALGOL. At the same time
Chomsky developed a similar grammatical form, the contextfree grammar. The BNF and context-free grammar for are
equivalent in power; the differences are only in notation. For
this reason BNF grammar and context-free grammar are
interchangeable. (in grammars)
Syntax


A BNF grammar is composed of a finite set
of BNF grammar rules, which define a
language
syntax is concerned with form rather than
meaning, a (programming) language consists
of a set of syntactically correct programs,
each of which is simply a sequence of
characters
Production Rules





A grammar -> set of production rules
<real-number> ::= <integer_part> . <fraction>
<integer_part> ::= <digit> | <integer_part> <digit>
<fraction>
::= <digit>| <digit> <fraction>
<digit>
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9
nonterminals
Token or terminal
Doesn’t Have to Make Sense!

A syntactically correct program need not
make any sense semantically.
 If
it is executed it would not have to compute
anything useful
 it could not computer anything at all


For example look at our simple declarative
and imperative sentences -> the syntax
subject verb object is fulfilled but doesn’t
make any sense
The home / ran / the girl.
Parse Trees



Production rules are rules for building strings
of tokens
beginning with the starting nonterminal, you
can use the rules to build a tree
The parse tree each leaf either has a terminal or is empty
 nonleaf nodes are with nonterminals
 generates the string formed by reading terminals
at its leaves from left to right
 a string is only in a language if is generated by
some parse tree

Parse tree
<real-number> ::= <integer_part> . <fraction>
<integer_part> ::= <digit> | <integer_part> <digit>
<fraction>
::= <digit>| <digit> <fraction>
<digit>
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9
<real-number>
String
13.13
<integer_part
>
.
<integer_part> <digit>
<digit>
1
3
<fraction>
<digit>
<fraction>
1
<digit>
3
Use of Formal Grammar




Important to the language user and language
implementor
user may consult to answer subtle questions
about program form, punctuation, and
structure
implementor may use it to determine all the
possible cases of input program structures
that are allowed
common agreed upon definition
BNF grammar or Context free




Assigns a structure to each string in the
language
is is always a tree because of the restrictions
on BNF grammar rules
parse tree provides an intuitive semantic
structure
BNF does a good job in defining the syntax
of a language
Syntax not defined by BNF
notation





Despite the elegance, power and simplicity of BNF
grammars there are areas of language that cannot
be expressed (contextual dependence)
ex: the same identifier may not be defined twice in
the same scope
also every language can be defined by multiple
grammars
problem : ambiguity (the dangling else)
They /are /flying planes or They / are flying/ planes
Ambiguity



Ambiguity is often a property of a given
grammar
G : S -> SS | 0 | 1
the grammar that generates binary strings is
ambiguous because there is a string in the
language that has two distinct parse trees
Ambiguous Grammar
S
0
S
S
S
S
S
S
S
S
S
0
1
0
0
1
Ambiguous Grammar
If every grammar for a given language
is ambiguous, then the language is
inherently ambiguous. However, the
language that generates binary string is
not because there is a grammar that
that is unambiguous
G: T -> 0T | 1T | 0 | 1
Expressions



We need control structures for expressions
Implicit (default) control - are in effect unless
modified by the programmer through some
explicit structure
explicit - modify implicit sequence
Sequencing with Arithmetic
Expressions
Root = -B
B2 - 4 * A * C
2*A
There are 15 separate operations in this formula In a
programming language this can be stated as a single
expression
Sequencing with Arithmetic
Expressions


Expressions are powerful and a natural
device for expressing sequences of
operations however, they raise new
problems.
The sequence-control mechanisms that
operate to determine the order of operations
within an expression are complex and subtle
Tree-Structure Representation

Clarifies the control structure of the
expression
*
+
a
(a+b) * (c-a)
-
b
c
d
Syntax for Expressions


For a programming language we must have
a notation for writing trees as linear
sequences of symbols
There are three common ones
 prefix
 postfix
 infix
Expression Notation
prefix
opE1E2
+ab
postfix
E1E2op
ab+
infix
E1opE2
a+b
postfix and prefix, nice -> do not have to use ()
infix
postfix
prefix
(a+b)*c
ab+c*
*+abc
a+b*c
abc*+
+a*bc
a+b+c
ab+c+
++abc
(a+b)+c
ab+c+
++abc
a + (b+c)
abc++
+a+bc
Which of the following is a valid
expression (either postfix or
prefix)?



BC*D-+
*ABCBBB**
Expression Notation - Infix



However, infix is familiar and easy to read
Infix is suited to binary operators, for unary
operators or multi-agrument function calls
must be exceptions to the general infix
property
But how to decode a+b*c?
 Precedence
(order of operations)
 Associativity ( normally left to right)
Precedence




Give operators precedence levels
higher precedence operators are evaluated
before lower precedence operators
without precedence rules, parentheses would
be needed in expressions
works well with all mathematical symbols but
breaks done with new operators not from
classical mathematics (?: in C)
Associativity






What if operators with the same precedence
are grouped together?
Operators + - / * are left associative
1+2+3+4 : left associative
a=b=c=2+3 : right associative
234 : right associative
mixfix notation - when symbols or keywords
interspersed with the components of expressions IF a>b then a else b
Abstract Syntax Tree


Infix, postfix, prefix use a different notation,
but all have the same meaningful
components
an abstract syntax tree is a way to represent
this for the notations
infix
postfix
prefix
(a+b)*c
ab+c*
*+abc
*
c
+
a
b
Side Effects


The use of operations that have side effects
in expressions is the basis of a long-standing
controversy in programming language design
Side effects are implicit results. For example
an operation may return an explicit result, as
in the sum returned as the result of an
addition, but it may also modify the values
stored in other data objects.
A * fun(x ) + a




First, we must fetch the r-value of a and the
fun(x) must be evaluated.
Notice the addition requires the value of a
and the result of the multiplication.
It is clearly desirable to fetch a once and use
it twice
Moreover, it should make no difference
whether fun(x) is evalutated before or after
the value of a if fetched
A * fun(x ) + a


However if fun has the side effect of
changing the value of a, then the exact order
of evaluation is critical!
If a has the initial value of 1 and fun(x)
returns 3 and also changes the value of a to
2, then the possible values for this
expression can be:
 evaluate
each term in sequence: 1 * 3 + 2 = 5
 evaluate a only once: 1 * 3 * 1 = 4
 call fun(x) before evaluating a: 2 * 3 + 2 = 8
 all are correct according the syntax
Positions on side effects in
expressions



Outlaw them! Disallow functions with side
effects or make them undefined
allow them but make it clear exactly what the
order of evaluation is so the programmer can
make proper use
The later is most general, but many language
definitions this question is ignored and the
result is different implementations provide
conflicting interpretations
Download