Language Translation Issues Lecture 5: Dolores Zage Programming Language Syntax The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of symbols, XY +- does not provides significant information for understanding a program translation into an object program rules: 2 + 3 x 4 is 14 not 20 (2+3) x 4 - specify interpretation by syntax - syntax guides the translator General Syntactic Criteria Provide a common notation between the programmer and the programming language processor the choice is constrained only slightly by the necessity to communicate particular items of information for example: a variable may be represented as a real can be done by an explicit declaration as in Pascal or by an implicit naming convention as FORTRAN general criteria: easy to read, write, translate and unambiguous Readability Algorithm is apparent from inspection of text self-documenting natural statement formats liberal use of key words and noise words provision for embedded comments unrestricted length identifiers mnemonic operator symbols COBOL design emphasizes readability often at the expense of ease of writing and translation Writeability Enhanced by concise and regular structures (notice readability->verbose, different; help us to distinguish programming features) FORTRAN - implicit naming does not help us catch misspellings (like indx and index, both are good integer variables, even though the programmer wanted indx to be index) redundancy can be good easier to read and allows for error checking Translation Ease of Key of easy translation is regularity of structure LISP can be translated in a few short easy rules, but it is a bear to read. COBOL has large number of syntactic constructs -> hard to translate Lack of ambiguity Central problem in every language design! Ambiguous construction allows for two or more different interpretations these do not arise in the structure of individual program elements but in the interplay between structures The dangling else is a classic example: If then else If (boolean) then if(boolean) then statement 1 else statement 2 S1 B1 B2 B1 S2 B2 S1 S2 Resolve dangling else Include begin … end delimiter around embedded conditional -ALGOL Ada-> delimiter end if C and Pascal -> final else is paired with the nearest then Character set ASCII 26 letters -> other languages have hundreds of letters identifiers and key words and reserved words blanks can be not significant except in literal character-string data (FORTRAN) or used as separators delimiters -> begin, end { } Other elements Identifiers, operators, key words, reserved words Free vrs Fixed format free written anywhere fixed - FORTRAN - first five characters are reserved for labels statements simple - no embedding structured or nested - embedded Overall Program-Subprogram Structure Separate subprogram definitions ( Common blocks in FORTRAN) separate data definitions ( class mechanism) nested subprogram definitions (Pascal nesting one subprogram in the other) separate interface definitions - package interface in Ada - in C you can do this with an include file data descriptions separated from executable statements (COBOL data and environment divisions) unseparated subprogram divisions - no organization early BASIC and SNOBOL Stages in Translation Process of translation of a program from its original syntax into executable form is central in every programming implementation translation can be quite simple as in LISP and Prolog but more often quite complex most languages could be implemented with only trivial translation if you wrote a software interpreter and willing to accept slow execution speeds Stages in Translation Syntactic recognition parts of compiler theory are fairly standard Analysis of the Source Program the structure of the program must be laboriously built up character by character during translation Synthesis of the Object Program construction of the executable program from the output of the semantic analysis Structure of a Compiler source program SOURCE PROGRAM RECOGNITION PHASES Lexical analysis lexical tokens Syntactic analysis Symbol table Other tables parse tree Semantic analysis intermediate code Object code from other compilations Optimization OBJECT optimized intermediate code CODE GENERATION Code generation Object code PHASES linking Executable code Analysis of the Source Program lexical analysis (tokenizing) parsing ( syntactic analysis) semantic analysis symbol-table maintenance insertion of implicit information (default settings) macro processing and compile-time operations(#ifdefs) Synthesis of the Object Program Optimization code generation - internal representation must be formed into assembly language statements, machine code or other object form linking and loading - references to external data or other subprograms Translator Groupings Crudely grouped by the number of passes they make over the source code standard - uses 2 passes decomposes into components, variable name usage generates an object program from collected information one pass - fast compilation - Pascal was designed so that it could be done in one pass three or more passes - if execution speed is paramount Formal Translation Models Based on the context-free theory of languages the formal definition of the syntax of a programming language is called a grammar a grammar consists of a set of rules (production) that specify the sequences of characters (lexical items) that form allowable programs in the language beginning defined Chomsky Hierarchy Language syntax was one of the earliest formal modes to be applied to programming language design in 1959 Chomsky outlined a model of grammars Classes of grammar and abstract machines Chomsky Level 0 1 2 3 Grammar Class Unrestricted Context sensitive Context free Regular Machine Class Turning machine Linear-bounded automaton Pushdown automaton Finite-state automaton Type 2 are our BNF grammars. Type 2 and 3 are what we use in programming languages A type n language is one that is generated by a type n grammar, where there is no grammar type n + 1 that also generates it. Every grammar of type is, by definition, also a grammar of type n-1. Grammar To Chomsky it is a 4-tuple (V, T, P, Z) where V is an alphabet T in V is an alphabet of terminal symbols P is a finite set of rewriting rules Z the distinguished symbol, is a member of T-V The language of a grammar is the set of terminal strings which can be represented from Z The difference in the four types is in the form of the rewriting rules allowed in P Type 0 or phrase structure Rules can have the form: u :: = V with u in V+ and V in V* That is, the left part u can also be a sequence of symbols and the right part can be empty abc -> dca a -> nil Type 1 or context sensitive or context dependent Restrict the rewriting rules xUy ::= xuy we are only allowed to Rewrite U as u only in the context x…y all productions a -> b where the length side a always must be less than or equal to the length of b G = ( {S,B,C}, {a,b,c}, S, P) P= S -> aSBC S -> abC bB -> bb bC -> bc CB -> BC cC -> cc What language is generated by this context sensitive grammar? Deciding the language? always start with the start rule: in this case it is S but it can any nonTerminal (look at the 4tuple definition) create a tree starting with the start rule and apply the productions finally finishing with all terminals “generalize” the pattern Identifying L given G P = 1. 2. 3. 4. 5. 6. S -> aSBC S -> abC bB -> bb bC -> bc CB -> BC cC -> cc S abC abc aSBC aabCBC aabBCC aabbCC aaSBCBC aaabCBCBC aabbcC aaabBBCCC aabbcc aaabbBCCC aaabBCCBC aaabBCBCC aaabbbCCC aaabbbcCC L -> anbncn where n>= 1 aaabbbccC aaabbbccc Type 2 or context free U can be rewritten as u regardless of the context in which it appears This grammar has only one symbol on the left hand side It also allows a rule to go the empty string Context Free Expression Grammar E-> E + T | E - T | T T -> T * F | T / F | F F -> number | name | (E) Type 3 - regular grammars Restrict the rules once more all rules must have the form u :: N or u :: WN Grammars As we moved from type 3 to type 2 to type 1 to type 0, the resulting languages became more complex type 2 and type 3 became important in programming languages type 3 provided a model (FSM) for building lexical analyzers type 2 (BNF) for developing parse trees of programs BNF Grammars Consider the structure of an English sentence. We usually describe it as sequence of categories subject / verb / object Examples: The girl/ played / baseball. The boy / cooked / dinner. BNF Grammars Each category can be further divided. For example subject is represented by article noun article / noun / verb / object There are other possible sentence structures besides the simple declarative ones, such as questions. auxiliary verb / subject / predicate Is / the boy / cooking dinner? Represent sentences by a set of rules <sentence> ::= <declarative> | <question> <declarative> ::= <subject> <verb> <object>. <subject> ::= <article><noun> <question> ::= <auxiliary verb> <subject> <predicate> This specific notation is called BNF (Backus-Naur form) and was developed in the late 1950s by John Backus as way to express the syntactic definition of ALGOL. At the same time Chomsky developed a similar grammatical form, the contextfree grammar. The BNF and context-free grammar for are equivalent in power; the differences are only in notation. For this reason BNF grammar and context-free grammar are interchangeable. (in grammars) Syntax A BNF grammar is composed of a finite set of BNF grammar rules, which define a language syntax is concerned with form rather than meaning, a (programming) language consists of a set of syntactically correct programs, each of which is simply a sequence of characters Production Rules A grammar -> set of production rules <real-number> ::= <integer_part> . <fraction> <integer_part> ::= <digit> | <integer_part> <digit> <fraction> ::= <digit>| <digit> <fraction> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 nonterminals Token or terminal Doesn’t Have to Make Sense! A syntactically correct program need not make any sense semantically. If it is executed it would not have to compute anything useful it could not computer anything at all For example look at our simple declarative and imperative sentences -> the syntax subject verb object is fulfilled but doesn’t make any sense The home / ran / the girl. Parse Trees Production rules are rules for building strings of tokens beginning with the starting nonterminal, you can use the rules to build a tree The parse tree each leaf either has a terminal or is empty nonleaf nodes are with nonterminals generates the string formed by reading terminals at its leaves from left to right a string is only in a language if is generated by some parse tree Parse tree <real-number> ::= <integer_part> . <fraction> <integer_part> ::= <digit> | <integer_part> <digit> <fraction> ::= <digit>| <digit> <fraction> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 <real-number> String 13.13 <integer_part > . <integer_part> <digit> <digit> 1 3 <fraction> <digit> <fraction> 1 <digit> 3 Use of Formal Grammar Important to the language user and language implementor user may consult to answer subtle questions about program form, punctuation, and structure implementor may use it to determine all the possible cases of input program structures that are allowed common agreed upon definition BNF grammar or Context free Assigns a structure to each string in the language is is always a tree because of the restrictions on BNF grammar rules parse tree provides an intuitive semantic structure BNF does a good job in defining the syntax of a language Syntax not defined by BNF notation Despite the elegance, power and simplicity of BNF grammars there are areas of language that cannot be expressed (contextual dependence) ex: the same identifier may not be defined twice in the same scope also every language can be defined by multiple grammars problem : ambiguity (the dangling else) They /are /flying planes or They / are flying/ planes Ambiguity Ambiguity is often a property of a given grammar G : S -> SS | 0 | 1 the grammar that generates binary strings is ambiguous because there is a string in the language that has two distinct parse trees Ambiguous Grammar S 0 S S S S S S S S S 0 1 0 0 1 Ambiguous Grammar If every grammar for a given language is ambiguous, then the language is inherently ambiguous. However, the language that generates binary string is not because there is a grammar that that is unambiguous G: T -> 0T | 1T | 0 | 1 Expressions We need control structures for expressions Implicit (default) control - are in effect unless modified by the programmer through some explicit structure explicit - modify implicit sequence Sequencing with Arithmetic Expressions Root = -B B2 - 4 * A * C 2*A There are 15 separate operations in this formula In a programming language this can be stated as a single expression Sequencing with Arithmetic Expressions Expressions are powerful and a natural device for expressing sequences of operations however, they raise new problems. The sequence-control mechanisms that operate to determine the order of operations within an expression are complex and subtle Tree-Structure Representation Clarifies the control structure of the expression * + a (a+b) * (c-a) - b c d Syntax for Expressions For a programming language we must have a notation for writing trees as linear sequences of symbols There are three common ones prefix postfix infix Expression Notation prefix opE1E2 +ab postfix E1E2op ab+ infix E1opE2 a+b postfix and prefix, nice -> do not have to use () infix postfix prefix (a+b)*c ab+c* *+abc a+b*c abc*+ +a*bc a+b+c ab+c+ ++abc (a+b)+c ab+c+ ++abc a + (b+c) abc++ +a+bc Which of the following is a valid expression (either postfix or prefix)? BC*D-+ *ABCBBB** Expression Notation - Infix However, infix is familiar and easy to read Infix is suited to binary operators, for unary operators or multi-agrument function calls must be exceptions to the general infix property But how to decode a+b*c? Precedence (order of operations) Associativity ( normally left to right) Precedence Give operators precedence levels higher precedence operators are evaluated before lower precedence operators without precedence rules, parentheses would be needed in expressions works well with all mathematical symbols but breaks done with new operators not from classical mathematics (?: in C) Associativity What if operators with the same precedence are grouped together? Operators + - / * are left associative 1+2+3+4 : left associative a=b=c=2+3 : right associative 234 : right associative mixfix notation - when symbols or keywords interspersed with the components of expressions IF a>b then a else b Abstract Syntax Tree Infix, postfix, prefix use a different notation, but all have the same meaningful components an abstract syntax tree is a way to represent this for the notations infix postfix prefix (a+b)*c ab+c* *+abc * c + a b Side Effects The use of operations that have side effects in expressions is the basis of a long-standing controversy in programming language design Side effects are implicit results. For example an operation may return an explicit result, as in the sum returned as the result of an addition, but it may also modify the values stored in other data objects. A * fun(x ) + a First, we must fetch the r-value of a and the fun(x) must be evaluated. Notice the addition requires the value of a and the result of the multiplication. It is clearly desirable to fetch a once and use it twice Moreover, it should make no difference whether fun(x) is evalutated before or after the value of a if fetched A * fun(x ) + a However if fun has the side effect of changing the value of a, then the exact order of evaluation is critical! If a has the initial value of 1 and fun(x) returns 3 and also changes the value of a to 2, then the possible values for this expression can be: evaluate each term in sequence: 1 * 3 + 2 = 5 evaluate a only once: 1 * 3 * 1 = 4 call fun(x) before evaluating a: 2 * 3 + 2 = 8 all are correct according the syntax Positions on side effects in expressions Outlaw them! Disallow functions with side effects or make them undefined allow them but make it clear exactly what the order of evaluation is so the programmer can make proper use The later is most general, but many language definitions this question is ignored and the result is different implementations provide conflicting interpretations