Chapter 3 Describing Syntax and Semantics Chapter 3: Describing Syntax and Semantics Objectives: - Introduction - The General Problem of Describing Syntax - Formal Methods of Describing Syntax - Attribute Grammars - Describing the Meanings of Programs: Dynamic Semantics 2 3.1 Introduction A language is a set of sentences or statements. Sentences or statements are the valid strings of a language. They consist of valid alphabets sequenced in a way that is consistent with the grammar of the language. Who must use language definitions? -Language designers - Implementers - Programmers (the users of the language) - A formal description of the language is essential in learning, writing, and implementing the language. Thus description must be precise and understandable. 3 - Languages are described by their syntaxes and semantics What is the meaning of Syntax? It means the form or structure of the expressions, statements and program units. The Syntax rules of a language specify which strings of characters from the language’s alphabet are in the language. What is the meaning of Semantics? It means the meaning of the expressions, statements, and program units. -Syntax describes what the language looks like. -Semantics determines what a particular construct actually does in a formal way. - Syntax is much easier to describe than semantics. 4 3.2 The General Problem of Describing Syntax: Terminology Formal descriptions of the syntax of programming languages, for simplicity sake, often do not include descriptions of the lowest-level syntactic units. These small units are called lexemes. A lexeme is basic component during lexical analysis. A lexeme consists of related alphabet from the language. e.g. numeric literals, operators (+), and special words (begin). Lexemes are partitioned into groups. For example, the names of variables, methods, classes, and so forth in a programming language form a group called identifiers. Each lexeme group is represented by a name, or a token. A token of a language is a category of its lexemes, and a lexeme is an instance of a token. For example, an identifier is a token that can have lexemes, or instances, such as counter and total (variable names). 5 Example: consider the following Java statement index = 2 * count +17; The lexemes and tokens of this statement are represented in the following table Lexemes Tokens index identifier = equal_sign 2 int_literal * mult_op 17 int_literal ; semicolon 6 In general, languages can be formally defined in two distinct ways: A language can be (1) generated or (2) recognized. 3.2.1 Language Recognizer A recognizer of a language identifies those strings that are within a language from those that are not. The lexical analyzer and the parser of a compiler are the recognizer of the language the compiler translates. The lexical analyzer recognizes tokens, and the parser recognizes the syntactic structure. 3.2.2 Language Generators A generator generates the valid sentences of a language. In some cases it is more useful than the recognizer since we can “watch and learn”. 7 3.3 Formal Methods of Describing Syntax The formal language that is used to describe the syntax of programming languages is called grammar. 3.3.1 BNF (Backus-Naur Form) and Context-Free Grammars BNF is widely accepted way to describe the syntax of a programming language. 3.3.1.1 Context-Free Grammars Regular expression and context-free grammar are useful in describing the syntax of a programming language. Regular expression describes how a token is made of alphabets, and context-free grammar determines how the tokens are put together as a valid sentence in the grammar. 8 3.3 Formal Methods of Describing Syntax 3.3.1.2 Origins of Backus- Naur From (BNF) John Backus and Peter Naur used BNF to describe ALGOL 58 and ALGOL 60, and BNF is nearly identical to context-free grammar. 3.3.1.3 BNF Fundamentals •Metalanguage is a language that is used to describe another language. BNF is a metalanguage for programming languages. • The BNF consists of rules (or productions). A rule has a lefthand side (LHS) as the abstraction, and a right-hand side (RHS) as the definition. As in java assignment statement the definition could be as follows: <assign> → <var> = <expression> 9 •The abstractions in a BNF description, or grammar, are often called nonterminal symbols, or simply nonterminals. •Lexemes and tokens of the rules are called terminal symbols, or simply terminals. •A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals i.e. mixture of tokens, lexemes, and references to other abstractions. •Nonterminals are often enclosed in angle brackets < > •A BNF description, or grammar, is a collection of rules. The rule indicates that whenever you see a nonterminal on the LHS, you can replace it with the RHS, just like expanding a non-terminal into its children in a tree. 10 An abstraction (or nonterminal symbol) can have more than one RHS. e.g. <if_stmt> → if ( <logic_expr> ) <stmt> | if ( <logic_expr> ) <stmt> else <stmt> 3.3.1.4 Describing Lists Recursion is used in BNF to describe lists. e.g. <ident_list> → ident | ident, <ident_list> 11 3.3.1.5 Grammars and Derivations -BNF is a generator of the language. The sentences of a language can be generated from the start symbol by applying a series of rules on it. - A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols) -Replacing a non-terminal with different RHS’s may derive different sentences. -Each string of symbols during a derivation is a sentential form. A sentence is a sentential form that has only terminal symbols -If we replace every leftmost non-terminal of the sentential form, the derivation is leftmost. However, the set of sentences generated are not affected by the derivation order. 12 Example (3.1) a grammar for small language <program> → begin <stmt_list> end <stmt_list> → <stmt> | <stmt>; <stmt_list> <stmt> → <var> = <expression> <var> → A | B | C <expression> → <var> + <var> | <var> - <var> | <var> A derivation of a program in this language follows: <program> => begin <stmt_list> end => begin <stmt> ; <stmt_list> end => begin <var> = <expression>; <stmt_list> end => begin A= <expression>; <stmt_list> end => begin A = <var> + < var> ; <stmt_list> end => begin A = B+ < var> ; <stmt_list> end => begin A = B + C; <stmt_list> end => begin A = B + C; <stmt> end => begin A = B + C ; <var> = <expression> end => begin A = B + C ; B = <expression> end => begin A = B + C ; B = <var> end => begin A = B + C ; B = C end 13 Example (3.2) a grammar for a Simple Assignment Statements <assign> → <id> = <expr> <id> → A | B | C | <expr> → <id> + <expr> | <id>*<expr> | (<expr>) | <id> A = B*(A+C) is generated by the leftmost derivation: <assign> => <id> = <expr> => A = <expr> => A = <id> * <expr> => A = B * <expr> => A = B * (<expr>) => A = B * (<id> + <expr>) => A = B * (A + <expr>) => A = B * (A + <id>) => A = B * (A + C) 14 3.3.1.6 Parse Trees A derivation can be represented by a tree hierarchy called parse tree. The root of the tree is the start symbol and applying a derivation rule corresponds to expand a non-terminal in a tree into its children. A parse tree for the simple statement: <assign> A = B * (A + C) <id> = <expr> <id> * A ( <expr> <expr> ) B <id> + <expr> <id> A C 15 3.3.1.7 Ambiguity A grammar is ambiguous if for a given sentence, there is more than one parse tree, i.e., there are two derivations that lead to the same sentence. Two distinct parse trees (generated by the grammar on the next slide) for the same sentence, A = B +C * A <assign> <id> A = <assign> <expr> <expr> <id> B + <expr> <id> C <id> <expr> * A <expr> <id> A <expr> <id> <expr> = <expr> + * <expr> <expr> <id> <id> A B C 16 An Ambiguous Grammar <assign> → <id> = <expr> <id> → A | B | C <expr> → <expr> + <expr> | <expr> * <expr> | ( <expr> ) | <id> 17 3.3.1.8 Operator Precedence Operator Precedence can be maintained by modifying a grammar so that operators with higher precedence are grouped earlier with its operands, so that they appear lower in the parse tree. If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr> → <expr> - <term> | <term> <term> → <term> / const | const 18 An Unambiguous Grammar <assign> → <id> = <expr> <id> → A | B | C <expr> → <expr> + <term> | <term> <term> → <term> * <factor> | <factor> <factor> → ( <expr> ) | <id> 19 3.3.1.9 Associativity of Operators Operator associativity can also be indicated by a grammar. <expr> => <expr> + <expr> | const <expr> => <expr> + const | const (ambiguous) (unambiguous) <expr> <expr> <expr> + + const const const 20 3.3.2 Extended BNF - EBNF has the same expression power as BNF - The addition includes optional constructs (parts), repetition, and multiple choices, very much like regular expression. - Optional parts are placed in brackets ([ ]) <if_stmt> → if (<expression>) <statement> [else <statement>] - Alternative parts of RHSs in parentheses and separate them with vertical bars <term> → <term> (* | / | %) <factor> - Put repetitions (0 or more) are placed inside braces ({}) <ident_list> → <identifier> {, <identifier>} 21 3.3.2 Extended BNF BNF <expr> <expr> + <term> | <expr> - <term> | <term> <term> <term> * <factor> | <term> / <factor> | <factor> EBNF <expr> <term> {(+ | -) <term>} <term> <factor> {(* | /) <factor>} 22 3.4 Attribute Grammars An attribute grammar is an extension to a context –free grammar . It is used to describe more of the structure of a programming language than can be described with context-free grammar. The extension allows certain language rules to be described, such as type compatibility. 3.4.1 Static Semantics - There are many restrictions on programming languages that are either difficult or impossible to describe in BNF, however, they can be described when we add attributes to the terminal/non-terminals in BNF. - These added attribute and their computation could be computed at compile-time, thus the name static semantics. 23 3.4 Attribute Grammars 3.4.1 Static Semantics (contd.) - Consider a rule which states that a variable must be declared before it is referenced. - Cannot be specified in a context-free grammar. -Can be tested at compile time. - Some rules can be specified in the grammar of a language, but will unnecessarily complicate the grammar. e.g. a rule in JAVA that states that a string literal cannot be assigned to a variable which was declared to be type int. 24 3.4 Attribute Grammars 3.4.2 Basic concepts An attribute grammar consists of, in addition to its BNF rules, the attributes of grammar symbols, a set of attribute computation functions (or semantic functions), and predicate functions that determine whether a rule could be applied. The latter two functions are associated with the grammar rules. Thus basic concepts are: - Attribute grammars are grammars to which have been added attributes, attribute computation functions and predicate functions. 25 - Attributes associate with grammar symbols, are similar to variables in the sense that they can have values assigned to them. - Attribute computation functions semantic functions, are associated with grammar rules. They are used to specify how attribute values are computed. - Predicate function which state some of the syntax and static semantic rules of the language are associated with grammar rule. 26 3.4. 3 Attribute grammars defined Definition: An attribute grammar is a context-free grammar (CFG) with the following additions: - For each grammar symbol X there is a set A(X) of attribute values. - Each rule has a set of functions that define certain attributes of the non-terminals in the rule. - Each rule has a set of predicates to check for attribute consistency. -Associate with each grammar symbol X is a set of attributes A(X). - A(X) consists of two disjoint sets S(X) and I(X), called synthesized and inherited attributes 27 Attribute Grammars (continued) - Synthesized attributes are used to pass semantic information up a parse tree. - Inherited attributes pass semantic information down and across a parse tree. - Let X0 X1 … Xn be a rule - Function of the form S(X0) = f (A(X1), … , A (Xn)) define synthesized attributes of X0. Their values depend only on children nodes. - Functions of the form I(Xj) = f(A(X0), ... , A (Xn)), for 1 <= j <= n, define inherited attributes of Xj. Their values depend on parent node or sibling nodes. Note that, to avoid circularity, inherited attributes are often restricted to functions of the form I(Xj) = f(A(X0), … , A(X(j-1))) This form prevents an inherited attribute from depending on itself or on attributes to the right in the parse tree. 28 3.4.4 Intrinsic attributes They are synthesized attributes of leaf nodes whose values are determined outside the parse tree. For example, the type of an instance of a variable in a program could come from the symbol table, which is used to store variable names and their types. 29 3.4.5 Examples of Attribute Grammars Example: expressions of the form expr = id + id id's can be either int_type or real_type. When they are the same, the expression type is that of the operands (id's). When the operand types (id's) are not the same, the type of the expression is always real. Type of the expression must match it's expected type. BNF <assign> <var> = <expr> <expr> <var> + <var> | <var> <var> A | B | C • Attributes: Actual_type: A synthesized attribute associated with the non-terminals for <var> and <expr> Expected_type: An inherited attribute associated with the non-terminal <expr> 30 An attribute grammar for simple assignment statements 1. Syntax rule: <assign> <var> = <expr> Semantic rule: <expr>.expected_type <var>.actual_type 2. Syntax rule: <expr> <var>[2] + <var>[3] Semantic rule: <expr>.actual_type if (<var>[2].actual_type =int) and (<var>[3].actual_type =int) then int else real end if Predicate: <expr>.actual_type == <expr>.expected_type 31 3. 4. Syntax rule: <expr> <var> Semantic rule: <expr>.actual_type <var>.actual_type Predicate: <expr>.actual_type == <expr>.expected_type Syntax rule: <var> A | B | C Semantic rule: <var>.actual_type look-up (<var>.string) The look-up function looks up a given name in symbol table and returns the variable’s type. •A parse tree of the sentence A = A + B generated by the grammar in previous Example. 32 3.4.6 Computing Attribute Values 1. 2. 3. 4. 5. <var>.actual_type ← look-up(A) (Rule 4) <expr>.expected_type ← <var>.actual_type (Rule 1) <var>[2].actual_type ← look-up(A) (Rule 4) <var>[3].actual_type ← look-up(B) (Rule 4) <expr>.actual_type ← either int or real (Rule 2) <expr>.expected_type == <expr>.actual_type is either TRUE or FALSE (Rule 2) Figure 3.7: The flow of attributes in the tree 33 Computing Attribute Values Conts. Figure 3.8: a fully attributed parse tree 34 3.5 Describing the meaning of Programs: Dynamic Semantics - The dynamic semantics of a program is the meaning of its expressions, statements, and program units. Dynamic semantic determines the meanings of programming constructs during their execution. - There is no single widely accepted notation or formalism for describing semantics. - Several needs for a methodology and notation for semantics: - Programmers need to know what statements mean - Compiler writers must know exactly what language constructs do - Three methods that are used to describe semantics formally: - Operational semantics - Axiomatic semantics -Denotational semantics 35 3.5.1 Operational Semantics The idea behind operational semantics is to describe the meaning of a statement or program by specifying the effects of running it on a machine. C Statement Operational Semantics 36 3.5.2 Denotational Semantics It is the most rigorous, widely known method for describing the meaning of programs Based on recursive function theory The most abstract semantics description method Originally developed by Scott and Strachey (1970) The process of building a denotational specification for a language (not necessarily easy): • Define a mathematical object for each language entity • Define a function that maps instances of the language entities onto instances of the corresponding mathematical objects The method is named denotational because the mathematical objects denote the meaning of their corresponding syntactic entities. 37 The difference between denotational and operational semantics: • In operational semantics, programming language constructs are translated into simpler programming language constructs. • In denotational semantics, programming language constructs are mapped to mathematical objects 38 3.5.2.1 Simple Example Decimal Numbers <dec_num> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | <dec_num> ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9') Mdec('0') = 0, Mdec ('1') = 1, …, Mdec ('9') = 9 Mdec (<dec_num> '0') = 10 * Mdec (<dec_num>) Mdec (<dec_num> '1’) = 10 * Mdec (<dec_num>) + 1 … Mdec (<dec_num> '9') = 10 * Mdec (<dec_num>) + 9 39 3.5.2.2 The state of a program The state of a program is the values of all its current variables s = {<i1, v1>, <i2, v2>, …, <in, vn>} Let VARMAP be a function that, when given a variable name and a state, returns the current value of the variable VARMAP(ij, s) = vj 40 3.5.3 Axiomatic Semantics Axiomatic semantics is based on mathematical logic. Axiomatic semantics defined in conjunction with the development of a method to prove the correctness of a program rather than directly specifying the meaning of a program Each statement of a program is both preceded and followed by a logical expression that specifies constraints on program variables. Simple Boolean expressions are adequate to express constraints 41 3.5.3.1 Assertions The logical expressions are called predicates, or assertions. An assertion immediately preceding a program statement describes the constraints on the program variables at that point in the program (precondition ). An assertion immediately following a statement describes the new constraints on those variables (and possibly others) after execution of the statement (postcondition). Precondition and postcondition assertions are presented in braces to distinguish them from parts of program statements {x > 10} sum = 2 * x + 1 {sum > 1} Developing an axiomatic description or proof of a given program requires that every statement in the program have both a precondition and a post condition. 42 Example : Show that the program segment y=2 {x = 1} z = x + y {z = 3} is correct with respect to the precondition {P}: x = 1 and the postcondition {Q}: z = 3. Solution: Suppose that p is true, so that x=1 as the program begins. Then y is assigned 2 and z is assigned the sum of x and y, which is 3. Hence program segment is correct.. x 1 y 2 z 1 + 2 = 3 , thus: {P} statement {Q} is true. 43 Summary BNF and context-free grammars are equivalent meta- languages Well-suited for describing the syntax of programming languages An attribute grammar is a descriptive formalism that can describe both the syntax and the semantics of a language Three primary methods of semantics description Operational, axiomatic, denotational 44