Parsing Techniques: A Practical Guide

Parsing Techniques A Practical Guide by Dick Grune and Ceriel J. H. Jacobs 1 Book in slide format • Fantastic book: Parsing Techniques • I went through chapters 1 and 2 of the book and created slides of them. • That is, the following slides are chapters 1 and 2, in slide form. • Additionally, there are several slides from: – Personal correspondence with one of the authors, Dick Grune. – Material from other sources. – Slides that I created, applying the concepts to XML. Roger L. Costello June 1, 2014 2 Why grammars, not automata? • There is a close relationship between formal grammars and other abstract notions used in computer science, such as automata and algorithms. • Indeed, since the results in one theory can often be translated into another, it seems to be an arbitrary decision as to which interpretation is primary. • In these slides formal grammars are given preferential treatment because they are probably the most commonly known of the various theories among computer scientists. This is due to the success of context-free grammars in describing the syntax of programming languages. 3 Chapter 1 Defining Parsing and Grammars 4 Parsing • Parsing is the process of structuring a linear representation in accordance with a given grammar. • This definition has been kept abstract on purpose to allow as wide an interpretation as possible. • The “linear representation” may be: – – – – – – a sentence a computer program a knitting pattern a sequence of geological strata a piece of music actions of ritual behavior In short, any linear sequence in which the preceding elements in some way restrict the next element. • For some of the examples the grammar is well known, for some it is an object of research, and for some our notion of a grammar is only just beginning to take shape. 5 Parsing grammar linear representation parser structure Parsing is the process of structuring a linear representation in accordance with a given grammar. A “linear representation” is any linear sequence in which the preceding elements in some way restrict the next element. 6 Grammar: a succinct summary • For each grammar, there are generally an infinite number of linear representations (“sentences”) that can be structured with it. • That is, a finite-sized grammar can supply structure to an infinite number of sentences. • This is the main strength of the grammar paradigm and indeed the main source of the importance of grammars: they summarize succinctly the structure of an infinite number of objects of a certain class. 7 Reasons for parsing There are several reasons to perform this structuring process called parsing. 1. 2. 3. One reason derives from the fact that the obtained structure helps us to process the object further. When we know that a certain segment of a sentence is the subject, that information helps in understanding or translating the sentence. Once the structure of a document has been brought to the surface, it can be processed more easily. A second reason is related to the fact that the grammar in a sense represents our understanding of the observed sentences: the better a grammar we can give for the movement of bees, the deeper our understanding of them. A third lies in the completion of missing information that parsers, and especially error-repairing parsers, can provide. Given a reasonable grammar of the language, an error-repairing parser can suggest possible word classes for missing or unknown words on clay tablets. 8 Grammatical inference • Grammatical inference: Given a (large) set of sentences, find the/a grammar which produces them. • Grammatical inference is also known as grammar induction or syntactic pattern recognition. 9 XML Schema from an XML instance The XML tool oXygen XML does grammatical inference when it creates an XML Schema from an XML instance document. 10 The science of parsing • Parsing is no longer an arcane art. • In the 1970s Aho, Ullman, Knuth, and many others put parsing techniques solidly on their theoretical feet. 11 Mathematician vs. Computer Scientist • To a mathematician all structures are static. They have always existed and will always exist. The only timedependence is that we have not discovered all the structures yet. – Example: the Peano axioms create the integers without reference to time. • The computer scientist is concerned with (and fascinated by) the continuous creation, combination, separation, and destruction of structures. Time is of the essence. – Example: if the computer scientist uses the Peano axioms to implement integer addition, he finds they describe a very slow process, which is why he will look for a more efficient approach. 12 Many uses for parsing Parsing is for anyone who has parsing to do: – The compiler writer – The linguist – The database interface writer – The geologist who wants to test grammatical descriptions of a sequence of geological strata – The musicologist who wants to test grammatical descriptions of a music piece 13 Requirements for a parser developer It requires a good ability to visualize, some programming experience, and the willingness and patience to follow non-trivial examples. 14 Chapter 2 Grammars as a Generating Device 15 Need to define some terms • In computer science as in everyday parlance, a grammar serves to describe a language. • To establish our terminology and to demarcate the universe of discourse, we shall examine these terms:  Language  Grammar  Language Descriptions 16 Language We examine three views of the word “language”: – How the larger part of mankind views language – How the computer scientist views language – How the formal-linguist views language 17 Layman’s view of languages • To the larger part of mankind, language is first and foremost a means of communication. • Communication is brought about by sending messages, through air vibrations or through written symbols. • Languages have three levels of composition: – Messages fall apart into sentences, – which are composed of words, – which in turn consist of symbol sequences when written. 18 Computer scientist view of languages • A language has sentences, and these sentences possess structure. • Information may possibly be derived from the sentence’s structure; that information is called the meaning of the sentence. • Sentences consist of words called tokens, each possibly carrying a piece of information, which is its contribution to the meaning of the whole sentence. 19 Computer scientist view of languages • A language is a probably infinitely large set of sentences, each composed of tokens in such a way that it has structure. • The tokens and structure cooperate to describe the semantics (meaning) of the sentence. • To a computer scientist 3 + 4 × 5 is a sentence in the language of “arithmetics on single digits”. Its structure can be shown by inserting parentheses (3 + (4 × 5)) and its semantics is 23. 20 Formal-linguist view of languages • A language is a “set” of sentences, and each sentence is a “sequence” of “symbols”. • There is no meaning, no structure. Either a sentence belongs to the language or it does not. • The only property of a symbol is that is has an identity. • In any language there are a certain number of different symbols – the alphabet – and that number must be finite. Just for convenience we write these symbols as a, b, c, …, but ◊,▪,ⱴ, … would do equally well, as long as there are enough symbols. 21 Formal-linguist view of languages • The word “sequence” means that the symbols in each sentence are in a fixed order and we should not shuffle them. • The word “set” means an unordered collection with all the duplicates removed. A set can be written down by writing the objects in it, surrounded by curly braces. • All this means is that to a formal-linguist the following is a language: {a, b, ab, ba} • The formal-linguist also calls a sentence a “word” and he says that “the word ab is in the language {a, b, ab, ba}” 22 Formal-linguist vs. computer scientist • The formal-linguist holds his views of language because he wants to study the fundamental properties of languages in their naked beauty. It gives him a grip on a seemingly chaotic and perhaps infinitely complex object: natural language. • The computer scientist holds his view of language because he wants a clear, well-understood, and unambiguous means of describing objects in the computer and of communication with the computer (a most exacting communication partner). 23 Grammars We examine three views of the word “grammar”: – How the larger part of mankind views grammar – How the formal-linguist views grammar – How the computer scientist views grammar 24 Layman’s view of grammars A grammar is a book of rules and examples which describes and teaches the language. 25 Formal-linguist’s view of grammars • A generative grammar is an exact, finite-size, recipe for constructing the sentences in the language. • This means that, following the recipe, it must be possible to construct each sentence of the language (in a finite number of actions) and no others. • This does not mean that, given a sentence, the recipe tells us how to construct that particular sentence, only that it is possible to do so. 26 Computer scientist’s view of grammars The computer scientist has the same view as the formal-linguist, with the additional requirement that the recipe should imply how a sentence can be constructed. 27 Infinite sets from finite descriptions A language is a possibly infinite set of sequences of symbols and a grammar is a finite recipe to generate those sentences. 28 Example of an infinite set from a finite description The set of all positive integers is a very finite-size description of a definitely infinite-size set. 29 Not all languages are describable • Can all languages be described by finite descriptions? • Answer: No. 30 Outline of the proof • The proof that not all languages can be described by finite descriptions is not trivial. But it is very interesting and famous. We will present an outline of it. • The proof is based on two observations and a trick. 31 Enumerate language descriptions The language descriptions can be listed. This is done as follows: 1. Take all descriptions of size one, that is, those of only one letter long, and sort them alphabetically. • Depending on what, exactly, we accept as a description, there may be zero descriptions of size one, or 27 (all letters + space), or 95 (all printable ASCII characters), or something similar. 2. Take all descriptions of size two, sort them alphabetically. Do the same for lengths 3, 4, and further. This is observation number one. 32 Each description has a well-defined position • Now we have a list of descriptions. Each describes a language. • So each description has a position on the list. • Example: our description the set of all positive integers is 32 characters long. To find its position on the list, we have to calculate how many descriptions there are with less than 32 characters, say L. We then have to generate all descriptions of size 32, sort them and determine the position of our description in it, say P, and add the two numbers L and P. This will, of course, give a huge number but it does ensure that the description is on the list in a well-defined position. This is observation number two. 33 Our example description is at position L + P L { descriptions of size 1 { descriptions of size 2 { descriptions of size 3 ... ... { descriptions of size 31 .................... P { descriptions of size 32 the set of all positive integers 34 Two things to note • Note #1: Just listing all descriptions alphabetically, without reference to their lengths, would not do. There are already infinitely many descriptions starting with an “a”, so no description starting with a higher letter could get a number on the list. • Note #2: there is no need to actually do all this. It is just a thought experiment that allows us to examine and draw conclusions about the behavior of a system in a situation which we cannot possibly examine physically. 35 Both nonsensical and meaningful descriptions There will be many nonsensical descriptions on the list. This is immaterial to the argument. The important thing is that all meaningful descriptions are on the list, and the strategy ensures that. 36 Alphabet • The words (sentences) in a language are composed of a finite set of symbols. • This set of symbols is called the alphabet. • We will assume the symbols in the alphabet are ordered. • Then the words in the language can be ordered too. • We shall indicate the alphabet by Σ. 37 Language that consists of all possible words • The language that consists of all possible words that can be built from an alphabet is called Σ* • For the alphabet Σ = {a, b} we get the language { , a, b, aa, ab, ba, bb, aaa, …} The empty word (the word consisting of zero as and zero bs). It may be easily overlooked, so we shall write it as ε (epsilon), regardless of the alphabet. So, Σ* = {ε, a, b, aa, ab, ba, bb, aaa, …} 38 Words in Σ* can be enumerated • Since the symbols in the alphabet Σ are ordered, we can list the words in the language Σ*, using the same technique as in the previous slides: – First, list all words of size zero, sorted; then list all words of size one, sorted; and so on • This is actually the order already used in our set notation for Σ* 39 Compare language L against Σ* • Since Σ* contains all possible words, all languages using alphabet Σ are subsets of it. • Let L be a language over Σ (the word “over” means “built out of”). • We can go through the list of words in Σ* and put checkmarks on all words that are in L. • Suppose our language L is “the set of all words that contain more as than bs”. L is (a, aa, aab, aba, baa, …} ✓ ✓ ✓ ✓ ✓ ε a b aa ab ba bb aaa aab aba abb ... 40 Encode languages using 0 and 1 • The list of blanks and checkmarks is sufficient to identify and describe a language. • For convenience we write the blank as 0 and the checkmark as 1, as if they were bits in a computer. • We can now write L = 01010001110… ✓ – So, we have attached the infinite bit-string 01010001110… to the language description “the set of all words that contain more as than bs”. ✓ ✓ ✓ • The set of all words over an alphabet is Σ* = 1111111… ✓ ε a b aa ab ba bb aaa aab aba abb ... 41 Languages are infinite bit-strings • Any language can be encoded as an infinite bit-string, be it a formal language like L, a programming language like Java, or a natural language like English. • For the English language the 1s in the bitstring will be very scarce, since hardly any arbitrary sequence of letters is a good English sentence. 42 List of languages • We attached the infinite bit-string 01010001110… to the language description “the set of all words that contain more as than bs”. • In the same way, we can attach bitstrings to all descriptions. • Some descriptions may not yield a language, in which case we can attach an arbitrary infinite bit-string to it. • Since all descriptions can be put on a single numbered list, we get, for example, this table: Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 ... Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... 43 The list is incomplete Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 ... Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... • Many languages exist that are not on the list of languages above. • The above list is far from complete, although the list of descriptions is complete. • We shall prove this by using the diagonalization process (“Diagonalverfahren”) of Cantor. 44 Flip the bits along the diagonal Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 ... Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... • Consider the language C = 100110…, which has the property that its n-th bit is unequal to the n-th bit of the language described by Description #n. • The first bit of C is 1 because the first bit of Description #1 is 0. The second bit of C is 0 because the second bit of Description #2 is 1. And so on. 45 Create a language Description Description #1 Description #2 Description #3 Description #4 Description #5 Description #6 ... Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... So C is created by walking the top-left to bottom-right diagonal of the language table and copying the opposites of the bits we meet. Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... C = 100110… 46 It’s a new language! Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... C = 100110… • The language C cannot be on the list! – C cannot equal line 1 since its first bit differs from that line. – C cannot equal line 2 since its second bit differs from that line. – And so forth. • So, C cannot be on the list. 47 Infinite number of new languages • So in spite of the fact that we exhaustively listed all possible finite descriptions, we have created a language that has no description on the list. • There are many more languages not on the list: – Construct, for example, the language whose n+6-th bit differs from the n+6-th bit in Description #n. Again, it cannot be on the list since for each Description #n the new language differs in the n+6-th bit. That means that bits 1…5 play no role, and can be chosen arbitrarily; this yields another 25 = 32 languages that are not on the list. Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... C+6 = xxxxx1101… 48 Even more new languages And there are many more languages not on the list: – Construct, for example, the language whose 2n-th bit differs from the 2n-th bit in Description #n. Again, it cannot be on the list since for each Description #n the new language differs in the 2n-th bit. That means that the odd bits play no role and can be chosen freely. Language 000000100… 110010001… 011011010… 110011010… 100000011.. 111011011… ... 2C = x1x1x0x0… 49 Infinitely many languages cannot be described • We can create an infinite number of languages, none of which allows a finite description. • For every language that can be described there are infinitely many that cannot. 50 Many languages beyond our reach We can only describe a tiny subset (not even a fraction) of all possible languages. There are infinitely many languages out there, forever beyond our reach. 51 Unequal infinities Although there are infinitely many descriptions and infinitely many languages, these infinities are not equal to each other. # of languages = ℵ1 # of descriptions of languages = ℵ0 52 Generating a set of objects • A good way to generate a set of objects is to start with a small object and to give rules for how to add to it and generate new objects from it. • Example: start with these objects: integers and addition. Then define rules for generating new objects from the primitive objects: 𝑒𝑣𝑒𝑛 → 2 𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛 “2 is an even number and the sum of two even numbers is again an even number.” This generates the set of all even numbers. 53 Generate the set of even numbers Rules Primitives 2 is an even number The integers The sum of two even numbers is an even number Addition operator The set of all even numbers Examples: 4 is an even number since it the sum of 2 + 2, and 2 is an even number. 6 is an even number since it is the sum of 4 and 2. 8 is an even number since if is the sum of 6 and 2. 54 Generative rules 𝑒𝑣𝑒𝑛 → 2 𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛 We can use those rules to generate the set of even numbers. even → even + even →2+2 →4 even → even + even →4+2 →6 → means “may be replaced by” 55 Generate infinite set from a finite set of rules 𝑒𝑣𝑒𝑛 → 2 𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛 • With those two rules we are able to generate an infinite set. • Note the generative character of the recipe (rules). 56 Terminal, non-terminal symbols “2” is called a terminal. 𝑒𝑣𝑒𝑛 → 2 𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛 “even” is a symbol that stands for a number. It is called a non-terminal (a singularly uninspired term). 57 Naming convention Even → Even + Even left-hand Side (LHS) right-hand Side (RHS) Since we have identified terminal symbols and non-terminal symbols as technical objects, we shall write them in Cambria Math font. We write terminals in lower case letters and start nonterminals with upper case letters. Non-terminals are called variables or syntactic categories in linguistic contexts. 58 Generate regular expressions Rules 1. Any character is a regular expression Primitives The symbols: |, *, (, and ) 2. r1 | r2 is a regular expression, where r1 and r2 are regular expressions 3. r1r2 is a regular expression Characters 4. r* is a regular expression Regular expressions Examples: Using rule 1 we generate this regular expression: a, which denotes this set: {a} Using rule 3 and 1 we generate this regular expression: ab, which denotes this set: {ab} Using rule 2, 3, and 1 we generate this regular expression: ab | b, which denotes this set: {ab, b} Using rule 4 and 3 we generate this regular expression: a*b, which denotes this infinite set: {ab, aab, aaab, …} 59 Rules for regular expressions 𝑅𝑒𝑔𝑒𝑥 𝑅𝑒𝑔𝑒𝑥 … 𝑅𝑒𝑔𝑒𝑥 𝑅𝑒𝑔𝑒𝑥 𝑅𝑒𝑔𝑒𝑥 𝑅𝑒𝑔𝑒𝑥 →𝑎 →𝑏 →𝑧 → 𝑅𝑒𝑔𝑒𝑥 | 𝑅𝑒𝑔𝑒𝑥 → 𝑅𝑒𝑔𝑒𝑥 𝑅𝑒𝑔𝑒𝑥 → 𝑅𝑒𝑔𝑒𝑥 ∗ 60 Tom, Dick and Harry language • Generate the set of all lists of names, each of the form: tom, dick and harry. • All names but the last two are separated by commas. • Duplicate names are okay, such as: tom, tom, and dick. • Although these are not complete sentences in normal English, we shall call them “sentence” since that is what they are in our midget language. 61 Simple recipe for generating the language 1. tom is a Name, dick is a Name, harry is a Name 2. A Name is a Sentence 3. A Sentence followed by a “,” and a Name is again a Sentence 4. Before finishing, if the Sentence ends in “, Name” replace it by “and Name” 62 Problem with the recipe • Clause 4 has trouble: 4. Before finishing, if the Sentence ends in “, Name” replace if by “and Name”. • A Sentence does not really end in “, Name”, it ends in “, harry” or such, and “Name” is just a symbol that stands for a real name; such symbols cannot occur in a real sentence and must in the end be replaced by a real name as given in clause 1: 1. tom is a Name, dick is a Name, harry is a Name. • Likewise, the word “Sentence” in the recipe is a symbol that stands for an actual sentence. 63 Two kinds of symbols • There are two kinds of symbols: – Real symbols which occur in finished sentences like “tom”, a comma, and “and” – Intermediate symbols like “Sentence” and “Name” that cannot occur in finished sentences • The first kind are called terminal symbols (terminals for short). • The second kind are called non-terminals, a singularly uninspired term. In linguistic contexts they are called variables or syntactic categories. 64 The recipe generates sentences To stress the generative character of the recipe, we shall replace X is Y by Y may be replaced by X. Instead of saying: tom is a name we say: Name may be replaced by tom 65 Revised recipe 1. Name may be replaced by tom Name may be replaced by dick Name may be replaced by harry 2. Sentence may be replaced by Name 3. Sentence may be replaced by Sentence, Name 4. “, Name” at the end of a Sentence must be replaced by “and Name” before Name is replaced by any of its replacements 5. A sentence is finished only when it no longer contains non-terminals 6. We start our replacement procedure with Sentence 66 Different types of clauses 1. 2. 3. 4. 5. 6. Name may be replaced by tom Name may be replaced by dick Name may be replaced by harry Sentence may be replaced by Name Sentence may be replaced by Sentence, Name “, Name” at the end of a Sentence must be replaced by “and Name” before Name is replaced by any of its replacements A sentence is finished only when it no longer contains non-terminals We start our replacement procedure with Sentence Clauses 1 through 4 describe replacements, but 5 and 6 are different: – Clause 5 is not specific to this grammar. It is valid generally and it is one of the rules of the game. – Clause 6 tells us where to start generating. This name is called the start symbol, and it is required for every grammar. 67 Conventions • For brevity we write → instead of “may be replaced by” Instead of writing: Name may be replaced by tom We write: Name → tom • The part before the → is called the left-hand side (LHS), the part after it is called the righthand side (RHS). 68 Finite recipe for generating strings in the t, d & h language 1. Name → tom Name → dick Name → harry 2. Sentence → Name Sentence → List End 3. List → Name List → List, Name 4. , Name End → and Name 5. the start symbol is Sentence 69 Transformation of the form of the recipe rules Our initial expression of each rule took this form: tom is a Name To emphasize the generative nature of the rules we revised the rules to this form: Name may be replaced by tom For brevity we replaced “may be replaced by” with an arrow: Name → tom That form is strong enough to serve as the basis for formal languages. 70 Generating a sentence from the recipe 1. 2. 3. 4. 5. Name → tom Name → dick Name → harry Sentence → Name Sentence → List End List → Name List → List, Name , Name End → and Name the start symbol is Sentence Sentence → List End → List, Name End → List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick and Name → tom, dick and harry 2. Sentence → List End 3. List → List, Name 3. List → List, Name 3. List → Name 1. Name → tom 1. Name → dick 4. , Name End → and Name 1. Name → harry 71 Form is the foundation for formal grammars • This form: Name → tom “Name may be replaced by tom” is strong enough to serve as a basis for formal grammars. • Similar forms, often called “rewriting systems”, have a long history among mathematicians, and were already in use several centuries B.C. in India. 72 Chomsky The specific form shown below was first studied extensively by Chomsky. His analysis has been the foundation for almost all research and progress in formal languages, parsers, and a considerable part of compiler construction and linguistics. 1. 2. 3. 4. 5. Name → tom Name → dick Name → harry Sentence → Name Sentence → List End List → Name List → List, Name , Name End → and Name the start symbol is Sentence 73 Formal languages • Formal languages are a branch of mathematics. • The mathematics of formal languages uses a special notation that has to be learned. It allows a very concise expression of what and how but gives very little information on why. This tutorial gives the why. 74 Formal definition of a grammar • A generative grammar G is an ordered four-tuple (VN, VT, S, F) where VN and VT are finite alphabets with VN ∩ VT = ∅, S is a distinguished symbol of VN, and F is a finite set of ordered pairs (P, Q) such that P and Q are in (VN ∪ VT)* and P contains at least one symbol from VN. • The symbols of VN are called nonterminal symbols or variables and will usually be denoted by capital letters. • The symbols of VT are called terminal symbols and will usually be denoted by small letters. • The sets VN and VT are disjoint in every grammar. • The nonterminal symbol S is called the initial symbol and is used to start the derivations of the sentences of the language. • The ordered pairs in F are called rewriting rules or productions and will be written in the form P → Q where the symbol → is, of course, not in VN ∪ VT. • Productions are used to derive new sentences from given ones by replacing a part equal to the left-hand side of a rule by the right-hand side of the same rule. 75 LHS must contain a non-terminal A generative grammar G is an ordered fourtuple (VN, VT, S, F) where VN and VT are finite alphabets with VN ∩ VT = ∅, S is a distinguished symbol of VN, and F is a finite set of ordered pairs (P, Q) such that P and Q are in (VN ∪ VT)* and P contains at least one symbol from VN. P→Q Must contain a non-terminal 76 Phrase structure grammars 1. 2. 3. 4. 5. Name → tom Name → dick Name → harry Sentence → Name Sentence → List End List → Name List → List, Name , Name End → and Name the start symbol is Sentence • The grammar above is in the form of what is known as a phrase structure grammar for the t,d&h language (often abbreviated to PS grammar). • PS grammars have no restriction on the right-hand side of production rules and on the left-hand side only the restriction that it contain at least one non-terminal. • PS grammars are called Type 0 grammars. 77 The alternative ( | ) symbol 1. 2. 3. 4. 5. Name → tom Name → dick Name → harry Sentence → Name Sentence → List End List → Name List → List, Name , Name End → and Name the start symbol is Sentence 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence Several right-hand sides with the same left-hand side are grouped together and separated by vertical bars, |. This bar symbol belongs to the formalism, just as the arrow →, and can be read “or else”. The right-hand side separated by vertical bars are also called alternatives. 78 Sentential forms • In the process of generating a sentence from a grammar, a series of intermediate forms are produced, ultimately leading to the sentence. • Each intermediate form is called a sentential form. • The sentential forms are all the forms that occur from start symbol to final sentence. 79 Sentential Forms 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence Sentence → List End → List, Name End → List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick and Name → tom, dick and harry Sentential forms 80 Terminology • If a sentential form contains no non-terminals it is called a sentence and belongs to the generated language. • The transitions (separated by arrows →) are called production steps. • The grammar rules are called production rules. 81 Example of terminology usage We have seen that the sentential forms occurring in the production process for a finitestate grammar all contain only one nonterminal, except the last. 82 Terminal productions The set of strings that are generated from the start symbol are called the terminal productions. 83 Sentence → List End → List, Name End → List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick and Name → tom, dick and harry Production graph: Sentence List List List , , , Name End Name Name tom End dick and Name and harry 84 Graphs • The production process can be made more visual by drawing connective lines between corresponding symbols, using a graph. • A graph is a set of nodes connected by a set of edges. • If the edges are arrows, the graph is a directed graph; if they are lines, the graph is undirected. – Almost all graphs used in parsing techniques are directed. 85 Production graph A graph corresponding to a production process is called a production graph or syntactic graph and depicts the syntactic structure (with regard to the given grammar) of the final sentence. 86 Production graph The production graph normally fans out and downwards Sentence List List List , , , Name End Name Name tom End dick and Name and harry Starlike construction results from rewriting a group of symbols. 87 Graph, not tree Sentence List List List , , , Name End Trees don’t fan out and then come back together. So this is a production graph, not a production tree. Name Name tom End dick and Name and harry 88 Production graphs are acyclic • A cycle in a graph is a path from node N following the arrows, leading back to N. • A production graph cannot contain cycles. Here’s why: To get a cycle we would need a nonterminal node N in the production graph that has produced children that are directly or indirectly N again. But since the production process always makes new copies for the nodes it produces, it cannot produce an already existing node. • So production graphs are always acyclic. Directed acyclic graphs are called dags. 89 Exercise • Draw the production graph for this grammar: A→B B→C C→A • Assume A is the start symbol. 90 Ha! It’s a trick question • You don’t draw a production graph for a grammar. You draw it for the process taken to generate a sentence. • The grammar rules on the previous slide has no terminals. It loops. So it cannot generate a sentence. 91 Only legal sentences 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence note the comma • It is impossible to generate: tom, dick, harry • If a sentence has more than one name, this rule must be used: Sentence → List End • The only way to remove End is with this rule: , Name End → and Name 92 Implementing “must replace” • Recall our recipe: “, Name” at the end of a Sentence must be replaced by “and Name” before Name is replaced by any of its replacements. • Our formalism uses arrow → which means may replace. • Amazingly, we have succeeded in implementing the notion must replace in a system that only uses may replace. • We accomplished this by splitting must replace into may replace (List Name may replace Sentence) and must not be a non-terminal (the items in a sentence must not be a non-terminal). 93 Grammar produces many sentences 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence The grammar produces many sentences: tom, dick and harry harry and tom harry tom, tom, tom and tom an infinity of others 94 Blind alley 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence Sentence → List End → List, Name End → List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick, harry End There is no rule for just the End nonterminal, so we can proceed no further with this sentential form. With the path we have taken, we have arrived at a blind alley. 95 Frugal framework • The main properties of a formal grammar are: – it has production rules, which may be used for rewriting part of the sentential form – it has a start symbol, which is the mother of all sentential forms • In the production rules we find non-terminals and terminals; finished sentences contain terminals only. • That’s it! It’s a frugal framework. 96 The expressive power of this frugal framework • Formal grammars is a framework of impressive frugality for generating sets. • Question: Is it sufficient for generating sets? • Answer: We do not have anything more expressive. All other methods known to mankind for generating sets have been proved to be equivalent-to, or less powerful than a phrase structure grammar. 97 Computer programs generate sets • A program reads some data and outputs a result. That result is called a sentence in formal language theory. • Given another input, the program generates another sentence. • And so on. • So a program generates a set of sentences (a language). 98 Are programs more expressive than phrase structure grammars? It has been proved that any set (language) that can be generated by a program can be generated by a phrase structure grammar. 99 A stronger method might exist • There is no proof that a stronger method cannot exist. • But in view of the fact that many quite different methods all turn out to halt (in expressivity) at the same barrier, it is highly unlikely that a stronger method will ever be found. 100 Illustration of the expressive power of grammars Below is a grammar for the movements of a Manhattan turtle. A Manhattan turtle moves in a plane and can only move north, east, south or west in distances of one block. The grammar below produces all paths that return to their own starting point. Move north east north south north west east north east south east west south north south east south west west north west east west south → → → → → → → → → → → → → north Move south | east Move west | ε east north south north west north north east south east west east north south east south east south north west east west south west 101 Production graph for the round trip: north east south west Move north south Move east Move west south south west The empty alternative in rule 1 (the ε) results in this Move dying out. north east 102 The grammar for a set • There can be infinitely many grammars for a set. • By the grammar for a set we mean any grammar that does the job (generates the desired set) and is not obviously overly complicated. 103 Easy grammars and hard grammars • Some grammars are easy to understand. • Some simple grammars generate very complicated sets. • The grammar for any given set is, however, usually far from simple. • Theory says that if a set can be generated at all (for example, by a program), it can be generated by a phrase structure grammar. • But theory does not say that it will be easy to do so, or that the grammar will be understandable. 104 Unmanageability of phrase structure grammars Apart from the intellectual problems phrase structure grammars pose, they also exhibit fundamental and practical problems: – No general parsing algorithm for them can exist. – All known special parsing algorithms are either very inefficient or very complex. 105 Chomsky hierarchy • The desire to restrict the unmanageability of phrase structure grammars, while keeping as much of their generative powers as possible, has led to the Chomsky hierarchy of grammars. • The hierarchy has four types of grammars, numbered 0 to 3. • It is useful to include a fifth type, called Type 4. 106 Chomsky hierarchy Type 0 (phrase-structure grammars) Type 1 (context-sensitive grammars) Type 2 (context-free grammars) Type 3 (regular grammars) Type 4 (finite-choice grammars) 107 Increasingly restricted grammars • Type 0 grammars are the (unrestricted) phrase structure grammars. • The other types originate from applying more and more restrictions to the allowed form of the rules of the grammar. • Each of these restrictions has far-reaching consequences; the resulting grammars are gradually easier to understand and manipulate, but are also gradually less powerful. • Fortunately, these less powerful types are still very useful, actually more useful even than Type 0. 108 Type 0: arbitrary number of symbols on LHS and RHS The characteristic property of a Type 0 grammar is that it may contain rules that transform an arbitrary (non-zero) number of symbols into an arbitrary (possibly zero) number of symbols. Example: , N E → and N in which 3 symbols are replaced by 2. 109 Type 1 grammars • Type 1 grammars restrict the freedom of the Type 0 grammars. • There are two completely different definitions of Type 1 grammars, which can be easily proved to be equivalent: – Type 1 monotonic – Type 1 context-sensitive 110 Type 1 monotonic grammars • A grammar is Type 1 monotonic if every rule has the same or more symbols on the righthand side (the tree expands, doesn’t contract). • This forbids, for example, the rule, , N E → and N N , LHS has 3 symbols E RHS has 2 symbols Tree is contracting and N 111 Type 1 context-sensitive grammars • A grammar is Type 1 context-sensitive if all of its rules are context-sensitive. • A rule is context-sensitive if only one nonterminal symbol in its left-hand side gets replaced by other symbols, while we find the others back, undamaged and in the same order, in the right-hand side. Example: Name Comma Name End → Name and Name End “The rule Comma → and may be applied if the left context is Name and the right context is Name End.” The replacement must be at least one symbol long; thus context-sensitive 112 grammars are always monotonic. Example of a CS grammar rule • Question: Is this a context-sensitive grammar rule: 𝑆1𝑎 → 𝑆1𝑆𝑎 • Answer: Yes, because 𝑆1 is replaced by 𝑆1𝑆. • “The replacement must be at least one symbol long”. Thus, the replacement may be more than one symbol. In this case the replacement is two symbols. Note: It is not the case that 𝑎 is replaced by 𝑆𝑎. Why? Because only a non-terminal may be replaced. 113 Key to writing monotonic grammars • In writing monotonic grammars one has to be careful to never produce more symbols than will eventually be produced. • This rule produces a symbol than must eventually be deleted: Sentence → List End The End symbol does not produce anything and has to be deleted: , Name End → and Name But that is not monotonic. 114 End symbol is deleted Sentence List End , Name and End Name 115 Monotonic grammar for the t,d&h language We avoid the need to delete the End marker by incorporating End into the rightmost name: Name Sentence List , EndName → → → → tom | dick | harry Name | List EndName | Name , List and Name 116 Context-sensitive grammar for the t,d&h language Name Sentence List Comma EndName and EndName → → → → → tom | dick | harry Name | List EndName | Name Comma List and EndName and Name context is . . . EndName context is and . . . We had to introduce a new non-terminal, Comma. Here’s why: Notice that this isn’t correct: Name Sentence List , EndName and EndName → → → → → tom | dick | harry Name | List EndName | Name , List and EndName and Name , is a terminal symbol. But in CS grammars only non-terminal symbols are replaced: “one non-terminal symbol in its left-hand side gets replaced by other symbols” 117 MT = CS and less powerful than PS • Monotonic and context-sensitive grammars are equally powerful: for each language that can be generated by a monotonic grammar a contextsensitive grammar exists that generates the same language, and vice versa. • They are less powerful than the Type 0 grammars. There are languages that can be generated by Type 0 grammars that cannot be generated by any Type 1 (or Type 2, 3, 4) grammar. 118 No simple Type 0 grammars Strangely enough, no simple examples of Type 0 languages are known, only their existence can be proved. 119 Type 0 (phrase-structure grammars) It can be proven that there are languages in here, but we cannot create grammars that generate them. Type 1 (context-sensitive grammars) Type 2 (context-free grammars) Type 3 (regular grammars) Type 4 (finite-choice grammars) Roger, not sure this is true 120 Contradiction? • Two slides back it says: Strangely enough, no simple examples of Type 0 languages are known, only their existence can be proved. • But hold on! Didn’t we already see a couple of Type 0 grammars? Here’s one: 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence This is a PS grammar! 121 Key concept 1. 2. 3. 4. 5. Name → tom | dick | harry Sentence → Name | List End List → Name | List, Name , Name End → and Name the start symbol is Sentence The above grammar is in the form of a phrase structure grammar but the language (set) it generates can be generated by less powerful grammars. 122 Type of a grammar is its smallest class • Any Type 1 grammar is also a Type 0 grammar since the class of Type 1 grammars is obtained from the class of Type 0 grammars by applying restrictions. • But it would be confusing to call a Type 1 grammar a Type 0 grammar; it would be like calling a cat a mammal: correct but imprecise. • A grammar is named after the smallest class (that is, the highest class number) in which it will still fit. 123 Using a Type 0 grammar for the t,d&h language was overkill • We saw a Type 0 grammar that generates the t,d&h language: Name Sentence List , End → → → → tom | dick | harry Name | List End Name | List, Name and Name • We saw two different Type 1 grammars that generates the t,d&h language; here’s one of them: Name Sentence List Comma EndName and EndName → → → → → tom | dick | harry Name | List EndName | Name Comma List and EndName and Name 124 Type n language for a Type n grammar • A Type n language can be generated by a Type n grammar or anything stronger, but not by a weaker Type n+1 grammar. • If a language is generated by a Type n grammar, that does not necessarily mean that there is no (weaker) Type n+1 grammar for it. – The t,d&h language can be generated by a Type 0 grammar, but it can also be generated by Type 1, 2, and 3 grammars. 125 Constructing a Type 1 grammar • The standard example of a Type 1 language is the set of strings that consist of an equal number of as, bs, and cs, in that order. aa....a bb....b cc....c n of them n of them n of them • We shall derive a grammar for this toy language. 126 Constructing a grammar for anbncn • Starting with the simplest case, we have the rule: 0. S → abc • Having obtained one instance of S, we may want to prepend more as to the beginning. If we want to remember how many there were, we shall have to append something to the end as well, and it cannot be a b or c. We shall use a yet unknown symbol Q. The following rule both prepends and appends: 1. S → aSQ 127 Continued • Now, to get aabbcc from this, each Q must be worth one b and one c, but we cannot just write: Q → bc because that would allow bs after the first c. • The above rule would, however, be all right if we were allowed to do replacement only between a b on the left and a c on the right. There the newly inserted bc will do no harm: 2. bQc → bbcc • Still, we cannot apply this rule since normally the Qs are to the right of the c. This can be remedied by allowing Q to hop left over c: 3. cQ → Qc 128 Grammar for anbncn 1. S → abc | aSQ 2. bQc → bbcc 3. cQ → Qc Derivation of a3b3c3 S aSQ aaSQQ aaabcQQ aaabQcQ aaabbccQ aaabbcQc aaabbQcc aaabbbcccc (start) (rule 1) (rule 1) (rule 1) (rule 3) (rule 2) (rule 3) (rule 3) (rule 2) 129 Derivation graph for a2b2c2 S S a a a a b Q b c Q b Q c b c c 1. S → abc | aSQ 2. bQc → bbcc 3. cQ → Qc 130 Starlike forms S S a a a a b Q b c Q b Q c b c Starlike forms c 131 Monotonic or CS? • Is the following grammar monotonic or contextsensitive? 1. S → abc | aSQ 2. bQc → bbcc 3. cQ → Qc • Answer: it is monotonic. The last rule is not contextsensitive since it does not conform to: only one nonterminal symbol in its left-hand side gets replaced by other symbols, while we find the others back, undamaged and in the same order, in the right-hand side. 132 The anbncn language is Type 1 • It can be proved (using the pumping lemma for context-free grammars) that there is no Type 2 grammar for the anbncn language. • We have created a Type 1 grammar for it. • Therefore it is of Type 1. 133 CS grammars, MT grammars • Although only context-sensitive Type 1 grammars can by rights be called contextsensitive grammars (CS grammars), that name is used even if the grammar is actually monotonic Type 1. • There are no standard initials for monotonic, but MT will do. 134 CF grammars • Type 2 grammars are called context-free grammars (CF grammars). • A CF grammar may contain only rules that have a single non-terminal on their left-hand side. 135 CS versus CF grammars Whereas context-sensitive grammars have rules in which a non-terminal symbol changes within a context, the rules in context-free grammars are independent of context (the left and right contexts are absent/empty). – The rules in context-free grammars have a single non-terminal on their left-hand side. 136 Independent production property • Since there is always only one symbol on the lefthand side, each node in a production graph has the property that whatever it produces is independent of what its neighbors produce: the productive life of a non-terminal is independent of its context. • Starlike forms cannot occur. Each node fans out, no nodes come together. • Consequently the production graph has a pure tree-form and is called a production tree. 137 Feature of XML that breaks production independence In XML, if an element has an IDREF attribute, then it is dependent on what its neighbors produce. <Book footnote_ref=“RB”> <Title>Illusions</Title> </Book> <Footnote id=“RB”> <Author>Richard Bach</Author> </Footnote> The Book element must reside in a context where there is an element with an ID attribute whose value matches the value of @footnote_ref. Consider an XML Schema that declares the Book element to have an IDREF footnote_ref attribute. That XML Schema is a context-free grammar. But the sentences that are generated (i.e., the XML instances) have additional semantics that imposes a context-sensitivity on the Book element. 138 Sentences in formal languages have no semantics • The sentences that you generate from a formal grammar have no semantics. • The symbols in a sentence have no semantics. • But in an XML instance document that conforms to an XML Schema the symbols do have semantics. – Example: an element with an attribute of type IDREF must reference an ID value. That element/attribute must reside in a context in which there is a matching ID value. So while the XML Schema is simply a context-free grammar, the semantics of the data types imposes an additional layer of constraint on the XML instance. This is a key concept 139 CF grammar for the t,d&h language 1. Name → tom | dick | harry 2. Sentence → Name | List and Name 3. List → Name , List | Name 140 A production tree for the CF t,d&h grammar 1. Name → tom | dick | harry 2. Sentence → Name | List and Name 3. List → Name , List | Name Sentence List Name , and Name and harry List Name tom , dick Notice that it is a tree: all nodes fan out, there are no starlike forms. 141 A rule “defines” a non-terminal 1. Name → tom | dick | harry 2. Sentence → Name | List and Name 3. List → Name , List | Name • All right-hand sides for a non-terminal are collected in one grammar rule. • Each grammar rule reads like a definition of the left-hand side: – A Sentence is either a Name or a List followed by and followed by Name. – A List is either a Name followed by a , followed by a List or it is a Name. 142 Sentences are produced using two processes Context-free grammars produce sentences by two processes: – concatenation (“… followed by …”) – choice (“either … or …”) 143 Identification mechanism In addition to the concatenation and choice processes there is an identification mechanism which links the name of a non-terminal used in a right-hand side to its defining rule (“… is a …”). – Example: “Name is a List” links Name to the rule that defines List. 144 Each non-terminal generates a language • Earlier we identified a “language” as a set of strings: the set of terminal productions of the start symbol. • The independent production property allows us to extend this definition to any non-terminal in the grammar: each non-terminal produces a set—a language—independent of the other non-terminals. • If we write the set of strings produced by 𝐴 as 𝐿(𝐴) and 𝐴 has a production rule with, say, two alternatives, 𝐴 → 𝛼 | 𝛽, then 𝐿(𝐴) = 𝐿(𝛼) ∪ 𝐿(𝛽). “∪” is the union operator on sets. • If 𝛼 then consists of, say, three members 𝑃𝑞𝑅, we have 𝐿(𝛼) = 𝐿(𝑃) ○ 𝐿(𝑞) ○ 𝐿(𝑅) “○” is the concatenation operator on the strings in the sets. 145 Nullable/Empty • A non-terminal whose language contains ε is called nullable. • One also says that it produces empty. 146 In Type 0 or Type 1 grammars only the start symbol defines a language • Recall the Type 1 grammar for anbncn: 1. S → abc | aSQ 2. bQc → bbcc 3. cQ → Qc • We cannot define a language L(Q) since Q does not produce anything meaningful by itself. 147 Only for Type 2 and lower • Defining a language for a non-start symbol is possible only for Type 2 grammars and lower. • Defining a non-start non-terminal as nullable is only possible for Type 2 grammars and lower. 148 Recursive non-terminals A non-terminal A is recursive if an A in a sentential form can produce something that again contains an A. Example: A → aA the A is directly recursive. Here is an example of indirect recursion: A → aB B → bA A produces aB and B produces bA, which takes us back to the production for A. 149 Right recursion A non-terminal A is right-recursive if it can produce something that has an A at the right end of the rewrite rule. Example: A → abcA 150 Left recursive A non-terminal A is left-recursive if it can produce something that has an A at the left end of the rewrite rule. Example: A → Aabc 151 Self-embedding A non-terminal A is self-embedding if there is a derivation in which A produces A with something, say α, before it and something, say β, after it. Example: A → αAβ 152 Nesting • Self-embedding describes nesting: α is the part produced when entering another level of nesting, β is the part produced when leaving that level. • The best-known example of nesting is the use of parentheses in arithmetic expressions: Arith_expression → … | Simple_expression Simple_expression → Number | ‘(‘ Arith_expression ‘)’ 153 Both left- and right-recursive A non-terminal can be left-recursive and rightrecursive at the same time; it is then selfembedding. Example: A → Ab | cA | d A → Ab → cAb → ccAb → ccdb 154 Recursion is essential for infinite languages • If no non-terminal in a grammar is recursive, each production step uses up one nonterminal since that non-terminal will never occur again. • So the production process cannot continue unlimitedly, and a finite language results. • Recursion is essential for life in grammars. 155 Can create infinite languages using the repetition operator • Thus far we have not used repetition operators in grammars. • Later we will extend the grammar syntax to provide repetition operators. • With the extended syntax we can create infinite languages without using recursion. 156 Advantage/disadvantage of CF grammars • In the actual world, many things are defined in terms of other things. • The advantage of CF grammars is that they are a very concise way to formulate such interrelationships. • The disadvantage of CF grammars is that they can generate a lot of good-looking nonsense. 157 XML • XML is a text-markup system. • Markup is used to express and control the basic structure. • An XML instance document is a parse tree! 158 CF grammar for English If we ignore enough detail we can recognize an underlying context-free structure in the sentences of a natural language, for example, English: Sentence Subject Object NounPhrase QualifiedNoun Noun Adjective Verb → → → → → → → → Subject Verb Object NounPhrase NounPhrase the QualifiedNoun Noun | Adjective QualifiedNoun castle | caterpillar | cats well-read | white | wistful | … admires | bark | criticize | … 159 … which produces sentences like: the well-read cats criticize the wistful caterpillar Sentence Subject Object NounPhrase QualifiedNoun Noun Adjective Verb → → → → → → → → Subject Verb Object NounPhrase NounPhrase the QualifiedNoun Noun | Adjective QualifiedNoun castle | caterpillar | cats well-read | white | wistful | … admires | bark | criticize | … 160 Since no context is incorporated, it will equally well produce this good-looking nonsense: the cats admires the white well-read castle Sentence Subject Object NounPhrase QualifiedNoun Noun Adjective Verb → → → → → → → → Subject Verb Object NounPhrase NounPhrase the QualifiedNoun Noun | Adjective QualifiedNoun castle | caterpillar | cats well-read | white | wistful | … admires | bark | criticize | … 161 For keeping context we could use a phrase structure grammar: Sentence Number Noun Singular Singular Verb Singular Noun Plural Plural Verb Plural → → → → → → → → Noun Number Verb Singular | Plural castle Singular | caterpillar Singular | … Singular admires | … ε cats Plural Plural bark | Plural criticize | … ε The markers Singular and Plural control the production of English words. Still, this grammar allows the cats to bark … For a better way to handle context, see various sections in Chapter 15, especially Van Wijngaarden grammars (Section 15.2) and attribute and affix grammars (Section 15.3). 162 Programming languages are defined using CF grammars • The bulk of examples of CF grammars originate from programming languages. • Sentences in these languages (that is, programs) have to be processed automatically (by a compiler) and it was recognized early (around 1958) that this is much easier if the language has a well-defined formal grammar. • The syntaxes of all programming languages in use today are defined through formal grammars. 163 XML • XML Schemas specify XML languages using CF grammars. • Sentences in XML languages (that is, XML instances) have to be validated automatically (by a validator) and it was recognized early (around 1999) that this is much easier if the XML language has a well-defined formal grammar. • The syntaxes of nearly all XML languages in use today are defined through XML Schemas. 164 ε-rules, ε-free • A grammar rule that has an empty right-hand side: A→ε is called an ε-rule. Read that rule as: A may be replaced by the empty string (which we denote by ε). – An empty string is a string of length zero, it contains no characters from the alphabet, Σ. • A grammar that contains no such rules is called εfree. 165 Non-monotonic CF grammar The only way a CF rule can be non-monotonic is by having an ε-rule. A grammar containing this rule would not be monotonic: A→ε 166 Require monotonicity • Some authors (for example, Chomsky) and some parsing algorithms require a CF grammar to be monotonic. • This means that it must be ε-free. 167 Making a grammar ε-free • Almost any CF grammar can be made ε-free by systematic substitution of the ε-rules. • The exception is a grammar in which the start symbol produces ε. • The transformation of a CF grammar that contains ε-rules into an ε-free grammar is explained in Section 4.2.3.1 168 Advantage of ε-free CF grammars The proofs and parsers are less complicated, sometimes much less complicated. 169 Disadvantage of ε-free CF grammars • The disadvantage with transforming a CF grammar to remove ε-rules is that the resulting grammar will almost always be more complicated. • Example: Suppose we have a system that can be fed bits of information like: “Amsterdam is the capital of the Netherlands”, “Truffles are expensive”, and can then be asked a question. continued 170 Disadvantage of ε-free CF grammars • On a superficial level we can define its input as: input → zero-or-more-bits-of-info question zero-or-more-bits-of-info → bit-of-info zero-or-more-bits-of-info | ε • This definition of input neatly fits the user’s view of the problem. • Here is an ε-free grammar for the input: input → question-preceded-by-info question-preceded-by-info → question | bit-of-info question-preceded-by-info • This second definition does not fit the user’s view of the problem. • As a grammar becomes more and more complicated, the requirement that it be ε-free becomes more and more of a nuisance: the grammar is working against us, not for us. 171 No problem theoretically • Requiring grammars be ε-free presents no problem from a theoretical point of view: any CF language can be described by an ε-free CF grammar and ε-rules are never needed. • Better still, any grammar with ε-rules can be mechanically transformed into an ε-free grammar for the same language. • But the price we pay is that of any grammar transformation: it is no longer our grammar and it does not reflect the original structure as well. 172 The ε-rule is a useful tool The bottom line is that the practitioner finds the ε-rule to be a useful tool. 173 ε-rules make parsing problematic • Many parsing methods will in principle work for ε-free grammars only: if something does not produce anything, you can’t very well see if it’s there. • Often the parsing method can be doctored to handle ε-rules, but that invariably increases the complexity of the method. • If ε-rules did not exist, then the topic of parsing would be 30% smaller – but then grammars would lose much more than 30% of their usefulness. 174 Advantage of using ε-rules The advantage is that ε-rules are very convenient for the grammar writer and user. 175 Notational style: BNF • There are several different styles of notation for CF grammars of programming languages. • They are all functionally equivalent. • The first is the Backus-Naur Form (BNF) which was first used to define ALGOL 60. Here is a sample: <name> ::= <sentence> ::= <list> ::= tom | dick | harry <name> | <list> and <name> <name>, <list> | <name> • Angle brackets are used to enclose nonterminals and ::= is used for “may produce” 176 Notational style: van Wijngaarden • The second notational style is that of van Wijngaarden. Here is a sample: name: sentence: list: tom symbol; dick symbol; harry symbol. name, list, and symbol, name. name, comma symbol, list; name. • The names of symbols end in …symbol; their representations are hardware-dependent and are not defined in the grammar. • Rules are terminated with a period. 177 van Wijngaarden grammars name: sentence: list: tom symbol; dick symbol; harry symbol. name, list, and symbol, name. name, comma symbol, list; name. • Punctuation is used in the traditional way. For example, the comma binds tighter than the semicolon. • The punctuation can be read as follows: : ; , . is defined as a(n) , or as a(n) followed by a(n) , and as nothing else. • So this rule sentence: name, list, and symbol, name. would be read as: A sentence is defined as a name followed by a list followed by an and-symbol followed by a name, and as nothing else. 178 van Wijngaarden grammars • The van Wijngaarden notation achieves its full power only when applied to the two-level van Wijngaarden grammars. • But it also has merits on its own: it is formal and still quite readable. 179 Extended CF grammars • CF grammars are made more compact and more readable by introducing special shorthands for frequently used constructions. • Rules like: List → Item | Item List are written in an extended CF grammar as: List → Item+ Item+ means “one or more Items” • We do not need to give a rule for Item+, the rule: Item+ → Item | Item Item+ is implicit. • This notation for grammars is called Extended BNF (EBNF). 180 Extended CF grammars (cont.) • Likewise, rules like: List → ε | Item List are written in an extended CF grammar as: List → Item* Item* means “zero or more Items” • The rule: Item* → ε | Item Item* is implicit. 181 Extended CF grammars (cont.) • Finally, rules like: Item → ε | Item are written in an extended CF grammar as: Item → Item? Item? means “zero or one Item” (optional Item) • The rule: Item? → ε | Item is implicit. 182 Repetition operators +, *, and ? are called repetition operators 183 Extending an operator’s range • In the preceding examples the operators +, *, and ? work on the preceding symbol. • Their range can be extended by using parentheses: (Item ;)? means “optionally an Item followed by a ; ” 184 Advantage of extended grammar The advantage of the repetition operators and parentheses is that grammars can be written more efficiently, more compactly, and more readable. 185 Illustrate the gain in efficiency, compactness, and readability Book Preface ChapterSequence Chapter ParagraphSequence Paragraph SentenceSequence Conclusion → → → → → → → → Preface ChapterSequence Conclusion “PREFACE” ParagraphSequence Chapter | Chapter ChapterSequence “CHAPTER” Number ParagraphSequence Paragraph | Paragraph ParagraphSequence SentenceSequence … “CONCLUSION” ParagraphSequence Use extended operators Book Preface Chapter Paragraph Sentence Conclusion → → → → → → Preface Chapter+ Conclusion “PREFACE” Paragraph+ “CHAPTER” Number Paragraph+ Sentence+ … “CONCLUSION” Paragraph+ 186 Overdoing a good thing Some styles even allow constructions like: – Item+4 meaning “One or more Item, with a maximum of 4 ” – Item+ , meaning “One or more Items separated by commas ” 187 No increase in expressive power • The extensions of an EBNF grammar do not increase its expressive powers. • All implicit rules can be made explicit and then a normal CF grammar results. • Their strength lies in their user-friendliness. 188 Kleene star • The star in the notation X* is called the Kleene star • As we’ve seen, in a grammar X* should be read as “zero or more Xs ” • If X is a set, X* should be read as “the set of zero or more elements of X concatenated ” • We denote an alphabet by Σ and the set of all strings over the alphabet by Σ* 189 Regular expressions • Forms involving the repetition operators *, +, or ? and possibly the separators ( and ) are called regular expressions. • EBNFs, which have regular expressions for their right-hand sides, are sometimes called regular right part grammars (RRP grammars), which is more descriptive than “extended context free” but is more of a tongue twister. 190 Structural meaning of a regular right-hand side • There are two schools of thought about the structural meaning of a regular right-hand side. • One school maintains that a rule like: Book → Preface Chapter+ Conclusion is an abbreviation of: Book → Preface α Conclusion α → Chapter | Chapter α • This is, a right recursive interpretation. 191 Right recursive interpretation • The advantage of a right recursive interpretation is that it is easy to explain and the transformation to “normal” CF is simple. • The disadvantages are: – The transformation entails anonymous rules (identified by α). – The lopsided production tree does not correspond to our idea of the structure of the Book (see figure on next slide). 192 Production tree for a right recursive interpretation Book Preface Conclusion α α Chapter α Chapter Chapter Book → Preface α Conclusion α → Chapter | Chapter α α Chapter 193 Iterative interpretation • The second school of thought claims that: Book → Preface Chapter+ Conclusion is an abbreviation of: Book → Preface Chapter Conclusion | Preface Chapter Chapter Conclusion | Preface Chapter Chapter Chapter Conclusion |… • It has the advantage that it yields a beautiful production tree (see figure on next slide), but the disadvantages are that it involves an infinite number of production rules and that the nodes in the production tree have varying fan-out. 194 Production tree for the iterative interpretation Book Preface Chapter Chapter Chapter Chapter Conclusion 195 Which interpretation is commonly used? Since the iterative implementation is complex, most practical parser generators use the recursive interpretation in some form or another, whereas most research has been done on the iterative interpretation. 196 XML/XSD uses the iterative interpretation Book → Preface Chapter+ Conclusion Is expressed in XSD like so: <xs:element name="Book"> <xs:complexType> <xs:sequence> <xs:element name="Preface">...</xs:element> <xs:element name="Chapter" maxOccurs="unbounded">...</xs:element> <xs:element name="Conclusion">...</xs:element> </xs:sequence> </xs:complexType> </xs:element> Which is interpreted using the iterative interpretation: <Book> <Preface>...</Preface> <Chapter>...</Chapter> <Chapter>...</Chapter> <Chapter>...</Chapter> <Conclusion>...</Conclusion> </Book> Wrong! The XML instance is a sentence of the grammar. The issue is: what is the parse tree (DOM tree) for this instance? 197 Basic property of CF grammars • The basic property of CF grammars is that they describe things that nest: an object may contain other objects in various places, which in turn may contain … etc. • When during the production process we have finished producing one of the objects, the right-hand side still “remembers” what has to come after it. 198 While working on expanding a non-terminal, the following symbols remain queued-up Sentence Subject Object NounPhrase QualifiedNoun Noun Adjective Verb → → → → → → → → Subject Verb Object NounPhrase NounPhrase the QualifiedNoun Noun | Adjective QualifiedNoun castle | caterpillar | cats well-read | white | wistful | … admires | bark | criticize | … After having descended into the depth of the non-terminal Subject to produce something like the wistful cat, the right-hand side Subject Verb Object still remembers that a Verb must follow. While we are working on the Subject, the Verb and Object symbols remain queued at the right in the sentential form. For example: the wistful QualifiedNoun Verb Object 199 Here’s what makes CF languages so useful • It is the parsability that make CF languages so useful, not the fact that they stem from the Chomsky hierarchy. • Parsing is the task of converting a string to the production tree. 200 Type 3 grammars • The restriction to Type 3 disallows the recollection of things that came before. • A right-hand side may only contain one non-terminal and it must come at the end. • This means that there are only two kinds of rules: – A non-terminal produces zero or more terminals. – A non-terminal produces zero or more terminals followed by one non-terminal. • Example: the language a*b+ is generated by this Type 3 grammar: S → a*B B → b+ 201 Type 2 versus Type 3 • Type 2 allows queuing-up whereas Type 3 does not. • Type 2 allows recollection of things that came before whereas Type 3 does not. This is a key concept 202 Chomsky definition of Type 3 Our definition of Type 3: A non-terminal produces zero or more terminals. A non-terminal produces zero or more terminals followed by one non-terminal. Chomsky definition of Type 3: A non-terminal produces one terminal. A non-terminal produces one terminal followed by one non-terminal. • Our definition is equivalent and more convenient. • It is not completely trivial to convert a Type 3 grammar under our definition to a grammar under the Chomsky definition. 203 Type 2.5 grammar (linear grammar) • A Type 2.5 grammar allows a single nonterminal on the right-hand side and it doesn’t have to be at the end. • This kind of grammar is called a linear grammar. 204 Note the equivalence between a Type 3 grammar and a finite-state automaton Type 3 grammar: S → a*B B → b+ Finite-state automaton b a S b B 205 Why grammars, not automata? • There is a close relationship between formal grammars and other abstract notions used in computer science, such as automata and algorithms. • Indeed, since the results in one theory can often be translated into another, it seems to be an arbitrary decision as to which interpretation is primary. • In these slides formal grammars are given preferential treatment because they are probably the most commonly known of the various theories among computer scientists. • This is due to the success of the context-free grammars in describing the syntax of programming languages. 206 Type 3 = Regular grammars • Type 3 grammars are also called regular grammars (RE grammars) or finite-state grammars (FS grammars) • More precisely the version defined below is called right-regular since the only nonterminal in a rule is found at the right end of the right-hand side. Our definition of Type 3: A non-terminal produces zero or more terminals. A non-terminal produces zero or more terminals followed by one non-terminal. 207 Type 3 grammars can recurse • Recall the rules on Type 3 grammars: – A non-terminal produces zero or more terminals. – A non-terminal produces zero or more terminals followed by one non-terminal • This grammar conforms to the rules; therefore it is Type 3: A → a | aA (it is recursive) 208 Left-regular grammars • The left-regular grammars are subject to the restriction that the only non-terminal in a rule must be at the left end of the right-hand side: – A non-terminal produces zero or more terminals. – A non-terminal produces one non-terminal followed by zero or more terminals. • Example: this Type 3 left-regular grammar generates the language a*b* S → Ab* A → a* • Left-regular grammars are less intuitive than rightregular grammars, occur less frequently, and are more difficult to process, but they do occur occasionally. 209 Regular grammar = right-regular grammar Given the prevalence of right-regular over leftregular, the term “regular grammar” is usually intended to mean “right-regular grammar.” 210 Right-recursive vs. right-regular • A non-terminal A is right-recursive if it can produce a sentential form that has an A at the right end. – Right-recursive means that rule A can be used again in the production process • A rule is right-regular simply means that its non-terminal is at the right end, following any terminal symbols. Right-regular has nothing to do with recursion. 211 Non-nesting Regular grammars don’t nest 212 Common usage • Regular grammars are used very often to describe the structure of text at the character level, in lexers. • It is customary for the terminal symbols of a regular grammar to be single characters. 213 Type 3 grammars for the t,d,&h language Right-regular grammar: Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h Left-regular grammar: Sentence → t | d | h | List List → ListHead &t | ListHead &d | ListHead &h ListHead → ListHead , t | ListHead , d | ListHead , h | t | d | h 214 Grammar and equivalent automaton Right-regular grammar: S → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h Automaton t,d,h t,d,h S ε & List t,d,h ListTail , 215 Challenge of left-regular grammars • Here is a Type 3 grammar (left-regular): S → Ab* A → a* • Doing recognition with a left-regular requires a non-deterministic automation. See Section 5.3. 216 Production chain • The production tree for a sentence from a Type 3 (right-regular) grammar degenerates into a production chain of non-terminals that drop a sequence of terminals on their left. • See next slide for an example  217 Right-regular grammar: Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h Sentence Production chain List ListTail t List , ListTail d & h 218 The […] notational device Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h There is a lot of repeating in the above grammar. A notational device has been invented to abate this nuisance. Square brackets are used to indicated “one out of a set of characters”: [tdh] is an abbreviation for t | d | h Sentence → [tdh] | List List → [tdh] ListTail ListTail → , List | & [tdh] 219 The macro notational device • A macro is a name for pieces of the grammar. • The macro is referenced by preceding the name with a $ symbol. • A referenced macro is substituted by the grammar. Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h use macro Name Sentence List ListTail → → → → t|d|h $Name | List $Name ListTail , List | & $Name 220 The lex parser • lex is a popular parser for regular grammars. • It supports both notational devices: the […] device and the macro device. 221 Type 3 Chomsky definition of t,d&h Chomsky definition of Type 3: A non-terminal produces one terminal. A non-terminal produces one terminal followed by one non-terminal. The following grammar does not adhere to the Chomsky definition of Type 3: Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h 2 terminals – not allowed in the Chomsky definition 222 Convert to the Chomsky definition Chomsky definition of Type 3: A non-terminal produces one terminal. A non-terminal produces one terminal followed by one non-terminal. If we adhere to the Chomsky definition of Type 3, our grammar will not be smaller than: Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h Sentence List ListTail Name → → → → Our Type 3 definition of the t,d&h language t | d | h | List t ListTail | d ListTail | h ListTail , List | & Name | & Name | & Name t|d|h Chomsky Type 3 definition of the t,d&h language 223 Chomsky version of t,d&h Sentence List ListTail Name → → → → t | d | h | List t ListTail | d ListTail | h ListTail , List | & Name | & Name | & Name t|d|h The Chomsky version is: – Bigger (an additional rule is needed) – Not as easy to read as the version that uses […] or macros – But it is easier to process 224 Key point There is variation in how people define Type 0, Type 1, Type 2, and Type 3. Depending on which definition you use, the grammars you create may vary in user friendliness and in ease of processing. 225 Formal linguist vs. Computer scientist • Chomsky’s definition of Type 3 has minimal mechanisms. The formal linguist is interested in and helped by minimally sufficient grammars. • The computer scientist values a form in which the concepts underlying the grammar ($Name, etc.) are easily expressed, at the expense of additional processing. 226 Two observations about regular grammars • The sentential forms will only contain one non-terminal and this will always be at the end. • The size of regular grammars can be reduced considerably by using the repetition operators *, +, and ? along with the grouping operators ( and ). 227 One non-terminal, at the end Sentence → t | d | h | List List → t ListTail | d ListTail | h ListTail ListTail → , List | & t | & d | & h Sentence → List → t ListTail → , List → d ListTail → & h Notice in these sentential forms that there is one non-terminal and it is at the end. 228 Smaller and simpler All regular grammars can be reduced considerably in size by using the repetition operators *, +, and ? for “zero or more”, “one or more” and “optionally one”, respectively. Sentence → [tdh] | List List → [tdh] ListTail ListTail → , List | & [tdh] Using the repetition operators along with ( and ) for grouping, we can simply the grammar to: Sentence → (( [tdh] , )* [tdh] & )? [tdh] 229 Regular expressions Regular expressions exist for all Type 3 grammars. Sentence → (( [tdh] , )* [tdh] & )? [tdh] regular expression Regular grammar (Type 3 grammar) that uses repetition operators along with ( and ) 230 Type 4 grammars • The last restriction we shall apply to what is allowed in a production rule is a pretty final one: no non-terminal is allowed in the right-hand side. • This removes all generative power from the mechanism, except for the choosing of alternatives. • The start symbol has a (finite) list of alternatives from which we are allowed to choose. • Type 4 grammars are named finite-choice grammars (FC grammars). 231 Example of a FC grammar • There is no FC grammar for the t,d&h language. • If, however, we are willing to restrict ourselves to lists of names of finite length (say, no more than three), then there is a FC grammar, since one could enumerate all combinations. For three names we get: S → [tdh] | [tdh] & [tdh] | [tdh] , [tdh] & [tdh] for a total of 3 + 3 x 3 + 3 x 3 x 3 = 39 production rules. 232 Chomsky: No FC grammar • FC grammars are not part of the official Chomsky hierarchy in that they are not identified by Chomsky. • They are nevertheless very useful and are often required as a tail-piece in some process or reasoning. • For example, the set of reserved words (keywords) in a programming language can be described by a FC grammar. 233 Parts of grammars are FC • Although not many grammars are FC in their entirety, some of the rules in many grammars are finite-choice. • For example, the first t,d&h grammar we looked at has a FC rule: 1. 2. 3. 4. 5. Name → tom Finite choice Name → dick Name → harry Sentence → Name Sentence → List End List → Name List → List, Name , Name End → and Name the start symbol is Sentence 234 Summary The below table summarizes the most complicated data structures that can occur in the production of a sentence, with respect to the grammar type used. Chomsky type Grammar type Most complicated data structure 0/1 PS/CS production dag 2 CF production tree 3 FS production chain 4 FC production element Legend: dag: directed acyclic graph PS: phrase -structure CS: context-sensitive FS: finite-state FC: finite-choice 235 XML is a graph • XML is a graph due to ID-IDREF. • So it must be Type 0/1. 236 Symbology Let: VN denote the set of non-terminal symbols VT the set of terminal symbols S the start symbol F the production rules 237 Formal definition of Type 0, 1, 2, 3 grammars A generative grammar G = (VN, VT, S, F) is said to be of Type i if it satisfies the restrictions described in this list: i=0 No restrictions except the LHS must contain at least one non-terminal i = 1: Every rewriting rule in F has the form Q1AQ2 → Q1PQ2, with Q1, Q2, and P in (VN ∪ VT)*, A ∈ VN, and P ≠ ε, except possibly for the rule S → ε, which may occur in F, in which case S does not occur on the right-hand sides of the rules. i = 2: Every rule in F has form A → P, where A ∈ VN, and P ∈ (VN ∪ VT)*. i = 3: Every rule in F has form with A → PB or A → P, where A, B ∈ VN, and P ∈ VT*. 238 Monotonic not necessarily a context-sensitive grammar • A grammar is Type 1 monotonic if every rule has the same or more symbols on the right-hand side (the tree expands, doesn’t contract). • A grammar is Type 1 context-sensitive if all of its rules are contextsensitive. A rule is context-sensitive if only one (non-terminal) symbol in its left-hand side gets replaced by other symbols, while we find the others back, undamaged and in the same order, in the right-hand side. • The below grammar for anbncn is Type 1 monotonic but not Type 1 context-sensitive: 1. S → abc | aSQ 2. bQc → bbcc 3. cQ → Qc Every rule has the same or more symbols on the right-hand side, so it is monotonic. 239 Context-sensitive grammar for anbncn The below grammar for anbncn is Type 1 contextsensitive: CS grammar for anbncn 1. 2. 3. 4. 5. 6. 7. 8. S CB HB HC aB bB bC cC → → → → → → → → aSBC | aBC HB HC BC ab bb bc cc Derivation of a2b2c2 S aSBC aaBCBC aabCBC aabHBC aabHCC aabBCC aabbCC aabbcC aabbcc http://www.answers.com/topic/context-sensitive-grammar (start) (rule 1) (rule 1) (rule 5) (rule 2) (rule 3) (rule 4) (rule 6) (rule 7) (rule 8) 240 Generating sentences from a grammar • Until now we have only produced single sentences from our grammars, in an ad hoc fashion. • But the purpose of a grammar is to generate all of its sentences. • Fortunately there is a systematic way to do so. 241 Production queue Queue aaSQQ, aabcQ aSQ aSQ substitute S S → abc | aSQ aQC → bbcc cQ → Qc We can systematically generate all sentences using a queue. 242 Systematic way to enumerate all sentences in a PS grammar • Begin with S as the only sentential form in the queue. Now continue doing the following: – Consider the first sentential form in the queue. – Scan it from left to right, looking for a substring that matches the lefthand side of a production rule. – For each such production rule found, make a copy of the sentential form, replace the substring with the production rule’s right-hand side, add the revised sentential form to the end of the queue. – If the original sentential form does not contain any non-terminals, write it down as a sentence in the language. – Throw away the original sentential form; it has been fully processed. • If no rule matched and the sentential form was not a finished sentence, it was a blind alley; they are removed automatically by the above process and leaves no trace. 243 Recursively enumerable • The procedure on the previous slide enumerates all strings in a PS language. • Thus, PS languages are also called recursively enumerable sets, where “recursively” is taken to mean “by a possibly recursive algorithm.” 244 Non-recursively enumerable sets There are sets that are not recursively enumerable; the set of all Type 0 grammars that do not produce the empty string is an example. There is no grammar for it, because this set cannot be generated (you cannot tell whether a Type 0 grammar produces the empty string). If phrasestructure is not sufficient, only natural language description will do, as shown here. 245 Language-generating procedure The queue procedure is a systematic way of producing all the strings that a grammar is capable of generating. That is, it is a languagegenerating procedure. 246 Let’s see the procedure in action for the anbncn grammar: S → abc | aSQ bQc → bbcc cQ → Qc Step Queue (front of queue on left) 1. S 2. abc Result aSQ abc aSQ 3. aabcQ aaSQQ 4. aaSQQ aabQc 5. aabQc aaabcQQ 6. aaabcQQ aaaSQQQ 7. aaaSQQQ aabbcc 8. aabbcc 9. aaabQcQ 10. aaaabcQQQ aaaaSQQQQ 11. aaaaSQQQQ aaabbccQ … … aaabQcQ aaaSQQQ aabbcc aaabQcQ aaaabcQQQ aaaabcQQQ aaaaSQQQQ aabbcc aaaaSQQQQ aaabbccQ aaaabQcQQ 247 Every sentence will be produced • The table on the previous slide shows that we do not get a sentence each time we turn the crank (each time we process the item at the front of the queue). • In fact, real sentences will get scarcer and scarcer. The reason is that during the process more and more side-lines develop which all require equal attention. • Still, we can be certain that every sentence that can be produced, will be produced. 248 Breadth-first generator This way of doing things is called breadth-first production. Computers are better at it than humans. S abc aSQ output aabcQ aaSQQ aabQc aaabcQQ aaaSQQQ 249 Replace all left-hand sides Queue ac, bC AC AC substitute A and AC S → AC A →b AC → ac If we were to only substitute A, then the remainder is C which is a blind alley. Doing both substitutions (replace A by b and AC by ac) also leads to a blind alley, but there will be an output, ac. 250 Sentential form provides a context Queue ………….. abcXYdef abcXYdef context The sentential form provides a context. If you ignore a context you run the risk of creating false productions. Remember, this discussion is just for phrase-structure grammars. 251 Grammar that generates the empty set • What language will this grammar generate? S → AB A→B B→A • Let’s show a few sentential forms: S → AB → BB → AB → BB → AB → … • • • • Every new sentential form contains non-terminals. It generates no sentences. Therefore, it produces the empty set. The language generated is the empty set: L(G) = {} 252 PS grammar that generates the empty set The language generated by this grammar is also empty: S → aAB bB → a Ab → SBb Aa → SaB B → SA B → ab 253 Undecidable • We have seen how to systematically generate, for PS grammars, all sentential forms using a queue. • It is not at all certain that the process will obtain a sentence. • It is quite possible that every new sentential form never terminates (see example on previous slide). • It is undecidable whether a PS grammar produces the empty set. • “Undecidable” means that there cannot be an algorithm that will, for every PS grammar, correctly tell if the grammar produces at least one sentence. 254 PS grammar Procedure No such procedure exists Yes (no), the language generated is the empty set 255 This queue will run forever Queue AA, BB AB AB substitute S → AB A→B B→A 256 No algorithm to determine if PS grammars will produce something • There is no algorithm that can decide, given an arbitrary PS grammar, whether it will produce a sentence. • This does not mean that we cannot prove for some given grammar that it generates nothing. It means that the proof method used will not work for all PS grammars. • We could have a program that correctly says Yes in finite time if the answer is Yes but that takes infinite time if the answer is No. In fact, the queue procedure answers Yes in finite time but takes an infinite time if the answer is No. 257 Many special cases can be identified • For many PS grammars we can prove if they produce the empty set or not. • For example, the grammar may have a rule S -> a, or we may find it has no rule without a non-terminal in its RHS. 258 Can still get useful info PS grammar Procedure No such procedure exists The language generated is (not) the empty set Even though we can’t get an exact answer, this does not prevent us from obtaining all sorts of useful information that gets close. The computer scientist is aware of but not daunted by the impossibilities from formal languages. 259 The Halting Problem for PS grammars • The previous slides say there is no algorithm to determine if an arbitrary PS grammar will produce a sentence. • Question: What would cause a grammar to not produce a sentence? • Answer: Each sentential form must have a nonterminal (otherwise we have a sentence). The production rules must produce another sentential form with a non-terminal. So the sentential forms never halt. • There is no algorithm that can determine if an arbitrary PS grammar’s production graph will halt. 260 The halting problem is undecidable • Problem: write a tool that, given an arbitrary PS grammar, it determines whether it will produce a string. • What algorithm would you devise? • You might use the algorithm presented earlier: use a queue, take the first item off the queue, substitute, add the new sententials to the end of the queue, repeat. Upon the first item generated that has no non-terminals (is a sentence), return Yes (the grammar does produce at least one sentence) and stop. Clearly this algorithm will run forever on those grammars that don’t produce a string. • Perhaps there is another algorithm that will solve the problem? It turns out, there is no other algorithm. The halting problem is not decidable for PS grammars 261 Example of a PS grammar? • First, all Type 1-Type 4 grammars are Type 0 grammars too. And you can trivially rewrite (preserving the sets they generate) any of these so they are no longer Type 1-4. • If we want interesting examples we will have to concentrate not on the form of the grammar, which is what Type N is concerned with, but on the sets they generate. 262 Fundamental difference between Type 0 and Type 1 The fundamental difference between Type 0 and Type 1 lies in the sets (languages) they can generate: – For a set that is Type 1 (context-sensitive), we can determine in finite time whether any item is or isn’t in the set, whereas – For a set that is Type 0 (phrase-structure), an item not in the set cannot be ruled out in finite time. That is, it may take an infinite amount of time to determine that an item is not in the set. More formally, determining if a given item belongs to a set generated by a Type 0 grammar is undecidable. 263 Time required to determine if an item is in the set defined by a grammar CS Grammar PS Grammar Item is in the set Finite time Finite time Item is not in the set Finite time Infinite time The difference between PS grammars and CS grammars is that PS grammars take an infinite amount of time to determine that an item is not in the set. 264 Finding a PS grammar • Finding a phrase-structure grammar—that’s not a context-sensitive grammar—amounts to finding a set in which we can determine in finite time that an item belongs in the set, but an infinite amount of time is required to determine that an item does not belong in the set. • Here is a set that is well-known to be undecidable: The set of all programs that terminate. 265 Creating a PS grammar • Let L be a grammar for a simple but complete (with full Turing power) programming language. • Write a breadth-first generator for all programs in L (generate the programs using the queue algorithm). • Write an interpreter for L. • Start interpreting the programs breath-first as they come. • When a program terminates, we produce it as part of the generated set. • If the interpreter doesn’t terminate, the program is not a member of the set. • So this Type 0 grammar generates just the set of all terminating programs in L, a set the membership of which is undecidable, so there cannot be a Type 1 grammar for this set. 266 Expanding/shrinking PS grammars Q → XYZ XYZ → Q Q is replaced by XYZ XYZ is replaced by Q Thus a production can grow or shrink. 267 Expanding/shrinking PS sentential forms Length of sentential form production process 268 Unsolvable • When we do get sentences from the queue procedure, they may be produced in an unexplainable order. • The sentential forms may grow for a while and then suddenly shrink, perhaps even to the empty string. • It can be proven that there cannot be an algorithm that for all PS grammars produces their sentences in increasing length (actually, in non-decreasing length). • In other words, the parsing problem for PS grammars is unsolvable. 269 Terminology: Undecidable vs. Unsolvable • Undecidable is the term used for Yes/No questions. – Example of an undecidable question: For any arbitrary PS grammar, does the grammar produce a sentence? • Unsolvable is the term used for problems. – Example of an unsolvable problem: For any arbitrary PS grammar, generate its sentences in increasing length. • Note: in the literature these terms are used interchangeably. 270 Turn to CS grammars • We have been discussing phrase-structure grammars: – How to systematically generate their languages (use the queue procedure) – Can we write a procedure to decide whether or not an arbitrary PS grammar will generate a sentence (no). We noted that PS sentential forms can expand and shrink during a production process. • Now let’s address the same issues for CS grammars. 271 Language generation The language-generating queue procedure is also applicable to CS grammars. 272 CS grammars don’t shrink context context Q1AQ2 → Q1PQ2 A is replaced by P P may be multiple symbols. Thus a production can grow. P cannot be empty (ε). Thus a production cannot shrink. A production is either of the same length or longer. 273 Expanding CS sentential forms Length of sentential form production process 274 Decidable • The sentential forms in CS grammars never shrink: the strings are produced in monotonic order of increasing length. • This means: if we want to know if a given string w is in the language, we can just wait until we see it come up, in which case the answer is Yes, or until we see a longer string come up, in which case the answer is No. • It is decidable whether a CS grammar produces the string w. 275 CS grammar, G A procedure exists for deciding if a string w is an element of a CS language Generate string using queue procedure generated string q Stop w ∈ L(G) Yes No q=w ? Stop w ∉ L(G) Continue Yes length(q) > length(w) ? No 276 Recursive sets • Since the strings in a CS language can be recognized by a possibly recursive algorithm, CS languages are also called recursive sets. • So, the term recursive set means there exists a procedure for determining if a string w is an element of the set generated by a CS grammar. 277 Halting Problem Decidable? • We can systematically generate, for CS grammars, all sentential forms using a queue. • However, it is not at all certain that the process will obtain a sentence. • It is quite possible that every new sentential form never terminates. • Is there a procedure for deciding whether a CS grammar produces the empty set? 278 Halting Problem Decidable? • Clearly the queue procedure will not work – it will loop endlessly if a grammar’s language is empty. • Perhaps there is some other procedure that could be applied to a CS grammar to decide if it will produce something? • It turns out that there is no procedure. The halting problem is undecidable for CS grammars. See http://www.cs.cmu.edu/~./FLAC/pdf/ContSens-6up.pdf, bottom of page 12. 279 Turn to CF grammars • We have been discussing context-sensitive grammars: – How to systematically generate their languages (use the queue procedure) – A procedure to decide whether a string w is an element of the language generated by an arbitrary CS grammar (run queue until a match is found or a longer string is encountered We noted that CS sentential forms always expand during a production process. • Now let’s address the same issues for CF grammars. 280 There is an algorithm to determine if a CF grammar will produce something With CF grammars it may still happen that a grammar will never produce a sentence but we can determine that beforehand, as follows: 1) 2) 3) 4) First, scan the grammar to find all non-terminals which have a right-hand side that contains terminals only or is empty. These terminals are guaranteed to produce something. Now scan again to find non-terminals which have a right-hand side that consists of only terminals and non-terminals that are guaranteed to produce something. This will give us new nonterminals that are guaranteed to produce something. Repeat 2) until we find no more new non-terminals. If we have not met the start symbol this way, the grammar will not produce anything. The halting problem is decidable for CF grammars 281 Example Determine that the CF grammar for the t,d,&h language produces a sentence: Sentence → t Sentence → d Sentence → h List → t ListTail List → d ListTail List → h ListTail ListTail → , List ListTail → & t ListTail → & d ListTail → & h 1) Sentence  t Sentence  d Sentence  h ListTail  & t ListTail  & d ListTail  & h List → t ListTail List → d ListTail List → h ListTail ListTail → , List Guaranteed to produce something List  t ListTail List  d ListTail List  h ListTail ListTail → , List 282 Leftmost rewriting Leftmost rewriting: in the production process, rewrite the leftmost non-terminal every time. 283 Rightmost rewriting Rightmost rewriting: in the production process, rewrite the rightmost non-terminal every time. 284 Notation for: rule 2, second alternative 1. Name → tom | dick | harry 2. Sentence → Name | List and Name 3. List → Name , List | Name Consider this derivation: Sentence → List and Name “Sentence” was rewritten using rule 2’s second alternative, i.e., 2b We will write the derivation like so: Sentence → 2b List and Name 285 Compare leftmost and rightmost rewriting 1. Name → tom | dick | harry 2. Sentence → Name | List and Name 3. List → Name , List | Name Leftmost rewriting Sentence → 2b List and Name → 3a Name, List and Name → 1a tom, List and Name → 3b tom, Name and Name → 1b tom, dick and Name → 1c tom, dick and harry Rightmost rewriting Sentence → 2b List and Name → 1c List and harry → 3a Name, List and harry → 3b Name, Name and harry → 1b Name, dick and harry → 1a tom, dick and harry Notes: a. The sequences of production rules are not as similar as we would expect. The sequences are neither equal nor each other’s mirror image, nor is there any obvious relationship. b. In grand total the same rules and alternatives are used. 286 Show the order that non-terminals are rewritten in the production tree Leftmost rewriting Sentence → 2b List and Name → 3a Name, List and Name → 1a tom, List and Name → 3b tom, Name and Name → 1b tom, dick and Name → 1c tom, dick and harry 1 Sentence 6 2 List and Name and harry 4 3 Name , List 5 Name tom , dick 287 Show the order that non-terminals are rewritten in the production tree 1 Rightmost rewriting Sentence → 2b List and Name → 1c List and harry → 3a Name, List and harry → 3b Name, Name and harry → 1b Name, dick and harry → 1a tom, dick and harry Sentence 2 3 List and Name and harry 4 6 Name , List 5 Name tom , dick 288 Different order of rewriting Rightmost rewriting Leftmost rewriting Sente nce 2 3 Name and List , List Name tom , dick 1 Sente nce Name 4 6 3 6 Name List , 5 List Name and harry tom , 1 and Name and harry 4 5 dick Both rewrite-sequences define the same production tree. But the order of rewriting differs. 289 2 Leftmost derivation Leftmost rewriting Sentence → 2b List and Name → 3a Name, List and Name → 1a tom, List and Name → 3b tom, Name and Name → 1b tom, dick and Name → 1c tom, dick and harry Here is the sequence of production rules used in leftmost rewriting: Sentence → List and Name → Name, List and Name → tom, List and Name → tom, Name and Name → tom, dick and Name → tom, dick and harry This sequence of production rules is called the leftmost derivation of a sentence. 290 Indicating a leftmost production A leftmost production step can be indicated by using an arrow marked with a small l, for example: Name, List and Name → tom, List and Name l The leftmost production sequence: Sentence → List and Name → Name, List and Name → tom, List and Name → tom, Name and Name → tom, dick and Name → tom, dick and harry l l l l l l can be abbreviated: * tom, dick and harry Sentence → l 291 Rightmost derivation Rightmost rewriting Sentence → 2b List and Name → 1c List and harry → 3a Name, List and harry → 3b Name, Name and harry → 1b Name, dick and harry → 1a tom, dick and harry Here is the sequence of production rules used in rightmost rewriting: Sentence → List and Name → List and harry → Name, List and harry → Name, Name and harry → Name, dick and harry → tom, dick and harry This sequence of production rules is called the rightmost derivation of a sentence. 292 Indicating a rightmost production A rightmost production step can be indicated by using an arrow marked with a small r, for example: List and Name → List and harry r The rightmost production sequence: Sentence → List and Name → List and harry → Name, List and harry → Name, Name and harry → Name, dick and harry → tom, dick and harry r r r r r r can be abbreviated: * tom, dick and harry Sentence → r 293 Indicating a production The fact that Sentence produces tom, dick and harry in any way is written: * tom, dick and harry Sentence → 294 Parsing (defined) Parsing is the task of reconstructing the derivation tree (or graph) for a given input string. grammar input string Parser derivation tree (or graph) 295 1. Name → tom | dick | harry 2. Sentence → Name | List and Name 3. List → Name , List | Name tom, dick and harry Parser Sentence List Name , and Name and harry List Name tom , dick 296 Most parsers use leftmost or rightmost derivation Some of the most efficient parsing techniques can be understood more easily if viewed as attempts to reconstruct a leftmost or rightmost derivation process of the input string. 297 The concept of zero is still not well accepted Roughly 1500 years after the introduction of zero as a number by mathematicians in India, the concept is still not well accepted in computer science: • Many programming languages do not support records with zero fields • Many programming languages do not support arrays with zero elements • Many programming languages do not support variable definitions with zero variables • In some programming languages the syntax for calling a routine with zero parameters differs from that for a routine with one or more parameters • XML provides a special syntax for empty elements • Many compilers refuse to compile a module that defines zero names • No parser generator can produce a parser for the empty language (the language with zero strings) 298 Empty language vs. a language that consists of the empty string • Empty language: {} • Language with only the empty string: {ε} That language is easily generated by this grammar: S→ε • What would the grammar for the empty language look like? 299 Grammars that produce the empty language • For a grammar to produce nothing, the production process cannot be allowed to terminate. • Here’s one such grammar: S→S That grammar is ugly for two reasons: – The generation process just loops and no information about the emptiness of the language is obtained – The use of the symbol S is arbitrary 300 Force the production process to get stuck • Another approach to force the production process to get stuck is by not having any production rules in the grammar. • Recall that grammars are formally defined as, G = (VN, VT, S, F), where F = the set of production rules. • This approach produces, G = (S, {}, S, {}). • That is not very satisfactory either since: – Now we have a non-terminal without a defining rule – The symbol S is still arbitrary 301 Don’t allow the production process to get started • A better way is never allow the production process to get started: have no start symbol. • This can be accommodated by allowing a set of start symbols in the definition of a grammar rather than a single start symbol. • There are good reasons for having a set of start symbols: each global element declaration in an XML Schema is a potential start symbol. • If we extend the definition of a CF grammar to use a set of start symbols, the grammar for the empty language obtains the elegant and satisfactory form: ({}, {}, {}, {}) 302 Rules with empty left-hand side • It might be useful to have grammar rules in which the left-hand side is empty: ε → djakdlsaiewp • Terminal productions of the right-hand sides of such rules may appear anywhere in the input, thus modeling noise and other everyday but extraneous events. 303 Our preoccupation with empty is not frivolous Our preoccupation with empty strings, sets, languages, etc. is not frivolous, since it is wellknown that the ease with which a system handles empty cases is a measure of its cleanliness and robustness. 304 CF grammars are limited • Many things can be expressed using CF grammars. • However, CF grammars have serious limitations. 305 Lineage of a symbol Here is a CF grammar for the d, h & h language (t = Tom, d = Dick, h = Harry, S = Start, L = List, N = Name): 1. S → L & N 2. S → N 3. L → N , L 4. L → N 5. N → t 6. N → d 7. N → h Here is the production tree for a derivation of: d, h & h: 1 3 6 7 4 7 d , h & h When we have obtained a sentence from a CF grammar, we may look at each terminal symbol in it and ask: How did it get here? For example, looking at the production tree, we see that “d” was produced as the 1st member of the right-hand side of rule number 6. The left-hand side of this rule, the parent of our symbol, was produced as the 1st member of rule 3. And so on, until we reach the start symbol. We can, in a sense, trace the lineage of the symbol in this way. A sentence from the grammar 306 Express lineage as rule/member pairs 1 (1,1) (1,3) 3 7 (3,1) (1,2) 6 4 (3,2) (4,1) (6,1) (7,1) 7 (7,1) d , h & 1. 2. 3. 4. 5. 6. 7. S→L&N S→N L→N,L L→N N→t N→d N→h h Rule/member pairs: { (6,1), (3, 1), (1, 1) } 307 Original symbol, original sentence • If all rule/member pairs in the lineage of a symbol are different, we call the symbol original. – Example: the lineage of the first h is { (7, 1), (4, 1), (3,3), (1,1) }. Since all rule/member pairs are different, h is original. • If all the symbols in a sentence are original, we call the sentence original. 308 Same symbol, different lineage (1,1) 1 (1,3) 3 7 6 4 (4,1) (7,1) 7 1. 2. 3. 4. 5. 6. 7. S→L&N S→N L→N,L L→N N→t N→d N→h (7,1) d , h & h If a symbol occurs twice in an original sentence, both its lineages must be different: if they were the same, they would describe the same symbol in the same place. Rule/member pairs: { (7, 1), (4, 1), (3,3), (1,1) } 309 Member/rule pairs: { (7,1), (1,3) } Any CF grammar produces a finite set of original sentences (rule, member) Finite number Finite number So the number of unique (rule, member) pairs is finite. Therefore the number of original symbols is finite and the number of original sentences is finite. We arrive at the surprising conclusion that any CF grammar produces a finite-size kernel of original sentences and (probably) an infinite number of unoriginal sentences. 310 Unoriginal sentences • What do “unoriginal” sentences look like? • By definition, an unoriginal sentence has one or more symbols that are unoriginal. • A symbol is unoriginal if it has two or more (rule, member) pairs that are the same. • Two (rule, member) pairs are the same means that the same grammar rule is repeated. 311 Original sentence 1 (1,1) (1,3) (1,2) b b (2,1) 2 (2,2) b (2,3) a 3 (3,1) a a a a 1. S → b A b 2. A → a A a 3. A → a b Rule/member pairs: { (1, 1) } Rule/member pairs: { (2,1), (1, 2) } Rule/member pairs: { (3, 1), (2, 2), (1, 2) } Rule/member pairs: { (2, 3), (1, 2) } Rule/member pairs: { (1, 3) } 312 Each symbol is original, so baaab is an original sentence. Unoriginal sentence 1 (1,2) b b 2 (2,2) a 1. S → b A b 2. A → a A a 3. A → a a 2 (2,2) a 3 a a a (3,1) b a a a b Rule/member pairs: { (3,1), (2, 2), (2, 2), (1, 2) } Duplicate rule/member pairs, so this symbol is not original and so baaaaab is not an original sentence. 313 Repeated rule Rule 2 is repeated 1 b a b b 2 a a 2 a 3 a a a a a b 314 Partition the sentence • Let’s partition the sentence into parts. • Let A = the repeated rule (in this case it is rule 2) – w : the part produced by the A that is furthest down the tree – vwx: the part produced by the A that is furthest up the tree – uvwxy: the entire unoriginal sentence 1 b a b u b 2 a v a 2 a 3 a a a a w a b x y 315 Pump up the sentence We can get another unoriginal sentence by replacing the smaller A by the larger A. 1 b a Replace this by this b u b 2 a v a 2 a 3 a a a a w a b x y 316 Another unoriginal sentence baaaaab u v w x y replace w with vwx baaaaaaab u v v w x x y uv2wx2y 317 and another unoriginal sentence baaaaaaab u v v x x w y replace w with vwx baaaaaaaaab u v v v w x x x y uv3wx3y 318 Family of nested sentences • We can, in this way, construct a complete family of sentences uvnwxny for all n≥0. • This form shows w nested in a number of v and x brackets, in an indifferent context of u and y. 1 b a b u b 2 a v a 3 a a a a w Family of nested sentences a 2 a b x y 319 Original sentences become exhausted When we examine longer and longer sentences in a CF language, the original sentences become exhausted and we meet only families of closely related sentences telescoping off into infinity. 320 uvwxy theorem • uvwxy theorem: any sentence generated by a CF grammar that is longer than the longest original sentence from that grammar can be cut into five pieces u, v, w, x, and y, in such a way that uvnwxny are sentences from that grammar for all n≥0. • The uvwxy theorem is also called the pumping lemma for context-free languages. 321 Language isn’t CF if long sentences don’t decay into nested sentences • If a language keeps on providing longer and longer sentences without reducing to families of nested sentences, there cannot be a CF grammar for it. • We have already encountered the context-sensitive language anbncn and it is easy to see that it does not decay into such nested sentences as sentences get longer and longer. Consequently, there is no CF grammar for it. – A general technique for showing that a language is not context-free may be found in this article: Commun. ACM, April 1993, Using the context-free pumping lemma, by David Billington. 322 Increase the length of original sentences • The longest original sentence is a property of the grammar, not the language. • By making a more complicated grammar for a language we can increase the set of original sentences and push away the border beyond which we are forced to start nesting (repeating). • If we make the grammar infinitely complicated, we can push the border to infinity and obtain a phrase structure language from it. How we can make a CF grammar infinitely complicated is described in the section on two-level grammars, 15.2.1. 323 Regular grammars are limited A simpler form of the uvwxy theorem applies to regular (Type 3) languages. 324 Repeated non-terminals for long sentences • We have seen that the sentential forms occurring in the production process for a regular grammar all contain only one nonterminal, which occurs at the end. • During the production of a very long sentence, one or more non-terminals must occur two or more times, since there are only a finite number of non-terminals. 325 Example Regular grammar S → sP | sA P → pQ Q → qA A → aR | aT R → rS T → tU U→u S sP spQ spqA spqaR spqarS spqarsA spqarsaT spqarsatU spqarsatu Clearly the sequence from the first A to this A can be repeated over and over 326 uvnw S sP spQ spqA u spqaR spqarS spqarsA u v spqarsaT spqarsatU spqarsatu u v w u is the part leading up to the first A, v is the part between the first A and the second A (it can be repeated), and w is the part after the last A to terminate the production process. 327 uvw theorem • uvw theorem: any sufficiently long string from a regular language can be cut into three pieces u, v, and w, so that uvnw are strings in the language for all n≥0. • The uvw theorem is also called the pumping lemma for regular languages. 328 CF grammars as transition graphs • A transition graph is a directed graph in which the arrows are labeled with zero of more symbols from the grammar. • As you follow the arrows in the graph you produce one of the associated symbols, if there is one, and nothing otherwise. • The nodes, often unlabeled, are resting points between producing the symbols. • If there is more than one outgoing arrow from a node you can choose any to follow. • Here is the transition graph for the tom, dick, and harry language: tom dick harry tom dick harry and , 329 Produce the same strings CF Grammar 1. 2. 3. Sentence → Name | List and Name Name → tom | dick | harry List → Name , List | Name Transition Graph tom dick harry tom dick harry and , 330 Turn a grammar into a set of transition graphs It is easy to turn a CF grammar into a set of transition graphs, one for each non-terminal, as shown below. Sentence → Name | List & Name Name → tom | dick | harry List → Name | Name , List Name Sentence List Name Name & tom dick harry Name List 331 Name , List Recursive transition network • The transition graphs on the previous slide have nonterminal labels above the arrows. • Upon encountering an arrow that points to node n2 and labeled with non-terminal N: push n2 onto a stack, continue the walk at the entrance to the transition graph for N. When leaving the transition graph for N, pop n2 from the stack and continue at node n2. • This is the recursive transition network interpretation of context-free grammars: the set of graphs is the transition network, and the stacking mechanism provides the recursion. 332 Regular grammars as transition graphs The rules of a regular grammar can also be turned into transition graphs: Sentence → [tdh] | List List → [tdh] ListTail ListTail → & [tdh] |, List [tdh] There is a non-terminal only when leaving a graph. No need for stacking: interpret an arrow marked with a non-terminal N as a jump to the transition graph for N. So a regular grammar corresponds to a nonrecursive transition network. Sentence List List [tdh] ListTail & [tdh] , List ListTail 333 Useless rules, successful production process • Grammars can contain useless rules: rules that cannot play a role in any successful production process. • A production process is successful when it results in a terminal string. • Production attempts can be unsuccessful by getting stuck (no further substitution possible) or by entering a situation in which no substitution sequence will ever remove all non-terminals. • Example of a Type 0 grammar that can get stuck: 1. S 2. S 3. S 4. A B 5. C → → → → → A B B A C x C C If we start with the first rule for S, all goes well and we produce the terminal string x. If we start with the second rule for S, we get stuck. It is a useless rule. If we start with the third rule for S, we get ourselves into an infinite loop, producing more and more Cs. Rules 2, 3, and 5 can never occur in a successful production process; they are useless rules and can be removed from the grammar without affecting the language produced. 334 Remove useless rules • Useless rules are not a fundamental problem: they do not obstruct the normal production process. • Still, they are dead wood in the grammar and one would like to remove them. • Also, when they occur in a grammar specified by a programmer they probably point at some error and one would like to detect them and give warning or error messages. 335 Useless rules is undecidable for Type 0 and Type 1 It can be shown that in general it is undecidable whether a rule in a Type 0 or Type 1 grammar is useless: there cannot be an algorithm that does it correctly in all cases. grammar (type 0 or 1) rule i Is the rule useless? yes/no Impossible to build this 336 Useless rules is decidable for Type 2 The problem of deciding whether a rule in a CF grammar is useless is easily solved. grammar (type 2) rule i Is the rule useless? Easy to build this yes/no 337 Key Concept • It is important to know what type of grammar you are dealing with (Type 0, 1, 2, or 3). Why? • Because if you know that it is Type 0 or 1, then you can take advantage of a result from the field of Formal Languages and not attempt to build a program to decide if a rule is useless. If you know that it is Type 2, then you can easily build the program. 338 3 causes for useless rules in CF grammars • A rule in a CF grammar can be useless through three causes: 1. It may contain an undefined non-terminal. 2. It may not be reachable from the start symbol. 3. It may fail to produce anything. • These are useless rules. 339 Useless Rules undefined non-terminals unreachable non-terminals non-productive non-terminals 340 Undefined non-terminals • The right-hand side of a rule may contain a nonterminal for which no production rule is given. • Such a rule can be removed. • However, that may result in another non-terminal becoming undefined. • Example: If the A in this rule B → . . . A . . . is undefined, remove the rule. But now B may be undefined, so remove rules with B on the righthand side, etc. 341 Unreachable non-terminals If a non-terminal cannot be reached from the start symbol, its defining rules will never be used, and it cannot contribute to the production of any sentence. 342 Non-productive rules • A rule that gets stuck in an infinite loop cannot contribute anything to the sentences of the language of the grammar, since once the rule is entered, there is no way to get rid of it: the rule has a non-productive non-terminal. • Example: the rule X → aX is non-productive and any rule which has X in its right-hand side is non-productive. • In an extreme case all non-terminals in a grammar are nonproductive. This happens when all right-hand sides in the grammar contain at least one non-terminal. Then there is no way to get rid of the non-terminals, and the grammar itself is non-productive. 343 Loops • Rules of the form A → A are called loops. • Loops can be indirect: A→B A B→C C B C→A • Loops can be hidden: A → PAQ ε ε * ε P→ PAQ A * ε Q→ 344 Loops can legitimately occur • A loop can legitimately occur in the production of a sentence, provided there is also a production that enables breaking out of the loop. • Example: below, the first rule is a loop, but the second rule enables breaking out of the loop: A → aA A→a 345 Proper grammar A grammar without useless non-terminals and loops is called a proper grammar. 346 Cleaning up a CF grammar • Normally, grammars supplied by people do not contain undefined, unreachable, or nonproductive non-terminals. • If they do, it is almost certainly a mistake and we would like to detect and report them. • Such anomalies can, however, occur in generated grammars or be introduced by some grammar transformations, in which case we wish to detect them to “clean up” the grammar. 347 Algorithm to detect and remove useless non-terminals and rules • The algorithm to detect and remove useless non-terminals and rules from a context-free grammar consists of two steps: 1. Remove non-productive rules 2. Remove unreachable non-terminals • It is not necessary to remove rules with undefined non-terminals since the first step does this automatically. 348 Let’s clean up this CF grammar S A B C D E F → → → → → → → A a b c d e f B | D E C F D The above grammar looks innocent: all its non-terminals are defined and it does not exhibit any suspicious constructions. 349 Step 1: Remove non-productive rules • The following slides describe how to remove non-productive rules. • Find the non-productive rules by finding the productive rules. After finding all productive rules, the other, remaining rules are the nonproductive rules. 350 Algorithm to find productive rules • A rule is productive if its right-hand side consists of symbols all of which are productive. • Productive symbols: – Terminal symbols are productive since they produce terminals. – Empty is productive since it produces the empty string. – A non-terminal is productive if there is a productive rule for it. 351 Initial knowledge Go through the grammar and for each rule for which we know that all its right-hand side symbols are productive, mark the rule and the non-terminal it defines as Productive. Rule Productive S → A B | D E A → a Productive B → b C C → c Productive D → d F E → e Productive F → f D 352 Build on top of our knowledge Now we know more. Apply this knowledge in a second round through the grammar. Rule Productive S → A B | D E A → a Productive B → b C Productive (since b is productive and C is productive) C → c Productive D → d F E → e Productive F → f D 353 Round three Rule Productive S → A B S → D E Productive (since A is productive and B is productive) A → a Productive B → b C Productive (since b is productive and C is productive) C → c Productive D → d F E → e Productive F → f D 354 Round four A fourth round yields nothing new. Rule Productive S → A B S → D E Productive (since A is productive and B is productive) A → a Productive B → b C Productive (since b is productive and C is productive) C → c Productive D → d F E → e Productive F → f D 355 Recap We now know that A, B, C, E and the rule S → A B are productive. D, F, and the rule S → D E are non-productive. Rule Productive S → A B S → D E Productive (since A is productive and B is productive) A → a Productive B → b C Productive (since b is productive and C is productive) C → c Productive D → d F E → e Productive F → f D 356 Remove non-productive rules We have pursued all possible avenues for productivity and have not found any possibilities for D, F, and the second rule for S. That means they are non-productive and can be removed from the grammar. Rule Productive S → A B Productive (since A is productive and B is productive) A → a Productive B → b C Productive (since b is productive and C is productive) C → c Productive E → e Productive The grammar after removing non-productive rules 357 Removing non-productive rules also removes undefined non-terminals • Earlier we said: It is not necessary to remove rules with undefined non-terminals since the first step [remove non-productive rules] does this automatically. • Consider a rule R that contains an undefined nonterminal, U. The algorithm shown on the previous slides will not mark R as “Productive” and hence R will be removed. Also, any rules that reference R in their right-hand side will not be marked “Productive” and will be removed. And so forth. • So an undefined non-terminal is just a special case of a non-productive non-terminal: it is non-productive because there is no rule for it. 358 Bottom-up process Removing the non-productive rules is a bottomup process: only at the bottom level, where the terminal symbols live, can we know what is productive. 359 Knowledge-improving algorithm • In the previous slides we increased our knowledge with each round. • The previous slides is our first example of a closure algorithm. 360 Closure algorithms Closure algorithms are characterized by two components: 1. Initialization: an assessment of what we know initially. For our problem we knew: The grammar rules Terminals and empty are productive 2. Inference rule: a rule telling how knowledge from several places is to be combined. The inference rule for our problem was: If all the right-hand side symbols of a rule are productive, then the rule’s left-hand side non-terminal is productive. The inference rule is repeated until nothing changes any more. 361 Step 2: Remove unreachable non-terminals The second step in removing useless nonterminals and rules from a context-free grammar is to remove unreachable nonterminals. 362 Reachable non-terminals • A non-terminal is called reachable or accessible if there exists at least one sentential form, derivable from the start symbol, in which it occurs. • Example: a non-terminal A is reachable if * S → αAβ for some α and β. • Find the unreachable non-terminals by finding the reachable ones. 363 Closure algorithm for finding reachable non-terminals • Initialization: the start symbol is marked “reachable”. • Inference rule: for each rule in the grammar of the form A → α with A marked “reachable”, all non-terminals in α are marked “reachable”. • Continue applying the inference rule until nothing changes any more. • The remaining unmarked non-terminals are not reachable and their rules can be removed. 364 Initialization Rule Reachable S → A B S is reachable A → a B → b C C → c E → e 365 Round one Rule Reachable S → A B S is reachable A → a A is reachable because it is reachable from S B → b C B is reachable because it is reachable from S C → c E → e 366 Round two Rule Reachable S → A B S is reachable A → a A is reachable because it is reachable from S B → b C B is reachable because it is reachable from S C → c C is reachable because it is reachable from B E → e 367 Round three Rule Reachable S → A B S is reachable A → a A is reachable because it is reachable from S B → b C B is reachable because it is reachable from S C → c C is reachable because it is reachable from B E → e The third round produces no change. So the rule E → e is unreachable and can be removed. 368 Cleaned grammar S A B C D E F → → → → → → → A a b c d e f B | D E C F D Initial grammar S A B C E → → → → → A B a b C c e Grammar after removing nonproductive rules S A B C → → → → A B a b C c Grammar after removing unreachable nonterminals 369 Top-down process Removing unreachable non-terminals is a topdown process: only at the top level, where the start symbol lives, can know what is reachable. 370 Order of cleaning matters • The cleaning process must occur in this order: – First, remove non-productive rules – Second, remove unreachable non-terminals • If the order is switched it may produce a grammar which again contains unreachable non-terminals. So you will have to redo the algorithm for removing unreachable nonterminals. 371 Need run the algorithms only once • Suppose we remove non-productive rules and then unreachable rules. • Consider a non-terminal N in a reachable rule: X → αNβ • By removing unreachable rules could N become undefined? Will we have to run the algorithm for removing non-productive rules again? • No. In the process of removing non-productive rules we determined that all symbols on the right-hand side of X are productive. That means that N is productive (N is defined). 372 Cleaning may remove all rules Cleaning a grammar may remove all rules, including those for the start symbol, in which case the grammar describes the empty language. 373 Set properties of context-free and regular languages Since languages are sets, it is natural to ask if the standard operations on sets can be performed on them, and if so, how. Set operations: – union – intersection – negation (complement) 374 Set operations • The union of two sets 𝑆1 and 𝑆2 contains the elements that are in either set; it is written: 𝑆1 ∪ 𝑆2 • The intersection contains the elements that are in both sets; it is written: 𝑆1 ∩ 𝑆2 • The negation of a set 𝑆 contains those in Σ* but not in 𝑆; it is written: ¬𝑆 375 Set operations on grammars In the context of formal languages the sets are defined through grammars, so actually we want to do the operations on the grammars rather than on the languages. 376 Union of two grammars • Constructing the grammar for the union of two languages is trivial for context-free and regular languages (and in fact for all Chomsky types): just construct a new start symbol 𝑆’ → 𝑆1 | 𝑆2, where 𝑆1 and 𝑆2 are the start symbols of the two grammars that describe the two languages. • Make sure the names 𝑆1 and 𝑆2 are different. A grammar describes a language. 377 context-free ∪ context-free = context-free The union of two context-free languages is a context-free language. 378 Intersection of two CF grammars might produce a non-CF grammar • The intersection of two context-free languages might not be context-free. • Consider the two CF languages: L1 = anbncm (same number of as and bs, arbitrary number of cs) L2 = ambncn (arbitrary number of as, same number of bs and cs) described by the CF grammars: L1 → A P A → a A b | ε P → c P | ε and L2 → Q C Q → a Q | ε C → b C c | ε • A string that occurs in both languages must have the same number of as and bs per L1 and the same number of bs and cs per L2. So the intersection language consists of strings of the form anbncn and we know that language is not context-free. 379 context-free ∩ context-free = … • The intersection of two context-free languages might be a context-sensitive language. – Example: the intersection of anbncm and ambncn is anbncn, and the latter is context-sensitive. • The intersection of two context-free languages might be a context-free language. – Example: the intersection of a context-free language with itself is context-free. 380 Set theory vs. language theory • When languages are treated as sets it is easy to generate anbncn anbncm anbncn ambncn • Conversely, when languages are treated as grammars it is quite difficult to generate anbncn 1. S → abc | aSQ 2. bQc → bbcc 3. cQ → Qc 381 Easy to intersect two CF languages Just enumerate them both (use the queue algorithm) and output words that appear on both lists. The queue algorithm outputs the strings in order of increasing length. Suppose grammar 1 generates string abc. We can determine if grammar 2 generates abc by running the queue algorithm on grammar 2 until (a) it outputs abc, or (b) it outputs a string with length greater than 3 (the length of abc). 382 Easy to determine membership of the intersection of two XML languages • Problem: is an XML instance document a member of the intersection of two XML Schemas? • Validate the XML instance document twice, once for each XML Schema. The XML instance is a member of the intersection if and only if it conforms to both XML Schemas. 383 The intersection of CF languages has weird properties • The intersection of two CF languages might be a Type 1 (context-sensitive) grammar. • The intersection of three CF languages is more powerful than the intersection of two of them. • Remarkable phenomenon: any Type 1 language, and even any Type 0 language, can be constructed by intersecting just two CF languages, provided we are allowed to erase all symbols in the resulting strings that belong to a set of erasable symbols. Continued  384 erasure(L3 ∩ L4) = a CS language • The CS language we will use to demonstrate this remarkable phenomenon is the set of all strings that consist of two identical parts: ww, where w is any string over the given alphabet. • The two languages to be intersected are defined by: L3 → A P A → a A x | b A y | ε P → a P | b P | ε and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε x and y are the erasable symbols 385 Intersecting two CF grammars with erasable symbols L3 → A P A → a A x | b A y | ε P → a P | b P | ε This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs. and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs. Example string: abaxyxaba Example string: abaxyxaba abaxyxaba erase the erasable symbols x, y abaaba 386 Dark mirror L3 → A P A → a A x | b A y | ε P → a P | b P | ε and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs. This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs. Example string: abaxyxaba Example string: abaxyxaba dark mirror of dark mirror of 387 Each grammar has 3 parts L3 → A P A → a A x | b A y | ε P → a P | b P | ε and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs. This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs. Example string: abaxyxaba Example string: abaxyxaba A1 M1 G1 G2 M2 A2 388 Corresponding parts must match for an intersection L3 → A P A → a A x | b A y | ε P → a P | b P | ε and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs. This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs. Example string: abaxyxaba Example string: abaxyxaba A1 M1 G1 The intersection forces A1 = G2, M1 = M2, and G1 = A2. G2 M2 A2 389 A1 equals A2 L3 → A P A → a A x | b A y | ε P → a P | b P | ε and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs. This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs. Example string: abaxyxaba Example string: abaxyxaba A1 M1 G1 G2 M2 A2 The intersection forces A1 = G2, M1 = M2, and G1 = A2. M1 is the dark mirror of A1, so M2 is the dark mirror of A1, so A2 equals A1. 390 Erase the dark mirrors L3 → A P A → a A x | b A y | ε P → a P | b P | ε and L4 → Q C Q → a Q | b Q | ε C → x C a | y C b | ε This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs. This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs. Example string: abaxyxaba Example string: abaxyxaba A1 M1 G1 G2 M2 A2 The intersection forces A1 = G2, M1 = M2, and G1 = A2. M1 is the dark mirror of A1, so M2 is the dark mirror of A1, so A2 equals A1. After erasing the mirrors we have abaaba, which is (aba)2 391 Create Type 0 from the intersection of context-free languages Using a massive application of the mirror-mirror trick, one can relatively easily prove that any Type 0 language can be constructed as the intersection of two CF languages, plus a set of erasable symbols. The construction is described extremely formally in Roughly the same explanation can be found in @Book { FormalLanguages.Harrison.1978, author = {Harrison, Michael A.}, title = {Introduction to Formal Language Theory}, publisher = {Addison-Wesley}, year = {1978} } on pages 307-311. @Article { New.Ginsburg.1967, author = {Ginsburg, Seymour and Greibach, Sheila and Harrison, Michael A.}, title = {One-Way Stack Automata}, journal = {J. ACM}, volume = {14}, number = {2}, year = {April 1967}, pages = {389-418}, annote = {} } on pages 402-405. But like all text in this book the explanation is very dense and is severely complicated by the fact that the author inside the explanation wants to prove that the two CF languages you need are deterministic. 392 **** Provisional Explanation of Creating a Type 0 Language by Intersecting Two CF Languages (Dick Grune)**** The basic trick is for the string in the intersection to represent the complete production process of a terminal production of a Type 0 grammar G. The steps are encoded as follows: … ( X1l alpha_1 X1r → -X1r -beta_1 -X1l ) ( X2l alpha_2 X2r → -X2r -beta_2 -X2l )... (1) where the →, ( and ) are unique markers; Xnl(eft) and Xnr(ight) are arbitrary strings; -S means the reverse of S; and alpha_1 → beta_1, alpha_2 → beta_2, etc. are rules in G. In fact this is what L1 generates, and that is easy to do: it's just a repetition of ( Xnl alpha_n Xnr → -Xnr -beta_n -Xnl). It is easy to produce S x -S by CF grammar, so apply twice, once for the inner Xnr, once for the outer Xnl, and you're done. But the above steps are a good derivation only when the output of one step is the input of the next, so -(-X1r -beta_1 -X1l) must be equal X2l alpha_2 X2r. This is where L2 comes in. It generates … → Y1 ) ( -Y1 → Y2 ) ( -Y2 → … (2) which again can be produced easily with a CF grammar, since its structure is again S x -S. Intersecting the two enforces the reverse of -X1r -beta_1 -X1l (which itself was the reverse of the result of the step alpha_1 → beta_1) to be equal to X2l alpha_2 X2r, the input of the next step. This makes the intersection of string (1) and string (2) a representation of a valid Type 0 production process. There are two more details to care for. One is the start-up, which is next to trivial. The second is the close-down and the harvesting of the result. This is where the homomorphism (the erasing of the erasable symbols) comes in. Before we start the whole construction we replace all terminals in G by non-terminals with similar names, and declare all symbols in G erasable. This ensures that when we in the end apply the homomorphism (the erasure act) the whole production process disappears. But of course we want to keep the final product which consists exclusively of those non-terminals that represent terminals. We harvest them by letting the productions of L1 and L2 end in the language T ) #-T, where T is any string of the non-terminals created for the original terminals of G, and # replaces each of the non-terminals by its corresponding terminal. Again this is easy to do since its structure is again essentially S x -S. Now when we erase the erasable symbols, everything disappears except the final string of terminals, a production of G. Hurray! 393 context-free ∩ regular = context-free • The intersection of a context-free and a regular language is always a context-free language. • There is a simple algorithm to construct a grammar for that intersection language. 394 De Morgan’s Law The intersection of two sets equals the negation of the two sets, unioned, then negated: 𝐿1 ∩ 𝐿2 = ¬((¬𝐿1) ∪ (¬𝐿2)) 395 The negation of a context-free language might not be context-free • De Morgan’s law: L1 ∩ L2 = ¬((¬L1) ∪ (¬L2)) • Suppose the negation of a CF language produces a CF language. • L1 and L2 are CF languages. Then ¬L1 is a CF language, as is ¬L2. We know that the union of two CF languages produces a CF language so (¬L1) ∪ (¬L2) produces a CF language. The negation of it then produces a CF language. So ¬((¬L1) ∪ (¬L2)) is a CF language. But that equals L1 ∩ L2 and we already know that the intersection of two CF languages might not be CF. • Therefore, negation of a CF language is not guaranteed to produce a CF language. 396 Set properties of regular (Type 3) languages regular-language ∪ regular-language = regular-language regular-language ∩ regular-language = regular-language ¬regular-language = regular-language 397 Would there be programming languages? XML? • It is interesting to speculate what would have happened if formal languages had been based on set theory with all the set operations right from the start, rather than on the Chomsky hierarchy. • Would context-free languages still have been invented? • CF languages guarantee only set union, not set intersection or set difference. If you insist on having set intersection (which is very tempting and convenient, see for example the ease with which you can construct anbncn by intersection), you'll never invent CF languages. 398 Parsing for grammar conformance • Scenario: you observe a pattern in the strings that you are dealing with. So you create a grammar to describe the pattern. Now you want to check that your grammar correctly describes the pattern. What is required of a parser? • Parsing only needs to check that the string conforms to the grammar. 399 Parsing to determine the string’s semantics • Often we want to go further than simply check that a string conforms to the grammar we have designed for it. • We want to know the string’s meaning, its semantics. • The semantics of a string is directly related to the structure of its production tree. If it is not, we have the wrong grammar. 400 Attaching semantics to a grammar Attaching semantics to a (context-free) grammar is done in a very simple and effective way: To each rule in the grammar, a semantic clause is attached, which relates the semantics of the members of the right-hand side of the rule to the semantics of the left-hand side. Sum → Digit {A0 := A1} semantic clause 401 Flow of the semantic info Semantic info can flow up, down, or both ways: • Up: semantic information flows from the leaves of the tree upward to the start symbol. The semantics of the members of the right-hand side of each rule is used to define the semantics of the left-hand side. • Down: semantic information flows downward from the start symbol to the leaves. The semantics of the left-hand side of each rule is used to define the semantics of the members of the right-hand side. • Both: semantic information flows up and down for a while until a stable situation is reached. 402 Inherited vs. derived semantic information • Semantic information flowing down is called inherited: each rule inherits semantics from its parent in the tree. • Semantic information flowing up is called derived: each rule derives semantics from its children. – Derived information is also called synthesized information. 403 Expressing semantics • There are many ways to express semantics. • We will briefly describe two often-used and well-studied techniques: 1. Attribute grammars 2. Transduction grammars 404 Add semantic info to this grammar • We will explain attribute grammars and transduction grammars using the language: sums of one-digit numbers. • The semantics of a sentence in the language is the value of the sum. • The language is generated by this grammar: 1. 2. 3. Sum → Digit Sum → Sum + Digit Digit → 0 | 1 | … | 9 Here is one of the strings in the language: 3 + 5 + 1 The semantics of that string is: 9 405 Attribute grammars • The semantic clause in an attribute grammar assume that each node in the production tree has room for one or more attributes, which are just values (numbers, strings, or anything else) sitting in nodes in production trees. • For simplicity we restrict ourselves to attribute grammars with only one attribute per node. • The semantic clause of a rule in such a grammar contains some formulas which compute the value of some of the non-terminals in that rule from those of other non-terminals in that same rule. • These semantic actions connect only values that are local to the rule. The overall semantics is composed as the result of all the local computations. – Local actions produce global results – cool! • If the semantic action of a rule R computes the value of the left-hand side of R, that value is derived (synthesized). If it computes a value of one of the non-terminals in the right-hand side of R, say A, then that value is inherited by A. 406 Naming the attributes Sum → Sum + Digit A0 A1 A3 Each non-terminal has an associated attribute. The attribute for the symbol on the left-hand side, Sum, is named A0 Each symbol, including terminals, are indexed. So the attribute for the right-side Sum is A1, the attribute for Digit is A3. 407 Attribute grammar for the Sum grammar 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Sum Sum Digit Digit Digit Digit Digit Digit Digit Digit Digit Digit → → → → → → → → → → → → Digit Sum + Digit 0 1 2 3 4 5 6 7 8 9 {A0 {A0 {A0 {A0 {A0 {A0 {A0 {A0 {A0 {A0 {A0 {A0 := := := := := := := := := := := := A1} A1 + A3} 0} 1} 2} 3} 4} 5} 6} 7} 8} 9} 408 Initial production tree + 1 + A0 = 3 3 A0 = 1 A0 = 5 5 The initial production tree for 3 + 5 + 1 is given above. Initially only the attributes of the leaves are known, but as soon as all attributes in a right-hand side of a production rule are known, we can use its semantic clause to compute the attribute of its left-hand side. This way the attribute values (semantics) percolate up the tree, finally reaching the start symbol and providing us with the semantics of the whole string. 409 Attribute values percolate up the tree + + 1 + A0 = 3 3 A0 = 5 5 A0 = 1 A0 = A1 + A3 =3+5 =8 A0 = 3 1 + A0 = 5 3 5 A0 = A1 + A3 =8+1 =9 A0 = A1 + A3 =3+5 =8 A0 = 3 3 A0 = 1 + 1 + A0 = 1 A0 = 5 5 Attribute grammars are a very powerful method for handling the semantics of a language. This is another example of a closure algorithm! 410 XML attributes • An XML Schema creates a grammar. • XML has “attributes”. • Are XML attributes in any way related to attribute grammars? • Did the creators of XML create XML attributes simply for tucking away a name-value pair in a tidy fashion? Or, did they have in mind a deeper usage for XML attributes: use them to define the semantics of the XML? 411 Transduction grammars • Transduction grammars define the semantics of a string (the “input string”) as another string, the “output string” or “translation”. • The semantic clause in a production rule is the string that should be output for the node. • The string is output for a node after the strings for all its children. 412 Transduction grammar for Sum Here is a transduction grammar which translates a sum of digits into an instruction to compute the value of the sum: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Sum Sum Digit Digit Digit Digit Digit Digit Digit Digit Digit Digit → → → → → → → → → → → → Digit Sum + Digit 0 1 2 3 4 5 6 7 8 9 “make it the result” “add it to the previous result “take a 0” “take a 1” “take a 2” “take a 3” “take a 4” “take a 5” “take a 6” “take a 7” “take a 8” “take a 9” 413 Sequence of outputs Sum take a 3 Sum Sum + Digit + Digit 1 Digit 5 3 + Digit + Digit 1 5 3 take a 5 Add it to the previous result Digit Sum Sum Digit 3 Sum Sum make it the result Sum Sum Sum + Digit + Digit 1 5 Sum Digit Sum + Digit + Digit 1 5 3 414 Meaning of 3 + 5 + 1 The transduction grammar translates 3 + 5 + 1 into: take a 3 make it the result take a 5 add it to the previous result take a 1 add it to the previous result which is indeed what 3 + 5 + 1 “means” 415 Augmented Transition Network (ATN) • Semantics can be introduced into a recursive transition network by attaching actions to the transitions in the graphs. • These actions can set variables, construct data structures, etc. • A thus augmented recursive transition network is known as an Augmented Transition Network (ATN). Recursive Transition Network: Name Sentence List & Name 416 Generative Power • Grammars generate languages. • By applying restrictions to grammar rules we reduce the generative power of the grammars. 417 Grammar power = language size? • One often hears statements like these: Type 𝑛 grammars are more powerful than Type 𝑛 + 1 grammars, for 𝑛 = 0,1,2 A regular (Type 3) grammar is not powerful enough to match parentheses. • What kind of power is meant? • One might think that it is the power to generate larger and larger sets, but this is clearly incorrect: the largest possible set of strings, Σ* (the set of all strings over the alphabet) is easily generated by the (unpowerful) Type 3 grammar: S → [Σ] S | ε where [Σ] is an abbreviation for the symbols of the alphabet 418 Power means restriction! • Restricting Σ* requires more powerful grammars. • More powerful grammars can define more complicated boundaries between correct and incorrect sentences. Example: a Type 1 (context-sensitive) grammar can define sentences with the same number of as, bs, and cs (i.e., anbncn) whereas the best that a Type 2 (context-free) grammar can do is define sentences with the same number of as and bs, with no restrictions on the number of cs (i.e., anbncm). • Some boundaries are so fine, they cannot be described by any grammar (that is, by any generative process). 419 Power of a grammar • A more powerful grammar does not mean that the grammar can generate larger languages (larger sets). • More powerful means that the grammar can define more precise rules regarding what strings are allowed in the language. 420 Metaphor for grammar power: outlining a rose • • • • • • • Imagine drawing a rose. It is approximated by increasingly finer outlines. In this metaphor, the rose corresponds to the language (imagine the strings of the language as molecules in the rose); the grammar serves to delineate its silhouette. A regular grammar only allows us straight horizontal and vertical line segments to describe the flower. Ruler and T-square suffice, but the result is a course and mechanical looking picture. A CF grammar would approximate the outline by straight lines at any angle and by circle segments. The drawing could still be made using the classical tools of compass and ruler. The result is stilted but recognizable. A CS grammar would present us with a smooth curve tightly enveloping the flower, but the curve is too smooth: it cannot follow all the sharp turns, and it deviates slightly at complicated points. Still, a very realistic picture results. An unrestricted phrase structure grammar can represent the outline perfectly. The rose itself cannot be caught in a finite description. Its essence remains forever out of our reach. 421 Set of Java programs that can be generated by the grammar types • • • • • • A regular grammar can generate the set of all “lexically correct” Java programs. A Java program is lexically correct if there are no newlines inside strings, comments are terminated before end-of-file, all numerical constants have the right form, etc. A context-free grammar can generate the set of all “syntactically correct” Java programs. These programs conform to the CF grammar in the language manual. A context-sensitive grammar can generate the set of all semantically correct Java programs. These are the programs that pass through a Java compiler without drawing error messages. An unrestricted phrase structure grammar can generate the set of all Java programs that would terminate in finite time when run with a given input. Such a grammar would, however, be very complicated, since it would incorporate detailed descriptions of the Java library routines and the Java run-time system. The set of all Java programs that solve a given problem (for example, play chess) cannot be generated by a grammar (although the description of the set is finite). Note that each of the above sets is a subset of the previous set. 422 Sets generated by the grammar types 423 Set versus Language Sets that can be generated by a grammar type Languages that can be generated by a grammar type 424 Set of XSLT programs that can be generated by the grammar types The set of all “lexically correct” XSLT programs can be generated by a regular grammar. An XSLT <xsl:value-of> element is lexically correct if it starts with <xsl:value-of, optionally followed by one or more attribute/value pairs, followed by </xsl:valueof> 1. 2. 3. 4. 5. Value-of → <xsl:value-of Rest Rest → string = “ Expression Rest → End-tag Expression → string ” End-tag End-tag → > </xsl:value-of > 425 Set of XSLT programs that can be generated by the grammar types The set of all “syntactically correct” XSLT programs can be generated by a context-free grammar. These programs conform to the CF grammar in the XSLT specification. 1. 2. 3. 4. 5. 6. 7. 8. 9. Value-of → Start-tag End-tag Start-tag → <xsl:value-of Select > Start-tag → <xsl:value-of Separator > Start-tag → <xsl:value-of Disable-output-escaping > End-tag → </xsl:value-of > Select → select = “ Expression ” Separator → separator = “ AVT ” Disable-output-escaping → disable-output-escaping = “ YESNO ” YESNO → yes | no 426 Set of XSLT programs that can be generated by the grammar types • The set of all semantically correct XSLT programs can be generated by a CS grammar. These are the programs that pass through an XSLT processor without drawing error messages. • The set of all XSLT programs that would terminate in finite time when run with a given input can be generated by an unrestricted phrase structure grammar. Such a grammar would, however, be very complicated, since it would incorporate detailed descriptions of the XPath function routines and the XSLT run-time system (e.g. optimizations). • The set of all XSLT programs that solve a given problem (for example, play chess) cannot be generated by a grammar (although the description of the set is finite). 427 The basis for the importance of context-free grammars • A Chomsky grammar is a finite mechanism that produces a (usually) infinite set of strings, a “language”. • Unlike many other set generation mechanisms, this production process assigns a structure to the produced string, which can be utilized to attach semantics to it. – For context-free (Type 2) grammars, this structure is a tree, which allows the semantics to be composed from the semantics of the branches. This is the basis of the importance of context-free grammars. 428 Tom, Dick and Harry example The following slides illustrate each of the grammar types, using the tdh language. 429 Example of a phrase structure (PS) grammar, Type 0 1. Sentence → Name | List End 2. Name → tom | dick | harry 3. List → Name | Name , List 4. , Name End → and Name Notice that rule 4 has more stuff on the left-hand side than the righthand side. That’s what characterizes a PS grammar. Key Point: this grammar is in the form of a phrase-structure grammar but the language (set) it generates can be generated by a mere regular grammar. Here is what characterizes a set that is Type 0: an item not in the set cannot be ruled out in finite time. That is, it may take an infinite amount of time to determine that an item is not in the set. More formally, determining if a given item belongs to a set generated by a Type 0 grammar is undecidable. 430 PS grammar vs. PS language • A PS grammar is a grammar that has the proper form: no restrictions on the LHS of each rule, other than it must contain a nonterminal. • A PS language is a language (set) that can only be generated by a PS grammar, not a CS grammar or CF grammar or FS grammar. 431 Example of a context-sensitive (CS) grammar , Type 1 1. 2. 3. 4. 5. 6. Sentence Name List Comma EndName and EndName Comma → → → → → → Name | List tom | dick | harry EndName | Name Comma List and EndName and Name , Notice that each rule has at least as much stuff on the right-hand side as on the left-hand side. Further, the RHS is exactly like the LHS, except one non-terminal has been changed. That’s what characterizes a CS grammar. Key Point: this grammar is in the form of a context-sensitive grammar but the language (set) it generates can be generated by a mere context-free grammar. What distinguishes a Type 1 grammar from a Type 0 grammar is that an item not in the set can be determined in finite time. 432 Example of a context-free (CF) grammar , Type 2 1. 2. 3. Sentence → Name | List and Name Name → tom | dick | harry List → Name , List | Name Notice that each rule has exactly one non-terminal on the LHS: the non-terminal is defined independent of context. 433 Example of a regular/finite-state (FS) grammar , Type 3 1. 2. 3. Sentence → tom | dick | harry | List List → tom ListTail | dick ListTail | harry ListTail ListTail →, List | and tom | and dick | and harry This is a right-regular grammar: each rule’s non-terminal is at the right end of the rule. 434 Example of a finite-choice (FC) grammar , Type 4 Sentence → [tdh] | [tdh] and [tdh] | [tdh] , [tdh] & [tdh] Note: Type 4 is not part of the Chomsky hierarchy. 435 Standard example of a Type 1 language: anbncn CS grammar for anbncn 1. 2. 3. 4. 5. 6. 7. 8. S CB HB HC aB bB bC cC → → → → → → → → aSBC | aBC HB HC BC ab bb bc cc Derivation of a2b2c2 S aSBC aaBCBC aabCBC aabHBC aabHCC aabBCC aabbCC aabbcC aabbcc (start) (rule 1) (rule 1) (rule 5) (rule 2) (rule 3) (rule 4) (rule 6) (rule 7) (rule 8) 436 Formal summary of Type 0, 1, 2, 3 grammars A generative grammar G = (VN, VT, S, F) is said to be of type i if it satisfies the restrictions described in this list: i=0 No restrictions except the LHS must contain a nonterminal i = 1: Every rewriting rule in F has the form Q1AQ2 → Q1PQ2, with Q1, Q2, and P in (VN ∪ VT)*, A ∈ VN, and P ≠ ε, except possibly for the rule S → ε, which may occur in F, in which case S does not occur on the right-hand sides of the rules. i = 2: Every rule in F has the form A → P, where A ∈ VN, and P ∈ (VN ∪ VT)*. i = 3: Every rule in F has the form A → PB or A → P, where A, B ∈ VN, and P ∈ VT*. 437 Type 0-3 languages are infinite • Every language that is Type 0-3 is infinite. • If a language is finite, we can enumerate the sentences using a Type 4 grammar. 438 Why does a parse “reconstruct”? Parsing is the task of reconstructing the derivation tree (or graph) for a given input string. • Why is parsing about “reconstructing”? • That implies that at one time the input string was in the form of a tree (or graph) but somehow it lost that form and now we are reconstructing it. 439 The brain creates parse trees According to Chomsky, and I think he is right in this, sentences in a language, natural or artificial, are constructed according to a grammar. While being generated they obtain a structure, the generation tree (or graph in the case of PS grammars). This structure encodes the meaning. When the sentence is spoken or written the terminal symbols (words) alone are transferred to the listener or reader, losing the structure (linearized). But since the meaning is attached to that structure the listener or reader will have to reconstruct the generation tree, now called the parse tree, to retrieve the meaning. That's why we need parsing. Actually I do not think Chomsky is a 100% right. CF grammars are not strong enough, and people don't use PS grammars. I think they use affix or attribute grammars (they are equivalent). But the above paragraph still holds. Dick Grune 440 The brain creates parse trees Sentences, linear sequences of symbols, are really just serializations of parse trees we humans grok natively in hardware. To get an idea across to someone, we have to conjure up the same parse tree in their head using a word stream. The Definitive ANTLR 4 Reference, p. 11 441 The computer scientist is undaunted by undecidable problems The issue addressed here is the formal linguist saying "you can't do this" (and he is correct) and the computer scientist saying "true, but I can handle an increasing number of instances with increasingly complicated algorithms". A good example is the package AmbiDexter (http://homepages.cwi.nl/~storm/publications/ambidexter.pdf), which "solves" an undecidable problem: is a given CF grammar ambiguous? (impossible to decide according to formal language theory). The program does so by trying all kinds of tricks to have the grammar produce two identical sentences (it was written by Bas Basten, one of those undaunted computer scientists). Dick Grune 442 OTHER SLIDES 443 Grammar-oriented programming • Grammar-oriented programming (GOP) and Grammaroriented Object Design (GOOD) are good for designing and creating a domain-specific programming language (DSL) for a specific business domain. • GOOD can be used to drive the execution of the application or it can be used to embed the declarative processing logic of a context-aware component (CAC) or context-aware service (CAS). GOOD is a method for creating and maintaining dynamically reconfigurable software architectures driven by business-process architectures. The business compiler was used to capture business processes within real-time workshops for various lines of business and create an executable simulation of the processes used. http://en.wikipedia.org/wiki/Grammar-oriented_programming 444 Rodney Brooks Once Turing came up with a formalism for computation we were able make great progress fairly quickly. Now if you took any late 19th-century mathematicians, you could explain the fundamental ideas of computation to them in two or three days, lead them through the theorems, they could understand it and they wouldn't find it mind boggling in any way. It follows on from 19th-century mathematics. Once you have that notion of computation, you are able to do a lot with it. http://www.edge.org/conversation/the-deep-question 445 CS grammar for generating the same number of as, bs, and cs Shuffle the as, bs, and cs 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. S S’ AB AC BA BC CA CB A B C → → → → → → → → → → → S’ | ε S’ABC | ABC BA CA AB CB AC BC a b c 446 Grammar for generating 1, 2, 4, 8, … as (2i as) S -> ACaB Ca -> aaC CB -> DB CB -> E aD -> Da AD -> AC aE -> Ea AE -> ε S -> ACaB -> AaaCB -> AaaE -> AaEa -> AEaa -> aa S -> ACaB -> AaaCB -> AaDB -> ADaB -> ACaB -> AaaCB -> AaaDB -> AaDaB -> ADaaB -> ACaaB -> AaaCaaB -> AaaaaCaB -> AaaaaaaCB -> AaaaaaaE -> AaaaaaEa -> AaaaaEaa -> AaaaEaaaa -> AaaEaaaaa -> AaEaaaaa -> AEaaaaaa -> aaaaaa 447 CS languages One of the simplest context-sensitive languages is: the language of all strings consisting of n occurrences of the symbol "a", then n "b"'s, then n "c"'s (abc, aabbcc, aaabbbccc, etc.). A superset of this language, called the Bach language,[1] is defined as the set of all strings where "a", "b" and "c" (or any other set of three symbols) occurs equally often (aabccb, baabcaccb, etc.) and is also context-sensitive.[2][3] Another example of a context-sensitive language that is not context-free is L = { ap : p is a prime number }. L can be shown to be a context-sensitive language by constructing a linear bounded automaton which accepts L. The language can easily be shown to be neither regular nor context free by applying the respective pumping lemmas for each of the language classes to L. http://en.wikipedia.org/wiki/Context-sensitive_language 448 Symbols • • • • • • • • • • • • • • • • • • • ∈ = an element of ∉ = not an element of ℕ = the set of natural numbers ε = the empty string Σ = the alphabet Σ* = the set of all strings over the alphabet Σ δ = transition function for a single token δ* = transition function for a sequence of tokens → ⊆ = proper subset of 𝐿 = a language 𝐿 = the complement of language 𝐿 ∩ = intersection ∪ = union ↦ P ∅ ≠ ¬ b {q0, q1} 449

Parsing Techniques: A Practical Guide

Related documents

Products

Support

Parsing Techniques: A Practical Guide

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib