Regular Languages & Finite Automata Lecture Notes

N Lecture Notes on Regular Languages and Finite Automata for Part IA of the Computer Science Tripos Prof. Andrew M. Pitts Cambridge University Computer Laboratory c 2013 A. M. Pitts Contents Learning Guide ii 1 Regular Expressions 1.1 Alphabets, strings, and languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Some questions about languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 4 6 8 2 Finite State Machines 2.1 Finite automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Determinism, non-determinism, and ε-transitions . . . . . . . . . . . . . . . . . . . . 2.3 A subset construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 14 17 20 20 3 Regular Languages, I 3.1 Finite automata from regular expressions . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Decidability of matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 28 30 4 Regular Languages, II 4.1 Regular expressions from finite automata . . . . . . . . . . . . . . . . . . . . . . . . 4.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Complement and intersection of regular languages . . . . . . . . . . . . . . . . . . . 4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 32 34 36 5 The Pumping Lemma 5.1 Proving the Pumping Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Using the Pumping Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Decidability of language equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 41 44 45 6 Grammars 6.1 Context-free grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Backus-Naur Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Regular grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Chomsky and Greiback normal forms . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 49 51 54 55 7 Pushdown Automata 7.1 Non-deterministic pushdown automata . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Behaviour of a NPDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Language accepted by a NPDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Toward computation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 59 61 63 63 ii Learning Guide The notes are designed to accompany eight lectures on regular languages and finite automata for Part IA of the Cambridge University Computer Science Tripos. The aim of this short course will be to introduce the mathematical formalisms of finite state machines, regular expressions and context-free grammars, and to explain their applications to computer languages. As such, it covers some basic theoretical material which Every Computer Scientist Should Know. Direct applications of the course material occur in the CST Part IB course on Compiler Construction and the CST Part II course on Natural Language Processing. Further and related developments will be found in the CST Part IB courses Computation Theory and Semantics of Programming Languages. This course contains the kind of material that is best learned through practice. The books mentioned below contain a large number of problems of varying degrees of difficulty, and some contain solutions to selected problems. A few exercises are given at the end of each section of these notes and relevant past Tripos questions are indicated there. At the end of the course students should be able to explain how to convert between the three ways of representing regular sets of strings introduced in the course; and be able to carry out such conversions by hand for simple cases. They should also be able to use the Pumping Lemma to prove that a given set of strings is not a regular language. They should be able to design a pushdown automaton to accept strings for a given context-free grammar. Recommended books Textbooks which cover the material in this course also tend to cover the material you will meet in the CST Part IB courses on Computation Theory and Complexity Theory, and the theory underlying parsing in various courses on compilers. There is a large number of such books. Three recommended ones are listed below. • J. E. Hopcroft, R. Motwani and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation, Second Edition (Addison-Wesley, 2001). • D. C. Kozen, Automata and Computability (Springer-Verlag, New York, 1997). • T. A. Sudkamp, Languages and Machines (Addison-Wesley Publishing Company, Inc., 1988). Note The material in these notes has been drawn from several different sources, including the books mentioned above and previous versions of this course by the author and by others. Any errors are of course all the author’s own work. A list of corrections will be available from the course web page (follow links from www.cl.cam.ac.uk/Teaching/). Andrew Pitts Andrew.Pitts@cl.cam.ac.uk 1 1 Regular Expressions Doubtless you have used pattern matching in the command-line shells of various operating systems (Slide 1) and in the search facilities of text editors. Another important example of the same kind is the ‘lexical analysis’ phase in a compiler during which the text of a program is divided up into the allowed tokens of the programming language. The algorithms which implement such pattern-matching operations make use of the notion of a finite automaton (which is Greeklish for finite state machine). This course reveals (some of!) the beautiful theory of finite automata (yes, that is the plural of ‘automaton’) and their use for recognising when a particular string matches a particular pattern. Pattern matching What happens if, at a Unix/Linux shell prompt, you type ls ∗ and press return? Suppose the current directory contains files called regfla.tex , regfla.aux, regfla.log, regfla.dvi, and (strangely) .aux. What happens if you type ls ∗ .aux and press return? Slide 1 1.1 Alphabets, strings, and languages The purpose of Section 1 is to introduce a particular language for patterns, called regular expressions, and to formulate some important problems to do with pattern-matching which will be solved in the subsequent sections. But first, here is some notation and terminology to do with character strings that we will be using throughout the course. 1 REGULAR EXPRESSIONS 2 Alphabets An alphabet is specified by giving a finite set, Σ, whose elements are called symbols. For us, any set qualifies as a possible alphabet, so long as it is finite. Examples: Σ1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} — 10-element set of decimal digits. Σ2 = {a, b, c, . . . , x, y, z} — 26-element set of lower-case characters of the English language. Σ3 = {S | S ⊆ Σ1 } — 210 -element set of all subsets of the alphabet of decimal digits. Non-example: N = {0, 1, 2, 3, . . . } — set of all non-negative whole numbers is not an alphabet, because it is infinite. Slide 2 Strings over an alphabet A string of length n (≥ 0) over an alphabet Σ is just an ordered n-tuple of elements of Σ, written without punctuation. Example: if Σ = {a, b, c}, then a, ab, aac, and bbac are strings over Σ of lengths one, two, three and four respectively. def Σ∗ = set of all strings over Σ of any finite length. N.B. there is a unique string of length zero over Σ, called the null string (or empty string) and denoted ε (no matter which Σ we are talking about). Slide 3 1.1 Alphabets, strings, and languages 3 Concatenation of strings The concatenation of two strings u, v ∈ Σ∗ is the string uv obtained by joining the strings end-to-end. Examples: If u = ab, v = ra and w = cad, then vu = raab, uu = abab and wv = cadra. This generalises to the concatenation of three or more strings. E.g. uvwuv = abracadabra. Slide 4 Slides 2 and 3 define the notions of an alphabet Σ, and the set Σ∗ of finite strings over an alphabet. The length of a string u will be denoted by length(u). Slide 4 defines the operation of concatenation of strings. We make no notational distinction between a symbol a ∈ Σ and the corresponding string of length one over Σ: so Σ can be regarded as a subset of Σ∗ . Note that Σ∗ is never empty—it always contains the null string, ε, the unique string of length zero. Note also that for any u, v, w ∈ Σ∗ uε = u = εu and (uv)w = uvw = u(vw) and length(uv) = length(u) + length(v). Example 1.1.1. Examples of Σ∗ for different Σ: (i) If Σ = {a}, then Σ∗ contains ε, a, aa, aaa, aaaa, . . . (ii) If Σ = {a, b}, then Σ∗ contains ε, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, . . . (iii) If Σ = ∅ (the empty set — the unique set with no elements), then Σ∗ = {ε}, the set just containing the null string. 1 REGULAR EXPRESSIONS 4 1.2 Pattern matching Slide 5 defines the patterns, or regular expressions, over an alphabet Σ that we will use. Each such regular expression, r, represents a whole set (possibly an infinite set) of strings in Σ∗ that match r. The precise definition of this matching relation is given on Slide 6. It might seem odd to include a regular expression ∅ that is matched by no strings at all—but it is technically convenient to do so. Note that the regular expression ε is in fact equivalent to ∅∗ , in the sense that a string u matches ∅∗ iff it matches ε (iff u = ε). Regular expressions over an alphabet Σ • each symbol a ∈ Σ is a regular expression • ε is a regular expression • ∅ is a regular expression • if r and s are regular expressions, then so is (r|s) • if r and s are regular expressions, then so is rs • if r is a regular expression, then so is (r)∗ Every regular expression is built up inductively, by finitely many applications of the above rules. (N.B. we assume ε, ∅, (, ), |, and ∗ are not symbols in Σ.) Slide 5 Remark 1.2.1 (Binding precedence in regular expressions). In the definition on Slide 5 we assume implicitly that the alphabet Σ does not contain the six symbols ε ∅ ( ) | ∗ Then, concretely speaking, the regular expressions over Σ form a certain set of strings over the alphabet obtained by adding these six symbols to Σ. However it makes things more readable if we adopt a slightly more abstract syntax, dropping as many brackets as possible and using the convention that −∗ binds more tightly than − −, binds more tightly than −|−. So, for example, r|st∗ means (r|s(t)∗ ), not (r|s)(t)∗ , or ((r|st))∗, etc. 1.2 Pattern matching 5 Matching strings to regular expressions • u matches a ∈ Σ iff u = a • u matches ε iff u = ε • no string matches ∅ • u matches r|s iff u matches either r or s • u matches rs iff it can be expressed as the concatenation of two strings, u = vw , with v matching r and w matching s • u matches r∗ iff either u = ε, or u matches r, or u can be expressed as the concatenation of two or more strings, each of which matches r Slide 6 The definition of ‘u matches r ∗ ’ on Slide 6 is equivalent to saying for some n ≥ 0, u can be expressed as a concatenation of n strings, u = u1 u2 . . . un , where each ui matches r. The case n = 0 just means that u = ε (so ε always matches r ∗ ); and the case n = 1 just means that u matches r (so any string matching r also matches r ∗ ). For example, if Σ = {a, b, c} and r = ab, then the strings matching r ∗ are ε, ab, abab, ababab, etc. Note that we didn’t include a regular expression for the ‘∗’ occurring in the UNIX examples on Slide 1. However, once we know which alphabet we are referring to, Σ = {a1 , a2 , . . . , an } say, we can get the effect of ∗ using the regular expression (a1 |a2 | . . . |an )∗ which is indeed matched by any string in Σ∗ (because a1 |a2 | . . . |an is matched by any symbol in Σ). 1 REGULAR EXPRESSIONS 6 Examples of matching, with Σ = {0, 1} • 0|1 is matched by each symbol in Σ • 1(0|1)∗ is matched by any string in Σ∗ that starts with a ‘1’ • ((0|1)(0|1))∗ is matched by any string of even length in Σ∗ • (0|1)∗ (0|1)∗ is matched by any string in Σ∗ • (ε|0)(ε|1)|11 is matched by just the strings ε, 0, 1, 01, and 11 • ∅1|0 is just matched by 0 Slide 7 Notation 1.2.2. The notation r + s is quite often used for what we write as r|s. The notation r n , for n ≥ 0, is an abbreviation for the regular expression obtained by concatenating n copies of r. Thus: ( r0 r n+1 def = ε def = r(r n ). Thus u matches r ∗ iff u matches r n for some n ≥ 0. We use r + as an abbreviation for rr ∗ . Thus u matches r + iff it can be expressed as the concatenation of one or more strings, each one matching r. 1.3 Some questions about languages Slide 8 defines the notion of a formal language over an alphabet. We take a very extensional view of language: a formal language is completely determined by the ‘words in the dictionary’, rather than by any grammatical rules. Slide 9 gives some important questions about languages, regular expressions, and the matching relation between strings and regular expressions. 1.3 Some questions about languages 7 Languages A (formal) language L over an alphabet Σ is just a set of strings in Σ∗ . Thus any subset L ⊆ Σ∗ determines a language over Σ. The language determined by a regular expression r over Σ is def L(r) = {u ∈ Σ∗ | u matches r}. Two regular expressions r and s (over the same alphabet) are equivalent iff L(r) and L(s) are equal sets (i.e. have exactly the same members). Slide 8 Some questions (a) Is there an algorithm which, given a string u and a regular expression r (over the same alphabet), computes whether or not u matches r? (b) In formulating the definition of regular expressions, have we missed out some practically useful notions of pattern? (c) Is there an algorithm which, given two regular expressions r and s (over the same alphabet), computes whether or not they are equivalent? (Cf. Slide 8.) (d) Is every language of the form L(r)? Slide 9 1 REGULAR EXPRESSIONS 8 The answer to question (a) on Slide 9 is ‘yes’. Algorithms for deciding such patternmatching questions make use of finite automata. We will see this during the next few sections. If you have used the UNIX utility grep, or a text editor with good facilities for regular expression based search, like emacs, you will know that the answer to question (b) on Slide 9 is also ‘yes’—the regular expressions defined on Slide 5 leave out some forms of pattern that one sees in such applications. However, the answer to the question is also ‘no’, in the sense that (for a fixed alphabet) these extra forms of regular expression are definable, up to equivalence, from the basic forms given on Slide 5. For example, if the symbols of the alphabet are ordered in some standard way, it is common to provide a form of pattern for naming ranges of symbols—for example [a − z] might denote a pattern matching any lowercase letter. It is not hard to see how to define a regular expression (albeit a rather long one) which achieves the same effect. However, some other commonly occurring kinds of pattern are much harder to describe using the rather minimalist syntax of Slide 5. The principal example is complementation, ∼(r): u matches ∼(r) iff u does not match r. It will be a corollary of the work we do on finite automata (and a good measure of its power) that every pattern making use of the complementation operation ∼(−) can be replaced by an equivalent regular expression just making use of the operations on Slide 5. But why do we stick to the minimalist syntax of regular expressions on that slide? The answer is that it reduces the amount of work we will have to do to show that, in principle, matching strings against patterns can be decided via the use of finite automata. The answer to question (c) on Slide 9 is ‘yes’ and once again this will be a corollary of the work we do on finite automata. (See Section 5.3.) Finally, the answer to question (d) is easily seen to be ‘no’, provided the alphabet Σ contains at least one symbol. For in that case Σ∗ is countably infinite; and hence the number of languages over Σ, i.e. the number of subsets of Σ∗ is uncountable. (Recall Cantor’s diagonal argument.) But since Σ is a finite set, there are only countably many regular expressions over Σ. (Why?) So the answer to (d) is ‘no’ for cardinality reasons. However, even amongst the countably many languages that are ‘finitely describable’ (an intuitive notion that we will not formulate precisely) many are not of the form L(r) for any regular expression r. For example, in Section 5.2 we will use the ‘Pumping Lemma’ to see that {an bn | n ≥ 0} is not of this form. 1.4 Exercises Exercise 1.4.1. Write down an ML data type declaration for a type constructor ’a regExp whose values correspond to the regular expressions over an alphabet ’a. Exercise 1.4.2. Find regular expressions over {0, 1} that determine the following languages: (a) {u | u contains an even number of 1’s} 1.4 Exercises 9 (b) {u | u contains an odd number of 0’s} Exercise 1.4.3. For which alphabets Σ is the set Σ∗ of all finite strings over Σ itself an alphabet? Tripos questions 2012.2.8 2005.2.1(d) 1999.2.1(s) 1997.2.1(q) 1996.2.1(i) 10 1 REGULAR EXPRESSIONS 11 2 Finite State Machines We will be making use of mathematical models of physical systems called finite automata, or finite state machines to recognise whether or not a string is in a particular language. This section introduces this idea and gives the precise definition of what constitutes a finite automaton. We look at several variations on the definition (to do with the concept of determinism) and see that they are equivalent for the purpose of recognising whether or not a string is in a given language. 2.1 Finite automata Example of a finite automaton a b / @ABC GFED q0 U e a @ABC / GFED q1 b a b GFED / @ABC q2 a G?>=< 89:; FED / @ABC q3 U b States: q0 , q1 , q2 , q3 . Input symbols: a, b. Transitions: as indicated above. Start state: q0 . Accepting state(s): q3 . Slide 10 The key features of this abstract notion of ‘machine’ are listed below and are illustrated by the example on Slide 10. • There are only finitely many different states that a finite automaton can be in. In the example there are four states, labelled q0 , q1 , q2 , and q3 . • We do not care at all about the internal structure of machine states. All we care about is which transitions the machine can make between the states. A symbol from some fixed alphabet Σ is associated with each transition: we think of the elements of Σ as input symbols. Thus all the possible transitions of the finite automaton can be specified by giving a finite graph whose vertices are the states and whose edges have 2 FINITE STATE MACHINES 12 both a direction and a label (drawn from Σ). In the example Σ = {a, b} and the only possible transitions from state q1 are b q1 − → q0 a and q1 − → q2 . In other words, in state q1 the machine can either input the symbol b and enter state q0 , or it can input the symbol a and enter state q2 . (Note that transitions from a state a back to the same state are allowed: e.g. q3 − → q3 in the example.) • There is a distinguished start state.1 In the example it is q0 . In the graphical representation of a finite automaton, the start state is usually indicated by means of a unlabelled arrow. • The states are partitioned into two kinds: accepting states2 and non-accepting states. In the graphical representation of a finite automaton, the accepting states are indicated by double circles round the name of each such state, and the non-accepting states are indicated using single circles. In the example there is only one accepting state, q3 ; the other three states are non-accepting. (The two extreme possibilities that all states are accepting, or that no states are accepting, are allowed; it is also allowed for the start state to be accepting.) The reason for the partitioning of the states of a finite automaton into ‘accepting’ and ‘non-accepting’ has to do with the use to which one puts finite automata—namely to recognise whether or not a string u ∈ Σ∗ is in a particular language (= subset of Σ∗ ). Given u we begin in the start state of the automaton and traverse its graph of transitions, using up the symbols in u in the correct order reading the string from left to right. If we can use up all the symbols in u in this way and reach an accepting state, then u is in the language ‘accepted’ (or ‘recognised’) by this particular automaton; otherwise u is not in that language. This is summed up on Slide 11. 1 2 The term initial state is a common synonym for ‘start state’. The term final state is a common synonym for ‘accepting state’. 2.1 Finite automata 13 L(M ), language accepted by a finite automaton M consists of all strings u over its alphabet of input symbols satisfying u →∗ q with q0 the start state and q some accepting state. Here q0 − u →∗ q q0 − means, if u = a1 a2 . . . an say, that for some states q1 , q2 , . . . , qn = q (not necessarily all distinct) there are transitions of the form a a a a 1 2 3 n q0 −→ q1 −→ q2 −→ · · · −→ qn = q. N.B. ε q− →∗ q ′ a case n = 1: q − →∗ q ′ case n = 0: q = q′ a iff q − → q′ . iff Slide 11 Example 2.1.1. Let M be the finite automaton pictured on Slide 10. Using the notation introduced on Slide 11 we have: aaab q0 −−−→∗ q3 (so aaab ∈ L(M )) q0 −−−→∗ q abaa iff q = q2 (so abaa ∈ / L(M )) baaa iff q = q3 (no conclusion about L(M )). q2 −−−→∗ q In fact in this case L(M ) = {u | u contains three consecutive a’s}. (For qi (i = 0, 1, 2) corresponds to the state in the process of reading a string in which the last i symbols read were all a’s.) So L(M ) coincides with the language L(r) determined by the regular expression r = (a|b)∗ aaa(a|b)∗ (cf. Slide 8). 2 FINITE STATE MACHINES 14 A non-deterministic finite automaton (NFA), M , is specified by • a finite set States M (of states) • a finite set ΣM (the alphabet of input symbols) • for each q ∈ States M and each a ∈ ΣM , a subset ∆M (q, a) ⊆ States M (the set of states that can be reached from q with a single transition labelled a) • an element sM ∈ States M (the start state) • a subset Accept M ⊆ States M (of accepting states) Slide 12 2.2 Determinism, non-determinism, and ε-transitions Slide 12 gives the formal definition of the notion of finite automaton. Note that the function a ∆M gives a precise way of specifying the allowed transitions of M , via: q − → q ′ iff q ′ ∈ ∆M (q, a). The reason for the qualification ‘non-deterministic’ on Slide 12 is because in general, for each state q ∈ States M and each input symbol a ∈ ΣM , we allow the possibilities that there are no, one, or many states that can be reached in a single transition labelled a from q, corresponding to the cases that ∆M (q, a) has no, one, or many elements. For example, if M is the NFA pictured on Slide 13, then ∆M (q1 , b) = ∅ i.e. in M , no state can be reached from q1 with a transition labelled b; ∆M (q1 , a) = {q2 } i.e. in M , precisely one state can be reached from q1 with a transition labelled a; ∆M (q0 , a) = {q0 , q1 } i.e. in M , precisely two states can be reached from q0 with a transition labelled a. 2.2 Determinism, non-determinism, and ε-transitions 15 Example of a non-deterministic finite automaton Input alphabet: {a, b}. States, transitions, start state, and accepting states as shown: a / @ABC GFED q0 U a a / @ABC GFED q1 a b / @ABC GFED q2 a / @ABC G?>=< 89:; FED q3 U b The language accepted by this automaton is the same as for the automaton on Slide 10, namely {u ∈ {a, b}∗ | u contains three consecutive a’s}. Slide 13 When each subset ∆M (q, a) has exactly one element we say that M is deterministic. This is a particularly important case and is singled out for definition on Slide 14. The finite automaton pictured on Slide 10 is deterministic. But note that if we took the same graph of transitions but insisted that the alphabet of input symbols was {a, b, c} say, then we have specified an NFA not a DFA, since for example ∆M (q0 , c) = ∅. The moral of this is: when specifying an NFA, as well as giving the graph of state transitions, it is important to say what is the alphabet of input symbols (because some input symbols may not appear in the graph at all). When constructing machines for matching strings with regular expressions (as we will do in Section 3) it is useful to consider finite state machines exhibiting an ‘internal’ form of non-determinism in which the machine is allowed to change state without consuming any input symbol. One calls such transitions ε-transitions and writes them as ε q− → q′ . This leads to the definition on Slide 15. Note that in an NFAε , M , we always assume that ε is not an element of the alphabet ΣM of input symbols. 2 FINITE STATE MACHINES 16 A deterministic finite automaton (DFA) is an NFA M with the property that for each q ∈ States M and a ∈ ΣM , the finite set ∆M (q, a) contains exactly one element—call it δM (q, a). Thus in this case transitions in M are essentially specified by a next-state function, δM , mapping each (state, input symbol)-pair (q, a) to the unique state δM (q, a) which can be reached from q by a transition labelled a: a q− → q ′ iff q ′ = δM (q, a) Slide 14 An NFA with ε-transitions (NFAε ) is specified by an NFA M together with a binary relation, called the ε-transition relation, on the set States M . We write ε q− → q′ to indicate that the pair of states (q, q ′ ) is in this relation. Example (with input alphabet = {a, b}): a GFED @ABC 8 q1 a ε ♣♣♣ ♣ ♣♣♣ / @ABC GFED q0 ◆ U ◆◆◆◆ ε ◆◆& @ABC GFED q4 b b / @ABC GFED q2 / @ABC GFED q5 Slide 15 a b / @ABC G?>=< 89:; FED q3 ◆ a ◆◆◆ε ◆◆◆ & ?>=< @ABC GFED 89:; q ♣8 7 U ♣ ♣ ♣ ♣♣♣ ε / @ABC G?>=< 89:; FED q6 b 2.3 A subset construction 17 L(M ), language accepted by an NFAε M consists of all strings u over the alphabet ΣM of input symbols satisfying − u q0 ⇒ q with q0 the initial state and q some accepting state. Here · ⇒ · is defined by: ε ε q ⇒ q ′ iff q = q ′ or there is a sequence q − → · · · q ′ of one or more ε-transitions in M from q to q ′ a ε a ε q ⇒ q ′ (for a ∈ ΣM ) iff q ⇒ · − → · ⇒ q′ ab ε a ε b ε q ⇒ q ′ (for a, b ∈ ΣM ) iff q ⇒ · − →·⇒·− → · ⇒ q′ and similarly for longer strings Slide 16 When using an NFAε M to accept a string u ∈ Σ∗ of input symbols, we are interested in sequences of transitions in which the symbols in u occur in the correct order, but with zero or more ε-transitions before or after each one. We write u q ⇒ q′ to indicate that such a sequence exists from state q to state q ′ in the NFAε . Then, by definition u u is accepted by the NFAε M iff q0 ⇒ q holds for q0 the start state and q some accepting state: see Slide 16. For example, for the NFAε on Slide 15, it is not too hard to see that the language accepted consists of all strings which either contain two consecutive a’s or contain two consecutive b’s, i.e. the language determined by the regular expression (a|b)∗ (aa|bb)(a|b)∗. 2.3 A subset construction Note that every DFA is an NFA (whose transition relation is deterministic) and that every NFA is an NFAε (whose ε-transition relation is empty). It might seem that non-determinism and ε-transitions allow a greater range of languages to be characterised as recognisable by a finite automaton, but this is not so. We can use a construction, called the subset construction, to convert an NFAε M into a DFA P M accepting the same language (at the expense of increasing the number of states, possibly exponentially). Slide 17 gives an example of this construction. The name ‘subset construction’ refers to the fact that there is one state of P M for each subset of the set States M of states of M . Given two subsets S, S ′ ⊆ States M , there a is a transition S − → S ′ in P M just in case S ′ consists of all the M -states q ′ reachable from 2 FINITE STATE MACHINES 18 a states q in S via the · ⇒ · relation defined on Slide 16, i.e. such that we can get from q to q ′ in M via finitely many ε-transitions followed by an a-transition followed by finitely many ε-transitions. Example of the subset construction M: a GFED @ABC q1 O ε / @ABC GFED q0 ε GFED @ABC 89:; ?>=< q2 U a k δP M : ∅ {q0 } {q1 } {q2 } {q0 , q1 } {q0 , q2 } {q1 , q2 } {q0 , q1 , q2 } a ∅ {q0 , q1 , q2 } {q1 } ∅ {q0 , q1 , q2 } {q0 , q1 , q2 } {q1 } {q0 , q1 , q2 } b ∅ {q2 } ∅ {q2 } {q2 } {q2 } {q2 } {q2 } b Slide 17 By definition, the start state of P M is the subset of States M whose elements are the states reachable by ε-transitions from the start state of M ; and a subset S ⊆ States M is an accepting state of P M iff some accepting state of M is an element of S. Thus in the example on Slide 17 the start state is {q0 , q1 , q2 } and a ε b ab ∈ L(M ) because in M : q0 − → q0 − → q2 − → q2 ab ∈ L(P M ) because in P M : {q0 , q1 , q2 } − → {q0 , q1 , q2 } − → {q2 }. a b Indeed, in this case L(M ) = L(a∗ b∗ ) = L(P M ). The fact that M and P M accept the same language in this case is no accident, as the Theorem on Slide 18 shows. That slide also gives the definition of the subset construction in general. 2.3 A subset construction 19 Theorem. For each NFAε M there is a DFA P M with the same alphabet of input symbols and accepting exactly the same strings as M , i.e. with L(P M ) = L(M ) Definition of P M (refer to Slides 12 and 14): def • States P M = {S | S ⊆ States M } def • ΣP M = ΣM a • S− → S ′ in P M iff S ′ = δP M (S, a), where def a δP M (S, a) = {q ′ | ∃q ∈ S (q ⇒ q ′ in M )} def ε • sP M = {q | sM ⇒ q} def • Accept P M = {S ∈ States P M | ∃q ∈ S (q ∈ Accept M )} Slide 18 To prove the theorem on Slide 18, given any NFAε M we have to show that L(M ) = L(P M ). We split the proof into two halves. ε Proof that L(M ) ⊆ L(P M ). Consider the case of ε first: if ε ∈ L(M ), then sM ⇒ q for some q ∈ Accept M , hence sP M ∈ Accept P M and thus ε ∈ L(P M ). Now given any nonnull string u = a1 a2 . . . an , if u is accepted by M then there is a sequence of transitions in M of the form (1) a a a n sM ⇒1 q1 ⇒2 · · · ⇒ qn ∈ Accept M . Since it is deterministic, feeding a1 a2 . . . an to P M results in the sequence of transitions (2) a a a n 1 2 sP M −→ S1 −→ · · · −−→ Sn where S1 = δP M (sP M , a1 ), S2 = δP M (S1 , a2 ), etc. By definition of δP M (Slide 18), from (1) we deduce q1 ∈ δP M (sP M , a1 ) = S1 so q2 ∈ δP M (S1 , a2 ) = S2 ... so qn ∈ δP M (Sn−1 , an ) = Sn 20 2 FINITE STATE MACHINES and hence Sn ∈ Accept P M (because qn ∈ Accept M ). Therefore (2) shows that u is accepted by P M . Proof that L(P M ) ⊆ L(M ). Consider the case of ε first: if ε ∈ L(P M ), then sP M ∈ ε Accept P M and so there is some q ∈ sP M with q ∈ Accept M , i.e. sM ⇒ q ∈ Accept M and thus ε ∈ L(M ). Now given any non-null string u = a1 a2 . . . an , if u is accepted by P M then there is a sequence of transitions in P M of the form (2) with Sn ∈ Accept P M , i.e. with Sn containing some qn ∈ Accept M . Now since qn ∈ Sn = δP M (Sn−1 , an ), an by definition of δP M there is some qn−1 ∈ Sn−1 with qn−1 ⇒ qn in M . Then since an−1 qn−1 ∈ Sn−1 = δP M (Sn−2 , an−1 ), there is some qn−2 ∈ Sn−2 with qn−2 ⇒ qn−1 . Working backwards in this way we can build up a sequence of transitions like (1) until, at the a last step, from the fact that q1 ∈ S1 = δP M (sP M , a1 ) we deduce that sM ⇒1 q1 . So we get a sequence of transitions (1) with qn ∈ Accept M , and hence u is accepted by M . 2.4 Summary The important concepts in Section 2 are those of a deterministic finite automaton (DFA) and the language of strings that it accepts. Note that if we know that a language L is of the form L = L(M ) for some DFA M , then we have a method for deciding whether or not any given string u (over the alphabet of L) is in L or not: begin in the start state of M and carry out the sequence of transitions given by reading u from left to right (at each step the next state is uniquely determined because M is deterministic); if the final state reached is accepting, then u is in L, otherwise it is not. We also introduced other kinds of finite automata (with non-determinism and ε-transitions) and proved that they determine exactly the same class of languages as DFAs. 2.5 Exercises Exercise 2.5.1. For each of the two languages mentioned in Exercise 1.4.2 find a DFA that accepts exactly that set of strings. Exercise 2.5.2. The example of the subset construction given on Slide 17 constructs a DFA with eight states whose language of accepted strings happens to be L(a∗ b∗ ). Give a DFA with the same language of accepted strings, but fewer states. Give an NFA with even fewer states that does the same job. Exercise 2.5.3. Given a DFA M , construct a new DFA M ′ with the same alphabet of input symbols ΣM and with the property that for all u ∈ Σ∗M , u is accepted by M ′ iff u is not accepted by M . Exercise 2.5.4. Given two DFAs M1 , M2 with the same alphabet Σ of input symbols, construct a third such DFA M with the property that u ∈ Σ∗ is accepted by M iff it is accepted by both M1 and M2 . [Hint: take the states of M to be ordered pairs (q1 , q2 ) of states with q1 ∈ States M1 and q2 ∈ States M2 .] 2.5 Exercises Tripos questions 2010.2.9 1998.2.1(s) 1995.2.19 21 2009.2.9 2004.2.1(d) 2001.2.1(d) 2000.2.1(b) 22 2 FINITE STATE MACHINES 23 3 Regular Languages, I Slide 19 defines the notion of a regular language, which is a set of strings of the form L(M ) for some DFA M (cf. Slides 11 and 14). The slide also gives the statement of Kleene’s Theorem, which connects regular languages with the notion of matching strings to regular expressions introduced in Section 1: the collection of regular languages coincides with the collection of languages determined by matching strings with regular expressions. The aim of this section is to prove part (a) of Kleene’s Theorem. We will tackle part (b) in Section 4. Definition A language is regular iff it is the set of strings accepted by some deterministic finite automaton. Kleene’s Theorem (a) For any regular expression r , L(r) is a regular language (cf. Slide 8). (b) Conversely, every regular language is the form L(r) for some regular expression r . Slide 19 3.1 Finite automata from regular expressions Given a regular expression r, over an alphabet Σ say, we wish to construct a DFA M with alphabet of input symbols Σ and with the property that for each u ∈ Σ∗ , u matches r iff u is accepted by M —so that L(r) = L(M ). Note that by the Theorem on Slide 18 it is enough to construct an NFAε N with the property L(N ) = L(r). For then we can apply the subset construction to N to obtain a DFA M = P N with L(M ) = L(P N ) = L(N ) = L(r). Working with finite automata that are non-deterministic and have ε-transitions simplifies the construction of a suitable finite automaton from r. Let us fix on a particular alphabet Σ and from now on only consider finite automata whose set of input symbols is Σ. The construction of an NFAε for each regular expression r over Σ proceeds by recursion on the syntactic structure of the regular expression, as follows. 3 REGULAR LANGUAGES, I 24 (i) For each atomic form of regular expression, a (a ∈ Σ), ε, and ∅, we give an NFAε accepting just the strings matching that regular expression. (ii) Given any NFAε s M1 and M2 , we construct a new NFAε , Union(M1 , M2 ) with the property L(Union(M1 , M2 )) = {u | u ∈ L(M1 ) or u ∈ L(M2 )}. Thus L(r1 |r2 ) = L(Union(M1 , M2 )) when L(r1 ) = L(M1 ) and L(r2 ) = L(M2 ). (iii) Given any NFAε s M1 and M2 , we construct a new NFAε , Concat(M1 , M2 ) with the property L(Concat(M1 , M2 )) = {u1 u2 | u1 ∈ L(M1 ) and u2 ∈ L(M2 )}. Thus L(r1 r2 ) = L(Concat(M1 , M2 )) when L(r1 ) = L(M1 ) and L(r2 ) = L(M2 ). (iv) Given any NFAε M , we construct a new NFAε , Star (M ) with the property L(Star (M )) = {u1 u2 . . . un | n ≥ 0 and each ui ∈ L(M )}. Thus L(r ∗ ) = L(Star (M )) when L(r) = L(M ). Thus starting with step (i) and applying the constructions in steps (ii)–(iv) over and over again, we eventually build NFAε s with the required property for every regular expression r. Put more formally, one can prove the statement for all n ≥ 0, and for all regular expressions of size ≤ n, there exists an NFAε M such that L(r) = L(M ) by mathematical induction on n, using step (i) for the base case and steps (ii)–(iv) for the induction steps. Here we can take the size of a regular expression to be the number of occurrences of union (−|−), concatenation (− −), or star (−∗ ) in it. Step (i) Slide 20 gives NFAs whose languages of accepted strings are respectively L(a) = {a} (any a ∈ Σ), L(ε) = {ε}, and L(∅) = ∅. 3.1 Finite automata from regular expressions 25 NFAs for atomic regular expressions / @ABC GFED q0 a / @ABC G?>=< 89:; FED q1 just accepts the one-symbol string a / @ABC G?>=< 89:; FED q0 just accepts the null string, ε / @ABC GFED q0 accepts no strings Slide 20 Step (ii) Given NFAε s M1 and M2 , the construction of Union(M1 , M2 ) is pictured on Slide 21. First, renaming states if necessary, we assume that States M1 and States M2 are disjoint. Then the states of Union(M1 , M2 ) are all the states in either M1 or M2 , together with a new state, called q0 say. The start state of Union(M1 , M2 ) is this q0 and its accepting states are all the states that are accepting in either M1 or M2 . Finally, the transitions of Union(M1 , M2 ) are given by all those in either M1 or M2 , together with two new εtransitions out of q0 to the start states of M1 and M2 . u Thus if u ∈ L(M1 ), i.e. if we have sM1 ⇒ q1 for some q1 ∈ Accept M1 , then we ε u get q0 − → sM1 ⇒ q1 showing that u ∈ L(Union(M1 , M2 ). Similarly for M2 . So L(Union(M1 , M2 )) contains the union of L(M1 ) and L(M2 ). Conversely if u is accepted by u Union(M1 , M2 ), there is a transition sequence q0 ⇒ q with q ∈ Accept M1 or q ∈ Accept M2 . Clearly, in either case this transition sequence has to begin with one or other of the εtransitions from q0 , and thereafter we get a transition sequence entirely in one or other of M1 or M2 finishing in an acceptable state for that one. So if u ∈ L(Union(M1 , M2 )), then either u ∈ L(M1 ) or u ∈ L(M2 ). So we do indeed have L(Union(M1 , M2 )) = {u | u ∈ L(M1 ) or u ∈ L(M2 )}. 3 REGULAR LANGUAGES, I 26 Union(M1 , M2 ) ✤❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ ONML sM1 M1 ✤ HIJK ✤ ε ♦♦♦7 ♦ ♦♦♦ ✤❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ / @ABC GFED q0 ❖❖ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴✤ ❖❖❖ ✤ ❖ ✤ ε ❖✤ ' HIJK sM2 M2 ✤ ONML ✤ ✤ ✤ ✤ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ Set of accepting states is union of Accept M1 and Accept M2 . Slide 21 Step (iii) Given NFAε s M1 and M2 , the construction of Concat(M1 , M2 ) is pictured on Slide 22. First, renaming states if necessary, we assume that States M1 and States M2 are disjoint. Then the states of Concat(M1 , M2 ) are all the states in either M1 or M2 . The start state of Concat(M1 , M2 ) is the start state of M1 . The accepting states of Concat(M1 , M2 ) are the accepting states of M2 . Finally, the transitions of Concat(M1 , M2 ) are given by all those in either M1 or M2 , together with new ε-transitions from each accepting state of M1 to the start state of M2 (only one such new transition is shown in the picture). u Thus if u1 ∈ L(M1 ) and u2 ∈ L(M2 ), there are transition sequences sM1 ⇒1 q1 in M1 u with q1 ∈ Accept M1 , and sM2 ⇒2 q2 in M2 with q2 ∈ Accept M2 . These combine to yield u ε u sM1 ⇒1 q1 − → sM2 ⇒2 q2 in Concat(M1 , M2 ) witnessing the fact that u1 u2 is accepted by Concat(M1 , M2 ). Conversely, it is not hard to see that every v ∈ L(Concat(M1 , M2 )) is of this form. For any transition sequence witnessing the fact that v is accepted starts out in the states of M1 but finishes in the disjoint set of states of M2 . At some point in the sequence one of the new ε-transitions occurs to get from M1 to M2 and thus we can split v as v = u1 u2 with u1 accepted by M1 and u2 accepted by M2 . So we do indeed have L(Concat(M1 , M2 )) = {u1 u2 | u1 ∈ L(M1 ) and u2 ∈ L(M2 )}. 3.1 Finite automata from regular expressions 27 Concat(M1 , M2 ) ✤❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ HIJK sM1 ✤/ ONML ✤ ✤ M1 89:; ?>=< 0123 7654 ✤❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ ✤ /✤ ONML HIJK sM2 ✤ ✤ ✤ ε Set of accepting states is Accept M2 . ✤ M2 ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ ✤ Slide 22 Step (iv) Given an NFAε M , the construction of Star (M ) is pictured on Slide 23. The states of Star (M ) are all those of M together with a new state, called q0 say. The start state of Star (M ) is q0 and this is also the only accepting state of Star (M ). Finally, the transitions of Star (M ) are all those of M together with new ε-transitions from q0 to the start state of M and from each accepting state of M to q0 (only one of this latter kind of transition is shown in the picture). Clearly, Star (M ) accepts ε (since its start state is accepting) and any concatenation of one or more strings accepted by M . Conversely, if v is accepted by Star (M ), the occurrences of q0 in a transition sequence witnessing this fact allow us to split v into the concatenation of zero or more strings, each of which is accepted by M . So we do indeed have L(Star (M )) = {u1 u2 . . . un | n ≥ 0 and each ui ∈ L(M )}. 3 REGULAR LANGUAGES, I 28 Star (M ) ✤❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ / @ABC G?>=< 89:; FED q0 ` @ABC sM ✤/ GFED ✤ ε ✤ M 89:; ?>=< /.-, ()*+ ✤ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ✤ ✤ ε The only accepting state of Star (M ) is q0 . Slide 23 This completes the proof of part (a) of Kleene’s Theorem (Slide 19). Figure 1 shows how the step-by-step construction applies in the case of the regular expression (a|b)∗ a to produce an NFAε M satisfying L(M ) = L((a|b)∗ a). Of course an automaton with fewer states and ε-transitions doing the same job can be crafted by hand. The point of the construction is that it provides an automatic way of producing automata for any given regular expression. 3.2 Decidability of matching The proof of part (a) of Kleene’s Theorem provides us with a positive answer to question (a) on Slide 9. In other words, it provides a method that, given any string u and regular expression r, decides whether or not u matches r. The method is: • construct a DFA M satisfying L(M ) = L(r); • beginning in M ’s start state, carry out the sequence of transitions in M corresponding to the string u, reaching some state q of M (because M is deterministic, there is a unique such transition sequence); • check whether q is accepting or not: if it is, then u ∈ L(M ) = L(r), so u matches r; otherwise u ∈ / L(M ) = L(r), so u does not match r . Note. The subset construction used to convert the NFAε resulting from steps (i)–(iv) of Section 3.1 to a DFA produces an exponential blow-up of the number of states. (P M has 3.2 Decidability of matching 29 Step of type (i): a / 89:; ?>=< Step of type (i): b a / 89:; ?>=< Step of type (ii): a|b b 89:; ?>=< / 89:; ?/.-, ()*+ >=< / 89:; ?/.-, ()*+ >=< a ε ♥♥♥6 ♥♥ ?>=<♥ / 89:; PPP PP ε P( 89:; ?>=< b ?/.-, ()*+ >=< / 89:; / 89:; ?/.-, ()*+ >=< Step of type (iv): (a|b)∗ ε / 89:; ?/.-, ()*+ >=< ] ε 89:; ?>=< ε ♥♥♥6 ♥ ♥ / 89:; ?>=<♥ PPP PP ε P( 89:; ?>=< a b / 89:; ?>=< / 89:; ?>=< ε Step of type (iii): (a|b)∗ a ε 89:; ?>=< ()*+ /.-, o a 89:; ?>=< o ε 89:; ?>=< ] ε 89:; ?>=< ε ♥♥♥6 ♥ ♥ / 89:; ?>=<♥ PPP PP ε P( 89:; ?>=< a b / 89:; ?>=< / 89:; ?>=< ε Figure 1: Steps in constructing an NFAε for (a|b)∗ a 30 3 REGULAR LANGUAGES, I 2n states if M has n.) This makes the method described above very inefficient. (Much more efficient algorithms exist.) 3.3 Exercises Exercise 3.3.1. Why can’t the automaton Star (M ) required in step (iv) of Section 3.1 be constructed simply by taking M , making its start state the only accepting state and adding new ε-transitions back from each old accepting state to its start state? Exercise 3.3.2. Work through the steps in Section 3.1 to construct an NFAε M satisfying L(M ) = L((ε|b)∗ aab∗ ). Do the same for some other regular expressions. Exercise 3.3.3. Show that any finite set of strings is a regular language. 31 4 Regular Languages, II The aim of this section is to prove part (b) of Kleene’s Theorem (Slide 19). 4.1 Regular expressions from finite automata Given any DFA M , we have to find a regular expression r (over the alphabet of input symbols of M ) satisfying L(r) = L(M ). In fact we do something more general than this, as described Q in the Lemma on Slide 24.1 Note that if we can find such regular expressions rq,q ′ for any ′ choice of Q, q, and q , then the problem is solved. For taking Q to be the whole of States M Q and q to be the start state, s say, then by definition of rs,q ′ , a string u matches this regular u∗ ′ expression iff there is a transition sequence s − → q in M . As q ′ ranges over the finitely many accepting states, q1 , . . . , qk say, then we match exactly all the strings accepted by M . Q Q In other words the regular expression rs,q | · · · |rs,q has the property we want for part (b) 1 k of Kleene’s Theorem. (In case k = 0, i.e. there are no accepting states in M , then L(M ) is empty and so we can use the regular expression ∅.) Lemma Given an NFA M , for each subset Q ⊆ States M and each Q pair of states q, q ′ ∈ States M , there is a regular expression rq,q ′ satisfying u Q ∗ L(rq,q →∗ q ′ in M with all inter′ ) = {u ∈ (ΣM ) | q − mediate states of the sequence in Q}. Hence L(M ) = L(r), where r = r1 | · · · |rk and k = number of accepting states, Q ri = rs,q i with Q = States M , s = start state, qi = ith accepting state. (In case k = 0, take r to be the regular expression ∅.) Slide 24 Q Proof of the Lemma on Slide 24. The regular expression rq,q ′ can be constructed by induction on the number of elements in the subset Q. 1 The lemma works just as well whether M is deterministic or non-deterministic; it also works for u u NFAε s, provided we replace − →∗ by ⇒ (cf. Slide 16). 4 REGULAR LANGUAGES, II 32 Base case, Q is empty. In this case, for each pair of states q, q ′ , we are looking for a regular expression to describe the set of strings u {u | q − →∗ q ′ with no intermediate states}. a So each element of this set is either a single input symbol a (if q − → q ′ holds in M ) or ′ possibly ε, in case q = q . If there are no input symbols that take us from q to q ′ in M , we can simply take ( ∅ if q 6= q ′ def ∅ rq,q ′ = ε if q = q ′ . On the other hand, if there are some such input symbols, a1 , . . . , ak say, we can take ( a1 | · · · |ak if q 6= q ′ def ∅ rq,q ′ = a1 | · · · |ak |ε if q = q ′ . Induction step. Suppose we have defined the required regular expressions for all subsets of states with n elements. If Q is a subset with n + 1 elements, choose some element q0 ∈ Q and consider the n-element set Q \ {q0 } = {q ∈ Q | q 6= q0 }. Then for any pair of states q, q ′ ∈ States M , by inductive hypothesis we have already constructed the regular expressions def Q\{q } r1 = rq,q ′ 0 , def Q\{q0 } r2 = rq,q , 0 def 0} r3 = rqQ\{q , 0 ,q0 def Q\{q } and r4 = rq0 ,q ′ 0 . Consider the regular expression def r = r1 |r2 (r3 )∗ r4 . Clearly every string matching r is in the set u {u | q − →∗ q ′ with all intermediate states in this sequence in Q}. Conversely, if u is in this set, consider the number of times the sequence of transitions u q − →∗ q ′ passes through state q0 . If this number is zero then u ∈ L(r1 ) (by definition of r1 ). Otherwise this number is k ≥ 1 and the sequence splits into k + 1 pieces: the first piece is in L(r2 ) (as the sequence goes from q to the first occurrence of q0 ), the next k − 1 pieces are in L(r3 ) (as the sequence goes from one occurrence of q0 to the next), and the last piece is in L(r4 ) (as the sequence goes from the last occurrence of q0 to q ′ ). So in this case u is in L(r2 (r3 )∗ r4 ). So in either case u is in L(r). So to complete the induction step we can define Q ∗ rq,q ′ to be this regular expression r = r1 |r2 (r3 ) r4 . 4.2 An example Perhaps an example will help to understand the rather clever argument in Section 4.1. The example will also demonstrate that we do not have to pursue the inductive construction of the regular expression to the bitter end (the base case Q = ∅): often it is possible to find some of Q the regular expressions rq,q ′ one needs by ad hoc arguments. 4.2 An example 33 Note also that at the inductive steps in the construction of a regular expression for M we are free to choose which state q0 to remove from the current state set Q. A good rule of thumb is: choose a state that disconnects the automaton as much as possible. Example 89:; 1 ❥5 ?>=< J a b❥❥❥❥❥ ❥❥ ❥❥❥❥ / 89:; ?7654 0123 >=< a b 0 ❚i ❚❚❚ ❚❚❚❚ a ❚❚❚❚ 89:; ?>=< 2 Direct inspection yields: {0} 2 ri,j {0,2} 0 a∗ a∗ b ∅ ε a aa∗ a∗ b ε 0 1 2 ri,j 0 1 2 0 1 1 2 Slide 25 As an example, consider the NFA shown on Slide 25. Since the start state is 0 and this is also the only accepting state, the language of accepted strings is that determined by the {0,1,2} regular expression r0,0 . Choosing to remove state 1 from the state set, we have {0,1,2} L(r0,0 (3) {0,2} {0,2} {0,2} {0,2} ) = L(r0,0 |r0,1 (r1,1 )∗ r1,0 ). {0,2} {0,2} Direct inspection shows that L(r0,0 ) = L(a∗ ) and L(r0,1 ) = L(a∗ b). To calculate {0,2} {0,2} L(r1,1 ), and L(r1,0 ), we choose to remove state 2: {0,2} {0} {0} {0} {0} {0,2} {0} {0} {0} {0} L(r1,1 ) = L(r1,1 |r1,2 (r2,2 )∗ r2,1 ) L(r1,0 ) = L(r1,0 |r1,2 (r2,2 )∗ r2,0 ). These regular expressions can all be determined by inspection, as shown on Slide 25. Thus {0,2} L(r1,1 ) = L(ε|a(ε)∗ (a∗ b)) and it’s not hard to see that this is equal to L(ε|aa∗ b); and {0,2} L(r1,0 ) = L(∅|a(ε)∗ (aa∗ )) 4 REGULAR LANGUAGES, II 34 which is equal to L(aaa∗ ). Substituting all these values into (3), we get {0,1,2} L(r0,0 ) = L(a∗ |a∗ b(ε|aa∗ b)∗ aaa∗ ). So a∗ |a∗ b(ε|aa∗ b)∗ aaa∗ is a regular expression whose matching strings comprise the language accepted by the NFA on Slide 25. (Clearly, one could simplify this to a smaller, but equivalent regular expression (in the sense of Slide 8), but we do not bother to do so.) 4.3 Complement and intersection of regular languages We saw in Section 3.2 that part (a) of Kleene’s Theorem allows us to answer question (a) on Slide 9. Now that we have proved the other half of the theorem, we can say more about question (b) on that slide. Complementation Recall that on page 8 we mentioned that for each regular expression r over an alphabet Σ, we can find a regular expression ∼(r) that determines the complement of the language determined by r: L(∼(r)) = {u ∈ Σ∗ | u ∈ / L(r)}. As we now show, this is a consequence of Kleene’s Theorem. Not(M ) def • States Not(M) = States M def • ΣNot(M) = ΣM • transitions of Not(M ) = transitions of M • start state of Not(M ) = start state of M • Accept Not(M) = {q ∈ States M | q ∈ / Accept M }. Provided M is a deterministic finite automaton, then u is accepted by Not(M ) iff it is not accepted by M : L(Not(M )) = {u ∈ Σ∗ | u ∈ / L(M )}. Slide 26 4.3 Complement and intersection of regular languages 35 Lemma 4.3.1. If L is a regular language over alphabet Σ, then its complement {u ∈ Σ∗ | u∈ / L} is also regular. Proof. Since L is regular, by definition there is a DFA M such that L = L(M ). Let Not(M ) be the DFA constructed from M as indicated on Slide 26. Then {u ∈ Σ∗ | u ∈ / L} is the set of strings accepted by Not(M ) and hence is regular. Given a regular expression r, by part (a) of Kleene’s Theorem there is a DFA M such that L(r) = L(M ). Then by part (b) of the theorem applied to the DFA Not(M ), we can find a regular expression ∼(r) so that L(∼(r)) = L(Not(M )). Since L(Not(M )) = {u ∈ Σ∗ | u ∈ / L(M )} = {u ∈ Σ∗ | u ∈ / L(r)}, this ∼(r) is the regular expression we need for the complement of r. Note. The construction given on Slide 26 can be applied to a finite automaton M whether or not it is deterministic. However, for L(Not(M )) to equal {u ∈ Σ∗ | u ∈ / L(M )} we need M to be deterministic. See Exercise 4.4.2. Intersection As another example of the power of Kleene’s Theorem, given regular expressions r1 and r2 we can show the existence of a regular expression (r1 &r2 ) with the property: u matches (r1 &r2 ) iff u matches r1 and u matches r2 . This can be deduced from the following lemma. Lemma 4.3.2. If L1 and L2 are a regular languages over an alphabet Σ, then their intersection def L1 ∩ L2 = {u ∈ Σ∗ | u ∈ L1 and u ∈ L2 } is also regular. Proof. Since L1 and L2 are regular languages, there are DFA M1 and M2 such that Li = L(Mi ) (i = 1, 2). Let And (M1 , M2 ) be the DFA constructed from M1 and M2 as on Slide 27. It is not hard to see that And (M1 , M2 ) has the property that any u ∈ Σ∗ is accepted by And (M1 , M2 ) iff it is accepted by both M1 and M2 . Thus L1 ∩ L2 = L(And (M1 , M2 )) is a regular language. 4 REGULAR LANGUAGES, II 36 And (M1 , M2 ) • states of And (M1 , M2 ) are all ordered pairs (q1 , q2 ) with q1 ∈ States M1 and q2 ∈ States M2 • alphabet of And (M1 , M2 ) is the common alphabet of M1 and M2 a a → q1′ in M1 and • (q1 , q2 ) − → (q1′ , q2′ ) in And (M1 , M2 ) iff q1 − a → q2′ in M2 q2 − • start state of And (M1 , M2 ) is (sM1 , sM2 ) • (q1 , q2 ) accepting in And (M1 , M2 ) iff q1 accepting in M1 and q2 accepting in M2 . Slide 27 Thus given regular expressions r1 and r2 , by part (a) of Kleene’s Theorem we can find DFA M1 and M2 with L(ri ) = L(Mi ) (i = 1, 2). Then by part (b) of the theorem we can find a regular expression r1 &r2 so that L(r1 &r2 ) = L(And (M1 , M2 )). Thus u matches r1 &r2 iff And (M1 , M2 ) accepts u, iff both M1 and M2 accept u, iff u matches both r1 and r2 , as required. 4.4 Exercises Exercise 4.4.1. Use the construction in Section 4.1 to find a regular expression for the DFA M whose state set is {0, 1, 2}, whose start state is 0, whose only accepting state is 2, whose alphabet of input symbols is {a, b}, and whose next-state function is given by the following table. δM : a b 0 1 2 1 2 1 2 2 1 Exercise 4.4.2. The construction M 7→ Not(M ) given on Slide 26 applies to both DFA and NFA; but for L(Not(M )) to be the complement of L(M ) we need M to be deterministic. Give an example of an alphabet Σ and a NFA M with set of input symbols Σ, such that {u ∈ Σ∗ | u ∈ / L(M )} is not the same set as L(Not(M )). 4.4 Exercises 37 Exercise 4.4.3. Let r = (a|b)∗ ab(a|b)∗. Find a complement for r over the alphabet Σ = {a, b}, i.e. a regular expressions ∼(r) over the alphabet Σ satisfying L(∼(r)) = {u ∈ Σ∗ | u ∈ / L(r)}. Tripos questions 2003.2.9 2000.2.7 1995.2.20 38 4 REGULAR LANGUAGES, II 39 5 The Pumping Lemma In the context of programming languages, a typical example of a regular language (Slide 19) is the set of all strings of characters which are well-formed tokens (basic keywords, identifiers, etc) in a particular programming language, Java say. By contrast, the set of all strings which represent well-formed Java programs is a typical example of a language that is not regular. Slide 28 gives some simpler examples of non-regular languages. For example, there is no way to use a search based on matching a regular expression to find all the palindromes in a piece of text (although of course there are other kinds of algorithm for doing this). Examples of non-regular languages • The set of strings over {(, ), a, b, . . . , z} in which the parentheses ‘(’ and ‘)’ occur well-nested. • The set of strings over {a, b, . . . , z} which are palindromes, i.e. which read the same backwards as forwards. • {an bn | n ≥ 0} Slide 28 The intuitive reason why the languages listed on Slide 28 are not regular is that a machine for recognising whether or not any given string is in the language would need infinitely many different states (whereas a characteristic feature of the machines we have been using is that they have only finitely many states). For example, to recognise that a string is of the form an bn one would need to remember how many as had been seen before the first b is encountered, requiring countably many states of the form ‘just seen n as’. This section make this intuitive argument rigorous and describes a useful way of showing that languages such as these are not regular. The fact that a finite automaton does only have finitely many states means that as we look at longer and longer strings that it accepts, we see a certain kind of repetition—the pumping lemma property given on Slide 29. 5 THE PUMPING LEMMA 40 The Pumping Lemma For every regular language L, there is a number ℓ ≥ 1 satisfying the pumping lemma property : all w ∈ L with length(w) ≥ ℓ can be expressed as a concatenation of three strings, w = u1 vu2 , where u1 , v and u2 satisfy: • length(v) ≥ 1 (i.e. v 6= ε) • length(u1 v) ≤ ℓ • for all n ≥ 0, u1 v n u2 ∈ L (i.e. u1 u2 ∈ L, u1 vu2 ∈ L [but we knew that anyway], u1 vvu2 ∈ L, u1 vvvu2 ∈ L, etc). Slide 29 5.1 Proving the Pumping Lemma Since L is regular, it is equal to the set L(M ) of strings accepted by some DFA M . Then we can take the number ℓ mentioned on Slide 29 to be the number of states in M . For suppose w = a1 a2 . . . an with n ≥ ℓ. If w ∈ L(M ), then there is a transition sequence as shown at the top of Slide 30. Then w can be split into three pieces as shown on that slide. Note that by choice of i and j, length(v) = j − i ≥ 1 and length(u1 v) = j ≤ ℓ. So it just remains to check that u1 v n u2 ∈ L for all n ≥ 0. As shown on the lower half of Slide 30, the string v takes the machine M from state qi back to the same state (since qi = qj ). So for any n, u1 v n u2 takes us from the initial state sM = qo to qi , then n times round the loop from qi to itself, and then from qi to qn ∈ Accept M . Therefore for any n ≥ 0, u1 v n u2 is accepted by M , i.e. u1 v n u2 ∈ L. Note. In the above construction it is perfectly possible that i = 0, in which case u1 is the null-string, ε. 5.2 Using the Pumping Lemma 41 If n ≥ ℓ = number of states of M , then in a a a a 2 ℓ 1 n q · · · −→ qℓ · · · −→ q1 −→ sM = q0 −→ qn ∈ Accept M | {z 2 } ℓ+1 states q0 , . . . , qℓ can’t all be distinct states. So qi = qj for some 0 ≤ i < j ≤ ℓ. So the above transition sequence looks like v ∗ u u 2 ∗ 1 ∗ qn ∈ Accept M qi = qj −→ sM = q0 −→ where def u1 = a1 . . . ai def v = ai+1 . . . aj def u2 = aj+1 . . . an . Slide 30 Remark 5.1.1. One consequence of the pumping lemma property of L and ℓ is that if there is any string w in L of length ≥ ℓ, then L contains arbitrarily long strings. (We just ‘pump up’ w by increasing n.) If you did Exercise 3.3.3, you will know that if L is a finite set of strings then it is regular. In this case, what is the number ℓ with the property on Slide 29? The answer is that we can take any ℓ strictly greater than the length of any string in the finite set L. Then the Pumping Lemma property is trivially satisfied because there are no w ∈ L with length(w) ≥ ℓ for which we have to check the condition! 5.2 Using the Pumping Lemma The Pumping Lemma (Slide 5.1) says that every regular language has a certain property— namely that there exists a number ℓ with the pumping lemma property. So to show that a language L is not regular, it suffices to show that no ℓ ≥ 1 possesses the pumping lemma property for the language L. Because the pumping lemma property involves quite a complicated alternation of quantifiers, it will help to spell out explicitly what is its negation. This is done on Slide 31. Slide 32 gives some examples. 5 THE PUMPING LEMMA 42 How to use the Pumping Lemma to prove that a language L is not regular For each ℓ ≥ 1, find some w ∈ L of length ≥ ℓ so that (†)   no matter how w is split into three, w = u1 vu2 ,   with length(u1 v) ≤ ℓ and length(v) ≥ 1,    there is some n ≥ 0 for which u1 v n u2 is not in L. Slide 31 Examples def (i) L1 = {an bn | n ≥ 0} is not regular. [For each ℓ ≥ 1, aℓ bℓ ∈ L1 is of length ≥ ℓ and has property (†) on Slide 31.] def (ii) L2 = {w ∈ {a, b}∗ | w a palindrome} is not regular. [For each ℓ ≥ 1, aℓ baℓ ∈ L1 is of length ≥ ℓ and has property (†).] def (iii) L3 = {ap | p prime} is not regular. [For each ℓ ≥ 1, we can find a prime p with p > 2ℓ and then ap ∈ L3 has length ≥ ℓ and has property (†).] Slide 32 5.2 Using the Pumping Lemma 43 Proof of the examples on Slide 32. We use the method on Slide 31. (i) For any ℓ ≥ 1, consider the string w = aℓ bℓ . It is in L1 and has length ≥ ℓ. We show that property (†) holds for this w. For suppose w = aℓ bℓ is split as w = u1 vu2 with length(u1 v) ≤ ℓ and length(v) ≥ 1. Then u1 v must consist entirely of as, so u1 = ar and v = as say, and hence u2 = aℓ−r−s bℓ . Then the case n = 0 of u1 v n u2 is not in L1 since u1 v 0 u2 = u1 u2 = ar (aℓ−r−s bℓ ) = aℓ−s bℓ and aℓ−s bℓ ∈ / L1 because ℓ − s 6= ℓ (since s = length(v) ≥ 1). (ii) The argument is very similar to that for example (i), but starting with the palindrome w = aℓ baℓ . Once again, the n = 0 of u1 v n u2 yields a string u1 u2 = aℓ−s baℓ which is not a palindrome (because ℓ − s 6= ℓ). (iii) Given ℓ ≥ 1, since there are infinitely many primes p, we can certainly find one satisfying p > 2ℓ. I claim that w = ap has property (†). For suppose w = ap is split as w = u1 vu2 def def with length(u1 v) ≤ ℓ and length(v) ≥ 1. Letting r = length(u1 ) and s = length(v), so that length(u2 ) = p − r − s, we have 2 u1 v p−s u2 = ar as(p−s) ap−r−s = asp−s +p−s = a(s+1)(p−s) . Now (s + 1)(p − s) is not prime, because s + 1 > 1 (since s = length(v) ≥ 1) and p − s > 2ℓ − ℓ = ℓ ≥ 1 (since p > 2ℓ by choice, and s ≤ r + s = length(u1 v) ≤ ℓ). Therefore u1 v n u2 ∈ / L3 when n = p − s. Remark 5.2.1. Unfortunately, the method on Slide 31 can’t cope with every non-regular language. This is because the pumping lemma property is a necessary, but not a sufficient condition for a language to be regular. In other words there do exist languages L for which a number ℓ ≥ 1 can be found satisfying the pumping lemma property on Slide 29, but which nonetheless, are not regular. Slide 33 gives an example of such an L. 5 THE PUMPING LEMMA 44 Example of a non-regular language that satisfies the ‘pumping lemma property’ def L = {cm an bn | m ≥ 1 and n ≥ 0} ∪ {am bn | m, n ≥ 0} satisfies the pumping lemma property on Slide 29 with ℓ = 1. [For any w ∈ L of length ≥ 1, can take u1 = ε, v = first letter of w , u2 = rest of w.] But L is not regular. [See Exercise 5.4.2.] Slide 33 5.3 Decidability of language equivalence The proof of the Pumping Lemma provides us with a positive answer to question (c) on Slide 9. In other words, it provides a method that, given any two regular expressions r1 and r2 (over the same alphabet Σ) decides whether or not the languages they determine are equal, L(r1 ) = L(r2 ). First note that this problem can be reduced to deciding whether or not the set of strings accepted by any given DFA is empty. For L(r1 ) = L(r2 ) iff L(r1 ) ⊆ L(r2 ) and L(r2 ) ⊆ L(r1 ). Using the results about complementation and intersection in Section 4.3, we can reduce the question of whether or not L(r1 ) ⊆ L(r2 ) to the question of whether or not L(r1 &(∼r2 )) = ∅, since L(r1 ) ⊆ L(r2 ) iff L(r1 ) ∩ {u ∈ Σ∗ | u ∈ / L(r2 )} = ∅. By Kleene’s theorem, given r1 and r2 we can first construct regular expressions r1 &(∼r2 ) and r2 &(∼r1 ), then construct DFAs M1 and M2 such that L(M1 ) = L(r1 &(∼r2 )) and L(M2 ) = L(r2 &(∼r1 )). Then r1 and r2 are equivalent iff the languages accepted by M1 and by M2 are both empty. The fact that, given any DFA M , one can decide whether or not L(M ) = ∅ follows from the Lemma on Slide 34. For then, to check whether or not L(M ) is empty, we just have to check whether or not any of the finitely many strings of length less than the number of states of M is accepted by M . 5.4 Exercises 45 Lemma If a DFA M accepts any string at all, it accepts one whose length is less than the number of states in M . Proof. Suppose M has ℓ states (so ℓ ≥ 1). If L(M ) is not empty, then we can find an element of it of shortest length, a1 a2 . . . an say (where n ≥ 0). Thus there is a transition sequence a a a 1 2 n sM = q0 −→ q1 −→ q2 · · · −→ qn ∈ Accept M . If n ≥ ℓ, then not all the n + 1 states in this sequence can be distinct and we can shorten it as on Slide 30. But then we would obtain a strictly shorter string in L(M ) contradicting the choice of a1 a2 . . . an . So we must have n < ℓ. Slide 34 5.4 Exercises Exercise 5.4.1. Show that the first language mentioned on Slide 28 is not regular. Exercise 5.4.2. Show that there is no DFA M for which L(M ) is the language on Slide 33. [Hint: argue by contradiction. If there were such an M , consider the DFA M ′ with the same states as M , with alphabet of input symbols just consisting of a and b, with transitions all those of M which are labelled by a or b, with start state δM (sM , c) (where sM is the start state of M ), and with the same accepting states as M . Show that the language accepted by M ′ has to be {an bn | n ≥ 0} and deduce that no such M can exist.] Exercise 5.4.3. Check the claim made on Slide 33 that the language mentioned there satisfies the pumping lemma property of Slide 29 with ℓ = 1. Tripos questions 2011.2.8 2006.2.8 2004.2.9 1998.2.7 1996.2.1(j) 1996.2.8 1995.2.27 2002.2.9 2001.2.7 1999.2.7 46 5 THE PUMPING LEMMA 47 6 Grammars We have seen that regular languages can be specified in terms of finite automata that accept or reject strings, and equivalently, in terms of patterns, or regular expressions, which strings are to match. This section briefly introduces an alternative, ‘generative’ way of specifying languages. 6.1 Context-free grammars Some production rules for ‘English’ sentences SENTENCE → SUBJECT VERB OBJECT SUBJECT → ARTICLE NOUNPHRASE OBJECT → ARTICLE NOUNPHRASE ARTICLE → a ARTICLE → the NOUNPHRASE → NOUN NOUNPHRASE → ADJECTIVE NOUN ADJECTIVE → big ADJECTIVE → small NOUN → cat NOUN → dog VERB → eats Slide 35 Slide 35 gives an example of a context-free grammar for generating strings over the seven element alphabet def Σ = {a, big, cat, dog, eats, small, the}. The elements of the alphabet are called terminals for reasons that will emerge below. The grammar uses finitely many extra symbols, called non-terminals, namely the eight symbols ADJECTIVE, ARTICLE, NOUN, NOUNPHRASE, OBJECT, SENTENCE, SUBJECT, VERB. One of these is designated as the start symbol. In this case it is SENTENCE (because we are interested in generating sentences). Finally, the context-free grammar contains a finite set of production rules, each of which consists of a pair, written x → u, where x is one of the non-terminals and u is a string of terminals and non-terminals. In this case there are twelve productions, as shown on the slide. 6 GRAMMARS 48 The idea is that we begin with the start symbol SENTENCE and use the productions to continually replace non-terminal symbols by strings. At successive stages in this process we have a string which may contain both terminals and non-terminals. We choose one of the non-terminals in the string and a production which has that non-terminal as its left-hand side. Replacing the non-terminal by the right-hand side of the production we obtain the next string in the sequence, or derivation as it is called. The derivation stops when we obtain a string containing only terminals. The set of strings over Σ that may be obtained in this way from the start symbol is by definition the language generated the context-free grammar. A derivation SENTENCE → SUBJECT VERB OBJECT → ARTICLE NOUNPHRASE VERB OBJECT → the NOUNPHRASE VERB OBJECT → the NOUNPHRASE eats OBJECT → the ADJECTIVE NOUN eats OBJECT → the big NOUN eats OBJECT → the big cat eats OBJECT → the big cat eats ARTICLE NOUNPHRASE → the big cat eats a NOUNPHRASE → the big cat eats a ADJECTIVE NOUN → the big cat eats a small NOUN → the big cat eats a small dog Slide 36 For example, the string the big cat eats a small dog is in this language, as witnessed by the derivation on Slide 36, in which we have indicated left-hand sides of production rules by underlining. On the other hand, the string (4) the dog a is not in the language, because there is no derivation from SENTENCE to the string. (Why?) Remark 6.1.1. The phrase ‘context-free’ refers to the fact that in a derivation we are allowed to replace an occurrence of a non-terminal by the right-hand side of a production without regard to the strings that occur on either side of the occurrence (its ‘context’). A more general form of grammar (a ‘type 0 grammar’ in the Chomsky hierarchy—see page 257 of Kozen’s 6.2 Backus-Naur Form 49 book, for example) has productions of the form u → v where u and v are arbitrary strings of terminals and non-terminals. For example a production of the form a ADJECTIVE cat → dog would allow occurrences of ‘ADJECTIVE’ that occur between ‘a’ and ‘cat’ to be replaced by ‘dog’, deleting the surrounding symbols at the same time. This kind of production is not permitted in a context-free grammar. Example of Backus-Naur Form (BNF) Terminals: x ′ + − ∗ ( ) Non-terminals: id op exp Start symbol: exp Productions: id ::= x | id′ op ::= + | − | ∗ exp ::= id | exp op exp | (exp) Slide 37 6.2 Backus-Naur Form It is quite likely that the same non-terminal will appear on the left-hand side of several productions in a context-free grammar. Because of this, it is common to use a more compact notation for specifying productions, called Backus-Naur Form (BNF), in which all the productions for a given non-terminal are specified together, with the different right-hand sides being separated by the symbol ‘|’. BNF also tends to use the symbol ‘::=’ rather than ‘→’ in the notation for productions. An example of a context-free grammar in BNF is given on Slide 37. Written out in full, the context-free grammar on this slide has eight productions, 6 GRAMMARS 50 namely: id → x id → id′ op → + op → − op → ∗ exp → id exp → exp op exp exp → (exp) The language generated by this grammar is supposed to represent certain arithmetic expressions. For example x + (x′′ ) (5) is in the language, but x + (x)′′ (6) is not. (See Exercise 6.5.2.) A context-free grammar for the language {an bn | n ≥ 0} Terminals: a b Non-terminal: I Start symbol: I Productions: I ::= ε | aIb Slide 38 6.3 Regular grammars 51 6.3 Regular grammars A language L over an alphabet Σ is context-free iff L is the set of strings generated by some context-free grammar (with set of terminals Σ). The context-free grammar on Slide 38 generates the language {an bn | n ≥ 0}. We saw in Section 5.2 that this is not a regular language. So the class of context-free languages is not the same as the class of regular languages. Nevertheless, as Slide 39 points out, every regular language is context-free. For the grammar defined on that slide clearly has the property that derivations from the start symbol to a string in Σ∗ must be of the form of a finite number of productions of the first kind followed by a single production of the second kind, i.e. sM → a1 q1 → a1 a2 q2 → · · · → a1 a2 . . . an qn → a1 a2 . . . an where in M the following transition sequence holds a a a 1 2 n sM −→ q1 −→ · · · −−→ qn ∈ Accept M . Thus a string is in the language generated by the grammar iff it is accepted by M . Every regular language is context-free Given a DFA M , the set L(M ) of strings accepted by M can be generated by the following context-free grammar: set of terminals = ΣM set of non-terminals = States M start symbol = start state of M productions of two kinds: a q → aq ′ whenever q − → q in M q→ε whenever q ∈ Accept M Slide 39 ′ 6 GRAMMARS 52 Definition A context-free grammar is regular iff all its productions are of the form X → uY or X→u where u is a string of terminals and X and Y are non-terminals. Theorem (a) Every language generated by a regular grammar is a regular language (i.e. is the set of strings accepted by some DFA). (b) Every regular language can be generated by a regular grammar. Slide 40 It is possible to single out context-free grammars of a special form, called regular (or right linear), which do generate regular languages. The definition is on Slide 40. Indeed, as the theorem on that slide states, this type of grammar generates all possible regular languages. Proof of the Theorem on Slide 40. First note that part (b) of the theorem has already been proved, because the context-free grammar generating L(M ) on Slide 39 is a regular grammar (of a special kind). To prove part (a), given a regular grammar we have to construct a DFA M whose set of accepted strings coincides with the strings generated by the grammar. By the Subset Construction (Theorem on Slide 18), it is enough to construct an NFAε with this property. This makes the task much easier. The construction is illustrated on Slide 41. We take the states of M to be the non-terminals, augmented by some extra states described below. Of course the alphabet of input symbols of M should be the set of terminal symbols of the grammar. The start state is the start symbol. Finally, the transitions and the accepting states of M are defined as follows. (i) For each production of the form q → uq ′ with length(u) ≥ 1, say u = a1 a2 . . . an with n ≥ 1, we add n − 1 fresh states q1 , q2 , . . . , qn−1 to the automaton and transitions a a a a 1 2 3 n q −→ q1 −→ q2 −→ · · · qn−1 −−→ q′ . (ii) For each production of the form q → uq ′ with length(u) = 0, i.e. with u = ε, we add an ε-transition ε q− → q′ . 6.3 Regular grammars 53 (iii) For each production of the form q → u with length(u) ≥ 1, say u = a1 a2 . . . an with n ≥ 1, we add n fresh states q1 , q2 , q3 , . . . , qn to the automaton and transitions a a a a 1 2 3 n q −→ q1 −→ q2 −→ q3 · · · −−→ qn . Moreover we make the state qn accepting. (iv) For each production of the form q → u with length(u) = 0, i.e. with u = ε, we do not add in any new states or transitions, but we do make q an accepting state. u If we have a transition sequence in M of the form sM ⇒ q with q ∈ Accept M , we can divide it up into pieces according to where non-terminals occur and then convert each piece into a use of one of the production rules, thereby forming a derivation of u in the grammar. Reversing this process, every derivation of a string of terminals can be converted into a transition sequence in the automaton from the start state to an accepting state. Thus this NFAε does indeed accept exactly the set of strings generated by the given regular grammar. Example of the construction used in the proof of the Theorem on Slide 40 regular grammar: S→abX X→bbY Y →X X→a Y →ε NFAε : ?>=< 89:; S a GFED @ABC q1 b GFED @ABC 89:; ?>=< Y ❆` ❆❆ ❆❆b ❆❆ ε ❆❆ b / GFED @ABC @ABC / GFED q2 X a (start symbol = S ) GFED @ABC 89:; ?>=< q3 Slide 41 6 GRAMMARS 54 Chomsky Normal Form (CNF) Theorem Any context-free language can be generated by a grammar whose productions are of one of the following three types: X →YZ X→a I →ε where X, Y, Z are non-terminals, a is a terminal, and I is the start symbol. The last type of production occurs if and only if the language contains ε (which is why the use of CNFs is usually restricted to languages that do not contain ε.) Slide 42 6.4 Chomsky and Greiback normal forms The types of production which are allowed in a regular grammar are very special. However, as the results on Slides 42 and 43 show, apparently mild generalizations of them serve to generate all context-free grammars. For proofs of these normal form results, see Hopcroft and Ullman §4.5 and §4.6, for example. Example 6.4.1. Here is a context-free grammar in Chomsky normal form for the language {an bn | n ≥ 0}. The set of terminals is {a, b}, the set of non-terminals is {I, A, B, C}, the start symbol is I and the productions are: I A B C ::= ::= ::= ::= ε | AB | AC a b IB. 6.5 Exercises 55 Greibach Normal Form (GNF) Theorem Any context-free language can be generated by a grammar whose productions are of one of the following two types: X → aU I →ε where a is a terminal, U is a (possibly empty) string of non-terminals, and I is the start symbol. The last type of production occurs if and only if the language contains ε (which is why the use of GNFs is usually restricted to languages that do not contain ε.) Slide 43 6.5 Exercises Exercise 6.5.1. Why is the string (4) not in the language generated by the context-free grammar in Section 6.1? Exercise 6.5.2. Give a derivation showing that (5) is in the language generated by the context-free grammar on Slide 37. Prove that (6) is not in that language. [Hint: show that if u is a string of terminals and non-terminals occurring in a derivation of this grammar and that ‘′ ’ occurs in u, then it does so in a substring of the form v ′ , or v ′′ , or v ′′′ , etc., where v is either x or id.] Exercise 6.5.3. Give a context-free grammar generating all the palindromes over the alphabet {a, b} (cf. Slide 28). Exercise 6.5.4. Give a context-free grammar generating all the regular expressions over the alphabet {a, b}. Exercise 6.5.5. Using the construction given in the proof of part (a) of the Theorem on 6 GRAMMARS 56 Slide 40, convert the regular grammar with start symbol q0 and productions q0 → ε q0 → abq0 q0 → cq1 q1 → ab into an NFAε whose language is that generated by the grammar. Exercise 6.5.6. Is the language generated by the context-free grammar on Slide 35 a regular language? What about the one on Slide 37? Exercise 6.5.7. Show that the language generated by the CFG in Example 6.4.1 is indeed {an bn | n ≥ 0}. Tripos questions 2008.2.8 2005.2.9 2002.2.1(d) 1997.2.7 1996.2.1(k) 57 7 Pushdown Automata Roughly speaking, a non-deterministic pushdown automaton is a non-deterministic finite automaton augmented with a single memory stack of unlimited depth. They can be used to accept exactly the context-free languages in just the same way that NFAs can be used to accept exactly the regular languages. (However, the deterministic version of pushdown automata accept a strictly smaller class of languages, in contrast to the situation for DFAs versus NFAs and regular languages.) 7.1 Non-deterministic pushdown automata A non-deterministic pushdown automaton (NPDA) M = (Q, Σ, s, F, Γ, I, ∆) is specified by the giving the information listed on Slide 44. The first part (Q, Σ, s, F ) of M is like specifying an NFA minus its transition relation. The next part Γ of M is used to build finite stacks: these are just finite strings S ∈ Γ∗ over the alphabet Γ of stack symbols, with the left-most string regarded as the ‘top of the stack’. When strings are regarded as stacks in this way, only certain operations on strings are allowed, as specified on Slide 45. (M contains a distinguished stack symbol I ∈ Γ in order to define its initial configuration: see Section 7.3 below.) Finally, the finite state machine and stack aspects of M are tied together by its transition relation ∆, which formally speaking is any finite subset of (Γ × Q × Σ × Γ∗ × Q) ∪ (Γ × Q × Γ∗ × Q). Slide 46 gives an example of a NPDA. 7 PUSHDOWN AUTOMATA 58 Non-deterministic Pushdown Automaton (NPDA) is specified by: • Q, finite set of machine states • Σ, alphabet of input symbols • s ∈ Q, the start state • F ⊆ Q, subset of accepting states • Γ, alphabet of stack symbols • I ∈ Γ, the initial stack symbol • ∆, finite set of transitions, which are either a ε A, q → S, q ′ , or ε-transitions A, q → S, q ′ (where A ∈ Γ, q ∈ Q, a ∈ Σ, S ∈ Γ∗ and q ′ ∈ Q). input-transitions Slide 44 Allowed operations on stacks S ∈ Γ∗ pop the top element A off a non-empty stack AS , producing a new stack S and returning the element A push a finite string S ′ of elements on to the top of a stack S , producing a new stack S ′ S Note: • pop is not defined on the empty stack; • we may push an empty string onto a stack (in which case it is unchanged). Slide 45 7.2 Behaviour of a NPDA 59 Example NPDA States: i q f Input symbols: a b Start state: i Accepting state: f Stack symbols: I A Initial stack symbol: I Transitions: ε-transition a-transitions b-transitions ε I, i → AI, i a A, i → ε, q a b I, q → ε, f b A, i → AA, i A, q → ε, q Slide 46 7.2 Behaviour of a NPDA The operation of a NDPA is described in terms of possible transitions between ‘instantaneous configurations’ of the memory stack, the internal machine state and the queue of input symbols waiting to be processed. So such a configuration takes the form (S, q, w) where • S ∈ Γ∗ is the current stack of memory symbols, • q ∈ Q is the current state, and • w ∈ Σ∗ is the string of input symbols yet to be processed (from left to right). We can think of the machine progressing from one such instantaneous configuration to the next non-deterministically by performing transitions from ∆, in the following sense: a (i) If (AS ′ , q, aw) is the current configuration and A, q → S, q ′ is an input-transition of M , then (SS ′ , q ′ , w) is a possible next configuration. (In other words, the action taken is: ‘pop A off the stack, consume input a, change to state q ′ and push S onto the top of the stack’.) 7 PUSHDOWN AUTOMATA 60 ε (ii) If (AS ′ , q, w) is the current configuration and A, q → S, q ′ is an ε-transition of M , then (SS ′ , q ′ , w) is a possible next configuration. (In other words, the action taken is: ‘pop A off the stack, change to state q ′ and push S onto the top of the stack’.) Note that in both cases the actions are only allowed to read a single memory symbol A from the top of the stack, but can replace it with a whole string S of memory symbols. More formally, given a NPDA M = (Q, Σ, s, F, Γ, I, ∆), Slide 47 defines binary relations between configurations: one-step transition relation (S, q, w) ⇒1 (S ′ , q ′ , w′ ) many-step transition relation (S, q, w) ⇒∗ (S ′ , q ′ , w′ ). Next-configuration relation (S, q, w) ⇒1 (S ′ , q ′ , w ′ ) describes how a NDPA M = (Q, Σ, s, F, Γ, I, ∆) can move from one configuration (S, q, w) to another (S ′ , q, w ′ ) in one step. It is defined to hold if it matches either of the following two cases: • (AS ′ , q, aw) ⇒1 (SS ′ , q ′ , w) a where A, q → S, q ′ is an input-transition of M ; • (AS ′ , q, w) ⇒1 (SS ′ , q ′ , w) ε where A, q → S, q ′ is an ε-transition of M . We write (S, q, w) ⇒∗ (S ′ , q ′ , w ′ ) to mean (S, q, w) = (S1 , q1 , w1 ) ⇒1 · · · ⇒1 (Sn , qn , wn ) = (S ′ , q ′ , w′ ) holds for some n ≥ 1 and configurations (Si , qi , wi ). Slide 47 7.3 Language accepted by a NPDA 61 For example, for the NPDA on Slide 46 we have (I, i, a3 b3 ) ⇒∗ (ε, f, ε) because of the following one-step transitions: a (I, i, a3b3 ) ⇒1 (AI, i, a2 b3 ) because I, i → AI, i ⇒1 (AAI, i, ab3) because A, i → AA, i ⇒1 (AAAI, i, b3) because A, i → AA, i ⇒1 (AAI, q, b2) because A, i → ε, q ⇒1 (AI, q, b) because A, q → ε, q ⇒1 (I, q, ε) because A, q → ε, q ⇒1 (ε, f, ε) because I, q → ε, f. a a b b b ε L(M ), language accepted by a NPDA M If M = (Q, Σ, s, F, Γ, I, ∆), then L(M ) = {w ∈ Σ∗ | (I, i, w) ⇒∗ (S, q, ε) holds for some S ∈ Γ∗ and q ∈ F }. Slide 48 7.3 Language accepted by a NPDA Given a NPDA M = (Q, Σ, s, F, Γ, I, ∆), the language it accepts consists of all strings w ∈ Σ∗ for which there is a transition from an initial configuration for w to some accepting configuration. By definition, the initial configuration for w is (I, i, w) 7 PUSHDOWN AUTOMATA 62 in other words, the stack of memory symbols just contains the initial stack symbol I, the machine is in its start state i and the input string to be processed is w. By definition, a configuration (S, q, w) is accepting if q ∈ F (the set of accepting states of M ) and w = ε (there are no more input symbols waiting to be processed). So we get the definition of L(M ), the language accepted by M , as on Slide 48. Example 7.3.1. If M is the NPDA given on Slide 46, then L(M ) is the context-free language {an bn | n ≥ 1}. We state without proof on Slide 49 the main result connecting NPDAs with context-free languages. Theorem A language is context-free if and only if it is accepted by some push-down automaton. For a proof, see for example Hopcroft and Ullman Sect. 5.5. Slide 49 Remarks (i) Note that an NFA can be regarded as a NPDA in which the alphabet of stack symbols is just {I}, there are no ε-actions, and input-transitions all have the pushed string equal to I (so that the stack only ever contains the single item I during operation). For such a NPDA, the definition of L(M ) on Slide 48 agrees with the definition of the language accepted by an NFA (Slide 11). (ii) In the literature, the language L(M ) defined on Slide 48 is called the language accepted by M by final state. There is an alternative definition of ‘accepting configuration’ for NPDAs that does without the subset F of accepting states of the machine: one just takes configurations of the form (ε, q, ε) in which the machine is an 7.4 Toward computation theory 63 arbitrary sate q ∈ Q, but the stack of memory symbols is empty and there are no more input symbols to be processed: L′ (M ) = {w ∈ Σ∗ | (I, i, w) ⇒∗ (ε, q, ε) holds for some q ∈ Q} is called the language accepted by M by empty stack. It can be shown that the class of languages we get using this definition of accepting configuration is no different from before—it is all context-free languages. (iii) A NPDA is deterministic if for each instantaneous configuration (with non-empty memory stack) there is exactly one transition that is applicable. In other words, for all (S, q, w) with S 6= ε, there is a unique (S ′ , q ′ , w′ ) for which (S, q, w) ⇒1 (S ′ , q ′ , w′ ) holds. Unlike the situation for finite automata, determinism for pushdown automata makes a difference to a machine’s recognising capabilities: the class of languages accepted by some deterministic pushdown automaton is strictly smaller than the collection of all context-free languages. 7.4 Toward computation theory Why have just one memory stack? Why not have two or more? It turns out that a machine with more than two stacks can always be simulated by one with just two. However, having two stacks rather than one gives a real leap in language-recognising power. Such machines have the same power as Turing machines, and the languages they accept are the recursively enumerable ones. These concepts are developed in the Part IB Computation Theory course. 7.5 Exercises Exercise 7.5.1. Show that if M is the NPDA from Slide 46, then L(M ) is the context-free language {an bn | n ≥ 1}. Exercise 7.5.2. Give a NPDA accepting the language of palindromes over the alphabet {a, b} (cf. Slide 28). Exercise 7.5.3. Let M be the NPDA ({i, q}, {a, b, c}, i, ∅, {I, X, Y }, I, ∆), where ∆ contains the following transitions: ε I, q → ε, q I, i → XI, i a I, i → Y I, i b I, i → I, q X, i → XX, i a X, i → Y X, i b X, i → X, q Y, i → XY, i a Y, i → Y Y, i b Y, i → Y, q a Y, q → ε, q. X, q → ε, q c c c b What is L(M )? Consider the language L′ (M ) accepted by this NPDA by empty stack, in the sense of Remark (ii) above. Show that for any w ∈ {a, b}∗ , the string wcwR is in L′ (M ), where wR is the reverse of w. Is every string in L′ (M ) of this form? [If you get stuck, see Hopcroft and Ullman, Example 5.1.] 64 7 PUSHDOWN AUTOMATA

Regular Languages & Finite Automata Lecture Notes

Related documents

Products

Support

Regular Languages & Finite Automata Lecture Notes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib