Notes on regular languages The simplest language class is the set of regular languages, so let's begin with finite automata, the machines that recognizes languages in this class. What is a finite automaton? Pictorially, the states of the machine are circles, and they are generally labeled by subscripted q’s such as q0, q1, … . The start state is usually indicated by an unlabeled arrow or a triangle pointing to it. By convention, the start state is usually q0. The final or accepting states are indicated by concentric circles. We accept a string if the machine ends up in that state after examining all elements of the string being tested. Transitions from one state to another are indicated by arrows labeled by an alphabet symbol. Suppose we want to recognize bit strings of even parity. Since parity is either even or odd we will need two states—one corresponding to each possibility. Thus we begin like this: 1 1 The start state on the left corresponds to strings with even parity and thus is indicated as an accepting state while the one on the right is where odd parity strings would end up. However, our alphabet contains two symbols—0 and 1 and thus our machine should have transitions corresponding to an input of 0. Since a 0 doesn’t change parity, we’ll use loops on both states so strings with 0’s can be handled. This gives us the machine below: Another example of the use of a finite automaton is in a lexical analyzer to recognize key words such as “if”, “then”, and “while” when compiling a program. Another way that regular languages are characterized is by grammars. A finite automaton recognizes strings in a language i.e. run the string through the machine and see what kind of state it stops in. Grammars generate strings in a language. That is, if you follow the rules in the grammar you will produce a string that is in the language. We’ll give the formal definition of a grammar later, but let’s construct one for the even parity language. We will need two variables corresponding to the two states in the machine above. We’ll use S and A corresponding to q0 and q1, respectively. The grammar below will generate even parity strings. The generation stops when the only symbols remaining in the string are alphabet symbols (terminals). S 0S S 1A S A 0A A 1S Using the S rule stops the derivation because all variables have been removed. Here’s a sample derivation of an even parity string: S 0S 01A 010A 0101S 0101S 01010 Note that a grammar is not unique and there may be other grammars that generate the same language. Now, for the formal definition of a grammar: A grammar G is defined as a quadruple G = (V, T, S, P) where V is a finite set of objects called variables, T is a finite set of objects called terminal symbols S V is a special symbol called the start variable, P is a finite set of productions or rules to generate strings in the language. The rules in P have the form where and are strings of symbols from V and T For regular languages, the left hand side of a production is always a single variable. A common convention is to use lower case letters near the front of the (English) alphabet and digits as the terminal symbols. Lower case letters near the end of the alphabet are usually used to denote strings in a language. For example, the string we derived above could be “named” w i.e. w = 01010. V and T are disjoint, and both sets are nonempty. T is really the same thing as . We use a different notation because we are discussing them in different ways. S is always an element of V. Other commonly used notation includes T* whose elements are a sequence of (terminal) symbols from T. A string in (V T)* contains both variables and terminal symbols. This may also be referred to as a sentential form. If string w can be obtained by starting with S and applying the productions of the grammar then we say that S derives w denoted by S * w. (The * indicates this can be done in 1 or more steps. We may also put a number above the double arrow indicating the actual number of steps used in the derivation.) The language generated by the grammar G is defined as L(G) = {w T* | S * w}. For more examples of grammars see examples 1.11, 1.12 and 1.13 in the text. Let’s look briefly at example 1.12 and discuss why certain variations will not work. The language L = {anbn+1 | n 0}. The grammar from the book is S Ab A aAb | A Observe that the following grammar also works: S aSb | b If we wanted all strings with an equal number of a’s and b’s with all a’s preceding the b’s we could replace the production A b in the second grammar by S , What does the following grammar generate? S aSb | Sb | Finally, let’s look at exercise 12 in section 1.2. generate? S aA | A bS What does the following grammar (We’ll discuss the answers to these in class.) Think about how you could construct a grammar that generates all strings with an equal number of a’s and b’s. There is no restriction on the order of the a’s and b’s. Here’s a table that shows grammar characteristics and machines that accept the different language classes. language class regular context-free context-sensitive recursively enumerable grammar characteristics A wB | w where w T* (i.e. a string of terminal symbols) lhs is single variable. If there is a variable on the rhs it is at the extreme right. e.g. S aA A where (V T)*. That is, the rhs is a string of variables and terminals e.g. S aSb where , (V T)* That is, both the lhs and rhs are strings of variables and terminals and 1 || || e.g. aA bBS unrestricted or phrase-structured where , (V T)* No restrictions on or machine finite automaton pushdown automaton linear-bounded automaton (LBA) Turing machine In addition to the finite automata and grammar, regular expressions are also used to represent regular languages. For example, for the language discussed in class, the set of odd length strings over {a, b} ending in b one possible regular expression is [(a+b)(a+b)]*b. A * means to use the expression 0 or more time. We’ll look at regular expressions in more detail later. Some simple examples of context-free languages that are not regular are the set of strings with an equal number of a’s and b’s or strings of the form aibjci+j.