Chapter 14 The finite state control structure 1 Analogy How are you feeling right now? Maybe you are happy or sleepy, dopey or grumpy, hungry or angry. Maybe you feel several of these at once, but for simplicity, let’s assume that just one word describes your current state. How did you come to be in this state? Clearly your state is affected by things that happen to you — the inputs you receive. But input alone does not determine how you feel. For example, receiving a 90% on an exam can leave you either delighted or disappointed, depending on your expectations. Hence, your current state is really a function of two factors: your state a little while ago, and the input you have received since then. In other words, your state at time t+1 is determined by your state at time t and the input you encountered between t and t+1. The notion of state and of transition from one state to another as determined by a combination of previous state and input is the basis for a set of computational models called finite state machines (FSMs). The FSM models in turn leads us in a very natural way to a powerful and broadly applicable programming strategy which we will examine in this chapter. 2 Introduction Computational models, or models of computation, are abstractions of devices that produce outputs (answers) from inputs (data). For simplicity, we'll assume a basic model of a computational device as a black box that has a single input channel and a single output channel. One simple physical form of a such a computational device would be a black box with a set of buttons on the front, exactly one of which can be pressed at any time, and a set of lights on the top, exactly one of which is lit at any time. An input value is specified by pressing one of the buttons; the output value is specified by the single light that is lit. We can easily think of this device as producing a single output (light) in response to a single input (button). But we can also think of the device as being given a sequence of inputs (by punching one button after another) and producing a sequence of outputs (by one light after another being lit). The most recent input is often referred to as the current input, and the light currently lit is the current output. Note that what goes on © 2001 Donald F. Stanat and Stephen F. Weiss Chapter 14 Finite State Machines Page 2 inside the box can be very complicated or even random; this simple model can be modified to accommodate any computational task we like. But we are interested only in a small set of the possible behaviors of the box, and we are interested only in behaviors that are deterministic. The simplest computational model is one in which the box computes a function of the input value. Such a box will produce an output value (turn on a light) solely on the basis of the current input value (the last button pressed). Because the output value is determined only by the most recent input value, we say that the computation requires no memory; this means that no information about previous inputs has to be stored to perform the computation that determines which light shall be lit. Furthermore, given a particular input, say the ith button, the output is always the same. We are interested in a more complex computational model that uses some information about past inputs in determining its output. Conceptually, we can imagine a device that keeps track of every input it has processed, that is, the entire input history. Simpler devices might keep track of less information, such as how many times the leftmost button was pressed, but even this information is unbounded in the sense that the number that must be stored may become arbitrarily large, and require arbitrarily much storage to represent it. We're interested in a simpler class of machines called Finite State Machines (FSM), which can store only a finite amount of information, and give outputs that depend on that information and the input. A Finite State Machine is a device that can be in any one of a specified finite set of states and which can change state as a result of an input. Thus, each time an FSM gets an input, we consider it to change states, although it may enter the same state it was in before. Finite state machines are useful for programming because they provide an alternative model for controlling program execution. Recall that program control is what determines which program statement (or block of statements) is to be executed next. The most common control structure is the default sequential structure; this causes the statements of a program to be executed sequentially, one after another, just as they appear in the program. The other control structures are 1. Alternative selection. This is usually embodied as an if, if...else, or switch statement. This performs one or more tests, and based on the result of the tests, chooses one block of code from a collection of blocks and executes it. The block of code may of course be empty, as it is in the false branch of the if statement. 2. Iteration. This is any loop structure; it causes some loop body to be executed a number of times with exit from the loop based on a test. 3. Subroutine. This causes the current action to be suspended. Control then branches to the subroutine code; upon completion of the subroutine, the original action resumes where it left off. Often recursion (which occurs when a subroutine calls itself) is treated as a separate control structure, but we choose to include it under the subroutine structure. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 3 To this collection we now add the finite state control structure, which chooses a statement to execute (or a block of statements or a subroutine to call) based on the state of a finite state machine and the most recent input. The remainder of this chapter will define finite state machines and work through several simple examples to develop the concept and to give you practice in working with them. Then, we will give some examples to show how the finite state machine can be used to solve a more complex problem. 3 The Basic Finite State Machine Model A finite state machine (also called a finite state automaton, or simply a finite automaton) is a device whose input is a sequence of symbols. The automaton is always in some identifiable state. Each time an input symbol is received, the machine enters a new state, although the new state may be the same as the state it was in before. At each point, the current state of the machine is determined completely by: 1. the state it was in prior to the last input, and 2. the value of the last input symbol. Formally, a finite state machine consists of five components: S: a finite set of states. A FSM is always in exactly one of its states. s0: a particular state called the start state that the machine is in before it has seen any input. I: a finite set of input symbols O: a finite set of outputs. : the next state function which maps a current state and current input to a new state. (SxI) -> S : the output function which maps the current state and the current input symbol to an output. (SxI) -> O The machine starts out in s0 and looking at the first symbol of a sequence of input symbols. It then issues the output appropriate to this state-symbol combination and goes to the appropriate next state. It then goes on to the next input symbol and the process repeats until the input sequence is exhausted; the machine then stops. Note that the machine is always in some state. It starts out in s0 and is left in some state when operation stops. Example: Our first example of an FSM is one that takes as input a sequence of binary digits and produces as output the same sequence except alternate ones have been changed to zeros. Thus the input 011010111 becomes 010010010. The machines has two states s0 Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 4 and s1, with s0 being the start state. Both the input and output sets are {0,1}. The next state and output functions are shown in the tables below. input input 0 1 S0 S0 S1 S1 S1 S0 state 0 state S0 1 1 0 S1 0 0 Output Next state Figure 1 An alternate but equivalent representation for a FSM is shown in Figure 2 below and is called a state diagram. States are shown as circles; the start state is indicated by the bold incoming arrow. The next state function and output functions are shown using directed arrows from one state to another. Each arrow is labeled with one element of I and one element of O. If the machine is in some state s and the current input symbol is x, then we follow the arc labeled x/y from s to a new state and produce output y. 0/0 0/0 1/1 S0 S1 1/0 Figure 2 The machine shown above produces a stream of output symbols that is exactly as long as the input stream. By allowing “nothing” to be an element of the output symbol set, we can specify machines whose output stream is shorter than the input. For example, the machine in Figure 3 has a two symbol input set {a,b}, and produces outputs of {a, b, “nothing”}. It reads in strings of a’s and b’s and collapses substrings of a’s into a single a; and substrings of b’s into a single b. Hence the input string “aaaabaabbbbbabbab” will produce the output “abababab”, and the input “aaaaaaaaaaaaaab” will produce “ab”. a / nothing a/ a S0 S1 a/ a b/ b b/ b S2 b/ nothing Figure 3 Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 5 3.1 Example: the stamp machine Consider a simple vending machine that dispenses 25-cent stamps, one at a time1. The set of inputs are nickels, dimes and quarters. If, for the sake of simplicity, we assume no coin return mechanism, there are only two outputs: one is a single stamp from the roll, the other is nothing; that is, no stamp from the roll. The machine must keep track of how much money has been put in so far, and if the amount is 25 cents or more, dispense one stamp and reduce the customer's "credit" by 25 cents. Note that the machine need not remember exactly how much money has been put in it; it must only remember the outstanding credit. Therefore, since it has no other form of memory, it must have one state for each possible amount of credit. There will be one state indicating that the credit is currently zero; one state indicating a credit of 5 cents, plus states for 10, 15 and 20 cents. There is no need for a state with a credit amount of 25 cents or more, because when the credit amount reaches 25 cents the machine will produce a stamp and decrease the credit amount by 25 cents, all at once. The list of states and possible actions is described in a table in Figure 4. This table merges the output and next state functions; each rectangle contains an output (top line) and a new state (bottom line). Figure 4 In the above table, every possibility has been accounted for. In other words, no matter what state the machine is in and which input it receives, there is exactly one output to produce and one next state. For example, suppose the machine is in the state "10 cents". If the customer puts in a nickel, it adds five cents to the amount put in so far and hence goes to state "15 cents" and produces the output "nothing". If another nickel is added, the machine goes to the “20 cents” state and again produces “nothing”. If the customer then adds a dime, the machine now has enough money to dispense a stamp and have 5 cents credit left over. Hence the machine produces the output “one stamp” and goes to the “5 cents” state. From any state in the stamp machine, adding a quarter will result in the output “one stamp” with no change in the credit balance. And so the new state is in fact the same as the old. With postal rates constantly going up, it’s impossible to keep this example current. So just return with us to those thrilling days of yesteryear when a first class stamp was 25 cents. 1 Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 6 The state diagram for the stamp machine, shown in Figure 5, provides an easy way to visualize what the stamp machine does. It starts out in the initial state (credit = 0 cents). An input can be a nickel, a dime, or a quarter. When the input is received the machine goes from the current state into a new state, following the arrow which is labeled with the type of coin that has been inserted. Some arrows are also labeled to denote that a stamp is produced as output when these paths are taken. If the arrow the machine follows does not say to output a stamp, then the output is "nothing". Notice that new state doesn’t always imply different state. For any credit balance, adding a quarter produces a stamp and leaves the credit balance unchanged. Thus from any state, when the input is a quarter, the machine always enters the same state it was in before and dispenses a stamp. Figure 5 As it is currently specified, the stamp machine will cheat the user out of some money if he or she runs out of coins while the machine is in some state other than 0. We could make the machine more realistic and more humane by adding another input: “I am done” and adding four new outputs corresponding to giving change of 5, 10, 15, and 20 cents. We would then add a new arc from each state to the 0 state with the input symbol being “I am done”, and the output being the appropriate amount of change. For example the arc from the 15 cents state would dispense 15 cents in change. The arc from the 0 state back to itself would produce no output. 4 Implementing a FSM A FSM can be implemented with a simple loop. Each time through the loop we get one input symbol, produce the appropriate output symbol, and then go to the next state. When the input stream is exhausted, the loop terminates and the machine stops. state = start state; while (there is more input) { x = next input symbol; output((state,x)); // Generate appropriate output. state = (state,x); // Move to next state. } The body of the loop contains three operations. First, we must get the next input symbol. This could be done with a read statement or perhaps by getting the next element from an Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 7 array or linked list. The second and third statements contain function calls to generate the appropriate output and go to the next state, respectively. This could be done by hard wiring the output and transition information into the code of the functions or by using a more general table look-up scheme. 5 Final output machines The machines we have seen so far all take a sequence of input symbols and produce a sequence of output symbols. The first machine took in one binary integer and produced another of the same length. The ab reducer machine took an input stream and produced a possibly shorter stream with duplicates eliminated. The stamp machine took in a sequence of coins and produced a sequence of stamps and possibly some change. While a FSM always produces a sequence of outputs in response to a sequence of inputs, we can choose to ignore all the outputs except for the last one produced. That is, the output symbol that was associated with the last symbol of the input sequence. As an example, let’s build a FSM that will take a string of a’s as input and tell us whether the input contained an odd or even number of a’s. The input set will consists of just the single symbol “a”; the output set will consist of E for even and O for odd. How many states should the machine have? To figure this out, remember that the states of a finite state machine correspond roughly to its memory. It is not necessary to remember the entire number. In fact, there are only two possibilities: that we have seen an odd number of a’s so far or that we have seen an even number of a’s so far. If we have seen an odd number so far and then see another a, then we have now seen an even number of a’s. Conversely, if we have seen an even number so far and see another a, then we have now seen an odd number. Hence we will need two states: s0 for even so far, and s1 for odd so far. The even state, s0, will be the state since initially we have seen zero a’s and zero is an even number. The next state and output functions will be as follows. If we are in the even state and see an a, then we go to the odd state and output an O since we have now seen an odd number of a’s. If we are in the odd state and see an a, then we go to the even state and output an E since we have now seen an even number of a’s. The last output symbol produced by the machine just before it stops, gives the parity of the input. The state diagram for this machine is shown below. Notice one unusual thing about this machine. Since output is associated with state transitions, there can be no output produced by the empty string even though the string of zero a’s is of even length. a/O S0 S1 even odd a/E Figure 6 Figure 7 shows a FSM that determines whether a binary integer is evenly divisible by 2, by three, by both 2 and 3, or by neither. The input is the sequence of binary digits that Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 8 constitute the number (reading left to right). The output is 2, 3, or b (for both) and n (for neither). The sequence of output symbols has no real significance, although the individual symbols give the division property of that portion of the input seen so far. The last output symbol gives the property for the entire number. We can think of such machines as implementing a function mapping input strings onto a single element of the output set. 0/b S0 1/n S1 1/3 0/b S2 0/2 1/n S3 0/2 1/3 1/n 0/2 S4 S5 0/2 1/n Figure 7 How was the state diagram of Figure 7 designed? The divisibility of an integer n by 2 and 3 is determined by the value of n mod 6, or the remainder when n is divided by 6. Thus, if n is divisible by 6, then it is divisible by both 2 and 3, and if the remainder of n divided by 6 is 4, then n is divisible by 2 (because 4 is) and not by 3 (because 4 is not). The state subscript on the states in Figure 7 represents the value of n mod 6. When the binary representation of n is extended by adding a 0, the result is the binary representation of 2n. When the binary representation of n is extended by adding a 1, the result is the binary representation of 2n + 1. With those facts in hand, constructing the state diagram of Figure 7 is straightforward. Figure 8 shows a FSM whose input is an arbitrary string of letters, digits, and blanks. The final output indicates whether the string is blank (consisting solely of blanks) or numeric (digits and blanks) or alphabetic (letters and blanks) or alphanumeric (letters, digits, and blanks). Rather than labeling the arcs with all possible inputs, we use 'L' for letter, 'D' for digit, and 'B' for blank. The outputs are 'B' for blank, 'N' for numeric, 'Ab' for alphabetic, and 'An' for alphanumeric. Note that once a string is found to contain both a letter and a digit, then it is certainly alphanumeric regardless of what else is in the string. Hence state 4, which is where we go when we find a string to be alphanumeric, is what is called a sink state. It is to finite state machines what a Roach Motel is to a roach: once you get there, you can never leave. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 9 L/Ab S2 D,B/An L/Ab S0 B/B S1 S4 B/B L,B,D/An D/N L,B/An S3 D/N Figure 8 6 Acceptor machines In some cases, we can eliminate the need for the separate output function altogether and instead incorporate the output into the states. Shown below is a modified version of the “even/odd a” machine from Figure 6 But this machine has the output symbols associated with the states rather than along the transitions. The interpretation is that if the machine is in a particular state, then it produces the output associated with that state. And the final output is the output associated with the state in which the machine stops rather than the output associated with the last transition. The machine below produces the same output as does the machine in Figure 6. As a bonus, this machine correctly produces the output E for the empty string. a S1 S0 even odd a Figure 9 The second example is derived from the FSM in Figure 7. We can determine the division property of the input simply by observing the state that the machine is in when it stops after having read the entire input sequence. If it ends up in states 2 or 4, then the input is evenly divisible by only 2; if it ends up in state 3, then it is divisible by 3 only. If the machine stops in state 0, then the input is divisible by both 2 and 3. And if it ends up in states 1 or 5, then the input is divisible by neither 2 or 3. Hence we can modify this machine to incorporate the output into the states as is shown below. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 10 0 S0 b S1 n 1 1 S2 2 0 0 0 1 S3 3 1 0 S4 2 1 S5 n 0 1 Figure 10 We can do the same thing with the FSM in Figure 8. State 0 indicates blanks only; state 1 indicates alphabetic; state 2 indicates numeric; and state 3 indicates alphanumeric. The revised state diagrams are shown in Figures 5 and 6 below. Note that we have been able to add a new output, e for empty, associated with s0. L S2 Ab L B,D L S0 e B S1 B S4 An L,B,D D L,D D S3 N D Figure 11 A special case of such a machine is one in which the output set contains only two elements. All input sequences are thus mapped onto one or the other output symbol. We can think of the output symbols as the binary digits 0 and 1 and associate the notion of “accept” with 1 and “reject” with 0. Then we can think of the machine as either accepting or rejecting an input depending on whether that input causes the machine to stop in an accepting or rejecting state. Such machines are called acceptor automata. The first three examples below are acceptor automata made from examples we have seen already. These figures also show one further shorthand notation. Instead of writing the output values 0 or 1 in the each state, we use a double circle to indicate accepting states (output of 1), and a single circle to indicate rejecting states (output of 0). The machine in Figure 12 accepts strings of a’s that are of even length and rejects strings of odd length. Notice that this machine correctly accepts the empty string. The machine in Figure 13 accepts binary integers that are evenly divisible by 2 or by 3 or by both. The machine in Figure 14 accepts strings that are either alphabetic or numeric and rejects strings that are blank or alphanumeric. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 11 a S0 S1 even odd a Figure 12 0 S0 0 S1 1 1 S2 0 0 1 S3 1 0 S4 1 S5 0 1 Figure 13 L S2 B,D L L S0 B L,B,D S4 S1 D L,D D S3 D Figure 14 The machine in Figure 15 might be used in the lexical analysis phase of a compiler. It takes strings of characters and accepts those that are valid identifiers (for example, names of variables or procedures). To be accepted, a string must begin with a letter (indicated by the generic 'L') followed by letters and digits ('D'). We denote by 'S' any character such as ‘$’ or ‘?’ that is neither a letter nor digit. Note that encountering any 'S' character takes us immediately to the sink state S3 which is a rejecting state and from which there is no exit. Note also that this FSM imposes no limit on the length of the input. Pascal, for example, allows names of up to 255 characters. However, there is no easy way to impose such a limit on a FSM. This is a limitation that is inherent to the FSM and one we will consider in more detail below. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 12 L,D S1 L S0 S D,S S2 L.D.S Figure 15 The FSM in Figure 16 reads strings of characters and accepts only those strings that contain a single unsigned integer, possibly preceded and followed by blanks. For the sake of simplicity, it is customary not to show the sink rejecting state nor the arrows leading to it. Instead, you can assume that if the machine is in some state s and looking at input symbol x, and if there is no arrow labeled x leading from s, then you go to the sink rejecting state and stay there. Thus, for example, if the machine is in state 0 and see the letter ‘a’, then you next go a rejecting sink state. The five states in this machine can be thought of as being associated with the five different classes of strings. In state 0 we have seen nothing yet. If the machine stops in s0 then we know the input was empty. State 1 indicates that we have seen only blanks so far; ending there indicates a string made up of blanks only. State 2 is associated with strings that have zero or more blanks followed by a contiguous substring of digits. In state 3 we know that we have seen zero or more leading blanks followed by a contiguous substring of digits followed by one or more blanks. And state 4 (not shown) is the sink state where we go if we encounter a character other than a digit or blank or if we see more than one contiguous string of digits. States 2 and 3 are accepting; the others are rejecting. d b b S0 b d S1 b S2 S3 d Figure 16 We can take our integer acceptor one step further by accepting a string of characters that contains one real number optionally preceded and followed by blanks. The real number can be represented either in standard notation or in scientific notation. The construction of this machine is left as an exercise. Try the machine on the following real numbers as well as on some strings that do not contain valid reals. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 13 100 -100 3.1415 6.02E24 -8.8E-11 7 String Searching Another practical use of acceptors is in string searching. String searching problems are very common problems, most notably in text editing. The usual statement of the problem is: "determine whether string X occurs in string Y." (For this problem string X will be referred to as the pattern and string Y as the target.) The naive method of doing this would be to write a loop that goes through the target string a character at a time and checks to see if the pattern occurs beginning at that character. After the complete pattern has been compared to the target, go to the next character in the target string and start again. This simple algorithm, (which could be simplified somewhat by using the substring facility of Java), is implemented by the following code, which we will call algorithm A: 7.1 Naive String Search (Algorithm A) // Find the first occurrence of pattern in target, // beginning at position start. // Determine whether one string occurs in another string // starting at a specified point. // Does s1 match a substring of s2 starting in position pos? public boolean match(String s1, String s2, int pos) { // pre: true // post: Returned value is true iff s1 matches a substring // of s2 starting in position pos. // Trivial case: not enough room in s2 for a match. if (s1.length()>s2.length()-pos) return false; int i; for (i=0; i<s1.length() && s1.charAt(i)==s2.charAt(i+pos); i++) {} return i==s1.length(); } // Find and public void { // pre: // post: report all matches of pattern in target. findAllMatches(String pattern, String target) true All occurrences of pattern in target have been reported. for (int i=0; i<target.length(); i++) { // inv: All occurrences of pattern in target starting in // positions less than i have been reported. if (match(pattern, target, i)) System.out.println("Match found starting in position "+i); } } Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 14 This simple solution can be inefficient because, in the worst case, most characters in the target are examined m times: once to see if it they be the first character in an occurrence of the pattern, once to see if they could be the second character in an occurrence of the pattern that began one symbol to the left, etc. Thus the number of character-character comparisons done in the worst case is (mn) where m is the length of the pattern and n is the length of the target. A way to avoid the multiple comparisons of each is for the program to remember some information about the characters that have been read so far. No characters more than m positions to the left of the character currently being examined can affect whether or not this character is part of an occurrence of the pattern. This is because these earlier characters are separated from the current character by a distance which is longer than the length of the pattern. Therefore all we need to know is, at most, what the last m characters read were. Since m is a finite number, there are only a finite number of possible combinations for these m characters to have had. Therefore, this information can be stored in the states of a finite state machine. This means that a finite state machine can be used to solve the string searching problem. What is needed is a finite state machine that will read an input string and accept the string if it contains the pattern. Since a finite state machine doesn't go back and reread any characters of the input string, the finite state machine will read each character of the target exactly once. Clearly, this is the better than algorithm A2. The program corresponding to this finite state machine would look something like this: 7.2 Finite State String Search (Algorithm B) var state = initialState; for (int i=0; i<target.length();i++) { // inv: We'll discuss this shortly! state = stateTransition(state, target.charAt(i)); if (accepting(state)) System.out.println("Match found starting in position "+i); } Now, all that remains to be specified is the state transition function of the finite state machine, but note that the finite state machine is determined completely by the pattern. The following is a finite state machine that finds the first occurrence of the pattern "123" in target strings of digits. 2 Although this algorithm is better than algorithm A, it is not the best we can do. Boyer and Moore have developed a string matching algorithm that is faster, by a linear factor, than this one. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 15 Figure 17 A high-level description of the automaton is that as it reads characters of the target string which might be part of an occurrence of the pattern "123" it proceeds straight across the diagram from left to right. Whenever it finds a character that does not fit the pattern it must retreat some number of steps. The number of steps it retreats depends on what the what the previously read target characters were, which is the same thing as saying that it depends on what state the automaton is in. For a more detailed view, trace what happens with the input string "2122123". The machine starts in the initial state s0. The first character read is "2". Since the pattern starts with the character "1", what has been read so far can't be the beginning of an occurrence of the pattern in the target. Therefore the machine stays in state s0. Now read the character "1". This is the first character of the pattern string. This could be at the beginning of an occurrence of the pattern so the machine goes to state s1. Likewise, read the next character which is "2". At each point this still might be an occurrence of the pattern, so the machine moves to s2. Now read a "2". This is not the next character of the pattern, which is a "3". Therefore, this is not an occurrence of the pattern and the machine must go backwards two steps to state s0. Now read in the next character, which is "1". As before, the machine moves to state s1. Continuing on, the machine reads the "2" and goes into state s2, and then finally reads the "3" and goes into state s3. This is the accepting state, so upon reaching it the machine reports that it has found an occurrence of the pattern. At this point it has read the entire target string, so the task is finished. From this we can deduce the loop invariant for the corresponding program: INV. The machine is in state si (0 <= i <= 3) if and only if i is the largest value such that the last i characters of the target string that were read are equal to the first i characters of the pattern string. Thus if the machine is in state s0 (as it is when it hasn't yet read any characters of the target string), then it has matched 0 characters of the pattern. If it is in state s2, then the last 2 characters read match the first 2 characters of the pattern. And if we are in s3, then the entire pattern must have been found in the target. The simplicity of this loop invariant should by itself be enough to show that this is a good way to solve the string searching problem. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 16 A problem which we haven't talked about is how to use this method if the pattern to be searched for is not known in advance. In this case, the transition function cannot be prepared ahead of time, but will have to be computed as part of the searching program. This is somewhat harder, but still easy enough to make the finite automaton method worthwhile. This algorithm, known as the Knuth-Morris-Pratt string matching algorithm is a well known application of finite state machines. The KMP algorithm is somewhat more complex than we've let on. For any pattern string, it constructs a finite automaton with which to process the target string. 8 What a FSM cannot do Given a finite input set, I, we denote by I* the set of all possible finite length input strings made from elements of I. While every string in I* is of finite length, the set I* itself is infinite. Any acceptor automaton M with input set I divides I* into two subsets: those strings that are accepted by M and those that are rejected. The two subsets are disjoint and their union is I*. We refer to the accepted subset as the language accepted by the machine. We have seen in the examples above of a variety of languages that can be accepted by acceptor automata. The question arises as to whether any language can be accepted by an acceptor automaton. That is, given an arbitrary division of I* into two disjoint subsets A and B, does there exist an acceptor automaton that accepts exactly A and rejects B? The answer is most emphatically 'No'. There are many languages that cannot be accepted by an acceptor automaton. To get a feel for the kinds of languages that cannot be accepted by acceptor automata, consider the language of algebraic expressions consisting of single letter variable names and the operators +, -, * and /. The very simple machine in Figure 18 accepts this language. Figure 19 shows a slightly more complex machine that accepts arithmetic expressions with one level of parenthesization allowed. But no acceptor automaton can accept arithmetic expressions that contain unbounded parenthesization. op L S0 S1 S2 L Figure 18 op S0 L S1 S2 L ( S0 ) L ( op S1 S2 L Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 17 Figure 19 To see why this is true, let’s assume the contrary. That is, let’s assume that an acceptor automaton M accepts the language of valid arithmetic expressions with no limit on the level of parentheses. Since M is a FSM, it must have a finite number of states, say n states. Now, consider the consider the expression (na)n. This is a string of n left parentheses followed by the letter a and followed by n right parentheses. It is a valid arithmetic expression and should thus be accepted by M. If we trace the action of M operating on this string, we will see that M visits some state si at least twice while processing the left parentheses. We know this must be true since M undergoes n state transitions while processing the n left parentheses and hence visits n+1 states. Since there are only n states in M, then one of the states, say si, must be visited at least twice. This is an example of the pigeon hole principle. It derives its name from the pigeon holes used by post office workers to sort mail. Simply stated, if you have lots of letters to be put into the only a few holes, then at least one of the holes must receive more than one letter. More formally, if you have a set of size n, and draw n+1 samples from this set (with replacement,) then at least one of the elements of the set will be drawn at least twice. In the case of a FSM, processing an input of length n takes the machine through n+1 states: the start state plus the n new states that are arrived at via the n state transitions. Hence at least one state must have been visited at least twice. Given that M visits si twice, we can divide the string of left parentheses into three parts: x,y, and z. The first substring x contains the parentheses that take us from the start state s0 to the first occurrence of si. It is possible that x is empty if si is in fact s0. Then substring y takes us from the first occurrence of si to the second occurrence of si. This substring must contain at least one symbol. And the substring z simply contains the rest of the left parentheses. It takes the machine from si to sj, the state that M is in after seeing all of the left parentheses; z might be empty too. Now, consider the string xza)n. Since y contains at least one parenthesis, the string xz has fewer left parentheses than does xyz. Hence xza)n contains more right parentheses than left parentheses and is invalid and should be rejected. But x takes M from s0 to si; and z takes M from si to sj. Thus in processing both strings, M is in state sj with a)n remaining to be seen. Since the action of a FSM is completely determined by its state and the input, M will thus do the same thing on both strings: it will either accept both or reject both. But that’s an error; M was supposed to accept one and reject the other. And so we have arrived at a contradiction. We assumed that M existed and have shown that any M that alleges to accept the language will err by either accepting an invalid string or rejecting a valid string. Hence our only conclusion is that the assumption that M exists must be false. Intuitively, the weakness in finite state machines is that they have only a bounded amount of memory. If a task requires more than that amount of memory, the FSM cannot handle the task. For example, matching arbitrarily nested parentheses takes an unbounded amount of memory and hence is beyond the capability of the FSM. Similarly, while a FSM can accept strings that contain a single integer, no FSM can determine the value of an arbitrary integer. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 18 This weakness in the power of the FSM clearly limits what we can do with it. However, the basic notion of the FSM can be extended and used as a control mechanism in more powerful computational models. 9 Extensions to the basic model The basic FSM model is a useful theoretical model of computation But it is really too weak for most practical problems. Fortunately, the restrictions imposed by the basic FSM model turn out to be an artificial limit on how we use the model. We can easily extend the basic FSM model to provide a very powerful and useful control structure by allowing the machine to do arbitrary computation in each state, including giving the machine access to arbitrary data structures such as counters, arrays, lists, etc. The FSM serves as a control structure much like loops and alternative selection, and the statements inside the FSM control structure, just like the statements inside a loop or select, are unrestricted. Consider, for example, the problem of determining whether an algebraic expression is properly parenthesized. This cannot be done in the pure FSM model, because parentheses can be nested arbitrarily deep, and when the machine has read more left parentheses than it has states, it will necessarily become 'confused.' But a simple machine with one control state and a counter can handle this problem as follows: s0 ' (' a nd count er •0 / +1 ' )' a nd count er > 0 / -1 ' )' a nd count er = 0 fa i l Pa rent h esi s ch ecker. Accept i f, a t end of st ri ng, t h e a ut oma t on i s i n t h e fi na l st a t e a nd t h e count er = 0. The initial state of the machine is s0, with the counter initialized to 0. As each left or right parenthesis is encountered in an input string, the next state is chosen based both on the present state and the value of the counter. When in state s0, if a '(' is read, the counter is incremented; if a ')' is read, the counter is decremented. If the value of the counter is ever 0 and a ')' is read, the string is not well-formed and the machine enters the 'fail' state. And if the counter is not equal to 0 when the string is processed, there were more left parentheses than right ones. Note that both acceptance and the next state function (and in general, the output function as well) use the value of the registers in determining what to do next. A text editor on a computer provides another illustration of the finite state model. The principal input to a text editor is keystrokes, but the same keystrokes can mean entirely Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 19 different things to the text editor, depending on the 'state' of the software. Most commonly, keystrokes are text to be inserted into the document, but, they may represent the name of the file under which the document is to be stored, or a character sequence to be searched for within the document, or instructions on how the document is to be formatted for printing. Thus, the text editor can be considered to have many states, and its reaction to a sequence of keystrokes is dependent on which state it is in. The user-interface for the UCSD Pascal System for the Apple II computer has three states: system, editor, and filer. What the interface does in response to a user’s input depends both on the input and on the state. For example, typing the character “e” in system state causes the system to go to the editor state. Once in the editor, typing an “e” adds that character to the file currently being edited. And typing an “e” in the filer state produces a extended listing of the files on the current volume. A program skeleton for using a general FSM control structure is given below. Here, the FSM has been implemented using the Java switch statement, and the end of input is indicated by a sentinel. Note that the statements inside the case statement (indicated as "process inputSymbol in state si") are arbitrary statements, including possibly compound statements, or even another finite state machine construct. The next state function is a function not only of the state si and the input symbol, but also the other parts of the state - the values of registers, counters, arrays, etc. var state = startState; while (true) { get the next input symbol // inv: all symbols preceding the current input have been processed // in the correct state, and the current state reflects the // input history. // Stop FSM at end of input. if (input symbol == sentinel) break; switch (state) { case s0: process state = break; case s1: process state = break; ... case sn: process state = break; } input symbol in state s0; stateTransition(state, input symbol); input symbol in state s1; stateTransition(state, input symbol); input symbol in state sn; stateTransition(state, input symbol); } As an example of the application of this control mechanism, consider a more realistic version of the integer acceptor machine shown previously in Figure 16. Instead of just accepting or rejecting the input, we also want to determine the value of the integer if the string is valid. In particular, we want to read an input string and first, determine if it is Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 20 empty (contains no characters), blank (contains only blanks), valid (contains a single unsigned integer), or invalid (anything other than empty, blank or valid). And second, if the string contains a valid integer, determine its value. We will ignore for now the possible problem of the integer being too large. Integer Reader 1 below shows the traditional way of approaching this problem. It is a sieve algorithm: a series of tests. As the input passes each test, it goes on to the next. If it fails a test, the algorithm stops. The first test determines if the line is empty. If it is nonempty, the leading blanks are stripped off. If nothing remains after blank stripping, then the string must have been blank. If the string is non-blank, then trailing blanks are stripped. The remaining characters are then tested to see if they are all digits. And if they are, then the value is accumulated. To convert a character digit to its integer counterpart, we first cast the character to an integer and then subtract the integer version of '0'. Hence ((int)'3')-(int)'0' has the integer value 3. This algorithm works, but it is somewhat ad hoc and it may look at some of the characters in the string more than once. Integer Reader 2 does the same thing, but is controlled by exactly the same FSM as was used in the integer acceptor. The computation within the states, particularly in state 2, extends the power of the machine. The state in which the machine stops indicates the result: s0 for empty; s1 for blank; s2 and s3 for valid; and s4 for invalid. If the string is valid, its value is accumulated in the integer variable value. This algorithm makes a single pass over the input and while it is longer than its predecessor, it is easy to write, understand and modify. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines 9.1 Integer Reader 1 public static String stripLeadingBlanks(String s) { // pre: true // post: Returned value is s with leading blanks removed. while (s.length()>0 && s.charAt(0)==' ') // SC eval s=s.substring(1); return s; } public static String stripTrailingBlank(String s) { // pre: s contains a non-blank character // post: Returned value is s with trailing blanks removed. while (s.charAt(s.length()-1)==' ') s=s.substring(0,s.length()-1); return s; } public static String reader1(String s) { // pre: true // post: Returned String gives string type and, if valid, // its value. String res=s; // See if string is empty. if (s.length()==0) return "|"+res+"|"+" is empty"; // String is nonempty; strip leading blanks s=stripLeadingBlanks(s); if (s.length()==0) return "|"+res+"|"+" is all blank"; // String is not all blank; strip trailing blanks s=stripTrailingBlank(s); // Check for all digits and accumulate value int value=0; for (int i=0;i<s.length();i++) { if (s.charAt(i)<'0' || '9'<s.charAt(i)) // Non-digit found. return "|"+res+"|"+" is invalid"; value = value*10+((int)s.charAt(i)-(int)'0'); } // Only digits found. return "|"+res+"|"+" is valid: "+value; } Printed February 06, 2016 12:21 AM Page 21 Chapter 14 Finite State Machines 9.2 Integer Reader 2 public static String reader2(String s) { // pre: true // post: Returned String gives string type and, if valid, // its value. String res=s; int state=0; // State of FSM. int i=0; // String index. int value=0; // char c; while (i<s.length()) { // Translate character: ' '->'b' blank // digit->'d' // something else->'s' if (s.charAt(i)==' ') c='b'; else if (s.charAt(i)>='0' && s.charAt(i)<='9') c='d'; else c='s'; switch (state) { // Nothing seen so far. case 0: if (c=='b') state=1; else if (c=='d') { state=2; value=(int)(s.charAt(i))-(int)'0'; } else state=4; break; // Only blanks seen so far. case 1: if (c=='d') { state=2; value=(int)(s.charAt(i))-(int)'0'; } else if (c=='s') state=4; break; // Seeing digits. case 2: if (c=='d') value=value*10+(int)(s.charAt(i))-(int)'0'; else if (c=='b') state=3; else state=4; break; // Valid plus trailing blanks. case 3: if (c=='d' || c=='s') state=4; break; // Invalid. case 4: } Sink state. Printed February 06, 2016 12:21 AM Page 22 Chapter 14 Finite State Machines Page 23 i++; } switch (state) { case 0: return case 1: return case 2: return case 3: return default: return } "|"+res+"|"+" "|"+res+"|"+" "|"+res+"|"+" "|"+res+"|"+" "|"+res+"|"+" is is is is is empty"; all blank"; valid: "+ value; valid: "+ value; invalid"; } The real number acceptor (exercise 7) can be similarly extended to determine the value of the real. 9.3 Comment locator Suppose that you wish to locate the comments in a Java program. There are two considerations. Any text following "//" up to a carriage return <cr> is a comment. Any text between '/*' and '*/' is a comment. This text may include line returns. The appropriate states for doing so can be listed as follows: 0 Outside a comment. 1 Outside comment, but have just seen a '/'. 2 Have just seen a second '/'. Now inside comment mode until <cr>. 3 Have just seen a '*' that followed a '/'. Now inside comment mode. 4 Just saw a '*' inside a /* ... */ comment. not / / S0 S1 not / or * / <cr> / * S2 Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines S4 not <cr> Page 24 S3 * not / or * not * * Comment Stripper Given the state diagram, the code is not difficult to write, but it's worth a moment's reflection to consider how difficult the code would be to understand without knowledge of how it arose. Documentation of code based on a FSM should always describe the FSM, either with a diagram or (when a diagram is not feasible) a state transition table, and a similar description of how the output is generated. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 25 // Strip comments from parameter string. public static String commentStrip(String s) // pre: true // post: Returned String is the inbound string stripped of comments. { int state = 0; // Current FSM state. String outS="";// Will hold outbound string. char c=' '; // Current character. for (int i=0;i<s.length();i++) { c=s.charAt(i); switch (state) { // Not in comment mode. case 0: if (c=='/') state=1; else outS=outS+c; break; // Have seen a '/'. case 1: if (c=='/') state=2; else if (c=='*') state=3; else { state=0; outS=outS+'/'+c; } break; // Have seen second '/'; enter comment mode for the rest // of this line. case 2: if (c=='\n') { state=0; outS=outS+'\n'; } break; // Have seen '/*'; enter comment mode until "*/'. case 3: if (c=='*') state=4; break; // In comment mode and have seen a '*'. // If next char is '/', leave comment mode. case 4: if (c=='/') state=0; else if (c!='*') state=3; break; } } return outS; } 9.4 Text Compression Let’s say we wanted to send a message consisting of only the characters 'a', 'b', and 'c'. Since the message may contain long runs of the same character (for example, “aaaaabaaaaacccccccbbbbbbababc”), we want to try to compress the message for more efficient transmission. A very simple way to do this is to encode any run of three or more Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 26 of the same character as <count><character> where count is an integer indicating the number of occurrences of character. For example, the message “aaaaabaaaaacccccccbbbbbbababc” would be abbreviated as “5ab5a7c6bababc”. Encoding runs of length two would produce no compression; encoding singletons would actually make the resulting string longer. To do the compression requires that we keep track of two things: the character seen most recently and the number of consecutive occurrences of that character. The former can be done with a FSM since there are only three characters; keeping track of the count, however, is beyond what a FSM can do and is handled as an extension. The procedures to compress a string and to display the compressed string are shown below. public static String output(int count, char c) { // Compress a homogeneous string of characters. // if count = 1, return c. // if count is 2, return cc // if count is 3 or more, return compressed form: count+c // pre count >= 1 if (count == 1) return String.valueOf(c); else if (count == 2) return String.valueOf(c)+String.valueOf(c); else return count+String.valueOf(c); } Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 27 public static String compress(String s) { // Compress s by encoding runs of length 3 or more. // FSM states: 's' for start. // 'a' seeing a's. // 'b' seeing b's // 'c' seeing c's char state = 's'; // FSM state. int count=1; // Current run length. String outString = ""; // String to be returned. for (int i=0;i<s.length();i++) { switch (state) { case 's': // Get first character of s. state = s.charAt(0); break; case 'a': // Have seen one or more a's. if (s.charAt(i)=='a') count++; else { outString = outString + output(count,'a'); count = 1; state = s.charAt(i); } break; case 'b': // Have seen one of more b's. if (s.charAt(i)=='b') count++; else { outString = outString + output(count,'b'); count = 1; state = s.charAt(i); } break; case 'c': // Have seen one of more c's. if (s.charAt(i)=='c') count++; else { outString = outString + output(count,'c'); count = 1; state = s.charAt(i); } break; } } // Flush final character(s). outString = outString + output(count,state); return outString; } The above code for compress, developed from a straightforward FSM model, works well, but has several sections that look very similar. A few moments of thought reveal that the redundancies are due to the fact that the actions in all the states except the start state are nearly identical. The start state can be eliminated by making the initial state the first Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 28 character of the sequence (thus changing the range of the for loop), and the separate cases for the remaining states can be combined into one, eliminating the redundancies and giving the following code. public static String compress(String s) { // Compress s by encoding runs of length 3 or more. // pre: true // post: Returned value is compressed version of s. if (s.length()==0) return ""; // Handle empty string. char state = s.charAt(0); // Initialize state. int count=1; // Current run length. String outString = ""; // String to be returned. for (int i=1;i<s.length();i++) { if (s.charAt(i)==state) // Repeat of previous character. count++; else // New character. { outString = outString + output(count,state); state = s.charAt(i); count = 1; } } // Flush final character(s). outString = outString + output(count,state); return outString; } 10 Summary The notion of state and of transition between states based on input is very common, ranging from the children's board game Candyland to traffic lights to many software applications. As we have seen the basic finite state machine is not powerful enough to handle most of these applications, but is easily extended and is the basis for a very useful and powerful programming paradigm. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 29 11 Exercises 1. Draw the state-transition diagram for the following finite state machine and describe in English what it does. Set of input symbols = {a,$} Set of output symbols = {a,0,1,2} Set of states = {s0,s1,s2} Initial state = s0 output function F: state transition function G: What will the machine output if the input is: i) aaa ii) aaaaaaa$ iii) aa$aaaa$a$$ Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 30 2. Construct the formal specification for the finite state machine given by the following state transition diagram and explain in English what it does. (In the diagram, the notation on the arrows gives the input symbol followed by the output symbol) 3. Write a program which, given a pattern string, generates the state transition table to be used by a finite automaton which will find all occurrences of the pattern string in an unspecified target string. 4. Assume that you have been hired by a traffic light manufacturer to design an "intelligent" traffic light. Inputs to your system will include signals from various timers, from sensors that detect the presence of cars in the left-turn and through lanes, and from pedestrian "walk" buttons. The outputs set the lights (including left-turn and pedestrian signals), ring a bell for blind pedestrians, and reset the timers. a.) write an English specification of what your intelligent traffic-light system is to do. b.) construct a finite state machine to perform to these specifications. 5. Binary Coded Decimal, or BCD, is a standard way of encoding decimal numbers in computers. Each decimal digit is stored as a sequence of four bits, with "0000" standing for "0", "0001" for "1", etc. Thus the BCD number "10010011" would represent the decimal number "93". Construct a finite state machine which will translate BCD numbers into decimal digits. a.) What are the input symbols? Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines b.) What are the output symbols? c.) How many states are needed? d.) Draw the state transition diagram. Page 31 6. Suppose that we have encoded the letters of the alphabet as decimal numbers which are represented in character form. For example, "01" is "a", "02" is "b" and so on. Write a Pascal program which translates a string of digits into a sequence of characters using a finite automaton. 7. a.) Construct a real number acceptor. b.) Expand the real number acceptor to calculate the value of the real number. Ignore the possibility of overflow. c.) Expand the real number acceptor/calculator so that it accepts numbers with commas to separate out thousands, e.g. "1,000", "3,465,712.3298798", "2,934e7". Do not accept numbers with incorrectly placed commas, e.g."123,45", ",235.98" or "345.2,345". 8. Literal strings are strings of characters delimited by quotation marks ("). Any character may be contained within a literal string. If a quotation mark is to be represented within a literal string, it is done by using two consecutive quotation marks (""). a.) design an acceptor for literal strings which do not contain any quotation marks b.) design an acceptor for literal strings which may contain quotation marks c.) design a finite automaton which reads in a literal string (enclosed in quotation marks) and writes out the literal string within the quotation marks, changing occurrences of "" to ". 9. Construct acceptors for the following sets of strings: a.) all strings of a's and b's in which every a is immediately followed by a b b.) all strings of a's and b's in which the substring "ab" occurs at least twice c.) all strings of a's and b's in which the third character from the end is a b. d.) all strings of a's and b's which contain either the pattern "aab" or the pattern "baa" Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 32 10. Describe informally the strings accepted by the machines given by the following diagrams: a.) (b.) 11. Construct a finite state machine which is an acceptor for valid telephone numbers. Valid telephone numbers consist of the following: a.) If the first digit is a "0" then a call is operator assisted. A "0" is itself a legal number, as is a "0" followed by any other legal number. Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 33 b.) If the first digit is not a "0" and the second digit is neither "0" nor "1" then it is a local number and should be exactly seven digits long. c.) If the first digit is not a "0" and the second digit is a "0" or a "1" then it is a longdistance number and should be exactly ten digits long. 12. Doctor Victor Frankenstein says: "The monster has been very difficult to deal with lately. It is almost impossible to wake him up in the morning. The only thing that will rouse him is 10,000 volts applied to the bolts on his neck. This wakes him up, but unfortunately it never fails to make him enraged, and he goes out and terrorizes the village. Whenever this happens, I send Igor out to calm him down. If this is successful, then the monster becomes docile and will help me with my experiments. If not, then the villagers threaten him with sticks and pitchforks, and he gets frightened and retreats back into the castle where he falls asleep. When he is docile, his only problem is that he is too eager to help and gets in the way. When this happens, I have Igor sing to him. Under this stimulus he becomes sentimental and sits and hums to himself. If Igor keeps singing, the monster will fall asleep, but if Igor stops singing the monster becomes docile and helpful again." a.) Consider the monster as a finite automaton. What are the states? What are the inputs and outputs? b.) Draw the monster's state transition diagram. 13. Construct a finite automaton takes as input strings of “0”s and “1”s and accepts those strings that contain the substring "011010". Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Page 34 Chapter 14: The finite state control structure 1 ANALOGY 1 2 INTRODUCTION 1 3 THE BASIC FINITE STATE MACHINE MODEL 3 3.1 Example: the stamp machine 5 4 IMPLEMENTING A FSM 6 5 FINAL OUTPUT MACHINES 7 6 ACCEPTOR MACHINES 9 7 STRING SEARCHING 13 7.1 Naive String Search (Algorithm A) 13 7.2 Finite State String Search (Algorithm B) 14 8 WHAT A FSM CANNOT DO 16 9 EXTENSIONS TO THE BASIC MODEL 18 9.1 Integer Reader 1 21 9.2 Integer Reader 2 22 9.3 Comment locator 23 9.4 9.5 25 Text Compression 25 10 SUMMARY 28 11 EXERCISES 29 Printed February 06, 2016 12:21 AM Chapter 14 Finite State Machines Printed February 06, 2016 12:21 AM Page 35