Basic Concepts for Symbols, Strings, and Languages TDDD16 Compilers and interpreters TDDB44 Compiler Construction Alphabet A finite set of symbols. Example: Formal Languages Part 1 Including Regular Expressions Sb = { 0,1 } Ss = { A,B,C,...,Z,Å,Ä,Ö } Sr = { WHILE,IF,BEGIN,... } binary alphabet Swedish characters reserved words String A finite sequence of symbols from an alphabet. Example: 10011 KALLE WHILE DO BEGIN Peter Fritzson IDA, Linköpings universitet, 2009. Properties of Strings in Formal Languages String Length, Empty String Number of symbols in the string. Example: z z z z x arbitrary string, |x| length of the string x |10011| = 5 according to Sb |WHILE| = 5 according to Ss |WHILE| = 1 according to Sr Empty string z The empty string is denoted e, |e| = 0 TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.2 Properties of Strings in Formal Languages Concatenation, Exponentiation Concatenation Length of a string z TDDD16/B44, P Fritzson, IDA, LIU, 2009. from Sb from Ss from Sr 2.3 Substrings: Prefix, Suffix z Two strings x and y are joined together x•y = xy Example: z x = AB, y = CDE produce x•y = ABCDE z |xy| = |x| + |y| z xy ≠ yx (not commutative) zex=xe= x String exponentiation z x0 = e z x1 = x z x2 = xx z xn = x•xn-1, n >= 1 TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.4 Languages A Language = A finite or infinite set of strings which can be constructed from a special alphabet. Example: z Alternatively: a subset of all the strings which can be x = abc constructed from an alphabet. Prefix: Substring at the beginning. z Prefix of x: abc (improper as the prefix equals x), ab, a, e Suffix: Substring at the end. z Suffix of x: abc (improper as the suffix equals x), bc, c, e TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.5 z ∅ = the empty language. NB! {e} ≠ ∅. Example: S = {0,1} z L1 = {00,01,10,11} z L2 = {1,01,11,001,...,111, ...} all strings which finish on 1 z L3 = ∅ TDDD16/B44, P Fritzson, IDA, LIU, 2009. all strings of length 2 all strings of length 1 which finish on 01 2.6 1 Closure Operations on Languages Concatenation S* denotes the set of all strings which can be constructed from L, M are languages. the alphabet Concatenation operation • (or nothing) between languages Closure types: z * = closure, Kleene closure z + = positive closure Example: S = {0,1} z S* = { e, 0,1,00,01,...,111,101,...} z S+ = S* – {e} = {0,1,00,01,...} TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.7 z L•M = LM = {xy|x ∈ L and y ∈ M} z L{e} = {e}L = L z L∅ = ∅L = ∅ Example: z L ={ab,cd} M={uv,yz} z gives us: LM ={abuv,abyz,cduv,cdyz} TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.8 Exponents and Union of Languages Closure of Languages Exponents of languages Closure z L0 = {e} z L1 =L z L* = L0 ∪ L1 ∪ ... ∪ L∞ Positive closure z L2 = L•L z Ln = L•Ln-1, n >= 1 Union of languages L+ = L1 ∪ L2 ∪ ... ∪ L∞ z L* = {e} ∪ LL* = L* – {e} , if e not in L L+ Example: A = {a,b} z L, M are languages. z L ∪ M = {x| x ∈ L or x ∈ M} z Example: L = {ab,cd} , M = {uv,yz} z gives us: L ∪ M = {ab,cd,uv,yz} TDDD16/B44, P Fritzson, IDA, LIU, 2009. z 2.9 z A* = {e,a,b,aa,ab,ba,bb,...} = All possible sequences of a and b. A language over A is always a subset of A*. TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.10 Regular expressions Regular expressions are used to describe simple languages, Small Language Exercise e.g. basic symbols, tokens. z Example: identifier = letter • (letter | digit)* Regular expressions over an alphabet S denote a language (regular set). TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.11 TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.12 2 Rules for constructing regular expressions S is an alphabet, z z the regular expression r describes the language Lr, Examples: S = {a,b} Regular expression r Language Lr z 1. r=a e {e} Lr={a} z 2. r=a* Lr={e,a,aa,aaa, ...} = {a}* {a} z 3. r=a|b Lr={a,b}={a} ∪ {b} z 4. r=(a|b)* Lr={a,b}*={e,a,b,aa,ab,ba,bb,aaa,aab,...} a a∈S the regular expression s corresponds to the language union: (s) | (t) Ls, etc. concatenation: (s).(t) Each symbol in the alphabet S is Regular Expression Language Examples Ls ∪ Lt Ls.Lt repetition: (s)* Ls * z 5. r=(a*b*)* Lr={a,b}*={e,a,b,aa,ab,ba,bb,aaa,aab,...} repetition: (s)+ Ls + z 6. r=a|ba* Lr={a,b,ba,baa,baaa,...}={a or bai | i≥0} a regular expression which denotes {a}. z * = repetition, zero or more times. Priorities Highest * + . Lowest | NB! {anbn | n>=0} cannot be described with regular expressions. z r=a*b* gives us Lr={ai bj | i,j>=0} does not work. z z + = repetition, one or more times. z . concatenation can be left out TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.13 r=(ab)* gives us Lr={(ab)i | i>=0}={e,ab,abab, ... } does not work. Regular expressions cannot ’’count’’ (have no memory). TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.14 Finite state Automata and Diagrams State Transition Diagram (Finite automaton) state diagram (DFA) for banbm Assume: z regular expression RU = ba+b+ = baa ... abb ... b z L(RU) = { banbm | n, m ≥ 1 } start A program which takes a string x and answers yes/no depending on whether x is included in the language. z The first step in constructing a recognizer for the language L(RU) is to draw a state diagram (transition diagram). TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.15 b b 0 a 9 3 a accepting state TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.16 Input and State Transitions Start in the starting node 0. Example of input: baab Repeat until there is no more input: Then accept when there is no Read input. z Follow a suitable edge. more input and state 3 is an accepting state. z Check whether we are in a final state. In this case accept the string. There is an error in the input if there is no suitable edge to follow. z Add one or several error nodes. TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.17 Step Current State Input 1 0 baab 2 1 aab 3 2 ab 4 2 b 5 3 e TDDD16/B44, P Fritzson, IDA, LIU, 2009. a 1 a 2 b start When there is no more input: b a, b error state Interpret a State Transition Diagram z a 2 b Recognizer z a 1 b b 0 a 9 a 3 b a, b error state accepting state 2.18 3 Representation of State Diagrams by Transition Tables NFA and Transition Tables The previous graph is a DFA Example: NFA for (b|a)* ab (Deterministic Finite Automaton). State It is deterministic because at each step there is exactly one state to go to and there is no transition marked ‘‘e’’. Accept regular set and corresponds to an NFA (Nondeterministic Finite Automaton). e no 0 A regular expression denotes a Found Next state Next state a b 9 1 1 no b 2 9 2 no ba+ 2 3 3 yes ba+b+ 9 3 9 no a start 0 a 1 state a b Accept 0 {0,1} {0} no {2} no b 2 1 b 2 9 state diagram for (b|a)*ab yes Transition table for (b|a)*ab Transition Table (Suitable for computer representation). It requires more calculations to simulate an NFA with a computer program, e.g. for input ab, compared to a DFA. TDDD16/B44, P Fritzson, IDA, LIU, 2009. TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.19 2.20 Transforming NFA to DFA Theorem z Any NFA can be transformed to a corresponding DFA. Small Regular Expression and Transition Diagram/Table Exercise When generating a recognizer automatically, the following is done: regular expression → NFA. NFA → DFA. z DFA → minimal DFA. z DFA → corresponding program code or table. z z b a DFA for (b|a)*ab start 0 b TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.21 a 1 b 2 a TDDD16/B44, P Fritzson, IDA, LIU, 2009. 2.22 4