Chapter 1 Introduction Grammars .... Chomsky TYPE NAME 0 Regular Grammars -1 Context Free -2 Context Sensitive-3 Unrestricted -- hierarchy EXAMPLE RECOGNIZER A -> aB or A -> a -- RE, FA A -> ABaB -- PDA AaB -> AbB |LF| >= |RT| -- LBA Anything -- TM Programs related to compilers Interpreters, Assemblers, Linkers, Loaders, Preprocessors, Editors, Debuggers, Profilers, Project Managers (sccs, rcs) Major Data Structures Major Data Structures Tokens Syntax tree Symbol table Literal table Intermediate code Temporary files Passes 1, 2, 3, or more Error handling When to give up. Porting Compiler for language A -> Existing Compiler -> Running compiler Written in language B for B for language A Cross compiler - generates a target code for a different machine from the one on which it runs. Bootstrapping - write a small simple translator language that can be used to construct the complete translator. source code scanner for a tokens parser literal table syntax tree semantic analyzer symbol table annotated tree source code optimizer error handler intermediate code code generator target code target code optimizer target code Lexical Analysis Pattern matching Applied to other areas such information retrieval systems. Programs strings. execute actions as triggered query by languages patterns Tasks of the lexical analyzer: strip out comments remove white space, tab and end of line correlating error messages associating with parser error messages create the listing preprocessing may be done line of and in numbers Why separate lex and parsing simpler - a grammar can do both, but this cleaner faster - lex does lots of i/o, can be separated portability - different alphabets can be used tokens, lexemes, const const if if relation<,<= id pi, d2 inconst 23 patterns const if < or <= letter followed by letters or digits digits typical tokens keywords, operators, identifiers, strings and punctuation symbols. constants, literal Some lexical conventions make life tougher DO 5 I = 1.25 is DO5I = 1.25 FORTRAN DO 5 I = 1,25 is keyword DO IF THEN THEN THEN = ELSE; ELSE ELSE = THEN; PL/1 Attributes for tokens rel_op matches <, >= etc. but code gen must know which one int_const matches 0, 1 Lexical Errors if 'fi' is encountered, lex returns id, even in wrong context bad character in input cannot form a token Error recovery delete characters until the next lexem inserting a missing character replace a character transpose two adjacent characters Can compute the min number of transformations required to convert the program into a correct one. Normally not done. Sometimes done locally. 3 ways of doing lexical analyzer use LEX write in a high level language and use language I/O write in assembler and do your own I/O Double buffering works quite well fill one while reading from the other. ok if token is smaller than the buffer. Lexical is to scan input and create tokens. interaction with parser tokens generated each time required. tokens written to a file or generated as needed. Roll of lex with symbol table? Create the symbol table now or during parsing? Personally, I like during parsing, however the text discusses the creation during lexical. specification of tokens strings and languages string length of string prefix suffix substring empty string empty set Operations on Languages union concatenation closure Lo = {} Regular Expressions is a re denoting {}, the set containing the empty string a is re a | b is re ab is re a* is re (a) - you can have an extra set of parens digit = [0-9] nat = digit+ signedNat = (+|-)? nat number = signedNat (“.” nat)?(E signedNat)? To implement, one needs a FA. Some languages cannot be recognized by regular expressions. id -> letter (letter | digit) * Draw the FSM for this. NFA, DFA, Minimize