CS 3240 Project 1 Report Kevin Chu (kchu9@gatech.edu) Joshua Robert (jroberts61@gatech.edu) Apurv Jain (ajain93@gatech.edu) Tri Nguyen (gnguyen31@gatech.edu) Abstract Simple Tree (AST) Generator Lexical Input Format: Should be a set of Character classes in the format $<name> <class/exclude set>. Lexical input can contain several of these character classes separated by newlines. Token descriptions will be separated from the character class section by a single new line. A sample input format is listed below: $DIGIT [0-9] $NON-ZERO [^0] IN $DIGIT $CHAR [a-zA-Z] $UPPER [^a-z] IN $CHAR $LOWER [^A-Z] IN $CHAR $TEST a|(bc) $IDENTIFIER $LOWER ($LOWER$DIGIT)* Assumptions: Regex Grammar handles the following operations: +,*,concat,|, (), exclude class and character set and generates a simple AST tree based on priority. Leaf nodes should contain epsilon, null, or a character class. Input for Lexical Input is assume to contain valid character classes and regex expressions for tokens. If you want to use the * or + operator, () are required around the character class. Question marks require an escape character Non-deterministic Finite Automata (NFA) Generator The NFA Generator takes in a list of ASTs for each type of recognized token. It then generates an NFA for each tree and then joins them to form the final language NFA. The NFAs are structured much like we have seen in class, nodes that represent states which transition to other nodes on some input or “epsilon”. NFANodes keep track of their names, a hashmap of transitions from some list of accepted inputs to a list of states. If a node is accepting, it keeps track of what type of token it accepts. An NFAGraph keeps track of it start state, end states, and all the nodes the graph contains. End states are accepting. The NFA generator constructs the NFAs from the AST trees using post-order traversal. Depending on if a node is an input or an operations it treats them differently. Operations are treated much like we were shown to do in class/textbook with epsilon transitions. For example, say we have two primitive NFAs, one with Node0 to Node1 on $LETTER and another N2 to N3 on $DIGIT. If these are concatenated, the first concat the second, the end state of the first then transitions to the start state of the second on “epsilon” and the new overall graph has a start state N0 and end state N3. Operations that are handled are concatenation, alternation, 0 or more replication, 1 or more replications, and 0 or 1 replication. Assumptions: I use “epsilon” for my transitions, this is assuming that string is not going to used in a lexical specification file as an input. This is the only real assumption that was made aside from reasonable assumptions about input and structure from our other packages. Test Cases: I tested on a few cases and made sure null cases were handled in case the parser gave odd nodes. I tested on a few regular expressions which I manually built ASTs for. Later testing the same regex in a lexical spec provided the same results. I tested on these regex, drew out the graphs they provided on node based data, and compared to how I we would expect them to appear: $NAME: $UPPERCASE($LOWERCASE)+ $VAR: ($UPPERCASE|$LOWERCASE)+($DIGIT|$LOWERCASE|$UPPERCASE)+ $EMAIL: ($UPPERCASE|$LOWERCASE|$DIGIT)+"@"($UPPERCASE|$LOWERCASE|$DIGIT)+"."$LO WERCASE$LOWERCASE$LOWERCASE $NULLS: $LOWER*|$UPPER The last regex I built a fairly large tree to make sure I covered all cases where ASTNodes or ASTNode inputs were null. DFA General description: I applied Breadth First Search traversal on the NFAGraph, by continuously having DFAStateFacotry to produce appropriate DFAStateInterface. I supply DFAWalker appropriate methods in DFAInterface to walk on Alphabet, and ensure that multiple DFAWalker can walk on the DFA concurrently Assumptions: Nested tokens are not allowed Test cases: $VAR: ($UPPERCASE|$LOWERCASE)+($DIGIT|$LOWERCASE|$UPPERCASE)+ and other test cases on the creation of DFAState Table Walker Picks up a character at a time from the input file and runs it through the DFA generated earlier. It generates tokens based on the longest match algorithm, wherein if the succeeding character isn’t a part of the transitions from the current state, the match is met and the corresponding token returned. The token type and token text associations are maintained as a list containing a list of strings. Assumptions: It ignores whitespace characters like tabs, new line feed, carriage return and spaces. Code Structure /src/ folder contains the source code of the project. There are 5 main packages, 1. default: It contains Scanner.java which encapsulates the main method and acts as the controller of the project 2. ast_generator: Classes associated with generating an AST 3. nfa_generator: Classes associated with creating a DFA using the AST 4. DFA: Classes required for creating a DFA using the NFA 5. table_walker: Class required to walk through the input file using the DFA generated