CS 3240 Project 1 Report

advertisement
CS 3240 Project 1 Report
Kevin Chu (kchu9@gatech.edu)
Joshua Robert (jroberts61@gatech.edu)
Apurv Jain (ajain93@gatech.edu)
Tri Nguyen (gnguyen31@gatech.edu)
Abstract Simple Tree (AST) Generator
Lexical Input Format: Should be a set of Character classes in the format $<name>
<class/exclude set>. Lexical input can contain several of these character classes
separated by newlines. Token descriptions will be separated from the character class section by
a single new line. A sample input format is listed below:
$DIGIT [0-9]
$NON-ZERO [^0] IN $DIGIT
$CHAR [a-zA-Z]
$UPPER [^a-z] IN $CHAR
$LOWER [^A-Z] IN $CHAR
$TEST a|(bc)
$IDENTIFIER $LOWER ($LOWER$DIGIT)*
Assumptions:
Regex Grammar handles the following operations: +,*,concat,|, (), exclude class and character
set and generates a simple AST tree based on priority.
Leaf nodes should contain epsilon, null, or a character class. Input for Lexical Input is assume to
contain valid character classes and regex expressions for tokens.
If you want to use the * or + operator, () are required around the character class.
Question marks require an escape character
Non-deterministic Finite Automata (NFA) Generator
The NFA Generator takes in a list of ASTs for each type of recognized token. It then generates
an NFA for each tree and then joins them to form the final language NFA.
The NFAs are structured much like we have seen in class, nodes that represent states which
transition to other nodes on some input or “epsilon”. NFANodes keep track of their names, a
hashmap of transitions from some list of accepted inputs to a list of states. If a node is
accepting, it keeps track of what type of token it accepts. An NFAGraph keeps track of it start
state, end states, and all the nodes the graph contains. End states are accepting.
The NFA generator constructs the NFAs from the AST trees using post-order traversal.
Depending on if a node is an input or an operations it treats them differently. Operations are
treated much like we were shown to do in class/textbook with epsilon transitions. For example,
say we have two primitive NFAs, one with Node0 to Node1 on $LETTER and another N2 to N3
on $DIGIT. If these are concatenated, the first concat the second, the end state of the first then
transitions to the start state of the second on “epsilon” and the new overall graph has a start
state N0 and end state N3. Operations that are handled are concatenation, alternation, 0 or
more replication, 1 or more replications, and 0 or 1 replication.
Assumptions:
I use “epsilon” for my transitions, this is assuming that string is not going to used in a lexical
specification file as an input. This is the only real assumption that was made aside from
reasonable assumptions about input and structure from our other packages.
Test Cases:
I tested on a few cases and made sure null cases were handled in case the parser gave odd
nodes. I tested on a few regular expressions which I manually built ASTs for. Later testing the
same regex in a lexical spec provided the same results. I tested on these regex, drew out the
graphs they provided on node based data, and compared to how I we would expect them to
appear:
$NAME: $UPPERCASE($LOWERCASE)+
$VAR: ($UPPERCASE|$LOWERCASE)+($DIGIT|$LOWERCASE|$UPPERCASE)+
$EMAIL:
($UPPERCASE|$LOWERCASE|$DIGIT)+"@"($UPPERCASE|$LOWERCASE|$DIGIT)+"."$LO
WERCASE$LOWERCASE$LOWERCASE
$NULLS: $LOWER*|$UPPER
The last regex I built a fairly large tree to make sure I covered all cases where ASTNodes or
ASTNode inputs were null.
DFA
General description:
I applied Breadth First Search traversal on the NFAGraph, by continuously having
DFAStateFacotry to produce appropriate DFAStateInterface.
I supply DFAWalker appropriate methods in DFAInterface to walk on Alphabet, and ensure that
multiple DFAWalker can walk on the DFA concurrently
Assumptions:
Nested tokens are not allowed
Test cases:
$VAR: ($UPPERCASE|$LOWERCASE)+($DIGIT|$LOWERCASE|$UPPERCASE)+
and other test cases on the creation of DFAState
Table Walker
Picks up a character at a time from the input file and runs it through the DFA generated earlier.
It generates tokens based on the longest match algorithm, wherein if the succeeding character
isn’t a part of the transitions from the current state, the match is met and the corresponding
token returned. The token type and token text associations are maintained as a list containing a
list of strings.
Assumptions: It ignores whitespace characters like tabs, new line feed, carriage return and
spaces.
Code Structure
/src/ folder contains the source code of the project. There are 5 main packages,
1. default: It contains Scanner.java which encapsulates the main method and acts as the
controller of the project
2. ast_generator: Classes associated with generating an AST
3. nfa_generator: Classes associated with creating a DFA using the AST
4. DFA: Classes required for creating a DFA using the NFA
5. table_walker: Class required to walk through the input file using the DFA generated
Download