Transformational Grammars “Colourless green ideas sleep furiously” - Noam Chomsky We might ask “Is this novel sentence (or sequence!) grammatical?” i.e., does the language described by some grammar validly contain this sentence?? Chomsky turned this question on its head and instead asked: “Could the grammar we’re considering have possibly generated this sentence?” He developed finite formal machines (“grammars”) that can theoretically recursively enumerate the infinitude of possible sentences of the corresponding language. Transformational Grammars The Chomsky hierarchy of grammars Unrestricted Context-sensitive Context-free Regular Slide after Durbin, et al., 1998 The more deeply nested the grammar, the simpler the rules. These are easiest to parse, but are also the most restricted Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) All transformational grammars are defined by their set of symbols and the production rules for manipulating strings consisting of those symbols Only two types of symbols: • Terminals (generically represented as “a”) • these actually appear in the final observed string (so imagine nucleotide or amino acid symbols) • Non-terminals (generically represented as “W”) • abstract symbols – easiest to see how they are used through example. The start state (usually shown as “S”) is a commonly used non-terminal The non-terminals are often used as place holders that disappear from the final string Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) Only two productions are allowed in a regular grammar! W→ aW W→ a We often also use a special terminal symbol “e”, which is used to denote the null string and to end a production… W→ e Don’t freak out! It’s easier to demonstrate how this all works than it is to describe! Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) Here’s a trivial regular grammar that can produce all possible nucleotide sequences: S→ AS S→ GS S→ CS S→ TS W = {S = "Start "} a = {A,G,C,T,e} S→ e Imagine we always start with S -- then we can repeatedly choose any of the valid productions, with S being replaced each time by the string on the right hand side of the production we’ve chosen… Regular Grammars Symbols and Productions (A.K.A “rewriting rules”) Here’s a trivial regular grammar that can produce all possible nucleotide sequences: S→ AS|CS|GS|TS|e W = {S = "Start "} a = {A,G,C,T,e} Imagine we always start with S -- then we can repeatedly choose any of the valid productions, with S being replaced each time by the string on the right hand side of the production we’ve chosen… Protein motifs as regular grammars “Classic” PROSITE motifs RU1A_HUMAN SKLF_DROME ROC_HUMAN ELAV_DROME Slide after Durbin, et al., 1998 SRSLKMRGQAFVIFKEVSSAT KLTGRPRGVAFVRYNKREEAQ VGCSVHKGFAFVQYVNERNAR GNDTQTKGVGFIRFDKREEAT RNP-1 Motif [RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM] S → rW1 | kW1 W1 → gW2 W5 → lW6| iW6 | vW6| aW6 W2 → [afilmnqstvwy]W3 W6 → [acdefghiklmnpqrstvwy]W7 W3 → [agsci]W4 W8 → f | y | m W4 → fW5 | yW5 Does this remind you of anything we’ve seen before? Automata Formal grammars are generative. However, each Chomsky grammar can be parsed using a corresponding abstract computational machine, or automata Grammar Parsing automaton Regular grammar Finite State automaton Context-free grammar Push-down automaton Context-sensitive grammar Linear bounded automaton Unrestricted grammar Turing machine The automata for the two most general grammars are of great theoretical interest but are of less practical significance for us because of the time and space complexity of the algorithms – their decision problems may only be computationally feasible in special cases. We will focus on the first two only!! Trinucleotide Repeat Disorders A family of diseases resulting from a trinucleotide expansion Fragile X – associated with 200 to 4000 repeats of a CGG trinucleotide in the FMR-1 gene Unaffected individuals have typically 5-40 copies, but individuals with intermediate numbers are considered to have a “premutation” with variable penetrance CAG Repeats – at least 9 different “PolyQ” disorders have been identified so far. Most are autosomal dominant Huntington disease – affected individuals have >35 copies of the CAG repeat in the HD (huntington Disease) gene Can we identify sequences with well-defined repeat characteristics? A Finite State Automaton The FMR triplet repeat considered as a sequence of states a S g 1 c 2 g 3 c 4 g 5 g 6 c 7 t 8 c The FMR triplet regular grammar: S → gW1 W5 → gW6 W1 → cW2 W6 → cW7| aW4 | cW4 W2 → gW3 W7 → gW6 W3 → cW4 W8 → g W4 → gW5 The grammar generates, the automaton parses g e A Finite State Automaton The FMR triplet repeat considered as a sequence of states a S g 1 c 2 g 3 c 4 g 5 g 6 c 7 t 8 g c FSAs can be either deterministic, or non-deterministic. Because our FMR repeat FSA offers multiple paths for accepting state 6, this is a non-deterministic FSA. An automaton with only one possible sequence of states (the “state path”) is always deterministic. Note however that there are no probabilities associated with the state transitions. This FSA is therefore NOT a probabilistic model or stochastic model. e Finite State Automata Moore vs. Mealy machines a S g 1 c 2 g 3 c 4 g 5 g 6 c 7 t 8 g e c The FSA shown above is a so-called “Mealy machine” -- Mealy machines “accept” or “emits” upon transition to a new state Later we will see and use examples of “Moore machines” -- Moore machines instead “accept on state” Moore and Mealy machine are always interconvertible. Think about ways to redraw this FSA as a Moore Machine Finite State Automata The FMR regular grammar as a Python data structure This is just one possible embodiment! states = { "Start" : [("G" , "W1")], "W1" : [("C" , "W2")], "W2" : [("G" , "W3")], "W3" : [("C" , "W4")], "W4" : [("G" , "W5")], "W5" : [("G" , "W6")], "W6" : [("C" , "W7") , ("A" , "W4"), ("C" , "W4")], "W7" : [("T" , "W8")], "W8" : [("G" , "End")] } This dict has keys that are states, and values that are lists of “acceptance conditions”. The acceptance conditions are in the format of a tuple with the symbol that would lead to acceptance, and the state that should be “transitioned to”. Reducing an FSA to Python code The deterministic case This is fairly straightforward: • initialize cur_state to “Start” • initialize cur_position in test sequence to zero • Initialize result_string to “” • Iterate over positions in sequence: • is the symbol at cur_position a valid production? • No? Failure. Return False • Yes! Accept symbol • set cur_state to new_state • is cur_state now “End”? • Yes! Success! Return result_str • concatenate symbol at cur_position to result_str • Exhausted test sequence? Failure. Return False Reducing an FSA to Python code The non-deterministic case is less straightforward! • We can no longer just iterate over the test sequence! • For each symbol in the test sequence, we might have to consider multiple valid productions (think loop, yes?) • We therefore may need to explore “branches” corresponding to these alternatives before we find one that is “correct” Although not necessarily the most efficient way, recursion is an easy way to explore these branches: • If a possible production is valid, assume that it is correct by accepting the symbol and new state • Increment the position in the test sequence • “Success” or “Failure” can easily be propagated back up through the recursion by testing the result of the recursive call and returning the resulting return sequence. • If it gets past the recursive call test, the branch has failed, decrement the position in the test sequence, and go to the next possible production • If there are no more productions to consider, we’ve failed, return False Python focus – classes Defining a class Like functions, minimally, all we need is a statement block of Perl code that we have given a name! Capital letters OK class I_dont_do_much (object): #any code you like!! pass Python classes are essentially user-defined data types …but it won’t do anything interesting though until we have specified some data and some methods! Python focus – classes The __init__ method This method corresponds to the “constructor” in other OOP languages class FSA (object): Methods defined as functions First argument always self def __init__(self, states): self.states = states Variables prepended with self become instance variables, and are visible throughout the namespace of a class instance Variables declared outside of a method have the same value in all instances of that class! Python focus – classes User-defined methods These are the interface with your class class FSA (object): User methods defined as functions First argument always self def __init__(self, states): self.states = states # initialize some other stuff def test (self, seq, cur_state = “Start”): some_var = 0 # do some things return something Variable some_var is visible only within the user defined test method! Python focus – classes Using classes Instantiate a class by invoking its name, and providing the arguments the __init__ method expects my_FSA = FSA(my_state_dict) result = myFSA.test(“AGCTGGGGTTTAATT”) Invoke class methods just by using the instance identifier in conjunction with the method name using attribute notation! We can make as many instances of a class as we need!