Powerpoint: Regular grammars, Trinucleotide repeats, PROSITE

advertisement
Transformational Grammars
“Colourless green ideas sleep furiously”
- Noam Chomsky
We might ask “Is this novel sentence (or sequence!)
grammatical?” i.e., does the language described by
some grammar validly contain this sentence??
Chomsky turned this question on its head and instead
asked: “Could the grammar we’re considering have
possibly generated this sentence?”
He developed finite formal machines (“grammars”) that can
theoretically recursively enumerate the infinitude of possible
sentences of the corresponding language.
Transformational Grammars
The Chomsky hierarchy of grammars
Unrestricted
Context-sensitive
Context-free
Regular
Slide after Durbin, et al., 1998
The more deeply nested the grammar, the simpler the rules.
These are easiest to parse, but are also the most restricted
Regular Grammars
Symbols and Productions (A.K.A “rewriting rules”)
All transformational grammars are defined by their set of
symbols and the production rules for manipulating
strings consisting of those symbols
Only two types of symbols:
• Terminals (generically represented as “a”)
• these actually appear in the final observed string (so
imagine nucleotide or amino acid symbols)
• Non-terminals (generically represented as “W”)
• abstract symbols – easiest to see how they are used
through example. The start state (usually shown as “S”) is a
commonly used non-terminal
The non-terminals are often used as place holders that
disappear from the final string
Regular Grammars
Symbols and Productions (A.K.A “rewriting rules”)
Only two productions are allowed in a regular grammar!
W→ aW
W→ a
We often also use a special terminal symbol “e”, which is
used to denote the null string and to end a production…
W→ e
Don’t freak out! It’s easier to demonstrate how this
all works than it is to describe!
Regular Grammars
Symbols and Productions (A.K.A “rewriting rules”)
Here’s a trivial regular grammar that can produce
all possible nucleotide sequences:
S→ AS
S→ GS
S→ CS
S→ TS
W = {S = "Start "}
a = {A,G,C,T,e}
S→ e
Imagine we always start with S -- then we can repeatedly choose any of
the valid productions, with S being replaced each time by the string on
the right hand side of the production we’ve chosen…
Regular Grammars
Symbols and Productions (A.K.A “rewriting rules”)
Here’s a trivial regular grammar that can produce
all possible nucleotide sequences:
S→ AS|CS|GS|TS|e
W = {S = "Start "}
a = {A,G,C,T,e}
Imagine we always start with S -- then we can repeatedly choose any of
the valid productions, with S being replaced each time by the string on
the right hand side of the production we’ve chosen…
Protein motifs as regular grammars
“Classic” PROSITE motifs
RU1A_HUMAN
SKLF_DROME
ROC_HUMAN
ELAV_DROME
Slide after Durbin, et al., 1998
SRSLKMRGQAFVIFKEVSSAT
KLTGRPRGVAFVRYNKREEAQ
VGCSVHKGFAFVQYVNERNAR
GNDTQTKGVGFIRFDKREEAT
RNP-1 Motif
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]
S → rW1 | kW1
W1 → gW2
W5 → lW6| iW6 | vW6| aW6
W2 → [afilmnqstvwy]W3
W6 → [acdefghiklmnpqrstvwy]W7
W3 → [agsci]W4
W8 → f | y | m
W4 → fW5 | yW5
Does this remind you of anything we’ve seen before?
Automata
Formal grammars are generative. However, each Chomsky
grammar can be parsed using a corresponding abstract
computational machine, or automata
Grammar
Parsing automaton
Regular grammar
Finite State automaton
Context-free grammar
Push-down automaton
Context-sensitive grammar
Linear bounded automaton
Unrestricted grammar
Turing machine
The automata for the two most general grammars are of great theoretical
interest but are of less practical significance for us because of the time and
space complexity of the algorithms – their decision problems may only be
computationally feasible in special cases.
We will focus on the first two only!!
Trinucleotide Repeat Disorders
A family of diseases resulting from a trinucleotide expansion
Fragile X – associated with 200 to 4000 repeats of a CGG
trinucleotide in the FMR-1 gene
Unaffected individuals have typically 5-40 copies, but
individuals with intermediate numbers are considered to
have a “premutation” with variable penetrance
CAG Repeats – at least 9 different “PolyQ” disorders
have been identified so far. Most are autosomal dominant
Huntington disease – affected individuals have >35 copies
of the CAG repeat in the HD (huntington Disease) gene
Can we identify sequences with well-defined repeat characteristics?
A Finite State Automaton
The FMR triplet repeat considered as a sequence of states
a
S
g
1
c
2
g
3
c
4
g
5
g
6
c
7
t
8
c
The FMR triplet regular grammar:
S → gW1
W5 → gW6
W1 → cW2
W6 → cW7| aW4 | cW4
W2 → gW3
W7 → gW6
W3 → cW4
W8 → g
W4 → gW5
The grammar generates, the automaton parses
g
e
A Finite State Automaton
The FMR triplet repeat considered as a sequence of states
a
S
g
1
c
2
g
3
c
4
g
5
g
6
c
7
t
8
g
c
FSAs can be either deterministic, or non-deterministic.
Because our FMR repeat FSA offers multiple paths for
accepting state 6, this is a non-deterministic FSA.
An automaton with only one possible sequence of
states (the “state path”) is always deterministic.
Note however that there are no probabilities associated
with the state transitions. This FSA is therefore NOT a
probabilistic model or stochastic model.
e
Finite State Automata
Moore vs. Mealy machines
a
S
g
1
c
2
g
3
c
4
g
5
g
6
c
7
t
8
g
e
c
The FSA shown above is a so-called “Mealy machine”
-- Mealy machines “accept” or “emits” upon transition to a new state
Later we will see and use examples of “Moore machines”
-- Moore machines instead “accept on state”
Moore and Mealy machine are always interconvertible.
Think about ways to redraw this FSA as a Moore Machine
Finite State Automata
The FMR regular grammar as a Python data structure
This is just one possible embodiment!
states =
{
"Start" : [("G" , "W1")],
"W1" : [("C" , "W2")],
"W2" : [("G" , "W3")],
"W3" : [("C" , "W4")],
"W4" : [("G" , "W5")],
"W5" : [("G" , "W6")],
"W6" : [("C" , "W7") , ("A" , "W4"), ("C" , "W4")],
"W7" : [("T" , "W8")],
"W8" : [("G" , "End")]
}
This dict has keys that are states, and values that are lists of
“acceptance conditions”. The acceptance conditions are in
the format of a tuple with the symbol that would lead to
acceptance, and the state that should be “transitioned to”.
Reducing an FSA to Python code
The deterministic case
This is fairly straightforward:
• initialize cur_state to “Start”
• initialize cur_position in test sequence to zero
• Initialize result_string to “”
• Iterate over positions in sequence:
• is the symbol at cur_position a valid production?
• No? Failure. Return False
• Yes! Accept symbol
• set cur_state to new_state
• is cur_state now “End”?
• Yes! Success! Return result_str
• concatenate symbol at cur_position to result_str
• Exhausted test sequence? Failure. Return False
Reducing an FSA to Python code
The non-deterministic case is less straightforward!
•
We can no longer just iterate over the test sequence!
•
For each symbol in the test sequence, we might have to consider multiple
valid productions (think loop, yes?)
•
We therefore may need to explore “branches” corresponding to these
alternatives before we find one that is “correct”
Although not necessarily the most efficient way, recursion is an easy
way to explore these branches:
•
If a possible production is valid, assume that it is correct by accepting the
symbol and new state
•
Increment the position in the test sequence
•
“Success” or “Failure” can easily be propagated back up through the
recursion by testing the result of the recursive call and returning the
resulting return sequence.
•
If it gets past the recursive call test, the branch has failed, decrement the
position in the test sequence, and go to the next possible production
•
If there are no more productions to consider, we’ve failed, return False
Python focus – classes
Defining a class
Like functions, minimally, all we need is a statement
block of Perl code that we have given a name!
Capital letters OK
class I_dont_do_much (object):
#any code you like!!
pass
Python classes are essentially user-defined
data types
…but it won’t do anything interesting though until we
have specified some data and some methods!
Python focus – classes
The __init__ method
This method corresponds to the “constructor” in other
OOP languages
class FSA (object):
Methods defined as functions
First argument always self
def __init__(self, states):
self.states = states
Variables prepended with self
become instance variables,
and are visible throughout the
namespace of a class instance
Variables declared outside of a method have the same
value in all instances of that class!
Python focus – classes
User-defined methods
These are the interface with your class
class FSA (object):
User methods defined as functions
First argument always self
def __init__(self, states):
self.states = states
# initialize some other stuff
def test (self, seq, cur_state = “Start”):
some_var = 0
# do some things
return something
Variable some_var is visible only within the user defined test method!
Python focus – classes
Using classes
Instantiate a class by invoking its name, and
providing the arguments the __init__
method expects
my_FSA = FSA(my_state_dict)
result = myFSA.test(“AGCTGGGGTTTAATT”)
Invoke class methods just by using the instance identifier in
conjunction with the method name using attribute notation!
We can make as many instances of a class as we need!
Download