Factorizable Languages - Professor Simon Shepherd

GENOMIC CODING MSc Level Module
(C) 2006,2009 Professor S J Shepherd.
All worldwide rights reserved.
Factorizable languages and Distinct Excluded Blocks
Each genome is a complete language in its own right.
Furthermore, genomes are, by definition, factorizable languages - a factorizable
language being one that can be completely defined by a core set of words that
do not appear in that language.
If we consider an alphabet of symbols , say  = {a,c,g,t} for the genome, then
the (infinite) set *, is all the strings that can be made from the letters of the
alphabet, including the empty string  (i.e. the string of zero length).
Any (finite) subset L  * is said to be a language over the alphabet .
Once the language L, has been defined, its complimentary set L = * – L
contains all the words (i.e. the excluded words) that do not appear in L.
Formally, a language is said to be factorizable if any sub-string of a word x  L is
also in L.
For a factorizable language, the set of excluded words L, acquires a ‘minimal’
property; that is, it contains a minimal ‘core’ of forbiden words L  L that
cannot be further cut into shorter words without producing a word which is in L.
The members of the set L have been termed Distinct Excluded Blocks (DEBs)11
and it is these that completely define the language L.
Because of the factorizability properties of such languages, it is possible to write
L = * L * and so L = * – * L *.
Natural (human) languages are not factorizable and therefore must be sensibly
defined by dictionaries of permissible words. For example, in English the number
of legitimate words is very small compared with the number of possible strings
and therefore it makes sense to publish English dictionaries that contain only
permitted words.
A factorizable language, on the other hand, is better defined by an ‘antidictionary’ of excluded words, since this list is generally much shorter than the
list of permissible words.
It is important to note however, that the set of the ‘core’ excluded words (i.e.
DEBs) eliminates far more words from the language than the number of words in
the core set.
For example, if a genome excludes the 4-mer [ccgg] then all 5-mers, 6-mers,
etc. containing [ccgg] are also, by definition, excluded.
Consequently, a few core DEBs can eliminate vast numbers of longer words,
leaving a relatively small language whose vocabulary and grammar can be
elucidated.
Since these longer motifs are excluded by implication, we term them Implied
Excluded Blocks (IEBs).
It is possible to combinatorically enumerate the IEBs excluded due to the
presence of DEBs by using the Goulden-Jackson algorithm.
In genomes, as the motif length increases, so the DEBs and IEBs exhibit very
different behaviours7. As ‘word’ length increases, it is always the case that, the
number of DEBs rises to a peak and then falls back again.
By contrast, the number of IEBs occurring increases exponentially with ‘word’
length. The characteristic pattern of DEBs, forms a distinct ‘signature’ which is
unique for any given genome.
Therefore, by simply counting the DEBs at each motif length it is possible to
construct a curve, which completely defines the genome of any given species.
For example, Figure 1 shows the DEB signatures for the Hepatitis A (NCBI-NC
001489) and Measles (NCBI-NC001498) viruses. From this, it can be seen that
although both genomes exhibit a graph of the same general form, which peaks
at a ‘word’ length of 8, the profiles of the two curves are very different. It is
therefore possible to distinguish between genomes and, in theory, group
together genomes, which exhibit similar genetic characteristics.
14000
12000
No. of DEBs
10000
8000
Hepatitis A
Measles
6000
4000
2000
0
1
2
3
4
5
6
7
8
9
10
11
12
Motif Length
Figure 1 The DEB signatures for the Hepatitis A and Measles virus
genomes.