GENOMIC CODING MSc Level Module (C) 2006,2009 Professor S J Shepherd. All worldwide rights reserved. Factorizable languages and Distinct Excluded Blocks Each genome is a complete language in its own right. Furthermore, genomes are, by definition, factorizable languages - a factorizable language being one that can be completely defined by a core set of words that do not appear in that language. If we consider an alphabet of symbols , say = {a,c,g,t} for the genome, then the (infinite) set *, is all the strings that can be made from the letters of the alphabet, including the empty string (i.e. the string of zero length). Any (finite) subset L * is said to be a language over the alphabet . Once the language L, has been defined, its complimentary set L = * – L contains all the words (i.e. the excluded words) that do not appear in L. Formally, a language is said to be factorizable if any sub-string of a word x L is also in L. For a factorizable language, the set of excluded words L, acquires a ‘minimal’ property; that is, it contains a minimal ‘core’ of forbiden words L L that cannot be further cut into shorter words without producing a word which is in L. The members of the set L have been termed Distinct Excluded Blocks (DEBs)11 and it is these that completely define the language L. Because of the factorizability properties of such languages, it is possible to write L = * L * and so L = * – * L *. Natural (human) languages are not factorizable and therefore must be sensibly defined by dictionaries of permissible words. For example, in English the number of legitimate words is very small compared with the number of possible strings and therefore it makes sense to publish English dictionaries that contain only permitted words. A factorizable language, on the other hand, is better defined by an ‘antidictionary’ of excluded words, since this list is generally much shorter than the list of permissible words. It is important to note however, that the set of the ‘core’ excluded words (i.e. DEBs) eliminates far more words from the language than the number of words in the core set. For example, if a genome excludes the 4-mer [ccgg] then all 5-mers, 6-mers, etc. containing [ccgg] are also, by definition, excluded. Consequently, a few core DEBs can eliminate vast numbers of longer words, leaving a relatively small language whose vocabulary and grammar can be elucidated. Since these longer motifs are excluded by implication, we term them Implied Excluded Blocks (IEBs). It is possible to combinatorically enumerate the IEBs excluded due to the presence of DEBs by using the Goulden-Jackson algorithm. In genomes, as the motif length increases, so the DEBs and IEBs exhibit very different behaviours7. As ‘word’ length increases, it is always the case that, the number of DEBs rises to a peak and then falls back again. By contrast, the number of IEBs occurring increases exponentially with ‘word’ length. The characteristic pattern of DEBs, forms a distinct ‘signature’ which is unique for any given genome. Therefore, by simply counting the DEBs at each motif length it is possible to construct a curve, which completely defines the genome of any given species. For example, Figure 1 shows the DEB signatures for the Hepatitis A (NCBI-NC 001489) and Measles (NCBI-NC001498) viruses. From this, it can be seen that although both genomes exhibit a graph of the same general form, which peaks at a ‘word’ length of 8, the profiles of the two curves are very different. It is therefore possible to distinguish between genomes and, in theory, group together genomes, which exhibit similar genetic characteristics. 14000 12000 No. of DEBs 10000 8000 Hepatitis A Measles 6000 4000 2000 0 1 2 3 4 5 6 7 8 9 10 11 12 Motif Length Figure 1 The DEB signatures for the Hepatitis A and Measles virus genomes.