Modelling_genes_practical

advertisement
DTC Bioinformatics module
Modelling genes
The aim of this practical is to build a probabilistic model of DNA sequence that allows you to
simulate genes similar to those found in higher eukaryotes such as fruit-flies, mice or humans.
You will first design a model, then estimate (or choose) parameters to suit humans. We will
start with a very simple model and build up the complexity.
i) Here is a graphical model representing an extremely simple model for a gene
Non-coding DNA
s
Start codon
Codon
t
TERM
Non-coding DNA
Here, the two parameters s and t are probabilities that determine the average distance between
genes and the average length of a gene. Only these two parameters are needed as the
transition probabilities must sum to one.
Q1
What is the distribution of gene length (in terms of numbers of codons) arising from
the above model? What is the average length?
Q2
The above model specifies only the type of sequence ‘emitted’ (i.e. non-coding/ start/
coding/ TERM). What differences at the DNA sequence level arise from differences
in sequence type?
ii) The first complication we want to consider is the DNA base composition associated with
each type of sequence. We will focus on humans. To answer the following questions you
will need to use various web-based resources, plus general searches.
Codon Usage Database: http://www.kazusa.or.jp/codon/
Human genome statistics:
www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9606&build=35&ver=1
Genome base composition: http://www.pasteur.fr/~tekaia/ntfreq.html
Consensus CDS database: http://www.ncbi.nlm.nih.gov/projects/CCDS/
(Note coding sequence data available from ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/archive/Hs35.1/ )
Q3
What frequencies of the DNA letters would you expect in non-coding DNA?
Q4
What frequencies of the DNA letters would you expect in coding DNA?
Q5
Do you find different base frequencies at the different positions in the codon?
GM
16/02/2016
1
DTC Bioinformatics module
iii) The next feature to explore is the length of coding sequences. You can download about
14,000 hand-curated gene annotations from the ftp site above (also mirrored at
www.stats.ox.ac.uk/~mcvean/DTC/BIOINF/Practicals/CCDS.03032005.txt). The most
important complication here is that genes in humans are typically broken into exons, so our
model of genes will have to take this into account.
Q6
What is the average length of the coding sequence in human genes?
Q7
What is the average number of exons
Q7
What is the empirical distribution of the length of coding sequence in humans?
Q8
What is the empirical distribution of exon number?
iv) We should be in a position to make a more complicated model of a gene that includes
more states (exons, introns) and has ‘emission’ probabilities associated with each state that
tell us what kind of DNA to expect. For the moment assume that all exon boundaries lie
between codons.
Q9
Build a graphical model similar to that above for a gene model that includes exons
and introns. Choose some suitable values for the parameters.
v) There are many features of genes we would need to include yet to make the model realistic.
For example, the domain structure of proteins, different ‘phases’ of exons, splice-site and
branch sequences association with introns, regulatory motifs both 5’ and 3’ of the coding
sequence. Choose one or more of these features to investigate and add appropriate elements
to your model.
Q10
Simulate 100 genes from your model (note this means both the gene structure and
DNA sequence). Compare the distributions of exon number and CDS length to the
empirical distributions. Now look at the empirical distributions of the length of the
first and last exon. Do you see a difference, and if so, why might that be?
GM
16/02/2016
2
Download