DTC Bioinformatics module Modelling genes The aim of this practical is to build a probabilistic model of DNA sequence that allows you to simulate genes similar to those found in higher eukaryotes such as fruit-flies, mice or humans. You will first design a model, then estimate (or choose) parameters to suit humans. We will start with a very simple model and build up the complexity. i) Here is a graphical model representing an extremely simple model for a gene Non-coding DNA s Start codon Codon t TERM Non-coding DNA Here, the two parameters s and t are probabilities that determine the average distance between genes and the average length of a gene. Only these two parameters are needed as the transition probabilities must sum to one. Q1 What is the distribution of gene length (in terms of numbers of codons) arising from the above model? What is the average length? Q2 The above model specifies only the type of sequence ‘emitted’ (i.e. non-coding/ start/ coding/ TERM). What differences at the DNA sequence level arise from differences in sequence type? ii) The first complication we want to consider is the DNA base composition associated with each type of sequence. We will focus on humans. To answer the following questions you will need to use various web-based resources, plus general searches. Codon Usage Database: http://www.kazusa.or.jp/codon/ Human genome statistics: www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9606&build=35&ver=1 Genome base composition: http://www.pasteur.fr/~tekaia/ntfreq.html Consensus CDS database: http://www.ncbi.nlm.nih.gov/projects/CCDS/ (Note coding sequence data available from ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/archive/Hs35.1/ ) Q3 What frequencies of the DNA letters would you expect in non-coding DNA? Q4 What frequencies of the DNA letters would you expect in coding DNA? Q5 Do you find different base frequencies at the different positions in the codon? GM 16/02/2016 1 DTC Bioinformatics module iii) The next feature to explore is the length of coding sequences. You can download about 14,000 hand-curated gene annotations from the ftp site above (also mirrored at www.stats.ox.ac.uk/~mcvean/DTC/BIOINF/Practicals/CCDS.03032005.txt). The most important complication here is that genes in humans are typically broken into exons, so our model of genes will have to take this into account. Q6 What is the average length of the coding sequence in human genes? Q7 What is the average number of exons Q7 What is the empirical distribution of the length of coding sequence in humans? Q8 What is the empirical distribution of exon number? iv) We should be in a position to make a more complicated model of a gene that includes more states (exons, introns) and has ‘emission’ probabilities associated with each state that tell us what kind of DNA to expect. For the moment assume that all exon boundaries lie between codons. Q9 Build a graphical model similar to that above for a gene model that includes exons and introns. Choose some suitable values for the parameters. v) There are many features of genes we would need to include yet to make the model realistic. For example, the domain structure of proteins, different ‘phases’ of exons, splice-site and branch sequences association with introns, regulatory motifs both 5’ and 3’ of the coding sequence. Choose one or more of these features to investigate and add appropriate elements to your model. Q10 Simulate 100 genes from your model (note this means both the gene structure and DNA sequence). Compare the distributions of exon number and CDS length to the empirical distributions. Now look at the empirical distributions of the length of the first and last exon. Do you see a difference, and if so, why might that be? GM 16/02/2016 2