Microbial Genomes

Microbial Genomes Part 1 –Methods for Studying Microbial Genomes Part 2 –Analysis and Interpretation of Whole Genome Sequences Part 1 1) Why study whole microbial genomes? - until whole genome analysis became viable, life sciences have been based on a reductionist principle – a process of dissecting cell and systems into fundamental components for further study - studies on whole genomes and whole genome sequences in particular give us a complete genomic blueprint for an organism - we can now begin to examine how all of these parts operate cooperatively to influence the activities and behavior of an entire organism – a complete understanding of the biology of an organism - microbes provide an excellent starting point for studies of this type as they have a relatively simple genomic structure compared to higher, multicellular organisms - analysis of whole microbial genomes also provides insight into microbial evolution and diversity beyond single protein or gene phylogenies - in practical terms analysis of whole microbial genomes is also a powerful tool in identifying new applications in for biotechnology and new approaches to the treatment and control of pathogenic organisms 1.1) History of Microbial Genome Sequencing - 1977 - first complete genome to be sequenced was bacteriophage X174 - 5386 bp - first genome to be sequenced using random DNA fragments - Bacteriophage  - 48502 bp - 1986 - mitochondrial (187 kb) and chloroplast (121 kb) genomes of Marchantia polymorpha sequenced - early 90’s - cytomegalovirus (229 kb) and Vaccinia (192 kb) genomes sequenced - 1995 - first complete genome sequence from a free living organism - Haemophilus influenzae (1.83 Mb) - late 1990’s - many additional microbial genomes sequenced including Archaea (Methanococcus jannaschii - 1996) and Eukaryotes (Saccharomyces cerevisiae - 1996) 1.2) Advantages in Clinical Microbiology - gain a comprehensive understanding of microbial pathogenesis and host pathogen interactions - identification of sensitive and specific molecular targets for identification and typing - identification of molecular markers associated with disease risk and severity - selection of potential candidates for rational development of therapeutic agents 2) Microbial genomes sequenced to date - currently there are 32 complete, published microbial genomes – 25 domain Bacteria, 5 Domain Archaea, 1 domain Eukarya (www.tigr.org) - around 130 additional microbial genome and chromosome sequencing projects underway 3) Tools for studying whole genomes - conventional techniques for analysing DNA are designed for the analysis of small regions of whole genomes such as individual genes or operons - many of the techniques used to study whole genomes are conventional molecular biology techniques adapted to operate effectively with DNA in a much larger size range 3.1) PFGE - agarose gel electrophoresis is a fundamental technique in molecular biology but is generally unable to resolve fragments greater than 20 kilobases in size – this makes it unsuitable for analysing DNA at a whole genome scale (whole microbial genomes are usually greater than 1000 kilobases in size) - PFGE (pulsed field gel electrophoresis) is a adaptation of conventional agarose gel electrophoresis that allows extremely large DNA fragments to be resolved (up to megabase size fragments) - PFGE is an essential technique for accurately determining the sizes of whole genomes/chromosomes prior to sequencing and is necessary for preparing large DNA fragments for large insert DNA cloning and analysis of subsequent clones - PFGE is also a commonly used and extremely powerful tool for genotyping and epidemiology studies for pathogenic microorganisms 3.1.1) Principle of PFGE - two factors influence DNA migration rates through conventional gels - charge differences between DNA fragments - ‘molecular sieve’ effect of DNA pores - DNA fragments normally travel through agarose pores as spherical coils, fragments greater than 20 kb in size form extended coils and therefore are not subjected to the molecular sieve effect - the charge effect is countered by the proportionally increased friction applied to the molecules and therefore fragments greater than 20 kb do not resolve - PFGE works by periodically altering the electric field orientation - the large extended coil DNA fragments are forced to change orientation and size dependent separation is re-established because the time taken for the DNA to reorient is size dependent - the most important factor in PFGE resolution is switching time, longer switching times generally lead to increased size of DNA fragments which can be resolved - switching times are optimised for the expected size of the DNA being run on the PFGE gel - switch time ramping increases the region of the gel in which DNA separation is linear with respect to size - a number of different apparatus have been developed in order to generate this switching in electric fields however most commonly used in modern laboratories are FIGE (Field Inversion Gel Electrophoresis) and CHEF (Contour-Clamped Homogenous Electrophoresis) 3.1.2) Preparation of DNA for PFGE - ideally a genomic DNA preparation that contains a high proportion of completely or almost completely intact genome copies would be suitable for PFGE - conventional means of DNA preparation are unsuitable for PFGE as mechanical shearing and low-level nuclease activity will result in fragmented DNA with an average size much smaller than an entire microbial genome (usually less than 200 kb in size) - the solution to this is to prepare genomic DNA from whole cells in a semisolid matrix (ie. agarose) that eliminates mechanical shearing - a very high concentration of EDTA is also used at all times in order to eliminate all nuclease activity Procedure – 1) intact cells are mixed with molten LMT agarose and set in a mold forming agarose ‘plugs’ 2) enzymes and detergents diffuse into the plugs and lyse cells 3) proteinase K diffuses into plugs and digests proteins 4) if necessary restriction digests are performed in plugs (extensive washing or PMSF treatment is required to remove proteinase K activity) 5) plugs are loaded directly onto PFGE and run - for restriction digests, conventional enzymes are unsuitable as they cut frequently on an entire genome sequence producing DNA fragments that are far too small - ‘rare cutter’ restriction endonucleases cut genomic DNA with far less frequency than conventional restriction enzymes such as HindIII, BamHI etc. - many rare cutter RE’s have 6-bp (or longer) recognition sites eg. NotI GCGGCCGC - in many cases the frequency of cutting is highly species dependent eg. BamHI will cut far less frequently on a low GC% genome when compared to a intermediate or high GC content genome - suitable rare cutter enzymes therefore have to be determined experimentally for each new species being studied 3.2) large insert cloning vectors – BAC’s and PAC’s - DNA cloning is another technique fundamental to molecular biology that requires adaptation in order to be useful in studying DNA at a whole genome scale - conventional plasmid derived cloning vectors are only able to reliably maintain inserts less than 20 kb in size - there are a number of approaches to generating clones with inserts in an intermediate size range (20 – 80 kB) such as cosmids, etc. - the most commonly used vectors for cloning extremely large DNA inserts are BAC’s (Bacterial Artificial Chromosomes) and PAC’s (P1-derived Artificial Chromosomes) - both BAC and PAC vectors are plasmid derived vectors distinguished from conventional vectors by extremely tightly controlled low copy numbers - these very low copy numbers help to limit the strain on host cellular resources generated by very large DNA inserts thus eliminating the rejection of large insert clones - low copy numbers also help to limit recombination events with host genomic DNA - BAC and PAC vectors both utilise E. coli as the host organism - BAC vectors are based on the E. coli single copy F-factor plasmid – the F-factor origin of replication is very tightly controlled - PAC vectors are based on an identical principle but instead use a single copy origin of replication derived from P1 phage 4) Approaches to whole genome sequencing - aim of microbial genome sequencing projects is to construct, from 500 – 800 bp sequencing reads containing about 1% mistakes, a genome sequence of several megabases with an error rate lower than 1 per 10000 nucleotides - with improving software, decreasing computation costs and advancements in automated DNA sequencing, an entire microbial genome project can be completed in a small laboratory in 1-2 years - there are two main approaches to sequencing microbial genomes – the ordered clone approach and direct shotgun sequencing - both require both large and small insert genomic DNA libraries in order to be effective 4.1) ordered clone approach - essentially this technique involves constructing a map of overlapping clones covering the whole genome and then completely sequencing the minimum subset of these ordered clones - there are a number of methods used to order clones including restriction fingerprinting and hybridisation mapping - in restriction fingerprinting a series of restriction fragment fingerprints are generated by PFGE for each clone in the library - restriction fragment patterns shared between two or more clones indicative of sequence overlap can be identified - hybridisation mapping involves probing large insert (cosmid, BAC or PAC) clones with PFGE purified DNA fragments generated by rare cutter restriction digests - multiple clones corresponding to each large DNA fragment are then aligned using labelled riboprobes corresponding to insert ends - once an ordered large insert clone set is identified, a whole genome sequence is determined by either shotgun or partial primer walk sequencing of each insert - the ordered clone approach to DNA sequencing requires a large amount of characterisation prior to actual DNA sequencing and is therefore a relatively time consuming approach, however, it may be cheaper than shotgun sequencing an entire genome as less redundant sequencing is required - with rapid decreases in costs for computing power and sequencing this method is no longer considered viable for small (< 5 Mb) genomes 4.2) random sequencing (shotgun) approach - this is the currently the most commonly used strategy for microbial whole genome sequencing - sequences from a large number of small insert clones are generated and overlapping sequences joined together to form a ‘contig’ of the whole genome sequence - although this requires enormous amounts of DNA sequencing (often up to 10x genome coverage) and computational power for sequence assembly, it is a relatively rapid approach to whole genome sequencing - the first 90 – 95% of the genome sequence is relatively easy to generate by shotgun sequencing resulting in several hundred discrete contigs - filling the gaps to produce a single contig is the most difficult and time consuming phase of this process 4.3) sequence assembly - three major steps in sequence assembly – conversion of data from automated sequencers, utilisation of sequences in the assembly process and assessment of the assembly - small and large insert clones that have been end sequenced are used as ‘linking clones’ - inverse PCR may also be used in order to fill gaps 4.4) gapped microbial genomes - considering the cost and difficulty in filling gaps between contigs some interest has been generated by the analysis of gapped microbial genomes - each gap is usually very small on average (approximately 75 bp for a 3.2x coverage library) - increasing bioinformatic resources available mean that these gaps have little influence on functional reconstruction - eg. Thiobacillus ferroxidans - all assigned amino acid biosynthesis genes (140 in total) identified from a gapped genome of 1912 contigs - error rates tend to be relatively high compared to genome sequences with greater coverage Part 2 1) Annotation of genome sequences - a microbial genome sequence alone is only raw data – it needs to be interpreted in order to be of any scientific significance - the most efficient means of determining function for gene sequences within microbial genomes is by comparison to gene sequences of other organisms with known function - in order for this to be effective you must first accurately predict all possible genes in your genome sequence 1.1) Identifying ORF’s - simple homology searches are not sufficient to identify all of the potential genes in a genome sequence - most genomes will contain genes with very little or no homology to known genes of other organisms - most efficient means for identifying potential genes in genome sequences is a three step process 1) submit entire sequence as a 6-frame translation for BLAST analysis in order to identify protein coding regions on the basis of high levels of homology 2) determine the sequence characteristics (GC content, codon bias etc.) that distinguish coding and non-coding regions of the genome 3) reanalyse the genome sequence using this data (plus potential ribosome binding sequences) in order to identify all the potential genes - using this process it has been experimentally shown that around 94% of genes can be accurately predicted 1.2) Assigning function to ORF’s - in order to assign function, all predicted ORF’s are translated to amino acid sequence and analysed by homology searches against sequence databases (usually Genbank) - for each ORF there are three possible results i) clear sequence homology indicating function, ii) blocks of homology to defined functional motifs and iii) no significant homology or homology to proteins of unknown function - similarity based assignment of function - clustering of potential genes (operons etc.) - clusters of orthologous genes from different organsisms (COG’s) 1.3) Genes of unidentified function - in most genome sequences many of the ORF’s identified cannot be assigned a specific function based on homology - although the figure varies, usually between 40 and 50% of ORF’s fall into this category - clearly this represents a significant gap in our knowledge of microbial metabolism - these ORF’s can be further divided into two categories – i) conserved hypothetical proteins – ORF’s with no homology to proteins of known function but with significant homology to unidentified ORF’s of other species - these ORF’s are therefore functionally conserved across numerous species and may represent important components of central metabolism that have not yet been identified - the more universal the distribution of these ORF’s the more likely they have a fundamental role in metabolism ii) ORF’s without homologues – these are ORF’s that have no homology to any known sequences – these may represent genes encoding proteins related to more specific organism adaptations - eg. Deinococcus radiodurans is a radiation resistant organism that contains many ORF’s without homologues – many of these are thought to be involved in specialised processes of DNA repair Organism (total ORF’s) Homologues to known proteins (%) No homologues (%) 33.3 35 Homologues to conserved hypothetical proteins (%) 10.3 33.3 E. coli (4277) Pyrococcus horikoshii (2064) Haemophilus influenzae (1709) B. subtilis (4099) Methanococcus jannaschii (1735) 58.8 18.2 23 58 38.1 5 40.6 37 21.3 56.4 31.7 - in order to gain a complete understanding of an organism and fully exploit the potential offered by microbial genome sequencing, it is essential that these unidentified ORF’s are assigned function - in most cases classical molecular biology tools will be necessary for this task, however, some suggestion of function for these ORF’s would greatly improve the efficiency of this process - one possibility is ‘structural genomics’ - this is the process of determining three dimensional structures of all the gene products encoded in a microbial genome (1000’s of structures!!) - function can then be inferred on the basis of 3d structure comparisons to other proteins - this relies on the principle that although two proteins with similar amino acid sequences can be assumed to have similar structures, two proteins with similar structure don’t necessarily have the same aa sequence 1.5) Metabolic Reconstruction - a powerful means of understanding the way microorganisms behave and interact with their environments is to reconstruct metabolic pathways in silico - a number of computational tools and databases have been developed in order to allow this - in many cases when this approach is used there are key enzymes involved in metabolic pathways that are not present in annotated genome sequences - there are two possibilities – i) these enzymes are encoded by low similarity or novel genes ii) the organism has developed alternatives to classical metabolic pathways 1.6) Microarray Hybridisation - a completely annotated microbial genome sequence, whilst a powerful scientific tool, still doesn’t provide all of the information needed to understand the complete biology of an organism as it essentially a static picture of the genome - for truly complete characterisation, the dynamic nature of gene expression within a microbial cell needs to be determined - microarray technology allows whole organism gene expression to be investigated - PCR products of every gene from a complete genome sequence are bound in a high density array on a glass slide - these arrays are probed with fluorescently labelled cDNA prepared from whole RNA under specific environmental conditions - the level of cDNA for each ORF is then quantified using high resolution image scanners - example – a microarray containing 97% of the predicted ORF’s from Mycobacterium tuberculosis was used to investigate the response to the antituberculosis drug isoniazid - isoniazid was found to induce several genes related to outer lipid envelope biosynthesis – consistent with the drugs physiological mode of action - a number of additional genes were also induced which may provide potential drug targets in the future 2) Characteristics of sequenced genomes 2.1) Horizontal gene transfer - before microbial genome sequences became available most of the focus of microbial evolution was on ‘vertical’ transmission of genetic information – mutation recombination and rearrangement within the clonal lineage of a single microbial population - genome sequences have demonstrated that horizontal transfer of genes (between different types of organsisms) are widespread and may occur between phylogentically diverse organisms - generally speaking, essential genes (such as 16S rRNA) are unlikely to be transferred because the potential host most likely already contains genes of this type that have coevolved with the rest of its cellular machinery and and cannot be displaced - genes encoding non-essential cellular processes of potential benefit to other organisms are far more likely to be transferred (eg. those involved in catabolic processes) - even so, genes with essential function can also may transferred although this is more rare eg. Thermonospora harbours two sets of 16s rRNA, one acquired by lateral transfer - clearly, lateral transfer of genomic information has enormous potential in improving an microorganisms ability to compete effectively - this may explain why horizontally transferred genes appear so frequently and ubiquitously in microbial genomes - an example of this is horizontally transfered genes between Archaeal and Bacterial hyperthermophiles 2.2) Whole genome phylogenetic analysis - most of the evolutionary relationships between microorganisms are inferred by comparison of single genes – usually 16s rRNA genes - although extremely effective, single gene phylogenetic trees only provide limited information which can make determining broad relationships between major groups difficult - phylogenetic relationships can be determined by whole genome comparisons of the observed absence or presence of protein encoding gene families - in effect this is similar to using the distribution of morphological characteristics to determine phylogenetic – without the problem of convergent evolution - trees produced using this method are similar to 16s rRNA trees, however, as more genome sequences become available more detailed conclusions can be drawn using this method 2.3) Archaeal genomics - analysis of the 5 complete genome available for members of the domain Archaea has provided new insights into relationships between Archaea, Bacteria and Eukaryotes - as expected from single gene phylogeny, Archaea contain a significant fraction of ORF’s that show homology only to Eukaryotic ORF’s (25-30%) - most of these encode proteins involved in transcription, translation and DNA metabolism - large numbers of genes are also shared exclusively with Bacteria while the remainder are exclusively Archaeal 2.4) Species and strain specific diversity - although genome sequencing and analysis is very useful when comparing phylogenetically disatant taxa, it is also of interest to examine the genomes of very closely related microorganisms - this allows a more quantitative approach for examining the relationships between genotype and phenotype - complete genome sequences have been dertermined for two species of the genus Chlamydia (pneumoniae and trachomatis) - although the overall genome structure was quite similar, C.pneumoniae contained an additional 214 genes most of which have an unknown function - two strains of the bacterium Helicobacter pylori have been completely sequenced (26695 and J99) - overall the two strains were very similar genetically with only 6% of genes being specific to each strain

Microbial Genomes

Related documents

Products

Support

Microbial Genomes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib