Microbial Genomes

advertisement
Microbial Genomes
Part 1 –Methods for Studying Microbial Genomes
Part 2 –Analysis and Interpretation of Whole Genome Sequences
Part 1
1) Why study whole microbial genomes?
- until whole genome analysis became viable, life sciences have been based on a
reductionist principle – a process of dissecting cell and systems into fundamental
components for further study
- studies on whole genomes and whole genome sequences in particular give us a
complete genomic blueprint for an organism
- we can now begin to examine how all of these parts operate cooperatively to influence
the activities and behavior of an entire organism – a complete understanding of the
biology of an organism
- microbes provide an excellent starting point for studies of this type as they have a
relatively simple genomic structure compared to higher, multicellular organisms
- analysis of whole microbial genomes also provides insight into microbial evolution and
diversity beyond single protein or gene phylogenies
- in practical terms analysis of whole microbial genomes is also a powerful tool in
identifying new applications in for biotechnology and new approaches to the treatment
and control of pathogenic organisms
1.1) History of Microbial Genome Sequencing
- 1977 - first complete genome to be sequenced was bacteriophage X174 - 5386 bp
- first genome to be sequenced using random DNA fragments - Bacteriophage  - 48502
bp
- 1986 - mitochondrial (187 kb) and chloroplast (121 kb) genomes of Marchantia
polymorpha sequenced
- early 90’s - cytomegalovirus (229 kb) and Vaccinia (192 kb) genomes sequenced
- 1995 - first complete genome sequence from a free living organism - Haemophilus
influenzae (1.83 Mb)
- late 1990’s - many additional microbial genomes sequenced including Archaea
(Methanococcus jannaschii - 1996) and Eukaryotes (Saccharomyces cerevisiae - 1996)
1.2) Advantages in Clinical Microbiology
- gain a comprehensive understanding of microbial pathogenesis and host pathogen
interactions
- identification of sensitive and specific molecular targets for identification and typing
- identification of molecular markers associated with disease risk and severity
- selection of potential candidates for rational development of therapeutic agents
2) Microbial genomes sequenced to date
- currently there are 32 complete, published microbial genomes – 25 domain Bacteria, 5
Domain Archaea, 1 domain Eukarya (www.tigr.org)
- around 130 additional microbial genome and chromosome sequencing projects
underway
3) Tools for studying whole genomes
- conventional techniques for analysing DNA are designed for the analysis of small
regions of whole genomes such as individual genes or operons
- many of the techniques used to study whole genomes are conventional molecular
biology techniques adapted to operate effectively with DNA in a much larger size range
3.1) PFGE
- agarose gel electrophoresis is a fundamental technique in molecular biology but is
generally unable to resolve fragments greater than 20 kilobases in size – this makes it
unsuitable for analysing DNA at a whole genome scale (whole microbial genomes are
usually greater than 1000 kilobases in size)
- PFGE (pulsed field gel electrophoresis) is a adaptation of conventional agarose gel
electrophoresis that allows extremely large DNA fragments to be resolved (up to
megabase size fragments)
- PFGE is an essential technique for accurately determining the sizes of whole
genomes/chromosomes prior to sequencing and is necessary for preparing large DNA
fragments for large insert DNA cloning and analysis of subsequent clones
- PFGE is also a commonly used and extremely powerful tool for genotyping and
epidemiology studies for pathogenic microorganisms
3.1.1) Principle of PFGE
- two factors influence DNA migration rates through conventional gels - charge differences between DNA fragments
- ‘molecular sieve’ effect of DNA pores
- DNA fragments normally travel through agarose pores as spherical coils, fragments
greater than 20 kb in size form extended coils and therefore are not subjected to the
molecular sieve effect
- the charge effect is countered by the proportionally increased friction applied to the
molecules and therefore fragments greater than 20 kb do not resolve
- PFGE works by periodically altering the electric field orientation
- the large extended coil DNA fragments are forced to change orientation and size
dependent separation is re-established because the time taken for the DNA to reorient is
size dependent
- the most important factor in PFGE resolution is switching time, longer switching times
generally lead to increased size of DNA fragments which can be resolved
- switching times are optimised for the expected size of the DNA being run on the PFGE
gel
- switch time ramping increases the region of the gel in which DNA separation is linear
with respect to size
- a number of different apparatus have been developed in order to generate this switching
in electric fields however most commonly used in modern laboratories are FIGE (Field
Inversion Gel Electrophoresis) and CHEF (Contour-Clamped Homogenous
Electrophoresis)
3.1.2) Preparation of DNA for PFGE
- ideally a genomic DNA preparation that contains a high proportion of completely or
almost completely intact genome copies would be suitable for PFGE
- conventional means of DNA preparation are unsuitable for PFGE as mechanical
shearing and low-level nuclease activity will result in fragmented DNA with an average
size much smaller than an entire microbial genome (usually less than 200 kb in size)
- the solution to this is to prepare genomic DNA from whole cells in a semisolid matrix
(ie. agarose) that eliminates mechanical shearing
- a very high concentration of EDTA is also used at all times in order to eliminate all
nuclease activity
Procedure –
1) intact cells are mixed with molten LMT agarose and set in a mold forming agarose
‘plugs’
2) enzymes and detergents diffuse into the plugs and lyse cells
3) proteinase K diffuses into plugs and digests proteins
4) if necessary restriction digests are performed in plugs (extensive washing or PMSF
treatment is required to remove proteinase K activity)
5) plugs are loaded directly onto PFGE and run
- for restriction digests, conventional enzymes are unsuitable as they cut frequently on an
entire genome sequence producing DNA fragments that are far too small
- ‘rare cutter’ restriction endonucleases cut genomic DNA with far less frequency than
conventional restriction enzymes such as HindIII, BamHI etc.
- many rare cutter RE’s have 6-bp (or longer) recognition sites eg. NotI GCGGCCGC
- in many cases the frequency of cutting is highly species dependent eg. BamHI will cut
far less frequently on a low GC% genome when compared to a intermediate or high GC
content genome
- suitable rare cutter enzymes therefore have to be determined experimentally for each
new species being studied
3.2) large insert cloning vectors – BAC’s and PAC’s
- DNA cloning is another technique fundamental to molecular biology that requires
adaptation in order to be useful in studying DNA at a whole genome scale
- conventional plasmid derived cloning vectors are only able to reliably maintain inserts
less than 20 kb in size
- there are a number of approaches to generating clones with inserts in an intermediate
size range (20 – 80 kB) such as cosmids, etc.
- the most commonly used vectors for cloning extremely large DNA inserts are BAC’s
(Bacterial Artificial Chromosomes) and PAC’s (P1-derived Artificial Chromosomes)
- both BAC and PAC vectors are plasmid derived vectors distinguished from
conventional vectors by extremely tightly controlled low copy numbers
- these very low copy numbers help to limit the strain on host cellular resources generated
by very large DNA inserts thus eliminating the rejection of large insert clones
- low copy numbers also help to limit recombination events with host genomic DNA
- BAC and PAC vectors both utilise E. coli as the host organism
- BAC vectors are based on the E. coli single copy F-factor plasmid – the F-factor origin
of replication is very tightly controlled
- PAC vectors are based on an identical principle but instead use a single copy origin of
replication derived from P1 phage
4) Approaches to whole genome sequencing
- aim of microbial genome sequencing projects is to construct, from 500 – 800 bp
sequencing reads containing about 1% mistakes, a genome sequence of several
megabases with an error rate lower than 1 per 10000 nucleotides
- with improving software, decreasing computation costs and advancements in automated
DNA sequencing, an entire microbial genome project can be completed in a small
laboratory in 1-2 years
- there are two main approaches to sequencing microbial genomes – the ordered clone
approach and direct shotgun sequencing
- both require both large and small insert genomic DNA libraries in order to be effective
4.1) ordered clone approach
- essentially this technique involves constructing a map of overlapping clones covering
the whole genome and then completely sequencing the minimum subset of these ordered
clones
- there are a number of methods used to order clones including restriction fingerprinting
and hybridisation mapping
- in restriction fingerprinting a series of restriction fragment fingerprints are generated by
PFGE for each clone in the library
- restriction fragment patterns shared between two or more clones indicative of sequence
overlap can be identified
- hybridisation mapping involves probing large insert (cosmid, BAC or PAC) clones with
PFGE purified DNA fragments generated by rare cutter restriction digests
- multiple clones corresponding to each large DNA fragment are then aligned using
labelled riboprobes corresponding to insert ends
- once an ordered large insert clone set is identified, a whole genome sequence is
determined by either shotgun or partial primer walk sequencing of each insert
- the ordered clone approach to DNA sequencing requires a large amount of
characterisation prior to actual DNA sequencing and is therefore a relatively time
consuming approach, however, it may be cheaper than shotgun sequencing an entire
genome as less redundant sequencing is required
- with rapid decreases in costs for computing power and sequencing this method is no
longer considered viable for small (< 5 Mb) genomes
4.2) random sequencing (shotgun) approach
- this is the currently the most commonly used strategy for microbial whole genome
sequencing
- sequences from a large number of small insert clones are generated and overlapping
sequences joined together to form a ‘contig’ of the whole genome sequence
- although this requires enormous amounts of DNA sequencing (often up to 10x genome
coverage) and computational power for sequence assembly, it is a relatively rapid
approach to whole genome sequencing
- the first 90 – 95% of the genome sequence is relatively easy to generate by shotgun
sequencing resulting in several hundred discrete contigs
- filling the gaps to produce a single contig is the most difficult and time consuming
phase of this process
4.3) sequence assembly
- three major steps in sequence assembly – conversion of data from automated
sequencers, utilisation of sequences in the assembly process and assessment of the
assembly
- small and large insert clones that have been end sequenced are used as ‘linking clones’
- inverse PCR may also be used in order to fill gaps
4.4) gapped microbial genomes
- considering the cost and difficulty in filling gaps between contigs some interest has
been generated by the analysis of gapped microbial genomes
- each gap is usually very small on average (approximately 75 bp for a 3.2x coverage
library)
- increasing bioinformatic resources available mean that these gaps have little influence
on functional reconstruction
- eg. Thiobacillus ferroxidans - all assigned amino acid biosynthesis genes (140 in total)
identified from a gapped genome of 1912 contigs
- error rates tend to be relatively high compared to genome sequences with greater
coverage
Part 2
1) Annotation of genome sequences
- a microbial genome sequence alone is only raw data – it needs to be interpreted in order
to be of any scientific significance
- the most efficient means of determining function for gene sequences within microbial
genomes is by comparison to gene sequences of other organisms with known function
- in order for this to be effective you must first accurately predict all possible genes in
your genome sequence
1.1) Identifying ORF’s
- simple homology searches are not sufficient to identify all of the potential genes in a
genome sequence
- most genomes will contain genes with very little or no homology to known genes of
other organisms
- most efficient means for identifying potential genes in genome sequences is a three step
process
1) submit entire sequence as a 6-frame translation for BLAST analysis in order to identify
protein coding regions on the basis of high levels of homology
2) determine the sequence characteristics (GC content, codon bias etc.) that distinguish
coding and non-coding regions of the genome
3) reanalyse the genome sequence using this data (plus potential ribosome binding
sequences) in order to identify all the potential genes
- using this process it has been experimentally shown that around 94% of genes can be
accurately predicted
1.2) Assigning function to ORF’s
- in order to assign function, all predicted ORF’s are translated to amino acid sequence
and analysed by homology searches against sequence databases (usually Genbank)
- for each ORF there are three possible results i) clear sequence homology indicating
function, ii) blocks of homology to defined functional motifs and iii) no significant
homology or homology to proteins of unknown function
- similarity based assignment of function
- clustering of potential genes (operons etc.)
- clusters of orthologous genes from different organsisms (COG’s)
1.3) Genes of unidentified function
- in most genome sequences many of the ORF’s identified cannot be assigned a specific
function based on homology
- although the figure varies, usually between 40 and 50% of ORF’s fall into this category
- clearly this represents a significant gap in our knowledge of microbial metabolism
- these ORF’s can be further divided into two categories –
i) conserved hypothetical proteins – ORF’s with no homology to proteins of known
function but with significant homology to unidentified ORF’s of other species
- these ORF’s are therefore functionally conserved across numerous species and may
represent important components of central metabolism that have not yet been identified
- the more universal the distribution of these ORF’s the more likely they have a
fundamental role in metabolism
ii) ORF’s without homologues – these are ORF’s that have no homology to any known
sequences – these may represent genes encoding proteins related to more specific
organism adaptations
- eg. Deinococcus radiodurans is a radiation resistant organism that contains many ORF’s
without homologues – many of these are thought to be involved in specialised processes
of DNA repair
Organism (total
ORF’s)
Homologues to
known proteins (%)
No homologues (%)
33.3
35
Homologues to
conserved
hypothetical
proteins (%)
10.3
33.3
E. coli (4277)
Pyrococcus
horikoshii (2064)
Haemophilus
influenzae (1709)
B. subtilis (4099)
Methanococcus
jannaschii (1735)
58.8
18.2
23
58
38.1
5
40.6
37
21.3
56.4
31.7
- in order to gain a complete understanding of an organism and fully exploit the potential
offered by microbial genome sequencing, it is essential that these unidentified ORF’s are
assigned function
- in most cases classical molecular biology tools will be necessary for this task, however,
some suggestion of function for these ORF’s would greatly improve the efficiency of this
process
- one possibility is ‘structural genomics’
- this is the process of determining three dimensional structures of all the gene products
encoded in a microbial genome (1000’s of structures!!)
- function can then be inferred on the basis of 3d structure comparisons to other proteins
- this relies on the principle that although two proteins with similar amino acid sequences
can be assumed to have similar structures, two proteins with similar structure don’t
necessarily have the same aa sequence
1.5) Metabolic Reconstruction
- a powerful means of understanding the way microorganisms behave and interact with
their environments is to reconstruct metabolic pathways in silico
- a number of computational tools and databases have been developed in order to allow
this
- in many cases when this approach is used there are key enzymes involved in metabolic
pathways that are not present in annotated genome sequences
- there are two possibilities – i) these enzymes are encoded by low similarity or novel
genes ii) the organism has developed alternatives to classical metabolic pathways
1.6) Microarray Hybridisation
- a completely annotated microbial genome sequence, whilst a powerful scientific tool,
still doesn’t provide all of the information needed to understand the complete biology of
an organism as it essentially a static picture of the genome
- for truly complete characterisation, the dynamic nature of gene expression within a
microbial cell needs to be determined
- microarray technology allows whole organism gene expression to be investigated
- PCR products of every gene from a complete genome sequence are bound in a high
density array on a glass slide
- these arrays are probed with fluorescently labelled cDNA prepared from whole RNA
under specific environmental conditions
- the level of cDNA for each ORF is then quantified using high resolution image scanners
- example – a microarray containing 97% of the predicted ORF’s from Mycobacterium
tuberculosis was used to investigate the response to the antituberculosis drug isoniazid
- isoniazid was found to induce several genes related to outer lipid envelope biosynthesis
– consistent with the drugs physiological mode of action
- a number of additional genes were also induced which may provide potential drug
targets in the future
2) Characteristics of sequenced genomes
2.1) Horizontal gene transfer
- before microbial genome sequences became available most of the focus of microbial
evolution was on ‘vertical’ transmission of genetic information – mutation recombination
and rearrangement within the clonal lineage of a single microbial population
- genome sequences have demonstrated that horizontal transfer of genes (between
different types of organsisms) are widespread and may occur between phylogentically
diverse organisms
- generally speaking, essential genes (such as 16S rRNA) are unlikely to be transferred
because the potential host most likely already contains genes of this type that have coevolved with the rest of its cellular machinery and and cannot be displaced
- genes encoding non-essential cellular processes of potential benefit to other organisms
are far more likely to be transferred (eg. those involved in catabolic processes)
- even so, genes with essential function can also may transferred although this is more
rare eg. Thermonospora harbours two sets of 16s rRNA, one acquired by lateral transfer
- clearly, lateral transfer of genomic information has enormous potential in improving an
microorganisms ability to compete effectively - this may explain why horizontally
transferred genes appear so frequently and ubiquitously in microbial genomes
- an example of this is horizontally transfered genes between Archaeal and Bacterial
hyperthermophiles
2.2) Whole genome phylogenetic analysis
- most of the evolutionary relationships between microorganisms are inferred by
comparison of single genes – usually 16s rRNA genes
- although extremely effective, single gene phylogenetic trees only provide limited
information which can make determining broad relationships between major groups
difficult
- phylogenetic relationships can be determined by whole genome comparisons of the
observed absence or presence of protein encoding gene families
- in effect this is similar to using the distribution of morphological characteristics to
determine phylogenetic – without the problem of convergent evolution
- trees produced using this method are similar to 16s rRNA trees, however, as more
genome sequences become available more detailed conclusions can be drawn using this
method
2.3) Archaeal genomics
- analysis of the 5 complete genome available for members of the domain Archaea has
provided new insights into relationships between Archaea, Bacteria and Eukaryotes
- as expected from single gene phylogeny, Archaea contain a significant fraction of
ORF’s that show homology only to Eukaryotic ORF’s (25-30%)
- most of these encode proteins involved in transcription, translation and DNA
metabolism
- large numbers of genes are also shared exclusively with Bacteria while the remainder
are exclusively Archaeal
2.4) Species and strain specific diversity
- although genome sequencing and analysis is very useful when comparing
phylogenetically disatant taxa, it is also of interest to examine the genomes of very
closely related microorganisms
- this allows a more quantitative approach for examining the relationships between
genotype and phenotype
- complete genome sequences have been dertermined for two species of the genus
Chlamydia (pneumoniae and trachomatis)
- although the overall genome structure was quite similar, C.pneumoniae contained an
additional 214 genes most of which have an unknown function
- two strains of the bacterium Helicobacter pylori have been completely sequenced
(26695 and J99)
- overall the two strains were very similar genetically with only 6% of genes being
specific to each strain
Download