Genomics -sequencing of microbial sequences

advertisement
Genomics-sequencing of microbial
genomes
This lecture illustrates the strategies used in microbial genome sequencing projects, compares
genome content and organisation amongst microbes, and shows how to derive information on gene
function across genome.
Objectives for students:


Expected to describe strategies involved in microbial genome sequencing and functional
genomics
Provide examples of information that can be derived from genomics
Microbial Genome Sequencing




Genome Sequencing Projects
o strategy & methods
o annotation
Comparative genomics
o organisation
o gene content
Functional genomics
o transcriptome
o proteome
o genome-wide mutation
Concentrate on strategy & ideas
Genome Sequencing Projects
Genome sequencing progress (2009)


Complete:
o Archaeal: 70 (2007 = 49) (2008= 55)
o Bacterial: 945 (2007 = 554) (2008= 728)
o (Eukaryotc : 121) (2007 = 76) (2008= 97)
Ongoing:
o Archaeal: 111
o Bacterial: 3498
o (Eukaryotic: 1223)
o Metagenome projects: 200
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 1
www.genomesonline.org
Bacterial genome projects
Many completed:











Haemophilus influenzae
Escherichia coli
Bacillus subtilis
Mycoplasma genitalium
Helicobacter pylori (x2)
Campylobacter jejuni
Treponema pallidum
Neisseria menigitidis
Neisseria gonnorhoea
Vibrio cholerae
E. coli O157
Links:




http://www.tigr.org/
http://www.ncbi.nlm.nih.gov/
http://www.sanger.ac.uk/
http://www.genomesonline.org/
Completed microbial eukaryote projects
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 2









Yeast -Saccharomyces cerevisiae
Plasmodium falciparum
Aspergillus nidulans, A.niger, A.oryzae &A.fumigatus
Trypanosoma cruzi & brucei
Leishmania
Entamoeba histolytica
Giardia lamblia
Candida albicans & glabrata
Paramecium
Genome sequencing strategy
In the pre-genome era there were a number of considerations regarding the benefits of sequencing.
The piecemeal collection of sequenced genes was slow and costly. Issues also arose over ownership,
strain choice, approach and data release. The genome project, however, provided a rational
approach to sequencing which was efficient and rapid, and was able to address novel questions. The
post genomic era has allowed the application of comparative and functional genomics.
Genome sequencing strategy:
 Strategy choice
o large collaborative cosmid/BAC-based projects
 now better suited for larger genomes
 slow
o small insert shotgun approach
 centralised
 rapid and efficient
 choice for bacteria
 Strain choice
o fresh isolate vs lab strain
o clinical vs environmental
o subsequent genetic analysis
E.g. Yeast genome sequence strategy


Yeast chromosomes (16) individually sequenced
several approaches used
o Make genome library in cosmids
 order cosmid library
 need to know which cosmid overlaps with which

link cosmid to genome map
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 3
o
o
o
o

produced tiled set of cosmids

only sequence minimum number
Use chromosome specific probe to identify chromosome-specific cosmids
sequence cosmid inserts by subcloning
Solve problems by direct PCR sequencing, walking and other libraries (lambda)
Telomeres
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 4
Whole genome/chromosome shot-gun strategy (WGS)





Rapid
Generation of small insert genomic library
Library is not initially ordered
DNA sequence ends of inserts
Depends on powerful computing to assemble sequence reads
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 5
Main steps in generating a complete genome sequence
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 6
Automated sequencers:
Manually chain termination sequencing requires four reaction tubes each containing a different type
of terminator base as well as a radioactive nucleotide for labelling the newly synthesised DNA
fragments. Each of the four reactions is electrophoresed in a separate lane of a gel. Demand for the
ability to read more sequence in a shorter amount of time, led to the automation of the DNA
sequencing process.
The attachment the of different fluorescent dyes to each of the four terminator bases ensured four
separate sequencing reactions were no longer required; the entire sequencing reaction could be
accomplished in a single tube. The development of these automated sequencing machines using
multiple capillaries, thin, hollow glass tubes filled with a gel polymer, removed the need for a
technician to add each sequencing reaction into an individual lane of the gel prior to the run
ABI 3700
The ABI 3700s (made by Applied Biosystems) are the most widely used automated sequencers. They
have 96 capillaries, with a robot loading from 384-well plates.
MegaBACE
The MegaBACE is made by Amersham. It also has 96 capillaries and robotic loading from 384–well
plate. Each run takes two to four hours, and can read up to 800 bases.
These advances have lead to the industrialization of sequencing. Most genome sequencing projects
divide tasks (such as genome libraries, production sequencing and finishing) among different teams.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 7
Sequencing machines run are run 24 hours a day, 7 days a weeks and many tasks can be perfomed
by robots.
454 sequencing- the future?
454 sequencing was developed Roche, and relies on a technique known as pyrosequencing
(sequencing by synthesis). It differs from Sanger sequencing, relying on the detection of
pyrophosphate release on nucleotide incorporation, rather than chain termination with
dideoxynucleotides.

Nucleotides are flowed sequentially in a fixed order across the PicoTiterPlate device during
a sequencing run.

During the nucleotide flow, hundreds of thousands of beads each carrying millions of
copies of a unique single-stranded DNA molecule are sequenced in parallel.

If a nucleotide complementary to the template strand is flowed into a well, the polymerase
extends the existing DNA strand by adding nucelotide(s).

Addition of one (or more) nucleotide(s) results in a reaction that generates a light signal
that is recorded by the CCD camera in the instrument.

The signal strength is proportional to the number of nucleotides incorporated in a single
nucelotide flow.
The GS FLX System software tracks the location of DNA carrying beads on a XY axis. Each bead
corresponds to a XY-coordinate on a series of images. The signal intensity per nucleotide flow is
recorded for each bead over time and is plotted to generate a flowgram. Each 10 hour sequencing
run on the GS FLX Titanium series will typically produce over one million flowgrams, one flowgram
per read.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 8
The development and impact of 454 sequencing. http://www.ncbi.nlm.nih.gov/pubmed/18846085
Rothberg et al.Biotechnology. Volume 26, 1117-1124 9/10/2008
Work involved in whole genome sequencing:
 individual sequencing reads accumulate
o each read about 500bp
o computing used to assemble reads
o contiguous sequences called contigs
 Aim for 8-10 read coverage of genome for accuracy
 example:
o H.influenzae
 19,687 templates
 24,304 reads assembled
 11,631,485 bp
Gaps in genome sequence need to be filled in:
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 9
Bridging Gaps
A contig is a set of gel readings that are related to one another by overlap of their sequences. The
gel readings in a contig can be summed to form a contiguous consensus sequence, the length of this
sequence forms the length of the contig.




rise in contig number as amount of reads increases
steady fall as accumulating sequence bridges gaps between contigs
levels off as new reads more likely in known contig than gap
start finishing
Finishing
 Why are gaps present?
 Gap bridging
o sequence gaps
 sequence gaps –choose appropriate clone and walk
o physical gaps
 alternative libraries (which?)
 PCR across gap
 Mistakes/poor sequence
o areas where sequence reads are less than 8-10
o repeated sequences -rRNA
 closure and completion
Genome annotation
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 10



Find ORFs
o look for ATG-Stop (+alternatives)
o over certain size
o overlaps
o computer based (“Glimmer” & “Orpheus”) and trained eye
ORF function
o Search databases with predicted translated sequences –BLASTX
o Consider level of similarity and context
o Domain comparisons
 Pfam/Prosite
Other features
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 11
http://www.yeastgenome.org/MAP/GENOMICVIEW/GenomicView.shtml
http://mips.gsf.de/genre/proj/yeast/index.jsp
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 12
Artemis: sequence viewer and annotation tool from the Sanger Centre
(http://www.sanger.ac.uk/Software/Artemis/)
http://xbase.bham.ac.uk/
xBASE is a database for comparative genome analysis of all bacterial genome sequences
Chaudhuri RR, Pallen MJ. xBASE, a collection of online databases for bacterial comparative genomics.
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D335-7.
http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D335
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 13
Post Genome Sequence Approaches



Comparative genomics
o comparing genome organisation and content
o genome size
o genome repeats/Tn/phages
o gene content
o minimal gene content
Functional genomics –ascribing gene function across a genome
o gene function –knowns
o phenotype prediction
o gene function –unknowns
o investigating function
Bacteria-Yeast
Bacteria: Does genome size matter?
 Link genome size to adaptive capability
o biosynthetic capability
 synthesis of nutrients
o Stress resistance
 resist environmental insults
o structural complexity
 surface structures, sporogenesis
 Regulation –sensing signals and transcriptional responses
o detect change or requirement and respond appropriately
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 14
o
transcriptional regulation
Although size of the bacterial genome is important how the genome is expressed and regulated
within its environment is also important:
Small genomes:



Mycoplasma genitalium
o 580,070 bp
o smallest genome for self-replicating organism
o free living but infects host cells
o few biosynthesis and regulatory systems
o has replication & transcription & translation, metabolism etc functions
Borrelia burgdorferi
o 910,725 bp
o Lyme disease
o few cellular biosynthetic systems
Mycoplasma pneumoniae(0.8 Mbp); Chlamydia trachomatis(1.0 Mbp);
Larger genomes:
 Haemophilus influenzae
o 1.830 Mbp
o colonises human respiratory tract
o limited environment
 Helicobacter pylori
o 1.667 Mbp
o colonises human stomach
o limited environment
 Campylobacter jejuni
o 1.641 Mbp
o colonises intestine
o limited environment
Very large:
 Escherichia coli(K-12)
o 4.639 Mbp
 Bacillus subtilis
o 4.214 Mbp
o soil/plant organism
o secondary metabolites
 Pseudomonas aeruginosa
o incomplete (5.9 Mbp)
 Yersinia pestis(4.4 Mbp)
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 15



Clostridiumspp (4-5 Mbp)
Mycobacterium tuberculosis
o 4.411 Mbp
o slow growing (double in 24h)
o large proportion of genome on lipid metabolism
Streptomyces coelicolor(~8 Mbp)
o secondary metabolites –antibiotics!
Organisation of bacterial genomes
 Linear chromosomes
o Borrelia burgdorferi
o Streptomyces coelicolor
 Multiple chromosomes
o Vibrio cholerae
 Plasmids
o Borrelia burgdorferi
o 17 linear & circular plasmids
o 50% genome size
o plasmid replication, “decaying genes”, Antigenic variation
 Transposons, IS elements, phages
o found in most genomes
o Although Campylobacter has none
 Repeats
Replication
 Origin (oriC) and termination (terC) of replication
o OriC often near dnaAgene (replication initiation protein)
o In Borrelia burgdorferi (linear) oriC (& dnaA) in centre
 strand bias
o which strand is each gene on?
o transcription in same direction as replication –more efficient
o variation in level of strand bias
 Mt 55% vs Bs 75%
Genes can be annotated according to sequence similarity e.g. gene families, and regulators,
transport, biosynthesis or domain matches such as trans-membrane domains, or DNA binding
domains. Paralogues and orthologues can also be noted. Paralogues are members of same family
(homologous) in same genome, but are likely to evolved to have a different exact function,
orthologues on the other hand are homologous genes(same family) in different genomes, that may
have identical function.
This allows the deduction of metabolic pathways in newly synthesised organisms:
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 16
e.g. Vibrio cholorae
Reprinted by permission from Macmillan Publishers Ltd: [NATURE] (Heidelberg et al, 406 ,477-483),
copyright (2000)
A significant proportion of genome contains ORFs of unknown function. Some may be orthologues
of unknowns in other organisms, whilst others may be unique to the organism and important for its
biology of organism. For example H.influenzae has 42% of genes with no known function whilst
H.pylori has 33%, E.coli has 38% and M.tuberculosis between 60% to 16%. The number of these
genes of unknown function is decreasing, however.
Comparison between genomes indicates the differing genomic arrangements within species, for
example:
Comparison of Salmonella enterica serovar Typhi CT18 and Salmonella enterica serovar Typhi Ty2
shows an inversion that spans the terminus.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 17
Variation in genomes may occur by gain or loss. Regions shared by closely related species are
referred to as Core regions. There is also an additional “flexible” gene pool containing variable
regions acquired from mobile genetic elements. These were first described as pathogenicity islands,
although they are also found in non-pathogens, and having wider roles than pathogenicty, are now
referred to as genomic islands. These islands contain genes are found in pathogens, commensals,
symbionts and environmental bacteria. The gain of a genome island can be associated with gene
loss e.g. gene reduction in obligate intracellular pathogens. Genome organisation as well as genome
content correlates with microbial lifestyle.
Inserted Genome islands are frequently located adjacent to tRNA genes, known as tRNA associated
elements, e.g tRNAProL and tRNAArgU in S.tyhpi and E.coli.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 18
The supragenome
The distributed-genome hypothesis (DGH) states that bacteria possess a number of virulence traits
that are expressed only at the population level and are not operational at the single-cell level, i.e.
that bacteria a have a (supra) genome much larger than the genome of any single bacterium.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 19
The supragenome consists of core and non-core gene sets, e.g.: Hiller et al. (Journal of Bacteriology,
November 2007, p. 8186-8195, Vol. 189, No. 22
http://jb.asm.org/cgi/content/abstract/189/22/8186)
sequenced 8 strains of Streptococcus pneumoniae and analysed a further 9 previously available .
They found core set of genes in all strains, but 20-30% genes were non-core (not present in all
strains) due to the genetic recombination generating diversity across strains. This was also observed
in Haemophilus influenzae(Hogg et al. Genome Biology 2007, 8:R103
http://genomebiology.com/2007/8/6/R103) who found–~1400 genes in the core set and ~1300 noncore genes in subset of strains.
Yeast







16 chromosomes totalling 12.068Mbp
5885 orfs –6275 but translation is thought to be unlikely in 390
Few introns ~4%
Average gene size 2kb (worm ~6kb and human >30kb)
GC vary along chromosome length
o low GC at telomere & centromere
o GC rich correlate with higher recombination
Tn and remnants in genome
o evidence of hotspots
50% orfs of known function
o For some the exact role is unclear
http://genome-www.stanford.edu/Saccharomyces
http://mips.gsf.de/projects/fungi
Functional genomics
•Functional genomics involves ascribing gene function across a genome.
Micro and Chip Arrays:
 Microarrays
o Glass slides with <10000 individual samples applied in known position
o Use of robotics
o Samples can be PCR products or oligos
o example: oligos complementary to each unique Tag
o example: oligo/PCR product complementary to each ORF
 Chip arrays
o silicon based
o >10,000 sequences
o http://www.affymetrix.com/index.html
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 20
Transcriptome
The transcriptome is the total set of RNAs (including mRNA, rRNA, tRNA, and non-coding RNA)
produced by a single cell or population of cells, and provides a genome-wide expression level of each
ORF. The expression of a gene relates to its role, so the transcriptome also allows the assessment of
mutants, by comparing the expression of each ORF in different conditions. Both genome wide
expression maps and global patterns of expression can be produced.
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
e.g. Expression profiling C. jejuni in low iron
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 21
Proteome
The proteome is the entire set of proteins expressed by a genome, cell, tissue or organism,
specifically, at a given time under defined conditions. This genome-wide determination of protein
expression provides information on how protein expression is linked to function. It allows
assessment of mutants, in particular regulatory mutants which affect several proteins. Bacteria are
grown under defined conditions, and their protein extracted and electrophoresed on 2D gel.
Proteins can then be identified by spot identification, mass spectrometry and peptide size
predictions from genome data.
E.g. growth of C. jejunini in iron
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 22
http://depts.washington.edu/yeastrc/pages/ms.html
Mutantome
Mass Mutagenesis can be used to create a mutantome, where every ORF in the genome has been
mutated via organism specific technology. This allows high throughput analysis of the phenotype,
allowing analysis of many 1000s of mutants under many conditions. Signature-tagged technology
enables analysis of mutant pools, but requires array technology for genome-wide projects.
Signature Tagging involves the addition of short unique DNA sequence tags. Each tag is linked to a
mutation, with each individual mutant having a unique tag.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 23
By inserting a “molecular barcode” within a gene for a number of mutants and then subjecting this
pool of mutants to a treatment, copies of the barcode present post-treatment can be determined.
The process allows identification of missing bar coded mutants, and also those genes which can be
assumed to have a role in adapting to the treatment environment.
Nature Reviews Genetics 7, 929-939 (December 2006:
http://www.nature.com/nrg/journal/v7/n12/full/nrg1984.html)
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 24
Interactome: Yeast 2 hybrid allows the identification of protein-protein interactions and proteinDNA interactions by testing for physical interactions (such as binding) between two proteins or a
single protein and a DNA molecule, respectively.
The premise behind the test is the activation of downstream reporter gene(s) by the binding of a
transcription factor onto an upstream activating sequence (UAS). For two-hybrid screening, the
transcription factor is split into two separate fragments, called the binding domain (BD) and
activating domain (AD). The BD is the domain responsible for binding to the UAS and the AD is the
domain responsible for the activation of transcription.
The expression library of binding-domain: protein 1 (bait) and the expression library of activationdomain: protein 2 (prey), allows the testing of combinations of all open reading frames within a
genome.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 25
http://en.wikipedia.org/wiki/File:Two_hybrid_assay.svg
Genomic indexing
Microarray techniques can be used to assess gene inventories. Genomic indexing evaluates the
distribution of genes of sequenced bacterial strains among un-sequenced strains of the same or
related species, and can be used to determine the repertoire of virulence genes found in bacterial
pathogens.
For example: an array of all known genes in a microbe is created, indicating that genes 1, 2, 3 & 14
form the minimal gene set as they hybridise the array with labelled chromosomal DNA. However
gene expression patterns from different isolates can be identified and compared.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 26
Reprinted by permission from Macmillan Publishers Ltd: [NATURE REVIEWS GENETICS]
(Mazurkiewicz et al. 7 929-939), copyright (2006)
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 27
This marks the end of the lecture notes for Genomes on Microbial Genomes.
University of Leicester –Genomes–Microbial Genomics -October 2010
Page 28
Download