12-Metagenomics_Primer_P_Pyl

advertisement
Metagenomics Primer
Paul Pyl
16S rRNA is used as a “fingerprint” gene in metagenomics.
This gene has interspersed variable regions while other regions are conserved
(regions which define ribosome function). Most bacteria have the same conserved
region. The fingerprint is the variable region but it can be amplified using primers
based on the conserved region – this allows the organism to be identified.
Read counts provide an approximation of bacterial abundances.
Whole metagenome sequencing can also be used – sequence all the DNA of
everything in the sample.
Can use size selection – e.g. concentrate on things approximately the right size for
bacteria.
Various pipelines are available for metagenomics with the same steps:
-
Filter reads e.g. remove reads from host, from food the host eats.
Assembly – assemble reads into contigs – this provides a type of clustering.
Mapping – find out which bacteria each contig comes from – use
metagenomic gene catalogue.
Abundance:
o number of samples x number of operational taxonomic units
o number of samples x number of genes
Metahit.eu – human gut microbiome database
Human microbiome project (NIH) – focused on humans
Whole metagenome studies work in the context of functional units e.g. all genes
involved in digesting fibre. Can find these in gene catalogue, Qiita (web-based).
BIOM files are becoming the standard file format for microbial abundance.
In R the biom package reads the (old) BIOM format (not HFS format).
Phyloseq can be used for basic analysis and provides an interface to DeSeq2.
Metagenomeseq – differential OUT analysis.
Reads to Counts (16S data)
-
Read pairs clustered by sequence similarity
Clusters have a consensus – can look up known 16S sequences using this
query and see where it fits best.
The resolution varies depending on which 16S variable region is used.
Reads to Counts (Whole Metagenome Sequencing)
-
Map reads to known metagenomics genes.
Estimate OTU abundances based on representative marker genes.
Sometimes sequences won’t map to anything.
Bacterial genomes can be very dynamic – the sequences can be chaotic.
Co-abundance groups are sets of genes which seem to be from the same species
and are often seen together in samples.
Metagenomic data is similar to RNA-seq data in that we are interested in counts –
usually relative abundance.
There can be problems if large effects dominate the composition of the
metagenome.
Can use spike ins – calibration against a counted group of bacteria.
Measuring Diversity
Certain types of bacteria may always make up a very high proportion of the biota e.g.
in the human gut. Normalisation will not change this.
Diversity depends on how many species there are and how evenly distributed they
are. Phyloseq can calculate these.
Metagenomic data is highly dimensional, a 2D projection can allow outliers to be
identified and removed, increasing the resolution.
Differential Analyses
-
Compare bacterial composition of samples
Compare metagenomic gene composition of samples
Metatranscriptomics – compare expression. DeSeq2 and EdgeR can do this.
MetagenomeSeq uses a zero inflated model. Metagenomic data has a certain
shape with lots of zeros – there are lots of things only found in one dataset and
no others.
Download