Metagenomics Primer Paul Pyl 16S rRNA is used as a “fingerprint” gene in metagenomics. This gene has interspersed variable regions while other regions are conserved (regions which define ribosome function). Most bacteria have the same conserved region. The fingerprint is the variable region but it can be amplified using primers based on the conserved region – this allows the organism to be identified. Read counts provide an approximation of bacterial abundances. Whole metagenome sequencing can also be used – sequence all the DNA of everything in the sample. Can use size selection – e.g. concentrate on things approximately the right size for bacteria. Various pipelines are available for metagenomics with the same steps: - Filter reads e.g. remove reads from host, from food the host eats. Assembly – assemble reads into contigs – this provides a type of clustering. Mapping – find out which bacteria each contig comes from – use metagenomic gene catalogue. Abundance: o number of samples x number of operational taxonomic units o number of samples x number of genes Metahit.eu – human gut microbiome database Human microbiome project (NIH) – focused on humans Whole metagenome studies work in the context of functional units e.g. all genes involved in digesting fibre. Can find these in gene catalogue, Qiita (web-based). BIOM files are becoming the standard file format for microbial abundance. In R the biom package reads the (old) BIOM format (not HFS format). Phyloseq can be used for basic analysis and provides an interface to DeSeq2. Metagenomeseq – differential OUT analysis. Reads to Counts (16S data) - Read pairs clustered by sequence similarity Clusters have a consensus – can look up known 16S sequences using this query and see where it fits best. The resolution varies depending on which 16S variable region is used. Reads to Counts (Whole Metagenome Sequencing) - Map reads to known metagenomics genes. Estimate OTU abundances based on representative marker genes. Sometimes sequences won’t map to anything. Bacterial genomes can be very dynamic – the sequences can be chaotic. Co-abundance groups are sets of genes which seem to be from the same species and are often seen together in samples. Metagenomic data is similar to RNA-seq data in that we are interested in counts – usually relative abundance. There can be problems if large effects dominate the composition of the metagenome. Can use spike ins – calibration against a counted group of bacteria. Measuring Diversity Certain types of bacteria may always make up a very high proportion of the biota e.g. in the human gut. Normalisation will not change this. Diversity depends on how many species there are and how evenly distributed they are. Phyloseq can calculate these. Metagenomic data is highly dimensional, a 2D projection can allow outliers to be identified and removed, increasing the resolution. Differential Analyses - Compare bacterial composition of samples Compare metagenomic gene composition of samples Metatranscriptomics – compare expression. DeSeq2 and EdgeR can do this. MetagenomeSeq uses a zero inflated model. Metagenomic data has a certain shape with lots of zeros – there are lots of things only found in one dataset and no others.