高通量技术在基因组与微生物学研究中的应用x

advertisement
高通量技术在基因组与微生物学研究中的
应用策略与解决方案
刘贵明
中科院北京基因组研究所
内容:
微生物基因组拼接算法和策略
微生物基因组的pangenome
微生物转录组
微生物基因组的甲基化检测
宏基因组
16S测序
WGS 测序
单细胞测序
Diversity of the microbial universe
Acquire genes from environment
Conjugation
Transformation
Phage infection (transduction)
Molecular evolutionary mechanisms that shape bacterial species diversity: one genome, pan-genome
and metagenome.
a. Intra-species
b .inter-species
c. population dynamic
mechanisms manipulate
the genomic diversity of
bacterial species
d. Metagenomics embraces
The community as the unit
of study
Microbiology in the post-genomic era. Nature Reviews Microbiology 6, 419-430 (June 2008)
高通量测序平台
Illumina
ABI
Pacific Biosciences
HiSeq
MiSeq
Ion PGM™
Ion Proton™
PACBIO RS
read长度
2X100/150
2X150/300
100-400
100-200
3000-20000
数据量
40-60G
15G
600M
64G
500M
基因组拼接(genome assembly)
1. DNA Shear & Sequence DNA
2. Construct assembly graph from overlapping reads
3. Simplify assembly graph
4. Detangle graph with long reads, mates, and other links
拼接算法
Overlap-Layout-Consensus
De Bruijn Graph
A Overlap
B Layout
C Consensus
soapdenovo
Long read assembly (Pacbio)
(A) At k = 50, the graph is tangled with hundreds of contigs.
(B) k = 1,000 significantly simplifies the graph.
(C) At k = 5,000, the graph is fully resolved into a single contig.
The advantages of SMRT sequencing. Roberts et al. Genome Biology 2013, 14:405
Illumina 平台文库构建
500 bp pair end
2-20K mate pair
40K fosmid library
De novo assembly
1. Soapdenovo
input data : Illumina
2. ALLPATHS-LG
input data: 180bp +Mate pair
or Illumina + PacBio(hybrid assembly)
A fill fragments-> unipaths-> Error correct
Reference-guided assembly
Read-mapped
assembly-mapped
Multi-reference assisted chromosome assembly
Reference-assisted chromosome assembly. Korbinian et al. PNAS.2011: 10249–10254
GapClose based pair end/mate pair reads
Toward almost closed genomes with GapFiller. Boetzer et al. Genome Biology 2012, 13:R56
微生物基因组的拼接方案
1. Illumina (SOAPdenovo)
Insert size: 180bp, 500bp, 2K, 5K 和40K
Read length: 2X100bp
2. Pacbio+Illumina(Hybrid assembly, WGS, http://wgs-assembler.sourceforge.net/)
Insert size(Illumina): pair end (500bp)
Read length: 2X100bp; >5Kb
3. Pacbio Only
Read length: >5Kb
Hybrid assembly
Hybrid Error Correction & Assembly
1. Trim/correct SR sequence
2. Compute an SR layout for each LR
1. map SRs to LRs
2. Trim LRs at coverage gaps
3. compute consensus for each LR
3. Co-assembly corrected LRs and SRs
-WGS assembler can suport 16Kb reads
Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. ; 30(7): 693–700.
Assemblies from different strategy
Contig sizes for various combinations of sequencing technologies
Assemblies are for E. coli C227-11 (assemblies including Illumina and PacBio CCS) and E.
coli JM221 (assemblies including 454). Both genomes have similar repeat content, PacBio
read length, and coverage. Assemblies of only second-generation data are comparable and
average N50 ≈ 100 Kbp. By comparison, adding 25X or 50X of PBcR to these data sets
increases N50 as much as 5 fold and pushes the maximum contig size greater than 1 Mbp
(for the PBcR/CCS combination).
Assemblies from different strategy
De Novo SMRT Sequencing
Genome size: 124.6 Mb
GC content: 33.92%
Raw data: 11 Gb
Assembly coverage: 15.37x
Polished Contigs: 540
Max Contig Length: 12.98 Mb
N50 Contig Length: 6.19 Mb
Sum of Contig Lengths: 124.57 Mb
Genome annotation
genome
ORF
Syteny
Pangenome
Tree
Ortholog
Ka/Ks
NR
InterPor
KEGG
COG
t/rRNA
recombination
Prohpage
repeat
sRNA
Pan-genome
Strain-specific
genes
Pan-genome
Pan-genome: the global gene repertoire
Pertaining to a species
Core genes: genes shared by all the strains
Strain-specific genes: genes present in
only one strain and absent from all the
others
Dispensable
genes
Dispensable genes: genes shared by some but not all the strains
Core genes
Mathematical definition of the Pan-genome
Open pan-genome:
Continuously increasing In size.
Examples:
E.Coli
Streptococcus
Close pan-genome:
No continuously increasing In size
example
Bacillus anthracis
The Core and Pan-Genomes of E. coli
All pan-genome
20 completely sequenced genomes
Vary in size more than 1Mb
Core genome: 1976 genes
Pan-genome: 17838 genes
>80% similarity
Removing IS
The Phylogenetic History of the Strains
concatenated gene of core genome (1878 genes) and maximum likelihood approach
First split group B2 and group D; Group A,B1,S1,S3 and SS emerged more recently.
Recent acquired are enriched in phage and transposable elements
微生物转录组(RNA-seq)
转录组学( transcriptomics) ,是在RNA 水平上研究
基因转录的整体情况及转录调控.
细菌RNA-seq
1 rRNA 和tRNA(MICROBExpress)
2 mRNA没有poly-A
mRNA富集
1.16S 和23S rRNA 的保守区域
2.核酸外切酶
3.抗体捕获
Strand specific library
DSSS protocol workflow.
(A) Fragmentation(60-200bp)
(B) Dephosphorylation. 5’phosphates are
removed from RNA .
(C) 3’adapter ligation.
(D) Rephosphorylation.
(E) 5’ adapter ligation
(F ) Reverse transcription (RT) and
amplification of library.
(G) Sequencing.
3’UTR
5’UTR
Antisense
Operons
FPKM
sRNA
Transcriptional Start Sites(TSS)
RNA-seq and ChIP-chip–based strategy to identify promoters, transcribed regions, and ssRNAs
(A) The different cDNA libraries that were generated
and sequenced in this study.
(B) Reads to different genome locations
(C) RNA-seq, and ChIP-chip data to identify small RNAs,
TSSs, promoters, and transcribed regions throughout
the chromosome of
PacBio RS系统实现对碱基修饰进行直接测序
N6-methyladenine、N4-methylcytosine 、 5-mC和5-hmC
Methylome of G. metallireducens GS-15.
(a) three instances of methylated sequence regions.
(b) coverage and kinetic score for all genomic positions.
(c) MTase specificities determined from the genomic
positions detected as methylated.
(d) Summary of detected methylated positions across
the genome.
Ecoli methylation
MTases targeting motifs in genome(a)
and plasmid(b)
Nat Biotechnol. 2012 Dec;30(12):1232-9
The RM system associated with M.EcoGIII regulates the expression of many
genes and pathways.
Nat Biotechnol. 2012 Dec;30(12):1232-9
Nat Biotechnol. 2012 Dec;30(12):1232-9
宏基因组(Metagenome), Handelsman 等在一篇研究土壤微生物的文章中
首次提出,指“微生物群落中的所有基因组的集合”。
研究内容
(1) 针对 16S rRNA 为主要研究对象的核糖体 RNA 研究: 种群分布和种群丰度
(2) 以环境中所有遗传物质为研究对象;DNA的WGS
(3) 以环境中所有转录本为主要研究对象的宏转录组研究(metatranscriptome)
(4) 基于单细胞的宏基因组研究
Metagenomics
16S
基于双向Index的策略(Dual-index sequencing strategy on MiSeq)
每个Lane的样品数目:
Index3’ number X Index 5’
推荐16S的引物设计区域:347F/803R
文库多样性评估
Coverage C
库容评估
Richness estimator (SChao1) 预测样品中微生物的种类
Shannon-Wiener index
物种丰富程度和均匀程度评估
Rarefaction curve
库容评估
C = 1− n1/N
Whole-metagenome shotgun全基因组测序策略
文库构建
与测序
数据统
计注释
菌群结
构与功
能分析
样本收集
数据质量评
估与筛选
Perl, BMTagger
菌群结构
MEGAN/M
G-RAST
文库构建
上机测序
加barcode
Miseq
序列拼接
GS de novo
assembler
基因注释
BLAST,
MetaGeneMark
菌群特性
功能基因/代
谢途径
临床信息、
样本特性
基于KEGG、SEED
36
Assembly soft
1 MetaVelvet
2 Meta-IDBA
基于单细胞的宏基因组
Standard metagenomics and sequence ‘binning’ to produce composite microbial genomes.
单细胞分离技术
微流体技术(microfluidics), 梯度稀释法( Serial dilution), 显微操作技术(micromanipulation),
荧光激活细 胞分类( FACS,fluorescence-activated cellsorting)
单细胞扩增技术
最广泛的MDA ,能够忠实的复制整个基因组DNA,扩增出10 -100 kb;
MALBAC, 扩增长度500-2000bp;
Sorting of Single Cells by Flow Cytometry
Candidate phylum TM6 genome recovered from a hospital sink biofilm provides
genomic insights into this uncultivated phylum
PNAS. 2013 25;110(26):E2390-9
TM6 genome recovered from a hospital sink biofilm
PNAS. 2013 25;110(26):E2390-9
IMS-MDA
Immunomagnetic separation (IMS) and multiple displacement amplification (MDA)
--Chlamydia trachomatis (antibodies or aptamers)
Analysis of sequencing data from DNA extracts from clinical samples, with and without MDA
高通量测序技术给微生物基因组学研究带来了一个高效的新平台和巨大的发展机遇.
谢谢!
Download