srep05453-s1

advertisement
Supplementary Materials: Non-invasive analysis of intestinal development in preterm and
term infants using RNA-Sequencing
Jason M. Knight1,3, Laurie A. Davidson2,3, Damir Herman6**, Camilia R. Martin7, Jennifer S.
Goldsby2,3, Ivan V. Ivanov5, Sharon M. Donovan8 and Robert S. Chapkin2,3,4*
1Department
of Electrical Engineering, Texas A&M University, College Station, TX,
of Nutrition & Food Science, Texas A&M University, College Station, TX, 3Center
for Translational Environmental Health Research, Texas A&M University, College Station, TX,
4Department of Veterinary Integrated Biosciences, Texas A&M University, College Station,
TX, 5Department of Veterinary Physiology and Pharmacology, Texas A&M University, College
Station, TX, 6Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical
Sciences, Little Rock, AR, 7Department of Neonatology, Beth Israel Deaconess Medical
Center, Harvard Medical School, Boston, MA, 8Department of Food Science & Human
Nutrition, University of Illinois, Urbana, IL.
2Department
*Address correspondence to Dr. Robert S. Chapkin, Center for Translational & Environmental
Health Research, MS 2253, Texas A&M University, College Station, TX 77843-2253, USA;
Tel: +1-979-845-0419; Fax: +1-979-458-3704; E-mail: r-chapkin@tamu.edu
**Current address: Ayasdi, 4400 Bohannon Drive, Suite #200, Menlo Park, CA 94025
Supplementary Figure 1: Integrative Genomics Viewer (IGV) was used to visualize the mapped
reads on the APOA4 gene for preterm infant sample 3 and term infant sample 3 at the top and the
bottom of the figure, respectively. The 3' bias is visible in the vast majority of reads belonging to or
mapping near the 3' UTR on the left side of the annotated region of the reference hg19 genome.
Supplementary Figure 2: Correlation scatter plots for RNA-Seq and qPCR for 11 differentially
expressed genes across the six individual samples. Any gene with zero mapped reads is not
displayed. The average slope is 0.869, the average Spearman correlation coefficient is 0.59 and the
average Pearson correlation coefficient is 0.57. Overall, these correlations are lower than the 0.7
Spearman and 0.8 Pearson correlation coefficients seen in the MAQC dataset1. However, this is not
surprising given the more diverse and challenging nature of fecal samples. For another comparison,
these correlations are similar to those typically observed between RNA-Seq and microarray (0.62 –
0.75 Pearson) data on the same MAQC dataset and higher than RNA-Seq – protein correlations (0.24
– 0.36)2.
1
Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or
without a reference genome." BMC bioinformatics 12.1 (2011): 323.
2
Fu, Xing, et al. "Estimating accuracy of RNA-Seq and microarrays with proteomics." BMC genomics
10.1 (2009): 161.
Supplementary Figure 3: Correlation scatterplots among all six individual samples and their
smoothed FPKM distributions. In addition, violin plots of FPKM show reasonable uniformity among
overall normalized expression intensities.
Supplementary Figure 4: Experimental design documenting sample isolation, sequencing and
mapping.
Supplementary Figure 5: Volcano plot showing 188 differentially expressed genes at p values < 0.05.
Given the noisy nature of the data and limited number of samples, q-values (p-values adjusted for
multiple testing) were not used in the DE selection criteria.
Supplementary Figure 6: A Spearman correlation heatmap, comparing individual samples with the
pooled term sample.
Supplementary Table 1: Read Statistics.
Sample
Preterm 1
Preterm 2
Preterm 3
Term 1
Term 2
Term 3
Term
Pooled
Total-reads
43983566
48841110
40462297
54179788
63701346
51699388
32005113
HumanERCC-reads Genes 1 or Genes
reads
more reads RPKM >1
503552
41132697
4021
4525
409748
44862225
2768
3432
15368537
27886375
13379
8596
7303948
42719602
11527
7266
710728
58601378
5036
5187
384088
48161868
3274
4095
1615385
0
5182
4049
Mito
125690
112459
670003
1084048
257341
124769
Ribosomal
104
176
2410
2074
163
82
Microbial
39862644
43331472
26175696
41780304
56749538
46670080
20542641
Viral
117847
105658
193080
176463
134459
117468
Fungi
91035
80809
144557
145378
101926
66361
165409
Protozoa
81383
74945
184535
190937
95394
58089
184987
Supplementary Table 2: qPCR Ct values used for validation of differentially expressed genes.
qPCR CT values
log2(fold change)
ABCC5
APOA4
CASP1
DYNLL1
NFKBIA
PLIN2
PPAP2A
RPS16
SCNN1A
SLC2A1
TMSB4X
Preterm 1
Preterm 2
Preterm 3
Term 1
Term 2
Term 3
40.00
37.39
33.56
37.38
30.75
40.00
34.29
31.69
23.57
33.37
34.07
40.00
40.00
34.90
27.64
38.17
29.72
40.00
40.00
35.24
27.05
33.16
27.13
36.26
37.66
31.11
25.95
33.74
28.95
36.79
40.00
35.21
25.71
35.88
29.26
40.00
40.00
36.30
29.46
35.10
28.50
40.00
40.00
40.00
34.32
38.74
34.47
40.00
40.00
36.25
31.95
34.62
28.44
36.80
40.00
37.21
29.30
36.04
31.02
38.77
37.52
28.78
21.92
28.86
22.56
31.70
qPCR
RNA-Seq
2.16
1.20
-3.21
-5.71
-2.30
-1.53
1.63
2.17
-1.53
-1.32
-1.56
-1.15
2.38
0.98
-2.25
0.63
3.61
3.04
-1.58
0.49
1.24
1.96
Supplementary Table 3: Preterm metadata.
Sample Name
Gestational Age (weeks)
Date of birth
Preterm 1
32.6
4/6/2010
Preterm 2
30.2
4/15/2010
Preterm 3
27.5
5/3/2010
Supplementary Table 4: Term metadata. Pooled samples were aggregated with individual samples to obtain the pooled sample.
Name
Gestational Age (weeks)
Date of birth
Diet
Ethnicity
Gender
Term 1
39.714
7/14/2006
Breast
Caucasian
Male
Term 2
Term 3
Pooled
Pooled
Pooled
Pooled
40
40
41
39.857
41.429
39
4/25/2006
7/7/2008
5/30/2006
6/4/2006
6/30/2006
5/18/2006
Breast
Formula
Breast
Breast
Breast
Breast
Caucasian
Caucasian
Caucasian
Caucasian
Caucasian
Asian/Caucasian
Female
Male
Male
Male
Male
Female
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
39.571
40
39.286
38.714
39.714
40.714
2/11/2007
4/6/2007
5/16/2007
10/18/2007
12/27/2006
8/30/2007
Breast
Breast
Breast
Breast
Formula
Formula
Caucasian
Caucasian
Caucasian
Caucasian
Caucasian
Caucasian
Male
Female
Female
Male
Male
Male
Pooled
Pooled
Pooled
Pooled
39.857
39.857
39.714
40
10/9/2007
10/18/2007
7/23/2008
9/26/2008
Formula
Formula
Formula
Formula
Caucasian
Caucasian
Caucasian
African-American
Female
Female
Male
Male
Supplemental Methods Appendix – Reference Genomes
RefSeq
Refseq version 59 [ftp://ftp.ncbi.nlm.nih.gov/refseq/release/] was used to obtain all the full (not raw
WGS shotgun repositories) DNA sequences for each organism group below. Statistics from each of
these data repositories can be seen at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/releasecatalog/archive/RefSeq-release57.catalog.gz.
The following datasets were acquired:

Mitochondria

Fungi

Viral

Protozoa
Reference sequences were generated with:
SNAP: snap index ../mitochondrion.1.1.genomic.fna . -s 17 -t5 -O200
STAR: STAR --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles
../mitochondrion.1.1.genomic.fna --runThreadN 16
Bowtie2: bowtie2-build ../mitochondrion.1.1.genomic.fna genome
Microbial:
PatricBRC and RefSeq were used to assemble the microbial genomes. PatricBRC can be obtained
as:wget -r -c -A "*PATRIC.ffn" ftp://ftp.patricbrc.org/patric2/genomes/ and RefSeq can be obtained
withwget -r -c -A "microbial.*genomic.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/. This is
roughly 16000 taxids and 30Gb of genomic nucleotides.
The size of the microbial dataset necessitated use of BWA to build the reference and align against it,
so:bwa index -a bwtsw ../micro-meta.fasta.
Fungi:
RefSeq was used to acquire the fungi database aswget -r -A 'fungi.*.genomic.fna.gz'
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/ which has approximately 2.3Gb of data. Reference
indices were generated using:
SNAP: snap index ../meta-microbe-5-25.fasta . -t15 -O100
STAR: STAR --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles../meta-microbe5-25.fasta --runThreadN 16 --genomeChrBinNbits 10
Bowtie2: bowtie2-build ../meta-microbe-5-25.fasta genome
Ribosomal:
From the Silva database, release 111, short subunit and long subunit fasta files were obtained that
were pre-truncated with NR (no redundancy). We chose to use the reference, rather than the complete
to keep the reference genome size low enough to use with typical aligners.
The GreenGenes database was not utilized due to a lack of metadata/information on its website.
Therefore, Silva was utilized instead.
Reference Indexes:
SNAP: snap index ../silva-111.fasta . -s 17 -t5 -O200
STAR: STAR --runMode genomeGenerate --genomeDir $(pwd)--genomeFastaFiles ../silva-111.fasta
--runThreadN 16 --genomeChrBinNbits 10
Bowtie2: bowtie2-build
../SSURef_111_NR_tax_silva_trunc.fasta,../LSURef_111_tax_silva_trunc.fasta genome
ERCC
Using the available sequence information available from Ambion at
http://tools.invitrogen.com/downloads/ERCC92.fa, we generated a STAR reference and aligned it
without spliced mapping.
iGenomes
Human references, sequence, annotations, and bowtie2 indices are available from the Illumina
iGenomes project. These are linked from the tophat website. The hg19 iGenome has been placed in
the /data/mnt/igenomes folder on sequencer.tamu.edu and was used for the analysis.
In addition, upon publication our automated analysis pipeline code will be made available at
http://github.com/chapkinlab.
Download