Supplementary Materials: Non-invasive analysis of intestinal development in preterm and term infants using RNA-Sequencing Jason M. Knight1,3, Laurie A. Davidson2,3, Damir Herman6**, Camilia R. Martin7, Jennifer S. Goldsby2,3, Ivan V. Ivanov5, Sharon M. Donovan8 and Robert S. Chapkin2,3,4* 1Department of Electrical Engineering, Texas A&M University, College Station, TX, of Nutrition & Food Science, Texas A&M University, College Station, TX, 3Center for Translational Environmental Health Research, Texas A&M University, College Station, TX, 4Department of Veterinary Integrated Biosciences, Texas A&M University, College Station, TX, 5Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Station, TX, 6Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, AR, 7Department of Neonatology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, 8Department of Food Science & Human Nutrition, University of Illinois, Urbana, IL. 2Department *Address correspondence to Dr. Robert S. Chapkin, Center for Translational & Environmental Health Research, MS 2253, Texas A&M University, College Station, TX 77843-2253, USA; Tel: +1-979-845-0419; Fax: +1-979-458-3704; E-mail: r-chapkin@tamu.edu **Current address: Ayasdi, 4400 Bohannon Drive, Suite #200, Menlo Park, CA 94025 Supplementary Figure 1: Integrative Genomics Viewer (IGV) was used to visualize the mapped reads on the APOA4 gene for preterm infant sample 3 and term infant sample 3 at the top and the bottom of the figure, respectively. The 3' bias is visible in the vast majority of reads belonging to or mapping near the 3' UTR on the left side of the annotated region of the reference hg19 genome. Supplementary Figure 2: Correlation scatter plots for RNA-Seq and qPCR for 11 differentially expressed genes across the six individual samples. Any gene with zero mapped reads is not displayed. The average slope is 0.869, the average Spearman correlation coefficient is 0.59 and the average Pearson correlation coefficient is 0.57. Overall, these correlations are lower than the 0.7 Spearman and 0.8 Pearson correlation coefficients seen in the MAQC dataset1. However, this is not surprising given the more diverse and challenging nature of fecal samples. For another comparison, these correlations are similar to those typically observed between RNA-Seq and microarray (0.62 – 0.75 Pearson) data on the same MAQC dataset and higher than RNA-Seq – protein correlations (0.24 – 0.36)2. 1 Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC bioinformatics 12.1 (2011): 323. 2 Fu, Xing, et al. "Estimating accuracy of RNA-Seq and microarrays with proteomics." BMC genomics 10.1 (2009): 161. Supplementary Figure 3: Correlation scatterplots among all six individual samples and their smoothed FPKM distributions. In addition, violin plots of FPKM show reasonable uniformity among overall normalized expression intensities. Supplementary Figure 4: Experimental design documenting sample isolation, sequencing and mapping. Supplementary Figure 5: Volcano plot showing 188 differentially expressed genes at p values < 0.05. Given the noisy nature of the data and limited number of samples, q-values (p-values adjusted for multiple testing) were not used in the DE selection criteria. Supplementary Figure 6: A Spearman correlation heatmap, comparing individual samples with the pooled term sample. Supplementary Table 1: Read Statistics. Sample Preterm 1 Preterm 2 Preterm 3 Term 1 Term 2 Term 3 Term Pooled Total-reads 43983566 48841110 40462297 54179788 63701346 51699388 32005113 HumanERCC-reads Genes 1 or Genes reads more reads RPKM >1 503552 41132697 4021 4525 409748 44862225 2768 3432 15368537 27886375 13379 8596 7303948 42719602 11527 7266 710728 58601378 5036 5187 384088 48161868 3274 4095 1615385 0 5182 4049 Mito 125690 112459 670003 1084048 257341 124769 Ribosomal 104 176 2410 2074 163 82 Microbial 39862644 43331472 26175696 41780304 56749538 46670080 20542641 Viral 117847 105658 193080 176463 134459 117468 Fungi 91035 80809 144557 145378 101926 66361 165409 Protozoa 81383 74945 184535 190937 95394 58089 184987 Supplementary Table 2: qPCR Ct values used for validation of differentially expressed genes. qPCR CT values log2(fold change) ABCC5 APOA4 CASP1 DYNLL1 NFKBIA PLIN2 PPAP2A RPS16 SCNN1A SLC2A1 TMSB4X Preterm 1 Preterm 2 Preterm 3 Term 1 Term 2 Term 3 40.00 37.39 33.56 37.38 30.75 40.00 34.29 31.69 23.57 33.37 34.07 40.00 40.00 34.90 27.64 38.17 29.72 40.00 40.00 35.24 27.05 33.16 27.13 36.26 37.66 31.11 25.95 33.74 28.95 36.79 40.00 35.21 25.71 35.88 29.26 40.00 40.00 36.30 29.46 35.10 28.50 40.00 40.00 40.00 34.32 38.74 34.47 40.00 40.00 36.25 31.95 34.62 28.44 36.80 40.00 37.21 29.30 36.04 31.02 38.77 37.52 28.78 21.92 28.86 22.56 31.70 qPCR RNA-Seq 2.16 1.20 -3.21 -5.71 -2.30 -1.53 1.63 2.17 -1.53 -1.32 -1.56 -1.15 2.38 0.98 -2.25 0.63 3.61 3.04 -1.58 0.49 1.24 1.96 Supplementary Table 3: Preterm metadata. Sample Name Gestational Age (weeks) Date of birth Preterm 1 32.6 4/6/2010 Preterm 2 30.2 4/15/2010 Preterm 3 27.5 5/3/2010 Supplementary Table 4: Term metadata. Pooled samples were aggregated with individual samples to obtain the pooled sample. Name Gestational Age (weeks) Date of birth Diet Ethnicity Gender Term 1 39.714 7/14/2006 Breast Caucasian Male Term 2 Term 3 Pooled Pooled Pooled Pooled 40 40 41 39.857 41.429 39 4/25/2006 7/7/2008 5/30/2006 6/4/2006 6/30/2006 5/18/2006 Breast Formula Breast Breast Breast Breast Caucasian Caucasian Caucasian Caucasian Caucasian Asian/Caucasian Female Male Male Male Male Female Pooled Pooled Pooled Pooled Pooled Pooled 39.571 40 39.286 38.714 39.714 40.714 2/11/2007 4/6/2007 5/16/2007 10/18/2007 12/27/2006 8/30/2007 Breast Breast Breast Breast Formula Formula Caucasian Caucasian Caucasian Caucasian Caucasian Caucasian Male Female Female Male Male Male Pooled Pooled Pooled Pooled 39.857 39.857 39.714 40 10/9/2007 10/18/2007 7/23/2008 9/26/2008 Formula Formula Formula Formula Caucasian Caucasian Caucasian African-American Female Female Male Male Supplemental Methods Appendix – Reference Genomes RefSeq Refseq version 59 [ftp://ftp.ncbi.nlm.nih.gov/refseq/release/] was used to obtain all the full (not raw WGS shotgun repositories) DNA sequences for each organism group below. Statistics from each of these data repositories can be seen at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/releasecatalog/archive/RefSeq-release57.catalog.gz. The following datasets were acquired: Mitochondria Fungi Viral Protozoa Reference sequences were generated with: SNAP: snap index ../mitochondrion.1.1.genomic.fna . -s 17 -t5 -O200 STAR: STAR --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles ../mitochondrion.1.1.genomic.fna --runThreadN 16 Bowtie2: bowtie2-build ../mitochondrion.1.1.genomic.fna genome Microbial: PatricBRC and RefSeq were used to assemble the microbial genomes. PatricBRC can be obtained as:wget -r -c -A "*PATRIC.ffn" ftp://ftp.patricbrc.org/patric2/genomes/ and RefSeq can be obtained withwget -r -c -A "microbial.*genomic.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/refseq/release/microbial/. This is roughly 16000 taxids and 30Gb of genomic nucleotides. The size of the microbial dataset necessitated use of BWA to build the reference and align against it, so:bwa index -a bwtsw ../micro-meta.fasta. Fungi: RefSeq was used to acquire the fungi database aswget -r -A 'fungi.*.genomic.fna.gz' ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/ which has approximately 2.3Gb of data. Reference indices were generated using: SNAP: snap index ../meta-microbe-5-25.fasta . -t15 -O100 STAR: STAR --runMode genomeGenerate --genomeDir $(pwd) --genomeFastaFiles../meta-microbe5-25.fasta --runThreadN 16 --genomeChrBinNbits 10 Bowtie2: bowtie2-build ../meta-microbe-5-25.fasta genome Ribosomal: From the Silva database, release 111, short subunit and long subunit fasta files were obtained that were pre-truncated with NR (no redundancy). We chose to use the reference, rather than the complete to keep the reference genome size low enough to use with typical aligners. The GreenGenes database was not utilized due to a lack of metadata/information on its website. Therefore, Silva was utilized instead. Reference Indexes: SNAP: snap index ../silva-111.fasta . -s 17 -t5 -O200 STAR: STAR --runMode genomeGenerate --genomeDir $(pwd)--genomeFastaFiles ../silva-111.fasta --runThreadN 16 --genomeChrBinNbits 10 Bowtie2: bowtie2-build ../SSURef_111_NR_tax_silva_trunc.fasta,../LSURef_111_tax_silva_trunc.fasta genome ERCC Using the available sequence information available from Ambion at http://tools.invitrogen.com/downloads/ERCC92.fa, we generated a STAR reference and aligned it without spliced mapping. iGenomes Human references, sequence, annotations, and bowtie2 indices are available from the Illumina iGenomes project. These are linked from the tophat website. The hg19 iGenome has been placed in the /data/mnt/igenomes folder on sequencer.tamu.edu and was used for the analysis. In addition, upon publication our automated analysis pipeline code will be made available at http://github.com/chapkinlab.