1 IETS 2016 2 Fertility and genomics: comparison of gene expression in contrasting 3 reproductive tissues of female cattle 4 PA McGettigan1, JA Browne1, SD Carrington2, MA Crowe2, T Fair1, N Forde1,6, BJ Loftus3, A 5 Lohan3, P Lonergan1, K Pluta2, S Mamo1, A Murphy4, J Roche3, SW Walsh1,7, CJ Creevey5,8, B 6 Earley5, S Keady5, DA Kenny5, D Matthews5, M McCabe5, D Morris5, A O’Loughlin5, S Waters5, 7 MG Diskin5, ACO Evans1,4,* 8 1 School of Agriculture and Food Science, 2School of Veterinary Medicine, 3School of Medicine 9 and Medical Sciences, 4Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland 10 5 11 6 12 Cardiovascular and Metabolic Medicine, University of Leeds, Leeds, United Kingdom 13 7 14 Waterford, Ireland 15 8 16 University, Aberystwyth, Ceredigion, United Kingdom 17 * Animal and Grassland Research and Innovation Centre, Teagasc, Athenry, Co Galway, Ireland. Present Address: Division of Reproduction and Early Development, Leeds Institute of Present Address: Department of Chemical and Life Sciences, Waterford Institute of Technology, Present Address: Institute of Biological, Environmental and Rural Sciences, Aberystwyth Correspondence: A Evans, alex.evans@ucd.ie. 18 19 20 1 21 Abstract 22 To compare gene expression among bovine tissues we used large bovine RNAseq datasets 23 comprising 280 samples from 10 different bovine tissues (uterine endometrium, granulosa cells, 24 theca cells, cervix, embryos, leukocytes, liver, hypothalamus, pituitary, muscle) generating 260 25 Gbases of data. We used twin approaches of an information-theoretic analysis of the existing 26 annotated transcriptome to identify the most tissue-specific genes, as well as a de-novo 27 transcriptome annotation to evaluate general features of the transcription landscape. We detected 28 expression of 97% of the Ensembl transcriptome with at least one read in one sample and between 29 28% and 66% at a level of 10 Tags Per Million (TPM)or greater in individual tissues. Over 95% of 30 genes exhibited some level of tissue-specific gene expression. This was mostly due to different 31 levels of expression in different tissues rather than exclusive expression in a single tissue. Less than 32 1% of annotated genes exhibited a highly restricted tissue-specific expression profile and 33 approximately 2% exhibited classic housekeeping profiles. We conclude that it is combined effects 34 of the variable expression of large numbers of genes (73 to 93% of the genome) with the specific 35 expression of a small number of genes (less than 1% of the transcriptome) that contributes to 36 determining the outcome of the function of individual tissues. 37 38 Keywords: RNAseq, transcription, bovine, reproduction, uterus, endometrium, embryo, cervix, 39 ovary, follicle, theca, granulosa, hypothalamus, pituitary, leukocyte, muscle, liver 40 41 Short title: Gene expression in bovine reproductive tissues 42 43 1. Introduction 44 Recently, there has been a flood of development of new ‘omic’ technologies such as proteomics, 45 transcriptomics and metabolomics that are enabling the generation of vast amounts of novel data 46 characterizing different aspects of cellular biology at a global level. These technologies have been 47 used in an attempt to better understand the development of various tissues with transcriptomics 48 studies producing the most data. A major challenge of this new era is to determine the biological 49 importance of these data in the context of cell and tissue function. In this paper, we focus on the 50 large volumes of data that we have produced in our studies of bovine tissues in recent years, with a 51 focus on reproductive tissues. 2 52 53 2. Development of reproductive tissues 54 While most tissues in the body are continually engaged in turnover, many reproductive tissues are 55 actively engaged in vigorous periods of tissue (cellular) proliferation that are often followed by 56 periods of dramatic and whole-sale tissue degradation and regression (this is especially true in 57 female reproductive tissues). Cumulatively, the coordination of the proliferation and regression of 58 tissues determines the success of reproduction and fertility. It is for this reason that many 59 investigators have chosen to focus their attentions on the development of specific tissues, usually 60 during narrow timeframes during the reproductive process, to understand their contribution to the 61 outcome of reproduction. 62 In taking a step back from the mass of data available on individual tissues, it is interesting to 63 consider some of the similarities and differences among tissues during their developmental 64 processes. For example, some reproductive tissues are relatively static in mass (although not 65 always function) with the hypothalamus and pituitary being clear examples. Other tissues develop 66 over long periods of time; oocytes and follicles being the best examples, developing over months 67 (or years depending on your point of view) with the most dramatic changes occurring in the final 68 days before ovulation. Uterine tissues undergo changes during the reproductive cycle in preparation 69 for pregnancy and then changes again over the months of gestation. Others tissues develop much 70 more rapidly, with the early embryo and the corpus luteum undergoing changes in size and 71 morphology that change them almost unrecognizably over a period of just a few days. 72 To better understand these changes, and the factors controlling them, many studies have focused on 73 measuring the expression of genes and significant amounts of data have been generated during the 74 last few years. One of the key technologies enabling this has been RNA Sequencing (RNA-seq). 75 Since its initial development in 2006 (Bainbridge et al. 2006), RNA-seq has rapidly displaced gene 76 expression microarrays for large-scale transcriptional profiling and is now the technology of choice 77 in many laboratories. Some of the advantages of RNA-seq over microarrays include global 78 profiling of transcripts including currently unannotated transcripts, identification of novel transcript 79 isoforms as well as more accurate quantification of transcript levels. In practice, many researchers, 80 including ourselves, have used the technology as a direct replacement for microarrays and tended 81 to restrict their initial analyses to the known annotated transcriptome of the organism of interest 82 (Mamo et al. 2011; Foley et al. 2012; Forde et al. 2012; O'Loughlin et al. 2012; Pluta et al. 2012; 83 Walsh et al. 2012; Keady In preparation; Matthews In preparation). 84 The availability of a complete bovine genome sequence has enabled the application of the RNA- 85 seq protocol to bovine samples (Elsik et al. 2009). We have previously published several RNA- 3 86 seq-derived transcriptomic studies focusing on aspects of bovine reproduction, fertility and 87 productivity traits under various experimental conditions (see Table 1). However, the question of 88 the completeness of the current bovine transcriptome annotation, and characteristic gene expression 89 differences between tissue types remain unaddressed. There are also other gaps in our knowledge 90 of the transcriptome; for example, to what extent are genes universally or uniquely expressed in 91 individual tissue types. There are also long-standing questions about the existence or otherwise of 92 so-called “housekeeping genes” with consistent expression levels in all tissue types at all times 93 (which could be used as reference genes for calibration of global expression studies). 94 In order to address these questions the aim of this study was to conduct a de-novo 95 annotation of the bovine transcriptome using data from 280 bovine samples taken from 10 distinct 96 tissue types and to compare it to the Ensembl bovine annotation (Ensemble 65) to identify novel 97 bovine transcripts. In addition, the patterns of transcription between different tissues types were 98 compared to identify genes with highly tissue-specific expression patterns and housekeeping genes. 99 100 3. Materials and Methods 101 Animal Handling 102 All animal procedures performed for the generation of tissue samples in this study were conducted 103 under experimental license from the Irish Department of Health and Children in accordance with 104 the Cruelty to Animals Act 1876 and the European Communities (Amendment of Cruelty to 105 Animals Act 1876) regulation 2002 and 2005 with approval from individual institutional ethics 106 committees. 107 Tissue sources 108 The sequencing data were generated from 10 different tissue types as part of 8 separate 109 experiments, some of which have been published. The details of tissue type, cattle breed and 110 number of samples are listed in Table 1. In brief, the samples consisted of 20 uterine endometrium 111 samples from mixed breed beef heifers collected on Day 13 and Day 16 after estrus of which 5 112 samples at each time point were confirmed pregnant and 5 were non-pregnant (Forde et al. 2012). 113 The follicular granulosa and theca cell samples were paired samples taken from dominant pre- 114 ovulatory follicles of Holstein-Friesian cows and heifers (37 animals in total) at 3 stages of 115 follicular development: selection, differentiation and luteinization (Walsh et al. 2012). The cervical 116 tissues were taken from 30 mixed breed beef heifers at 6 time points during the peri-estrus period 117 (Pluta et al. 2012). The embryo samples consisted of 28 pooled samples from mixed breed beef 118 cattle taken at 5 different days post-fertilization (Mamo et al. 2011). The leukocytes were taken 4 119 from 16 Simmental male beef calves at 4 different time points post-weaning, resulting in 55 120 samples in total (not all animal yielded 4 samples each) (O'Loughlin et al. 2012). The liver samples 121 were taken from 12 early post-partum Holstein-Friesian dairy cows in either mild or severe 122 negative energy balance (McCabe et al. 2012). The hypothalamus and pituitary samples were taken 123 from 23 mixed breed beef animals (Matthews In preparation). The muscle samples were taken from 124 the M. longissimus dorsi of 27 Aberdeen Angus steers undergoing nutritional restriction and 125 compensatory growth (Keady In preparation). In some cases multiple samples were not collected 126 from all individual animals giving rise to the actual number of samples contributing to the study as 127 shown in Table 1. 128 RNA-seq library preparation 129 All samples were sequenced on a GAIIx sequencer (Illumina) by the Conway Institute 130 Transcriptomics laboratory at University College Dublin, Ireland. All libraries were non-strand 131 specific and were processed as single read libraries (with the exception of the muscle samples that 132 were processed as paired-end libraries). The library type, read length, total numbers of reads 133 generated for each library type and Gene Expression Omnibus (GEO) ID (Edgar et al. 2002) 134 (where available) are listed in Table 1. 135 Alignment and preprocessing 136 FASTQ files (Cock et al. 2010) from each library were converted to the Sanger FASTQ format and 137 were then aligned individually to the UMD3.1/BosTau6 bovine genome assembly using the 138 software Tophat version 1.4.0 (Trapnell et al. 2009). Individual alignments, in bam format, from 139 each library were merged together using the samtools merge command (Li et al. 2009). Finally, all 140 combined tissue bam files were merged together into a single file. De novo transcriptome 141 annotation was carried out on each individual tissue and on the combined dataset using cufflinks 142 v1.1.0 (Trapnell et al. 2010). The Ensembl v65 annotation of the bovine genome was taken as the 143 reference transcriptome (Flicek et al. 2012). Coordinates of repetitive regions were downloaded for 144 the UMD3.1 assembly from UCSC genome browser (Kent et al. 2002). Introns were identified 145 from the alignments with Python code utilizing the pysam library. Visualization of the alignments 146 was carried out using the Integrated Genome Viewer (IGV) v2.0 (Robinson et al. 2011). Eval 147 version 2.2.8 (Keibler and Brent 2003) was used to generate summary statistics for the de-novo 148 transcriptomes. The BEDTools (Quinlan and Hall 2010) intersectbed program as well as the 149 GenomicRanges (Aboyoun 2015) and rtracklayer (Lawrence et al. 2009) packages from 150 R/Bioconductor were used to identify overlap of exons with genomic features such as repetitive 151 regions or annotated Ensembl/refseq exons. 152 Quality Control 5 153 Density plots of the logged Reads Per Kilobase per Million (RPKM) (Mortazavi et al. 2008) levels 154 for each gene in each sample were generated. Samples with read depth of less than 2 million reads 155 in total, or with >80% non-unique mapping reads > 80% were excluded from further analysis. 156 None of the 280 samples were excluded from further analysis using these QC criteria. 157 Initial quality control checks identified a bias in the data, exhibited by the paired-end muscle data 158 which had a much-reduced percentage of non-unique reads compared to the single read libraries. In 159 order to ensure the libraries were comparable with each other for the purposes of identifying 160 differential expression, all FASTQ files were trimmed to a common length of 36 bp following the 161 approach of Anders et al. (Anders et al. 2012). In addition, only the first read in the paired-end 162 muscle data was used for differential gene expression analysis. 163 Dimensional reduction and clustering 164 Hierarchical clustering of the individual samples was carried out using the ColorDendrogram 165 function from the sparcl R package. The Eisen distance metric was used as implemented in the 166 MADE4 bioconductor package (Culhane et al. 2005). Principal Component Analysis plots were 167 generated using the function prcomp in R. 168 Quantifying the diversity/dispersion of gene expression in a tissue 169 The Gini coefficient, a measure of the unevenness of the distribution of reads, was calculated using 170 the Gini function in the R library reldist (Handcock et al. 1999). It is most commonly used in the 171 social sciences as a measure of income inequality across different segments of a population. It is 172 defined as twice the area between the 45 degree line and the Lorenz curve where, in this case, the 173 Lorenz curve is a graph describing the cumulative share of total reads assigned to the bottom x% of 174 the gene universe. A tissue with an exactly equal distribution of reads among all genes would have 175 a Gini coefficient of zero and a tissue where all reads come from a single gene would have a Gini 176 of 1. The total count of reads for each gene from all samples of each tissue was obtained and the 177 Gini index for each tissue was calculated separately. The Gini index has been used by others to 178 measure skewness in other aspects of transcriptomics such as PolyA length (Morozov et al. 2012). 179 Gene expression measures of Ensembl 65 annotated transcripts 180 HTseq (Anders and Huber 2010) was used to generate raw read counts for each gene in each 181 library using the Ensembl 65 bovine annotation as the reference. These raw counts were converted 182 into TPM (Tags Per Million) (Li et al. 2010) and RPKM metrics. 183 Categorical tissue specificity 184 Following the method of Schug et al. (Schug et al. 2005), tissue specificity of each gene was 185 measured using the categorical tissue specificity metric (Qgt). Qgt weights genes according to the 6 186 degree to which the expression of a gene is skewed towards a single tissue; it is based on the 187 Shannon entropy of the gene Hg (Shannon and Weaver 1949). 188 The following calculation was used: given expression levels of a gene in N tissues, the relative 189 expression of a gene g in a tissue t was defined as: 190 pt |g wg,t w g,t 1t N 191 where wg,t is the expression level of the gene in the tissue. In this case, either TPM or RPKM can 192 be used as the expression level as for the purposes of this calculation they are equivalent. To avoid 193 division by zero in later calculations, a count of 1 is added to the raw counts for each gene in each 194 sample before calculation of TPM and RPKM. The relative expression is then the RPKM of the 195 gene in the tissue divided by the sum of RPKMs for that gene across all tissues. 196 The Shannon entropy of a gene's expression was calculated as: 197 Hg p t |g log 2 ( pt |g ) 1t N 198 Hg has units of bits and ranges from zero for genes expressed in a single tissue to a maximum of 199 log2(N) for genes expressed uniformly in all tissues. In this case, the maximum entropy of a gene 200 was log2(10)=3.32 bits, which would represent a ubiquitously and uniformly expressed gene (i.e. an 201 ideal housekeeping gene). This relative expression derived entropy calculation does not 202 discriminate between absolute expression levels, so in order to give higher weight to genes with 203 greater absolute expression levels, the categorical tissue specificity was defined as 204 Qg|t H g log 2 ( pt |g ) 205 The expression of a particular gene becomes more specific to a single tissue as the value of Qg|t 206 approaches zero. By contrast, ideal housekeeping genes would have a Qg|t of 2log2(10)=6.64 bits in 207 all tissues. 208 Tissue-specific gene lists and pathway analysis 209 Permutation testing of a balanced set of tissue samples was used to estimate the null distribution of 210 the entropy values. 10 samples were taken from each tissue type (pituitary, which consisted of 3 211 pooled samples was excluded from this step because of the low number of samples). The labels 212 were randomly permuted 100 times and the minimum entropy score across all genes (0.404 bits) 7 213 was used as the cut-off. Genes having an entropy value below this score in the original set were 214 considered to be expressed in a highly tissue specific manner. 215 The DAVID pathway annotation tool (Huang da et al. 2009) was used to determine over- 216 represented KEGG pathways among the tissue-specific gene list generated for each tissue. 217 Correlation of tissue specific expression ranking between species 218 The probesets from the matching tissues from the Schug dataset were sorted according to the 219 Qgt(rma) field. This ordering generated the ranks which were used to compare to the ranked list of 220 genes from our RNAseq generated Qgt values. Human and mouse affymetrix IDs were matched to 221 bovine Ensembl gene IDs using cross-reference tables downloaded from the Ensembl Biomart tool. 222 The Spearman correlation (spearmans rho) was calculated using the cor.test function in R. 223 ANOVA analysis 224 Analysis of variance was carried out to determine the overall amount of tissue-specific expression 225 in the dataset. The TPM matrix was used in conjunction with the limma (Smyth 2004) and puma 226 (Pearson et al. 2009) packages in R/bioconductor to calculate the tissue effects. False discovery 227 rate thresholds of 0.05 and 0.01 were used. 228 229 4. Results 230 Of the 24,616 annotated genes, 23,818 were expressed at a level of one read in at least one of the 231 280 samples (22,963 genes with ≥ 5 reads). Of these 17,893 were expressed at a level of ≥ 10 tags 232 per million (TPM). Embryos showed the greatest number of expressed genes and muscle had the 233 fewest (Table 2). There were 1,838 genes in the ens65 annotation that were not detected in any of 234 the samples. A full list of genes and their expression levels in tissues is shown in Supplementary 235 Table S1. 236 Clustering of samples 237 Initial clustering of the samples both by Hierarchical Clustering (Figure 1) and Principal 238 Components Analysis (Figure 2) showed almost perfect grouping of samples by tissues. The 239 exceptions were some of the cervical samples that initially grouped with the theca samples. Further 240 analysis revealed that this was due to very high expression of several collagen and basement 241 membrane genes (see Supplementary Table S1) in a number of cervical and thecal tissue samples. 242 We hypothesized that this could have been from inclusion of some connective tissue in these 243 samples. These transcripts were removed and TPMs were recalculated based on the new sample 244 totals (as the numerator of the TPM calculation is affected by these highly expressed genes). 8 245 Following this filtering perfect grouping of samples by tissue was achieved. It is notable that the 246 clustering by tissue was true also for the paired theca and granulosa samples (which came from the 247 same animals). 248 Overall tissue-specific expression 249 The results of the Analysis of Variance (ANOVA)/Differentially Expressed Genes (DEG) analysis 250 of gene expression among the 10 tissue types indicate that 95.8% of genes were differentially 251 expressed between tissues at a False Discovery Rate (FDR) of 0.05 and 93.6% at an FDR of 0.01. 252 The principal components analysis showed clustering together of samples from the same tissue and 253 very specific separation of samples along consecutive principal components. This was reflected by 254 the separation of these samples in PC1 and PC2. The first 2 principal components accounted for 255 33% of total variance, the first 10 principal components accounted for 89%. 256 Gini Index 257 The Gini index for each tissue is shown in Table 3. The gene expression was skewed in all tissues. 258 The most extreme skew was seen in the muscle tissue (Gini coefficient 0.96). The top 10 most 259 highly expressed genes in muscle made up 34% of all reads (gene aligned reads); while the top 260 gene (ENSBTAG00000046332 Actin, alpha skeletal muscle) on its own accounted for 11% of all 261 gene-aligned reads. The tissue with the lowest Gini coefficient was the endometrium tissue where 262 the top 10 genes constituted just 3% of total gene aligned reads and the most highly expressed gene 263 (ENSBTAG00000021466, collagen alpha-1(III) chain) accounted for 0.5% of total reads. 264 Tissue-specific gene expression 265 The full table of Qgt for each gene in each tissue as well as the overall entropy (Hg) for each gene is 266 available in supplementary Table S1. 267 Using the permutation analysis, 452 transcripts exhibited highly significant tissue-specific 268 expression (< 0.404 bits entropy score) (Table 4). The top tissue-specific gene for each tissue (as 269 determined by Qgt score) is shown in Figure 3. In many cases there was a large variance associated 270 with the top gene and it would not have been ranked top using other types of analysis such as 271 ANOVA. 272 Pathway analysis of tissue-specific genes 273 The most tissue-specific genes for each tissue (after permutation testing) were analyzed to identify 274 over-represented pathways (see Table 4). In some cases, the number of genes was insufficient to 275 detect pathways; however, in each case, inspection of the individual list confirmed the presence of 276 genes, most of which have a previously reported biological function in the tissue in question. The 9 277 tissue with the most highly specific genes was the liver with 196 genes. Tissues with 10 or fewer 278 highly specific genes were the pituitary, granulosa, endometrium and theca (Table 4). 279 Pathway analysis of housekeeping genes 280 411 genes were identified as candidate housekeeping genes (high and not differentially expressed 281 among tissues) based on entropy scores of ≥ 3.25 (Table 4). The pathways over-represented 282 included mitochondrial pathways (and diseases related to mitochondrial dysfunction such as 283 Huntington’s, Parkinson’s and Alzheimer’s) and basic cellular processes such as DNA repair, 284 protein translation, protein degradation and various protein modifications (ubiquitination, 285 methylation, neddylation). 286 Comparison of tissue specific genes between species 287 Of the 10 bovine tissues, 5 of them had matching samples in the human and mouse dataset 288 generated by Schug et al (Schug et al. 2005). The matched tissues for human were liver, uterus 289 (endometrium) and pituitary, while for mouse the matches were hypothalamus, liver, muscle and 290 uterus. A total of 8,737 bovine genes were matched to the human probes and 8,914 to the mouse 291 microarray probes. The most significant expression correlations were with the pituitary profiles 292 (0.61) followed by muscle (0.42) and then liver (0.23). Comparison of the expression correlations 293 between the mouse endometrium vs the bovine endometrium was marginally significant. 294 Comparison of the expression correlations between the bovine and human endometrium and 295 hypothalamic tissues were non-significant. 296 297 5. Discussion 298 Comparison with other mammalian gene atlases 299 The current study presents the first RNA-seq-derived multi-tissue comparison of gene expression 300 in bovine tissues. Our comparison is based on more biological replicates, greater sequencing depth 301 and higher gene coverage than other comparisons and includes a focus on reproductive tissues. A 302 bovine gene atlas (BGA) based on Illumina Digital Gene Expression (DGE) tags generated using 303 the DpnII restriction enzyme has also been published (Harhay et al. 2010) which used 300 million 304 tags (20 bp sequences) from 92 tissues and 3 cell lines. Similar to the current study, the tissue 305 samples were from different animals, breeds and sexes. The DGE approach has several limitations 306 compared with RNA-seq in that only a 16-17 bp tag (+4 bp recognition sequence) is generated per 307 transcript so information about the full transcript extent is missing. The shorter DGE tags are 308 frequently non-unique and so information about some transcripts is lost. Five of the ten tissues 10 309 included in this study were also profiled in the BGA (i.e. muscle, liver, pituitary, hypothalamus and 310 leukocytes). 311 The archetype for this project is the original mouse Gene Atlas (Su et al. 2004) that profiled 61 312 tissues in mouse and 79 tissues in human using Affymetrix microarrays. It remains the most 313 comprehensive profiling of mammalian tissues to date and is a popular resource for researchers 314 attempting to determine the tissue expression profiles of specific genes. Other notable mammalian 315 expression atlases include the Allen Brain Atlas (Lein et al. 2007) which profiles the expression of 316 genes in the mouse brain via in situ hybridization and the mouse brain atlas generated by Siddiqui 317 et al (Siddiqui et al. 2005) using the Long Sage protocol on 72 individual tissues from mice. More 318 recently the Human BodyMap project has conducted RNA-seq on 16 different human tissue types 319 (GEO ID: GSE30611, http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE30611) 320 and a pig expression atlas (using microarrays) was generated based on 62 different tissues and cell 321 types with a particular focus on the gastrointestinal tract (Freeman et al. 2012). Another group 322 profiled 900 different regions of the human brain using expression microarrays (Hawrylycz et al. 323 2012). An equine RNA-seq atlas has also been published based on 8 different tissues (Coleman et 324 al. 2010). 325 Our comparison sets this study apart from those listed above due to the higher overall number of 326 samples per tissue and the range of experimental conditions under which these tissues were 327 recovered. The perturbation of the tissue by developmental changes or experimental challenges 328 provides a more realistic picture of the diversity of transcription in an individual tissue. Our use of 329 information theory (entropy) to measure tissue specificity enables the identification of tissue- 330 specific expression even in cases where the most tissue-specifically expressed genes are not 331 expressed in all samples from the tissue in question. For example, aromatase (CYP19A1) is not 332 expressed at all stages of granulosa tissue development but it is the gene most characteristic of this 333 tissue (Stocco 2008; Walsh et al. 2012) and was confirmed in this study as such. 334 The nature of gene expression in different tissues 335 Transcription in different mammalian tissues is very variable and this is reflected in our bovine 336 tissue data as evidenced by the Gini coefficient. In most tissues a small number of genes 337 disproportionately contribute to the overall mRNA pool. The most extreme example of this in our 338 dataset was the gene expression in the muscle tissue (Keady In preparation). This probably reflects 339 the specialization of muscle tissue for a specific task (with few cell types), compared with more 340 diverse and complex functions required by a tissue such as the endometrium (with many cell 341 types). This may also explain why muscle was one of the earliest and most successful cases of 342 identifying transcriptional regulatory control regions determining tissue-specific transcription 343 (Wasserman and Fickett 1998). 11 344 Differential gene analysis reveals that almost all genes have an element of tissue-specific 345 expression; however, it is usually a matter of different levels of expression in different tissues (i.e. 346 along a continuum of expression rather than a digital pattern of presence or absence of gene 347 expression). This is reflected in the fact that almost 95% of genes have some measurable 348 component of tissue type contribution to expression level as determine by the ANOVA analysis. 349 However, a small number of genes (452, see Table 4) were identified as having highly specific 350 expression restricted to either 1 or 2 tissues. These low entropy/high information genes have very 351 specific functions characteristic of their tissue of expression such as prolactin in the pituitary (Egli 352 et al. 2010). The liver had many more of these types of genes than the other tissue types. There was 353 also a similarly small number (411) of potential housekeeping genes exhibiting high entropy and 354 little evidence of tissue-specific expression patterns. This was similar to the number of genes 355 proposed as housekeeping genes by other groups (Warrington et al. 2000; Hsiao et al. 2001; 356 Eisenberg and Levanon 2003). However, there was relatively little overlap in the housekeeping 357 lists that we generated compared with these earlier human lists (24 and 22 housekeeping genes 358 overlap with the data from Warrington et al., 2000 and Hsiao et al., 2001, respectively). 359 While it is tempting to consider these as potential reference genes for normalization processes, 360 either for quantitative polymerase chain reaction (qPCR) or RNA-seq experiments, the addition of 361 further tissues or similar tissues under differing conditions would reduce this number. The recent 362 discovery of the impact of transcriptional amplification by C-myc on global transcription would 363 suggest that external standards (spike-ins) calibrated to cell number may be the only viable 364 universal approach (Loven et al. 2012). 365 The Qgt values generated in this study indicate much higher tissue specificity than that indicated by 366 Schug and colleagues (Schug et al. 2005). This can be explained by several factors. Schug et al. 367 (2005) relied on the Gene Atlas dataset that profiled many more tissues using microarrays (79 368 human tissues of which 43 were used) but with an n=2 per tissue. RNA-seq avoids the coverage 369 issues present with microarrays and the digital nature of the data generated is well suited to entropy 370 calculations. It is likely that some of the genes that are considered highly tissue-specific in the 371 current study will become less so as additional tissue types are profiled. The tissue specificity 372 results for some of our bovine tissues were compared with equivalent tissues in human and mouse 373 where Qgt values were generated by Schug et al. (2005) with mixed results. The pituitary and 374 muscle tissues exhibited high correlation among the most specifically expressed genes, liver 375 showed an intermediate level of concordance and the hypothalamus and endometrium displayed 376 limited species similarities. Brawand et al. (2011) compared the expression of 6 tissues across 10 377 different mammalian species (Brawand et al. 2011). They detected differing rates of gene 378 expression changes in different tissues including the liver. This may reflect differing evolutionary 12 379 pressures on different tissues/organs. It suggests that the genes that are highly specific to pituitary 380 and muscle function are not changing as much as those involved in endometrium, liver and 381 hypothalamus. 382 Overall completeness of the bovine transcriptome 383 It is remarkable that so much of the bovine genome is transcribed. Transcription from 97% of the 384 assembled bovine genome in at least one sample was detected. However, such gross percentages 385 can be misleading. Most of the transcription is at a very low level and the read density in exonic 386 regions is orders of magnitude higher than intergenic and intronic regions. 387 Most of the novel transcription units that were identified were single exon transcripts. Of the 388 12,041 novel multi-exon transcripts 4,494 have some overlap with genes mapped to the Bovine 389 Genome from other species such as human and mouse (as determined by coordinates available 390 from the UCSC xenoRef table) (Kent et al. 2002). However, the contribution of these novel reads 391 to the overall RNA-seq dataset is relatively low (less than 1%). This is in stark contrast to the 392 amount of RNA sequence derived from the known exonic regions. 28% of the reads fall within 393 known Ensembl exons and 68% fall within the entire gene span (including exons and introns). 394 While 30% of the reads fall outside these categories, it is distributed over a much larger portion of 395 the genome and the level of transcription is several orders of magnitude lower. The fact of almost 396 ubiquitous transcription of the genome is not new. Pol2 (the key enzyme involved in transcription) 397 binding to DNA is relatively non-specific with as much as 90% of the binding and resultant 398 transcription suggested to be noisy and non-functional (Struhl 2007). 399 The Encyclopedia of DNA elements (ENCODE) results from both the prototype (Birney et al. 400 2007) and scale-up phase (Bernstein et al. 2012) have confirmed the ubiquity of transcription but 401 controversially they have suggested that this transcription is functional. These claims have been 402 challenged by others (van Bakel et al. 2010; Eddy 2012). Despite the ENCODE claims, we 403 recommend a cautious and conservative interpretation of the low abundance intergenic 404 transcription. 405 406 6. Conclusions 407 In this paper we have used profiled 280 different bovine samples from 10 different tissues using 408 RNA-seq. It is remarkable that so much of the bovine genome is transcribed with 23,818 out of 409 24,616 (97%) genes of the assembled bovine genome being detected in at least one sample. 410 However, most of the transcription is at a very low level giving rise to individual tissues with 411 highly characteristic patterns of gene expression, even among the majority of genes that are 13 412 expressed in all tissues (95.8% of genes were differentially expressed). We have shown that a 413 small number of genes disproportionately contribute to the majority of the mRNA pool in a given 414 tissue; that tissues have a limited number of uniquely expressed genes (ranging from 2 to 196 genes 415 in this study) and also that we detected 411 housekeeping genes there were expressed at high levels 416 in all of the 10 tissue types examined. We conclude that it is combined effects of the variable 417 expression of large numbers of genes (73 to 93% of the genome) with the specific expression of a 418 small number of genes (less than 1% of the genome) that contributes to determining the outcome of 419 the function of individual tissues. 420 421 7. Acknowledgements 422 The authors are grateful to Science Foundation Ireland (07/SRC/B1156) for funding a large portion 423 of this research. 424 425 Competing interests 426 None 427 428 References 429 430 Aboyoun, P., Pages H., Lawrence M. (2015) Representation and Manipulation of genomic intervals. R package. In 'Genomic Ranges.' 1.9.65 edn.) 431 432 433 Anders, S., and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology 11(10), R106 434 435 436 Anders, S., Reyes, A., and Huber, W. (2012) Detecting differential usage of exons from RNA-seq data. Genome Research 22(10), 2008-2017 437 438 439 440 441 Bainbridge, M.N., Warren, R.L., Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A., Griffith, M., Hickenbotham, M., Magrini, V., Mardis, E.R., Sadar, M.D., Siddiqui, A.S., Marra, M.A., and Jones, S.J. (2006) Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics 7, 246 442 443 444 Bernstein, B.E., Birney, E., Dunham, I., Green, E.D., Gunter, C., and Snyder, M. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57-74 445 446 447 Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E., Kuehn, M.S., Taylor, C.M., Neph, S., 14 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 Koch, C.M., Asthana, S., Malhotra, A., Adzhubei, I., Greenbaum, J.A., Andrews, R.M., Flicek, P., Boyle, P.J., Cao, H., Carter, N.P., Clelland, G.K., Davis, S., Day, N., Dhami, P., Dillon, S.C., Dorschner, M.O., Fiegler, H., Giresi, P.G., Goldy, J., Hawrylycz, M., Haydock, A., Humbert, R., James, K.D., Johnson, B.E., Johnson, E.M., Frum, T.T., Rosenzweig, E.R., Karnani, N., Lee, K., Lefebvre, G.C., Navas, P.A., Neri, F., Parker, S.C., Sabo, P.J., Sandstrom, R., Shafer, A., Vetrie, D., Weaver, M., Wilcox, S., Yu, M., Collins, F.S., Dekker, J., Lieb, J.D., Tullius, T.D., Crawford, G.E., Sunyaev, S., Noble, W.S., Dunham, I., Denoeud, F., Reymond, A., Kapranov, P., Rozowsky, J., Zheng, D., Castelo, R., Frankish, A., Harrow, J., Ghosh, S., Sandelin, A., Hofacker, I.L., Baertsch, R., Keefe, D., Dike, S., Cheng, J., Hirsch, H.A., Sekinger, E.A., Lagarde, J., Abril, J.F., Shahab, A., Flamm, C., Fried, C., Hackermuller, J., Hertel, J., Lindemeyer, M., Missal, K., Tanzer, A., Washietl, S., Korbel, J., Emanuelsson, O., Pedersen, J.S., Holroyd, N., Taylor, R., Swarbreck, D., Matthews, N., Dickson, M.C., Thomas, D.J., Weirauch, M.T., Gilbert, J., Drenkow, J., Bell, I., Zhao, X., Srinivasan, K.G., Sung, W.K., Ooi, H.S., Chiu, K.P., Foissac, S., Alioto, T., Brent, M., Pachter, L., Tress, M.L., Valencia, A., Choo, S.W., Choo, C.Y., Ucla, C., Manzano, C., Wyss, C., Cheung, E., Clark, T.G., Brown, J.B., Ganesh, M., Patel, S., Tammana, H., Chrast, J., Henrichsen, C.N., Kai, C., Kawai, J., Nagalakshmi, U., Wu, J., Lian, Z., Lian, J., Newburger, P., Zhang, X., Bickel, P., Mattick, J.S., Carninci, P., Hayashizaki, Y., Weissman, S., Hubbard, T., Myers, R.M., Rogers, J., Stadler, P.F., Lowe, T.M., Wei, C.L., Ruan, Y., Struhl, K., Gerstein, M., Antonarakis, S.E., Fu, Y., Green, E.D., Karaoz, U., Siepel, A., Taylor, J., Liefer, L.A., Wetterstrand, K.A., Good, P.J., Feingold, E.A., Guyer, M.S., Cooper, G.M., Asimenos, G., Dewey, C.N., Hou, M., Nikolaev, S., Montoya-Burgos, J.I., Loytynoja, A., Whelan, S., Pardi, F., Massingham, T., Huang, H., Zhang, N.R., Holmes, I., Mullikin, J.C., Ureta-Vidal, A., Paten, B., Seringhaus, M., Church, D., Rosenbloom, K., Kent, W.J., Stone, E.A., Batzoglou, S., Goldman, N., Hardison, R.C., Haussler, D., Miller, W., Sidow, A., Trinklein, N.D., Zhang, Z.D., Barrera, L., Stuart, R., King, D.C., Ameur, A., Enroth, S., Bieda, M.C., Kim, J., Bhinge, A.A., Jiang, N., Liu, J., Yao, F., Vega, V.B., Lee, C.W., Ng, P., Yang, A., Moqtaderi, Z., Zhu, Z., Xu, X., Squazzo, S., Oberley, M.J., Inman, D., Singer, M.A., Richmond, T.A., Munn, K.J., Rada-Iglesias, A., Wallerman, O., Komorowski, J., Fowler, J.C., Couttet, P., Bruce, A.W., Dovey, O.M., Ellis, P.D., Langford, C.F., Nix, D.A., Euskirchen, G., Hartman, S., Urban, A.E., Kraus, P., Van Calcar, S., Heintzman, N., Kim, T.H., Wang, K., Qu, C., Hon, G., Luna, R., Glass, C.K., Rosenfeld, M.G., Aldred, S.F., Cooper, S.J., Halees, A., Lin, J.M., Shulha, H.P., Xu, M., Haidar, J.N., Yu, Y., Iyer, V.R., Green, R.D., Wadelius, C., Farnham, P.J., Ren, B., Harte, R.A., Hinrichs, A.S., Trumbower, H., Clawson, H., Hillman-Jackson, J., Zweig, A.S., Smith, K., Thakkapallayil, A., Barber, G., Kuhn, R.M., Karolchik, D., Armengol, L., Bird, C.P., de Bakker, P.I., Kern, A.D., Lopez-Bigas, N., Martin, J.D., Stranger, B.E., Woodroffe, A., Davydov, E., Dimas, A., Eyras, E., Hallgrimsdottir, I.B., Huppert, J., Zody, M.C., Abecasis, G.R., Estivill, X., Bouffard, G.G., Guan, X., Hansen, N.F., Idol, J.R., Maduro, V.V., Maskeri, B., McDowell, J.C., Park, M., Thomas, P.J., Young, A.C., Blakesley, R.W., Muzny, D.M., Sodergren, E., Wheeler, D.A., Worley, K.C., Jiang, H., Weinstock, G.M., Gibbs, R.A., Graves, T., Fulton, R., Mardis, E.R., Wilson, R.K., Clamp, M., Cuff, J., Gnerre, S., Jaffe, D.B., Chang, J.L., Lindblad-Toh, K., Lander, E.S., Koriabine, M., Nefedov, M., Osoegawa, K., Yoshinaga, Y., Zhu, B., and de Jong, P.J. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799-816 490 491 492 493 494 Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csardi, G., Harrigan, P., Weier, M., Liechti, A., Aximu-Petri, A., Kircher, M., Albert, F.W., Zeller, U., Khaitovich, P., Grutzner, F., Bergmann, S., Nielsen, R., Paabo, S., and Kaessmann, H. (2011) The evolution of gene expression levels in mammalian organs. Nature 478(7369), 343-8 495 496 497 498 Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38(6), 1767-71 499 15 500 501 502 Coleman, S.J., Zeng, Z., Wang, K., Luo, S., Khrebtukova, I., Mienaltowski, M.J., Schroth, G.P., Liu, J., and MacLeod, J.N. (2010) Structural annotation of equine protein-coding genes determined by mRNA sequencing. Animal Genetics 41 Suppl 2, 121-30 503 504 505 Culhane, A.C., Thioulouse, J., Perriere, G., and Higgins, D.G. (2005) MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics 21(11), 2789-90 506 507 508 Eddy, S.R. (2012) The C-value Paradox, junk DNA and ENCODE. Current Biology 22(21), R898R899 509 510 511 Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30(1), 207-10 512 513 514 Egli, M., Leeners, B., and Kruger, T.H. (2010) Prolactin secretion patterns: basic mechanisms and clinical implications for reproduction. Reproduction 140(5), 643-54 515 516 517 Eisenberg, E., and Levanon, E.Y. (2003) Human housekeeping genes are compact. Trends in Genetics : TIG 19(7), 362-5 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 Elsik, C.G., Tellam, R.L., Worley, K.C., Gibbs, R.A., Muzny, D.M., Weinstock, G.M., Adelson, D.L., Eichler, E.E., Elnitski, L., Guigo, R., Hamernik, D.L., Kappes, S.M., Lewin, H.A., Lynn, D.J., Nicholas, F.W., Reymond, A., Rijnkels, M., Skow, L.C., Zdobnov, E.M., Schook, L., Womack, J., Alioto, T., Antonarakis, S.E., Astashyn, A., Chapple, C.E., Chen, H.C., Chrast, J., Camara, F., Ermolaeva, O., Henrichsen, C.N., Hlavina, W., Kapustin, Y., Kiryutin, B., Kitts, P., Kokocinski, F., Landrum, M., Maglott, D., Pruitt, K., Sapojnikov, V., Searle, S.M., Solovyev, V., Souvorov, A., Ucla, C., Wyss, C., Anzola, J.M., Gerlach, D., Elhaik, E., Graur, D., Reese, J.T., Edgar, R.C., McEwan, J.C., Payne, G.M., Raison, J.M., Junier, T., Kriventseva, E.V., Eyras, E., Plass, M., Donthu, R., Larkin, D.M., Reecy, J., Yang, M.Q., Chen, L., Cheng, Z., Chitko-McKown, C.G., Liu, G.E., Matukumalli, L.K., Song, J., Zhu, B., Bradley, D.G., Brinkman, F.S., Lau, L.P., Whiteside, M.D., Walker, A., Wheeler, T.T., Casey, T., German, J.B., Lemay, D.G., Maqbool, N.J., Molenaar, A.J., Seo, S., Stothard, P., Baldwin, C.L., Baxter, R., Brinkmeyer-Langford, C.L., Brown, W.C., Childers, C.P., Connelley, T., Ellis, S.A., Fritz, K., Glass, E.J., Herzig, C.T., Iivanainen, A., Lahmers, K.K., Bennett, A.K., Dickens, C.M., Gilbert, J.G., Hagen, D.E., Salih, H., Aerts, J., Caetano, A.R., Dalrymple, B., Garcia, J.F., Gill, C.A., Hiendleder, S.G., Memili, E., Spurlock, D., Williams, J.L., Alexander, L., Brownstein, M.J., Guan, L., Holt, R.A., Jones, S.J., Marra, M.A., Moore, R., Moore, S.S., Roberts, A., Taniguchi, M., Waterman, R.C., Chacko, J., Chandrabose, M.M., Cree, A., Dao, M.D., Dinh, H.H., Gabisi, R.A., Hines, S., Hume, J., Jhangiani, S.N., Joshi, V., Kovar, C.L., Lewis, L.R., Liu, Y.S., Lopez, J., Morgan, M.B., Nguyen, N.B., Okwuonu, G.O., Ruiz, S.J., Santibanez, J., Wright, R.A., Buhay, C., Ding, Y., Dugan-Rocha, S., Herdandez, J., Holder, M., Sabo, A., Egan, A., Goodell, J., Wilczek-Boney, K., Fowler, G.R., Hitchens, M.E., Lozado, R.J., Moen, C., Steffen, D., Warren, J.T., Zhang, J., Chiu, R., Schein, J.E., Durbin, K.J., Havlak, P., Jiang, H., Liu, Y., Qin, X., Ren, Y., Shen, Y., Song, H., Bell, S.N., Davis, C., Johnson, A.J., Lee, S., Nazareth, L.V., Patel, B.M., Pu, L.L., Vattathil, S., Williams, R.L., Jr., Curry, S., Hamilton, C., Sodergren, E., Wheeler, D.A., Barris, W., Bennett, G.L., Eggen, A., Green, R.D., Harhay, G.P., Hobbs, M., Jann, O., Keele, J.W., Kent, M.P., Lien, S., McKay, S.D., McWilliam, S., Ratnakumar, A., Schnabel, R.D., Smith, T., Snelling, W.M., Sonstegard, T.S., Stone, R.T., Sugimoto, Y., Takasuga, A., Taylor, J.F., Van Tassell, C.P., Macneil, M.D., Abatepaulo, A.R., Abbey, C.A., Ahola, V., Almeida, I.G., Amadio, A.F., Anatriello, E., Bahadue, S.M., Biase, F.H., Boldt, C.R., Carroll, J.A., Carvalho, W.A., Cervelatti, E.P., Chacko, E., Chapin, J.E., Cheng, Y., Choi, J., Colley, A.J., de Campos, T.A., De Donato, M., Santos, I.K., de Oliveira, 16 550 551 552 553 554 555 556 557 558 559 560 561 562 563 C.J., Deobald, H., Devinoy, E., Donohue, K.E., Dovc, P., Eberlein, A., Fitzsimmons, C.J., Franzin, A.M., Garcia, G.R., Genini, S., Gladney, C.J., Grant, J.R., Greaser, M.L., Green, J.A., Hadsell, D.L., Hakimov, H.A., Halgren, R., Harrow, J.L., Hart, E.A., Hastings, N., Hernandez, M., Hu, Z.L., Ingham, A., Iso-Touru, T., Jamis, C., Jensen, K., Kapetis, D., Kerr, T., Khalil, S.S., Khatib, H., Kolbehdari, D., Kumar, C.G., Kumar, D., Leach, R., Lee, J.C., Li, C., Logan, K.M., Malinverni, R., Marques, E., Martin, W.F., Martins, N.F., Maruyama, S.R., Mazza, R., McLean, K.L., Medrano, J.F., Moreno, B.T., More, D.D., Muntean, C.T., Nandakumar, H.P., Nogueira, M.F., Olsaker, I., Pant, S.D., Panzitta, F., Pastor, R.C., Poli, M.A., Poslusny, N., Rachagani, S., Ranganathan, S., Razpet, A., Riggs, P.K., Rincon, G., Rodriguez-Osorio, N., Rodriguez-Zas, S.L., Romero, N.E., Rosenwald, A., Sando, L., Schmutz, S.M., Shen, L., Sherman, L., Southey, B.R., Lutzow, Y.S., Sweedler, J.V., Tammen, I., Telugu, B.P., Urbanski, J.M., Utsunomiya, Y.T., Verschoor, C.P., Waardenberg, A.J., Wang, Z., Ward, R., Weikard, R., Welsh, T.H., Jr., White, S.N., Wilming, L.G., Wunderlich, K.R., Yang, J., and Zhao, F.Q. (2009) The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324(5926), 522-8 564 565 566 567 568 569 570 571 572 573 Flicek, P., Amode, M.R., Barrell, D., Beal, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Gil, L., Gordon, L., Hendrix, M., Hourlier, T., Johnson, N., Kahari, A.K., Keefe, D., Keenan, S., Kinsella, R., Komorowska, M., Koscielny, G., Kulesha, E., Larsson, P., Longden, I., McLaren, W., Muffato, M., Overduin, B., Pignatelli, M., Pritchard, B., Riat, H.S., Ritchie, G.R., Ruffier, M., Schuster, M., Sobral, D., Tang, Y.A., Taylor, K., Trevanion, S., Vandrovcova, J., White, S., Wilson, M., Wilder, S.P., Aken, B.L., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernandez-Suarez, X.M., Harrow, J., Herrero, J., Hubbard, T.J., Parker, A., Proctor, G., Spudich, G., Vogel, J., Yates, A., Zadissa, A., and Searle, S.M. (2012) Ensembl 2012. Nucleic Acids Research 40, D84-D90 574 575 576 577 578 Foley, C., Chapwanya, A., Creevey, C., Narciandi, F., Morris, D., Kenny, E., Cormican, P., Callanan, J.J., O'Farrelly, C., and Meade, K.G. (2012) Global endometrial transcriptomic profiling: transient immune activation precedes tissue proliferation and repair in healthy beef cows. BMC Genomics 13(1), 489 579 580 581 582 583 Forde, N., Duffy, G.B., McGettigan, P.A., Browne, J.A., Prakash Mehta, J., Kelly, A.K., MansouriAttia, N., Sandra, O., Loftus, B.J., Crowe, M.A., Fair, T., Roche, J.F., Lonergan, P., and Evans, A.C. (2012) Evidence for an early endometrial response to pregnancy in cattle: both dependent upon and independent of interferon tau. Physiological Genomics 44(16), 799-810 584 585 586 587 588 Freeman, T.C., Ivens, A., Baillie, J.K., Beraldi, D., Barnett, M.W., Dorward, D., Downing, A., Fairbairn, L., Kapetanovic, R., Raza, S., Tomoiu, A., Alberio, R., Wu, C., Su, A.I., Summers, K.M., Tuggle, C.K., Archibald, A.L., and Hume, D.A. (2012) A gene expression atlas of the domestic pig. BMC Biology 10, 90 589 590 591 592 Handcock, M.S., Morris, M., and ebrary Inc. (1999) 'Relative distribution methods in the social sciences.' In Statistics for social science and public policy (Springer,: New York) Available at http://site.ebrary.com/lib/princeton/Doc?id=5008065 593 594 595 596 597 Harhay, G.P., Smith, T.P., Alexander, L.J., Haudenschild, C.D., Keele, J.W., Matukumalli, L.K., Schroeder, S.G., Van Tassell, C.P., Gresham, C.R., Bridges, S.M., Burgess, S.C., and Sonstegard, T.S. (2010) An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation. Genome Biology 11(10), R102 598 17 599 600 601 602 603 604 605 606 607 608 609 Hawrylycz, M.J., Lein, E.S., Guillozet-Bongaarts, A.L., Shen, E.H., Ng, L., Miller, J.A., van de Lagemaat, L.N., Smith, K.A., Ebbert, A., Riley, Z.L., Abajian, C., Beckmann, C.F., Bernard, A., Bertagnolli, D., Boe, A.F., Cartagena, P.M., Chakravarty, M.M., Chapin, M., Chong, J., Dalley, R.A., Daly, B.D., Dang, C., Datta, S., Dee, N., Dolbeare, T.A., Faber, V., Feng, D., Fowler, D.R., Goldy, J., Gregor, B.W., Haradon, Z., Haynor, D.R., Hohmann, J.G., Horvath, S., Howard, R.E., Jeromin, A., Jochim, J.M., Kinnunen, M., Lau, C., Lazarz, E.T., Lee, C., Lemon, T.A., Li, L., Li, Y., Morris, J.A., Overly, C.C., Parker, P.D., Parry, S.E., Reding, M., Royall, J.J., Schulkin, J., Sequeira, P.A., Slaughterbeck, C.R., Smith, S.C., Sodt, A.J., Sunkin, S.M., Swanson, B.E., Vawter, M.P., Williams, D., Wohnoutka, P., Zielke, H.R., Geschwind, D.H., Hof, P.R., Smith, S.M., Koch, C., Grant, S.G., and Jones, A.R. (2012) An anatomically comprehensive atlas of the adult human brain transcriptome. Nature 489(7416), 391-9 610 611 612 613 614 615 Hsiao, L.L., Dangond, F., Yoshida, T., Hong, R., Jensen, R.V., Misra, J., Dillon, W., Lee, K.F., Clark, K.E., Haverty, P., Weng, Z., Mutter, G.L., Frosch, M.P., MacDonald, M.E., Milford, E.L., Crum, C.P., Bueno, R., Pratt, R.E., Mahadevappa, M., Warrington, J.A., Stephanopoulos, G., and Gullans, S.R. (2001) A compendium of gene expression in normal human tissues. Physiological Genomics 7(2), 97-104 616 617 618 619 Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research 37(1), 113 620 621 622 623 Keady, S.M., Creevey, C., Kenny , D.A., Waters, S.M. (In preparation) Transcriptional regulation in M. longissimus dorsi during nutritional restriction and compensatory growth in Aberdeen Angus steers using RNAseq technology. 624 625 626 Keibler, E., and Brent, M.R. (2003) Eval: a software package for analysis of genome annotations. BMC Bioinformatics 4, 50 627 628 629 Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. (2002) The human genome browser at UCSC. Genome Research 12(6), 996-1006 630 631 632 Lawrence, M., Gentleman, R., and Carey, V. (2009) rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25(14), 1841-2 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 Lein, E.S., Hawrylycz, M.J., Ao, N., Ayres, M., Bensinger, A., Bernard, A., Boe, A.F., Boguski, M.S., Brockway, K.S., Byrnes, E.J., Chen, L., Chen, T.M., Chin, M.C., Chong, J., Crook, B.E., Czaplinska, A., Dang, C.N., Datta, S., Dee, N.R., Desaki, A.L., Desta, T., Diep, E., Dolbeare, T.A., Donelan, M.J., Dong, H.W., Dougherty, J.G., Duncan, B.J., Ebbert, A.J., Eichele, G., Estin, L.K., Faber, C., Facer, B.A., Fields, R., Fischer, S.R., Fliss, T.P., Frensley, C., Gates, S.N., Glattfelder, K.J., Halverson, K.R., Hart, M.R., Hohmann, J.G., Howell, M.P., Jeung, D.P., Johnson, R.A., Karr, P.T., Kawal, R., Kidney, J.M., Knapik, R.H., Kuan, C.L., Lake, J.H., Laramee, A.R., Larsen, K.D., Lau, C., Lemon, T.A., Liang, A.J., Liu, Y., Luong, L.T., Michaels, J., Morgan, J.J., Morgan, R.J., Mortrud, M.T., Mosqueda, N.F., Ng, L.L., Ng, R., Orta, G.J., Overly, C.C., Pak, T.H., Parry, S.E., Pathak, S.D., Pearson, O.C., Puchalski, R.B., Riley, Z.L., Rockett, H.R., Rowland, S.A., Royall, J.J., Ruiz, M.J., Sarno, N.R., Schaffnit, K., Shapovalova, N.V., Sivisay, T., Slaughterbeck, C.R., Smith, S.C., Smith, K.A., Smith, B.I., Sodt, A.J., Stewart, N.N., Stumpf, K.R., Sunkin, S.M., Sutram, M., Tam, A., Teemer, C.D., Thaller, C., Thompson, C.L., Varnam, L.R., Visel, A., Whitlock, R.M., Wohnoutka, P.E., Wolkey, C.K., Wong, V.Y., Wood, M., Yaylaoglu, M.B., 18 648 649 Young, R.C., Youngstrom, B.L., Yuan, X.F., Zhang, B., Zwingman, T.A., and Jones, A.R. (2007) Genome-wide atlas of gene expression in the adult mouse brain. Nature 445(7124), 168-76 650 651 652 Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A., and Dewey, C.N. (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4), 493-500 653 654 655 656 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078-9 657 658 659 Loven, J., Orlando, D.A., Sigova, A.A., Lin, C.Y., Rahl, P.B., Burge, C.B., Levens, D.L., Lee, T.I., and Young, R.A. (2012) Revisiting global gene expression analysis. Cell 151(3), 476-82 660 661 662 663 Mamo, S., Mehta, J.P., McGettigan, P., Fair, T., Spencer, T.E., Bazer, F.W., and Lonergan, P. (2011) RNA sequencing reveals novel gene clusters in bovine conceptuses associated with maternal recognition of pregnancy and implantation. Biology of Reproduction 85(6), 1143-51 664 665 666 667 Matthews, D., Waters, SM, Creevey, C., Morris, DG, Kenny DA, Diskin, MG (In preparation) The effect of severe short term dietary restriction on gene expression in the bovine hypothalamus using next generation RNA sequencing technology. 668 669 670 671 McCabe, M.S., Waters, S.M., Morris, D.G., Kenny, D.A., Lynn, D.J., and Creevey, C.J. (2012) RNA-seq analysis of differential gene expression in liver from lactating dairy cows divergent in negative energy balance. BMC Genomics 13(1), 193 672 673 674 675 Morozov, I.Y., Jones, M.G., Gould, P.D., Crome, V., Wilson, J.B., Hall, A.J., Rigden, D.J., and Caddick, M.X. (2012) mRNA 3' tagging is induced by nonsense-mediated decay and promotes ribosome dissociation. Molecular and Cellular Biology 32(13), 2585-95 676 677 678 Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5(7), 621-8 679 680 681 682 O'Loughlin, A., Lynn, D.J., McGee, M., Doyle, S., McCabe, M., and Earley, B. (2012) Transcriptomic analysis of the stress response to weaning at housing in bovine leukocytes using RNA-seq technology. BMC Genomics 13(1), 250 683 684 685 686 Pearson, R.D., Liu, X., Sanguinetti, G., Milo, M., Lawrence, N.D., and Rattray, M. (2009) puma: a Bioconductor package for propagating uncertainty in microarray analysis. BMC Bioinformatics 10, 211 687 688 689 690 691 Pluta, K., McGettigan, P.A., Reid, C.J., Browne, J.A., Irwin, J.A., Tharmalingam, T., Corfield, A., Baird, A.W., Loftus, B.J., Evans, A.C., and Carrington, S.D. (2012) Molecular aspects of mucin biosynthesis and mucus formation in the bovine cervix during the periestrous period. Physiological Genomics 44(24), 1165-1178 692 19 693 694 Quinlan, A.R., and Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841-2 695 696 697 Robinson, J.T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G., and Mesirov, J.P. (2011) Integrative genomics viewer. Nature Biotechnology 29(1), 24-6 698 699 700 701 Schug, J., Schuller, W.P., Kappen, C., Salbaum, J.M., Bucan, M., and Stoeckert, C.J., Jr. (2005) Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biology 6(4), R33 702 703 704 Shannon, C.E., and Weaver, W. (1949) 'The mathematical theory of communication.' (University of Illinois Press: Urbana,) v (i.e. vii), 117 p. 705 706 707 708 709 710 711 712 713 714 715 Siddiqui, A.S., Khattra, J., Delaney, A.D., Zhao, Y., Astell, C., Asano, J., Babakaiff, R., Barber, S., Beland, J., Bohacec, S., Brown-John, M., Chand, S., Charest, D., Charters, A.M., Cullum, R., Dhalla, N., Featherstone, R., Gerhard, D.S., Hoffman, B., Holt, R.A., Hou, J., Kuo, B.Y., Lee, L.L., Lee, S., Leung, D., Ma, K., Matsuo, C., Mayo, M., McDonald, H., Prabhu, A.L., Pandoh, P., Riggins, G.J., de Algara, T.R., Rupert, J.L., Smailus, D., Stott, J., Tsai, M., Varhol, R., Vrljicak, P., Wong, D., Wu, M.K., Xie, Y.Y., Yang, G., Zhang, I., Hirst, M., Jones, S.J., Helgason, C.D., Simpson, E.M., Hoodless, P.A., and Marra, M.A. (2005) A mouse atlas of gene expression: largescale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. Proceedings of the National Academy of Sciences of the United States of America 102(51), 18485-90 716 717 718 719 Smyth, G.K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, Article3 720 721 722 Stocco, C. (2008) Aromatase expression in the ovary: hormonal and molecular regulation. Steroids 73(5), 473-87 723 724 725 Struhl, K. (2007) Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nature Structural and Molecular Biology 14(2), 103-5 726 727 728 729 730 Su, A.I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K.A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., Cooke, M.P., Walker, J.R., and Hogenesch, J.B. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 101(16), 6062-7 731 732 733 Trapnell, C., Pachter, L., and Salzberg, S.L. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105-11 734 735 736 737 738 Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28(5), 511-5 20 739 740 741 van Bakel, H., Nislow, C., Blencowe, B.J., and Hughes, T.R. (2010) Most "dark matter" transcripts are associated with known genes. PLoS Biology 8(5), e1000371 742 743 744 745 746 Walsh, S.W., Mehta, J.P., McGettigan, P.A., Browne, J.A., Forde, N., Alibrahim, R.M., Mulligan, F.J., Loftus, B., Crowe, M.A., Matthews, D., Diskin, M., Mihm, M., and Evans, A.C. (2012) Effect of the metabolic environment at key stages of follicle development in cattle: focus on steroid biosynthesis. Physiological Genomics 44(9), 504-17 747 748 749 750 Warrington, J.A., Nair, A., Mahadevappa, M., and Tsyganskaya, M. (2000) Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiological Genomics 2(3), 143-7 751 752 753 Wasserman, W.W., and Fickett, J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. Journal of Molecular Biology 278(1), 167-81 754 755 Additional Files: 756 Supplementary Table S1. 757 758 21 Table 1: Sequencing metrics for the 10 bovine tissue samples under investigation Tissue type Breed of animals Number of animals Sex Number of samples Library Type Read length (base pairs) Endometrium 20 F 20 Single read 36 F 36 37 F 30 Embryo (Day 5 to 19) Mixed breed beef HolsteinFriesian HolsteinFriesian Mixed breed beef Mixed breed beef Leukocytes Liver Granulosa Theca Cervix Hypothalamus Pituitary Muscle Total GEO ID Study Original study Pubmed ID 36 Total tissue library size (Gbases) 119,950,155 4.3 GSE56392 22759920 Single read 36 263,863,637 9.5 GSE34317 37 Single read 36 356,596,059 12.8 GSE34317 F 30 Single read 42 475,102,282 20.0 GSE38225 28 Mixed 28 Pooled Single read 84 664,986,589 55.9 GSE56513 Forde et al 2012 Walsh et al 2012 Walsh et al 2012 Pluta et al 2012 Mamo et al 2011 Simmental 16 M 55 Single read 36 1,229,589,572 44.3 GSE37447 22708644 HolsteinFriesian Mixed breed beef Mixed breed beef 12 F 21 Single read 36 289,137,975 10.4 GSE37544 23 F 23 Single read 42 594,937,671 24.9 GSE49540 3 pools F 3 Pooled Single read 42 106,119,419 4.5 O’Loughlin et al 2012 McCabe et al 2012 Matthews et al (In prep.) Matthews et al (In prep) In preparation In preparation 27 M 27 Paired End 2x40 936,006,782 74.9 Keady et al (in prep) In preparation 5,034,250,870 261.4 Aberdeen Angus 280 22 Number of reads GSE48481 22414914 22414914 23092952 21795669 22607119 Table 2: Numbers of Ensembl annotated genes detected in each tissue type with ≥1, ≥5 reads (transcripts) and ≥10 tags per million reads (TPM). Note there are 24,616 genes in total in the Ensembl bovine annotation. Tissue type Number of transcripts detected (reads ≥ 1) Number of transcripts detected (reads ≥ 5) Number of transcripts detected (TPM ≥ 10) Endometrium 19,360 17,078 12,104 Granulosa 19,377 16,973 11,205 Theca 19,840 17,571 11,133 Cervix 20,287 18,186 12,596 Embryo (Day 5 to 19) 22,874 21,444 16,141 Leukocytes 19,772 17,632 10,205 Liver 19,141 16,740 9,782 Hypothalamus 21,259 19,006 11,996 Pituitary 18,465 15,861 10,080 Muscle 18,031 15,913 7,020 All above tissues 23,818 22,963 17,893 23 Table 3: Gini coefficient for each tissue. First column shows Gini coefficient for all 24,616 genes in Ensemble. Second column shows Gini coefficient for all non-zero genes. The Gini coefficient is a measure of the unevenness in the distribution of the reads (transcripts) among the genes. Samples with a Gini coefficient of 0.00 would have an equal distribution of reads among all genes. Samples with a Gini coefficient of 1.00 would have all reads from a single gene. Gini Gini (all genes) (all non-zero genes) 0.00 0.00 Endometrium 0.79 0.79 Hypothalamus 0.81 0.80 Embryo 0.83 0.82 Cervix 0.84 0.84 Pituitary 0.85 0.84 Granulosa 0.85 0.84 Theca 0.85 0.85 Leukocytes 0.87 0.87 Liver 0.89 0.89 Muscle 0.96 0.96 1.00 1.00 24 Decreasing number of genes contributing to total read count Tissue Table 4: List of over-represented KEGG pathways and DAVID functional clusters for the significantly tissue expressed genes. For those tissues with no overrepresented pathways or clusters the complete list of genes is shown. Final row of table shows over-represented pathways among genes exhibiting a housekeeping profile. Tissue Total sig genes Cervix 35 Extracellular region Secretory Granule Embryo 32 Protease Proteinase inhibitor 12, Kunitz metazoa Signal peptide Regulation of transcription, DNA-dependent Endometrium 3 Proenkephalin (PENK); Protease serine 16 (thymus) (PRSS16); Similar to thioesterase superfamily member 5 (THEM5/Bt.33457) Granulosa 4 Inhibin beta A (INHBA); Inhibin alpha (INHBB); Aromatase (Cytochrome P450 XIX) (CYP19A1); Follicle stimulating hormone receptor (FSHR) Hypothalamus 36 Taurine and hypotaurine metabolism Beta-alanine metabolism Alanine, aspartate and glutamate metabolism Butanoate metabolism Type I diabetes mellitus Neuron projection Immunoglobulin V-set Integral to membrane Cell surface receptor linked signal transduction Membrane fraction Leukocytes 35 Chemokine signaling Cytokine-cytokine receptor interaction Cell adhesion molecules (CAMs) Primary immunodeficiency Chemokine receptor activity Plasma membrane Integral to membrane KEGG top pathways DAVID top clusters 25 Genes (symbol) Liver 196 Complement and coagulation cascades Retinol metabolism Drug metabolism Steroid hormone biosynthesis Metabolism of xenobiotics by cytochrome P450 Secreted Enzyme inhibitor activity Transition metal ion binding Ion binding Complement and coagulation cascades Muscle 99 Tight junction Viral myocarditis Focal adhesion Arrhythmogenic right ventricular cardiomyopathy Hypertrophic cardiomyopathy Myofibiril Muscle organ development Muscle protein Striated muscle development Sarcoplasmic reticulum Pituitary 10 Theca 2 Housekeeping 411 Neuroactive ligand-receptor interaction Follicle stimulating hormone beta (FSHB); Glycoprotein hormones alpha polypeptide (CGA); Gonadotrophin releasing hormone receptor (GNRHR); Growth hormone (GH1) Prolactin (PRL); Thyroid stimulating hormone beta (TSHB); Similar to peptidyl-pro cis trans isomerase (LOC782178); Pituitary specific transcription factor (POU class 1 homeobox 1) (POU1F1); Growth hormone releasing hormone receptor (GHRHR); Immunoglobulin superfamily member 1 (IGSF1) CYPXVII, cythochrome P450 17A1 (CYP17A1); Insulin-like 3 (leydig cell) (INSL3) Huntington’s disease Parkinson’s disease Oxidative phosphorylation Alzherimer’s disease Proteasome Amioacyl t-RNA synthesis Nucleotide excision repair Purine metabolism RNA polymerase Valine, leucine and isoleucine biosynthesis Ubiquitin mediated proteolysis Mitochondrion Proteasome Proteolysis Translation/ribosome DNA repair Ubiquitin Protein neddylation tRNA aminoacylation Methyltransferase activity 26 Figure 1: Hierarchical clustering of individual samples in ten bovine tissues. 27 Figure 2: Principal components plot showing first 2 principal components of 10 bovine tissues. 28 Figure 3: Boxplot of Transcripts per million (TPM) showing the gene in each tissue that has the strongest tissue specific signature i.e. lowest Qgt value. The last 2 plots show the 2 genes with strongest housekeeping profile – i.e. ubiquitously expressed at the same high level in all tissues (highest entropy score Hg). 29