View/Open - Aberystwyth University

advertisement
1
IETS 2016
2
Fertility and genomics: comparison of gene expression in contrasting
3
reproductive tissues of female cattle
4
PA McGettigan1, JA Browne1, SD Carrington2, MA Crowe2, T Fair1, N Forde1,6, BJ Loftus3, A
5
Lohan3, P Lonergan1, K Pluta2, S Mamo1, A Murphy4, J Roche3, SW Walsh1,7, CJ Creevey5,8, B
6
Earley5, S Keady5, DA Kenny5, D Matthews5, M McCabe5, D Morris5, A O’Loughlin5, S Waters5,
7
MG Diskin5, ACO Evans1,4,*
8
1
School of Agriculture and Food Science, 2School of Veterinary Medicine, 3School of Medicine
9
and Medical Sciences, 4Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland
10
5
11
6
12
Cardiovascular and Metabolic Medicine, University of Leeds, Leeds, United Kingdom
13
7
14
Waterford, Ireland
15
8
16
University, Aberystwyth, Ceredigion, United Kingdom
17
*
Animal and Grassland Research and Innovation Centre, Teagasc, Athenry, Co Galway, Ireland.
Present Address: Division of Reproduction and Early Development, Leeds Institute of
Present Address: Department of Chemical and Life Sciences, Waterford Institute of Technology,
Present Address: Institute of Biological, Environmental and Rural Sciences, Aberystwyth
Correspondence: A Evans, alex.evans@ucd.ie.
18
19
20
1
21
Abstract
22
To compare gene expression among bovine tissues we used large bovine RNAseq datasets
23
comprising 280 samples from 10 different bovine tissues (uterine endometrium, granulosa cells,
24
theca cells, cervix, embryos, leukocytes, liver, hypothalamus, pituitary, muscle) generating 260
25
Gbases of data. We used twin approaches of an information-theoretic analysis of the existing
26
annotated transcriptome to identify the most tissue-specific genes, as well as a de-novo
27
transcriptome annotation to evaluate general features of the transcription landscape. We detected
28
expression of 97% of the Ensembl transcriptome with at least one read in one sample and between
29
28% and 66% at a level of 10 Tags Per Million (TPM)or greater in individual tissues. Over 95% of
30
genes exhibited some level of tissue-specific gene expression. This was mostly due to different
31
levels of expression in different tissues rather than exclusive expression in a single tissue. Less than
32
1% of annotated genes exhibited a highly restricted tissue-specific expression profile and
33
approximately 2% exhibited classic housekeeping profiles. We conclude that it is combined effects
34
of the variable expression of large numbers of genes (73 to 93% of the genome) with the specific
35
expression of a small number of genes (less than 1% of the transcriptome) that contributes to
36
determining the outcome of the function of individual tissues.
37
38
Keywords: RNAseq, transcription, bovine, reproduction, uterus, endometrium, embryo, cervix,
39
ovary, follicle, theca, granulosa, hypothalamus, pituitary, leukocyte, muscle, liver
40
41
Short title: Gene expression in bovine reproductive tissues
42
43
1. Introduction
44
Recently, there has been a flood of development of new ‘omic’ technologies such as proteomics,
45
transcriptomics and metabolomics that are enabling the generation of vast amounts of novel data
46
characterizing different aspects of cellular biology at a global level. These technologies have been
47
used in an attempt to better understand the development of various tissues with transcriptomics
48
studies producing the most data. A major challenge of this new era is to determine the biological
49
importance of these data in the context of cell and tissue function. In this paper, we focus on the
50
large volumes of data that we have produced in our studies of bovine tissues in recent years, with a
51
focus on reproductive tissues.
2
52
53
2. Development of reproductive tissues
54
While most tissues in the body are continually engaged in turnover, many reproductive tissues are
55
actively engaged in vigorous periods of tissue (cellular) proliferation that are often followed by
56
periods of dramatic and whole-sale tissue degradation and regression (this is especially true in
57
female reproductive tissues). Cumulatively, the coordination of the proliferation and regression of
58
tissues determines the success of reproduction and fertility. It is for this reason that many
59
investigators have chosen to focus their attentions on the development of specific tissues, usually
60
during narrow timeframes during the reproductive process, to understand their contribution to the
61
outcome of reproduction.
62
In taking a step back from the mass of data available on individual tissues, it is interesting to
63
consider some of the similarities and differences among tissues during their developmental
64
processes. For example, some reproductive tissues are relatively static in mass (although not
65
always function) with the hypothalamus and pituitary being clear examples. Other tissues develop
66
over long periods of time; oocytes and follicles being the best examples, developing over months
67
(or years depending on your point of view) with the most dramatic changes occurring in the final
68
days before ovulation. Uterine tissues undergo changes during the reproductive cycle in preparation
69
for pregnancy and then changes again over the months of gestation. Others tissues develop much
70
more rapidly, with the early embryo and the corpus luteum undergoing changes in size and
71
morphology that change them almost unrecognizably over a period of just a few days.
72
To better understand these changes, and the factors controlling them, many studies have focused on
73
measuring the expression of genes and significant amounts of data have been generated during the
74
last few years. One of the key technologies enabling this has been RNA Sequencing (RNA-seq).
75
Since its initial development in 2006 (Bainbridge et al. 2006), RNA-seq has rapidly displaced gene
76
expression microarrays for large-scale transcriptional profiling and is now the technology of choice
77
in many laboratories. Some of the advantages of RNA-seq over microarrays include global
78
profiling of transcripts including currently unannotated transcripts, identification of novel transcript
79
isoforms as well as more accurate quantification of transcript levels. In practice, many researchers,
80
including ourselves, have used the technology as a direct replacement for microarrays and tended
81
to restrict their initial analyses to the known annotated transcriptome of the organism of interest
82
(Mamo et al. 2011; Foley et al. 2012; Forde et al. 2012; O'Loughlin et al. 2012; Pluta et al. 2012;
83
Walsh et al. 2012; Keady In preparation; Matthews In preparation).
84
The availability of a complete bovine genome sequence has enabled the application of the RNA-
85
seq protocol to bovine samples (Elsik et al. 2009). We have previously published several RNA-
3
86
seq-derived transcriptomic studies focusing on aspects of bovine reproduction, fertility and
87
productivity traits under various experimental conditions (see Table 1). However, the question of
88
the completeness of the current bovine transcriptome annotation, and characteristic gene expression
89
differences between tissue types remain unaddressed. There are also other gaps in our knowledge
90
of the transcriptome; for example, to what extent are genes universally or uniquely expressed in
91
individual tissue types. There are also long-standing questions about the existence or otherwise of
92
so-called “housekeeping genes” with consistent expression levels in all tissue types at all times
93
(which could be used as reference genes for calibration of global expression studies).
94
In order to address these questions the aim of this study was to conduct a de-novo
95
annotation of the bovine transcriptome using data from 280 bovine samples taken from 10 distinct
96
tissue types and to compare it to the Ensembl bovine annotation (Ensemble 65) to identify novel
97
bovine transcripts. In addition, the patterns of transcription between different tissues types were
98
compared to identify genes with highly tissue-specific expression patterns and housekeeping genes.
99
100
3. Materials and Methods
101
Animal Handling
102
All animal procedures performed for the generation of tissue samples in this study were conducted
103
under experimental license from the Irish Department of Health and Children in accordance with
104
the Cruelty to Animals Act 1876 and the European Communities (Amendment of Cruelty to
105
Animals Act 1876) regulation 2002 and 2005 with approval from individual institutional ethics
106
committees.
107
Tissue sources
108
The sequencing data were generated from 10 different tissue types as part of 8 separate
109
experiments, some of which have been published. The details of tissue type, cattle breed and
110
number of samples are listed in Table 1. In brief, the samples consisted of 20 uterine endometrium
111
samples from mixed breed beef heifers collected on Day 13 and Day 16 after estrus of which 5
112
samples at each time point were confirmed pregnant and 5 were non-pregnant (Forde et al. 2012).
113
The follicular granulosa and theca cell samples were paired samples taken from dominant pre-
114
ovulatory follicles of Holstein-Friesian cows and heifers (37 animals in total) at 3 stages of
115
follicular development: selection, differentiation and luteinization (Walsh et al. 2012). The cervical
116
tissues were taken from 30 mixed breed beef heifers at 6 time points during the peri-estrus period
117
(Pluta et al. 2012). The embryo samples consisted of 28 pooled samples from mixed breed beef
118
cattle taken at 5 different days post-fertilization (Mamo et al. 2011). The leukocytes were taken
4
119
from 16 Simmental male beef calves at 4 different time points post-weaning, resulting in 55
120
samples in total (not all animal yielded 4 samples each) (O'Loughlin et al. 2012). The liver samples
121
were taken from 12 early post-partum Holstein-Friesian dairy cows in either mild or severe
122
negative energy balance (McCabe et al. 2012). The hypothalamus and pituitary samples were taken
123
from 23 mixed breed beef animals (Matthews In preparation). The muscle samples were taken from
124
the M. longissimus dorsi of 27 Aberdeen Angus steers undergoing nutritional restriction and
125
compensatory growth (Keady In preparation). In some cases multiple samples were not collected
126
from all individual animals giving rise to the actual number of samples contributing to the study as
127
shown in Table 1.
128
RNA-seq library preparation
129
All samples were sequenced on a GAIIx sequencer (Illumina) by the Conway Institute
130
Transcriptomics laboratory at University College Dublin, Ireland. All libraries were non-strand
131
specific and were processed as single read libraries (with the exception of the muscle samples that
132
were processed as paired-end libraries). The library type, read length, total numbers of reads
133
generated for each library type and Gene Expression Omnibus (GEO) ID (Edgar et al. 2002)
134
(where available) are listed in Table 1.
135
Alignment and preprocessing
136
FASTQ files (Cock et al. 2010) from each library were converted to the Sanger FASTQ format and
137
were then aligned individually to the UMD3.1/BosTau6 bovine genome assembly using the
138
software Tophat version 1.4.0 (Trapnell et al. 2009). Individual alignments, in bam format, from
139
each library were merged together using the samtools merge command (Li et al. 2009). Finally, all
140
combined tissue bam files were merged together into a single file. De novo transcriptome
141
annotation was carried out on each individual tissue and on the combined dataset using cufflinks
142
v1.1.0 (Trapnell et al. 2010). The Ensembl v65 annotation of the bovine genome was taken as the
143
reference transcriptome (Flicek et al. 2012). Coordinates of repetitive regions were downloaded for
144
the UMD3.1 assembly from UCSC genome browser (Kent et al. 2002). Introns were identified
145
from the alignments with Python code utilizing the pysam library. Visualization of the alignments
146
was carried out using the Integrated Genome Viewer (IGV) v2.0 (Robinson et al. 2011). Eval
147
version 2.2.8 (Keibler and Brent 2003) was used to generate summary statistics for the de-novo
148
transcriptomes. The BEDTools (Quinlan and Hall 2010) intersectbed program as well as the
149
GenomicRanges (Aboyoun 2015) and rtracklayer (Lawrence et al. 2009) packages from
150
R/Bioconductor were used to identify overlap of exons with genomic features such as repetitive
151
regions or annotated Ensembl/refseq exons.
152
Quality Control
5
153
Density plots of the logged Reads Per Kilobase per Million (RPKM) (Mortazavi et al. 2008) levels
154
for each gene in each sample were generated. Samples with read depth of less than 2 million reads
155
in total, or with >80% non-unique mapping reads > 80% were excluded from further analysis.
156
None of the 280 samples were excluded from further analysis using these QC criteria.
157
Initial quality control checks identified a bias in the data, exhibited by the paired-end muscle data
158
which had a much-reduced percentage of non-unique reads compared to the single read libraries. In
159
order to ensure the libraries were comparable with each other for the purposes of identifying
160
differential expression, all FASTQ files were trimmed to a common length of 36 bp following the
161
approach of Anders et al. (Anders et al. 2012). In addition, only the first read in the paired-end
162
muscle data was used for differential gene expression analysis.
163
Dimensional reduction and clustering
164
Hierarchical clustering of the individual samples was carried out using the ColorDendrogram
165
function from the sparcl R package. The Eisen distance metric was used as implemented in the
166
MADE4 bioconductor package (Culhane et al. 2005). Principal Component Analysis plots were
167
generated using the function prcomp in R.
168
Quantifying the diversity/dispersion of gene expression in a tissue
169
The Gini coefficient, a measure of the unevenness of the distribution of reads, was calculated using
170
the Gini function in the R library reldist (Handcock et al. 1999). It is most commonly used in the
171
social sciences as a measure of income inequality across different segments of a population. It is
172
defined as twice the area between the 45 degree line and the Lorenz curve where, in this case, the
173
Lorenz curve is a graph describing the cumulative share of total reads assigned to the bottom x% of
174
the gene universe. A tissue with an exactly equal distribution of reads among all genes would have
175
a Gini coefficient of zero and a tissue where all reads come from a single gene would have a Gini
176
of 1. The total count of reads for each gene from all samples of each tissue was obtained and the
177
Gini index for each tissue was calculated separately. The Gini index has been used by others to
178
measure skewness in other aspects of transcriptomics such as PolyA length (Morozov et al. 2012).
179
Gene expression measures of Ensembl 65 annotated transcripts
180
HTseq (Anders and Huber 2010) was used to generate raw read counts for each gene in each
181
library using the Ensembl 65 bovine annotation as the reference. These raw counts were converted
182
into TPM (Tags Per Million) (Li et al. 2010) and RPKM metrics.
183
Categorical tissue specificity
184
Following the method of Schug et al. (Schug et al. 2005), tissue specificity of each gene was
185
measured using the categorical tissue specificity metric (Qgt). Qgt weights genes according to the
6
186
degree to which the expression of a gene is skewed towards a single tissue; it is based on the
187
Shannon entropy of the gene Hg (Shannon and Weaver 1949).
188
The following calculation was used: given expression levels of a gene in N tissues, the relative
189
expression of a gene g in a tissue t was defined as:
190
pt |g 
wg,t
w
g,t
1t  N
191
where wg,t is the expression level of the gene in the tissue. In this case, either TPM or RPKM can
192
be used as the expression level as for the purposes of this calculation they are equivalent. To avoid
193
division by zero in later calculations, a count of 1 is added to the raw counts for each gene in each
194
sample before calculation of TPM and RPKM. The relative expression is then the RPKM of the
195
gene in the tissue divided by the sum of RPKMs for that gene across all tissues.
196
The Shannon entropy of a gene's expression was calculated as:
197
Hg 
 p
t |g
log 2 ( pt |g )
1t  N
198
Hg has units of bits and ranges from zero for genes expressed in a single tissue to a maximum of
199
log2(N) for genes expressed uniformly in all tissues. In this case, the maximum entropy of a gene
200
was log2(10)=3.32 bits, which would represent a ubiquitously and uniformly expressed gene (i.e. an
201
ideal housekeeping gene). This relative expression derived entropy calculation does not
202
discriminate between absolute expression levels, so in order to give higher weight to genes with
203
greater absolute expression levels, the categorical tissue specificity was defined as
204
Qg|t  H g  log 2 ( pt |g )
205
The expression of a particular gene becomes more specific to a single tissue as the value of Qg|t
206
approaches zero. By contrast, ideal housekeeping genes would have a Qg|t of 2log2(10)=6.64 bits in
207
all tissues.
208
Tissue-specific gene lists and pathway analysis
209
Permutation testing of a balanced set of tissue samples was used to estimate the null distribution of
210
the entropy values. 10 samples were taken from each tissue type (pituitary, which consisted of 3
211
pooled samples was excluded from this step because of the low number of samples). The labels
212
were randomly permuted 100 times and the minimum entropy score across all genes (0.404 bits)
7
213
was used as the cut-off. Genes having an entropy value below this score in the original set were
214
considered to be expressed in a highly tissue specific manner.
215
The DAVID pathway annotation tool (Huang da et al. 2009) was used to determine over-
216
represented KEGG pathways among the tissue-specific gene list generated for each tissue.
217
Correlation of tissue specific expression ranking between species
218
The probesets from the matching tissues from the Schug dataset were sorted according to the
219
Qgt(rma) field. This ordering generated the ranks which were used to compare to the ranked list of
220
genes from our RNAseq generated Qgt values. Human and mouse affymetrix IDs were matched to
221
bovine Ensembl gene IDs using cross-reference tables downloaded from the Ensembl Biomart tool.
222
The Spearman correlation (spearmans rho) was calculated using the cor.test function in R.
223
ANOVA analysis
224
Analysis of variance was carried out to determine the overall amount of tissue-specific expression
225
in the dataset. The TPM matrix was used in conjunction with the limma (Smyth 2004) and puma
226
(Pearson et al. 2009) packages in R/bioconductor to calculate the tissue effects. False discovery
227
rate thresholds of 0.05 and 0.01 were used.
228
229
4. Results
230
Of the 24,616 annotated genes, 23,818 were expressed at a level of one read in at least one of the
231
280 samples (22,963 genes with ≥ 5 reads). Of these 17,893 were expressed at a level of ≥ 10 tags
232
per million (TPM). Embryos showed the greatest number of expressed genes and muscle had the
233
fewest (Table 2). There were 1,838 genes in the ens65 annotation that were not detected in any of
234
the samples. A full list of genes and their expression levels in tissues is shown in Supplementary
235
Table S1.
236
Clustering of samples
237
Initial clustering of the samples both by Hierarchical Clustering (Figure 1) and Principal
238
Components Analysis (Figure 2) showed almost perfect grouping of samples by tissues. The
239
exceptions were some of the cervical samples that initially grouped with the theca samples. Further
240
analysis revealed that this was due to very high expression of several collagen and basement
241
membrane genes (see Supplementary Table S1) in a number of cervical and thecal tissue samples.
242
We hypothesized that this could have been from inclusion of some connective tissue in these
243
samples. These transcripts were removed and TPMs were recalculated based on the new sample
244
totals (as the numerator of the TPM calculation is affected by these highly expressed genes).
8
245
Following this filtering perfect grouping of samples by tissue was achieved. It is notable that the
246
clustering by tissue was true also for the paired theca and granulosa samples (which came from the
247
same animals).
248
Overall tissue-specific expression
249
The results of the Analysis of Variance (ANOVA)/Differentially Expressed Genes (DEG) analysis
250
of gene expression among the 10 tissue types indicate that 95.8% of genes were differentially
251
expressed between tissues at a False Discovery Rate (FDR) of 0.05 and 93.6% at an FDR of 0.01.
252
The principal components analysis showed clustering together of samples from the same tissue and
253
very specific separation of samples along consecutive principal components. This was reflected by
254
the separation of these samples in PC1 and PC2. The first 2 principal components accounted for
255
33% of total variance, the first 10 principal components accounted for 89%.
256
Gini Index
257
The Gini index for each tissue is shown in Table 3. The gene expression was skewed in all tissues.
258
The most extreme skew was seen in the muscle tissue (Gini coefficient 0.96). The top 10 most
259
highly expressed genes in muscle made up 34% of all reads (gene aligned reads); while the top
260
gene (ENSBTAG00000046332 Actin, alpha skeletal muscle) on its own accounted for 11% of all
261
gene-aligned reads. The tissue with the lowest Gini coefficient was the endometrium tissue where
262
the top 10 genes constituted just 3% of total gene aligned reads and the most highly expressed gene
263
(ENSBTAG00000021466, collagen alpha-1(III) chain) accounted for 0.5% of total reads.
264
Tissue-specific gene expression
265
The full table of Qgt for each gene in each tissue as well as the overall entropy (Hg) for each gene is
266
available in supplementary Table S1.
267
Using the permutation analysis, 452 transcripts exhibited highly significant tissue-specific
268
expression (< 0.404 bits entropy score) (Table 4). The top tissue-specific gene for each tissue (as
269
determined by Qgt score) is shown in Figure 3. In many cases there was a large variance associated
270
with the top gene and it would not have been ranked top using other types of analysis such as
271
ANOVA.
272
Pathway analysis of tissue-specific genes
273
The most tissue-specific genes for each tissue (after permutation testing) were analyzed to identify
274
over-represented pathways (see Table 4). In some cases, the number of genes was insufficient to
275
detect pathways; however, in each case, inspection of the individual list confirmed the presence of
276
genes, most of which have a previously reported biological function in the tissue in question. The
9
277
tissue with the most highly specific genes was the liver with 196 genes. Tissues with 10 or fewer
278
highly specific genes were the pituitary, granulosa, endometrium and theca (Table 4).
279
Pathway analysis of housekeeping genes
280
411 genes were identified as candidate housekeeping genes (high and not differentially expressed
281
among tissues) based on entropy scores of ≥ 3.25 (Table 4). The pathways over-represented
282
included mitochondrial pathways (and diseases related to mitochondrial dysfunction such as
283
Huntington’s, Parkinson’s and Alzheimer’s) and basic cellular processes such as DNA repair,
284
protein translation, protein degradation and various protein modifications (ubiquitination,
285
methylation, neddylation).
286
Comparison of tissue specific genes between species
287
Of the 10 bovine tissues, 5 of them had matching samples in the human and mouse dataset
288
generated by Schug et al (Schug et al. 2005). The matched tissues for human were liver, uterus
289
(endometrium) and pituitary, while for mouse the matches were hypothalamus, liver, muscle and
290
uterus. A total of 8,737 bovine genes were matched to the human probes and 8,914 to the mouse
291
microarray probes. The most significant expression correlations were with the pituitary profiles
292
(0.61) followed by muscle (0.42) and then liver (0.23). Comparison of the expression correlations
293
between the mouse endometrium vs the bovine endometrium was marginally significant.
294
Comparison of the expression correlations between the bovine and human endometrium and
295
hypothalamic tissues were non-significant.
296
297
5. Discussion
298
Comparison with other mammalian gene atlases
299
The current study presents the first RNA-seq-derived multi-tissue comparison of gene expression
300
in bovine tissues. Our comparison is based on more biological replicates, greater sequencing depth
301
and higher gene coverage than other comparisons and includes a focus on reproductive tissues. A
302
bovine gene atlas (BGA) based on Illumina Digital Gene Expression (DGE) tags generated using
303
the DpnII restriction enzyme has also been published (Harhay et al. 2010) which used 300 million
304
tags (20 bp sequences) from 92 tissues and 3 cell lines. Similar to the current study, the tissue
305
samples were from different animals, breeds and sexes. The DGE approach has several limitations
306
compared with RNA-seq in that only a 16-17 bp tag (+4 bp recognition sequence) is generated per
307
transcript so information about the full transcript extent is missing. The shorter DGE tags are
308
frequently non-unique and so information about some transcripts is lost. Five of the ten tissues
10
309
included in this study were also profiled in the BGA (i.e. muscle, liver, pituitary, hypothalamus and
310
leukocytes).
311
The archetype for this project is the original mouse Gene Atlas (Su et al. 2004) that profiled 61
312
tissues in mouse and 79 tissues in human using Affymetrix microarrays. It remains the most
313
comprehensive profiling of mammalian tissues to date and is a popular resource for researchers
314
attempting to determine the tissue expression profiles of specific genes. Other notable mammalian
315
expression atlases include the Allen Brain Atlas (Lein et al. 2007) which profiles the expression of
316
genes in the mouse brain via in situ hybridization and the mouse brain atlas generated by Siddiqui
317
et al (Siddiqui et al. 2005) using the Long Sage protocol on 72 individual tissues from mice. More
318
recently the Human BodyMap project has conducted RNA-seq on 16 different human tissue types
319
(GEO ID: GSE30611, http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE30611)
320
and a pig expression atlas (using microarrays) was generated based on 62 different tissues and cell
321
types with a particular focus on the gastrointestinal tract (Freeman et al. 2012). Another group
322
profiled 900 different regions of the human brain using expression microarrays (Hawrylycz et al.
323
2012). An equine RNA-seq atlas has also been published based on 8 different tissues (Coleman et
324
al. 2010).
325
Our comparison sets this study apart from those listed above due to the higher overall number of
326
samples per tissue and the range of experimental conditions under which these tissues were
327
recovered. The perturbation of the tissue by developmental changes or experimental challenges
328
provides a more realistic picture of the diversity of transcription in an individual tissue. Our use of
329
information theory (entropy) to measure tissue specificity enables the identification of tissue-
330
specific expression even in cases where the most tissue-specifically expressed genes are not
331
expressed in all samples from the tissue in question. For example, aromatase (CYP19A1) is not
332
expressed at all stages of granulosa tissue development but it is the gene most characteristic of this
333
tissue (Stocco 2008; Walsh et al. 2012) and was confirmed in this study as such.
334
The nature of gene expression in different tissues
335
Transcription in different mammalian tissues is very variable and this is reflected in our bovine
336
tissue data as evidenced by the Gini coefficient. In most tissues a small number of genes
337
disproportionately contribute to the overall mRNA pool. The most extreme example of this in our
338
dataset was the gene expression in the muscle tissue (Keady In preparation). This probably reflects
339
the specialization of muscle tissue for a specific task (with few cell types), compared with more
340
diverse and complex functions required by a tissue such as the endometrium (with many cell
341
types). This may also explain why muscle was one of the earliest and most successful cases of
342
identifying transcriptional regulatory control regions determining tissue-specific transcription
343
(Wasserman and Fickett 1998).
11
344
Differential gene analysis reveals that almost all genes have an element of tissue-specific
345
expression; however, it is usually a matter of different levels of expression in different tissues (i.e.
346
along a continuum of expression rather than a digital pattern of presence or absence of gene
347
expression). This is reflected in the fact that almost 95% of genes have some measurable
348
component of tissue type contribution to expression level as determine by the ANOVA analysis.
349
However, a small number of genes (452, see Table 4) were identified as having highly specific
350
expression restricted to either 1 or 2 tissues. These low entropy/high information genes have very
351
specific functions characteristic of their tissue of expression such as prolactin in the pituitary (Egli
352
et al. 2010). The liver had many more of these types of genes than the other tissue types. There was
353
also a similarly small number (411) of potential housekeeping genes exhibiting high entropy and
354
little evidence of tissue-specific expression patterns. This was similar to the number of genes
355
proposed as housekeeping genes by other groups (Warrington et al. 2000; Hsiao et al. 2001;
356
Eisenberg and Levanon 2003). However, there was relatively little overlap in the housekeeping
357
lists that we generated compared with these earlier human lists (24 and 22 housekeeping genes
358
overlap with the data from Warrington et al., 2000 and Hsiao et al., 2001, respectively).
359
While it is tempting to consider these as potential reference genes for normalization processes,
360
either for quantitative polymerase chain reaction (qPCR) or RNA-seq experiments, the addition of
361
further tissues or similar tissues under differing conditions would reduce this number. The recent
362
discovery of the impact of transcriptional amplification by C-myc on global transcription would
363
suggest that external standards (spike-ins) calibrated to cell number may be the only viable
364
universal approach (Loven et al. 2012).
365
The Qgt values generated in this study indicate much higher tissue specificity than that indicated by
366
Schug and colleagues (Schug et al. 2005). This can be explained by several factors. Schug et al.
367
(2005) relied on the Gene Atlas dataset that profiled many more tissues using microarrays (79
368
human tissues of which 43 were used) but with an n=2 per tissue. RNA-seq avoids the coverage
369
issues present with microarrays and the digital nature of the data generated is well suited to entropy
370
calculations. It is likely that some of the genes that are considered highly tissue-specific in the
371
current study will become less so as additional tissue types are profiled. The tissue specificity
372
results for some of our bovine tissues were compared with equivalent tissues in human and mouse
373
where Qgt values were generated by Schug et al. (2005) with mixed results. The pituitary and
374
muscle tissues exhibited high correlation among the most specifically expressed genes, liver
375
showed an intermediate level of concordance and the hypothalamus and endometrium displayed
376
limited species similarities. Brawand et al. (2011) compared the expression of 6 tissues across 10
377
different mammalian species (Brawand et al. 2011). They detected differing rates of gene
378
expression changes in different tissues including the liver. This may reflect differing evolutionary
12
379
pressures on different tissues/organs. It suggests that the genes that are highly specific to pituitary
380
and muscle function are not changing as much as those involved in endometrium, liver and
381
hypothalamus.
382
Overall completeness of the bovine transcriptome
383
It is remarkable that so much of the bovine genome is transcribed. Transcription from 97% of the
384
assembled bovine genome in at least one sample was detected. However, such gross percentages
385
can be misleading. Most of the transcription is at a very low level and the read density in exonic
386
regions is orders of magnitude higher than intergenic and intronic regions.
387
Most of the novel transcription units that were identified were single exon transcripts. Of the
388
12,041 novel multi-exon transcripts 4,494 have some overlap with genes mapped to the Bovine
389
Genome from other species such as human and mouse (as determined by coordinates available
390
from the UCSC xenoRef table) (Kent et al. 2002). However, the contribution of these novel reads
391
to the overall RNA-seq dataset is relatively low (less than 1%). This is in stark contrast to the
392
amount of RNA sequence derived from the known exonic regions. 28% of the reads fall within
393
known Ensembl exons and 68% fall within the entire gene span (including exons and introns).
394
While 30% of the reads fall outside these categories, it is distributed over a much larger portion of
395
the genome and the level of transcription is several orders of magnitude lower. The fact of almost
396
ubiquitous transcription of the genome is not new. Pol2 (the key enzyme involved in transcription)
397
binding to DNA is relatively non-specific with as much as 90% of the binding and resultant
398
transcription suggested to be noisy and non-functional (Struhl 2007).
399
The Encyclopedia of DNA elements (ENCODE) results from both the prototype (Birney et al.
400
2007) and scale-up phase (Bernstein et al. 2012) have confirmed the ubiquity of transcription but
401
controversially they have suggested that this transcription is functional. These claims have been
402
challenged by others (van Bakel et al. 2010; Eddy 2012). Despite the ENCODE claims, we
403
recommend a cautious and conservative interpretation of the low abundance intergenic
404
transcription.
405
406
6. Conclusions
407
In this paper we have used profiled 280 different bovine samples from 10 different tissues using
408
RNA-seq. It is remarkable that so much of the bovine genome is transcribed with 23,818 out of
409
24,616 (97%) genes of the assembled bovine genome being detected in at least one sample.
410
However, most of the transcription is at a very low level giving rise to individual tissues with
411
highly characteristic patterns of gene expression, even among the majority of genes that are
13
412
expressed in all tissues (95.8% of genes were differentially expressed). We have shown that a
413
small number of genes disproportionately contribute to the majority of the mRNA pool in a given
414
tissue; that tissues have a limited number of uniquely expressed genes (ranging from 2 to 196 genes
415
in this study) and also that we detected 411 housekeeping genes there were expressed at high levels
416
in all of the 10 tissue types examined. We conclude that it is combined effects of the variable
417
expression of large numbers of genes (73 to 93% of the genome) with the specific expression of a
418
small number of genes (less than 1% of the genome) that contributes to determining the outcome of
419
the function of individual tissues.
420
421
7. Acknowledgements
422
The authors are grateful to Science Foundation Ireland (07/SRC/B1156) for funding a large portion
423
of this research.
424
425
Competing interests
426
None
427
428
References
429
430
Aboyoun, P., Pages H., Lawrence M. (2015) Representation and Manipulation of genomic
intervals. R package. In 'Genomic Ranges.' 1.9.65 edn.)
431
432
433
Anders, S., and Huber, W. (2010) Differential expression analysis for sequence count data.
Genome Biology 11(10), R106
434
435
436
Anders, S., Reyes, A., and Huber, W. (2012) Detecting differential usage of exons from RNA-seq
data. Genome Research 22(10), 2008-2017
437
438
439
440
441
Bainbridge, M.N., Warren, R.L., Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A., Griffith,
M., Hickenbotham, M., Magrini, V., Mardis, E.R., Sadar, M.D., Siddiqui, A.S., Marra, M.A., and
Jones, S.J. (2006) Analysis of the prostate cancer cell line LNCaP transcriptome using a
sequencing-by-synthesis approach. BMC Genomics 7, 246
442
443
444
Bernstein, B.E., Birney, E., Dunham, I., Green, E.D., Gunter, C., and Snyder, M. (2012) An
integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57-74
445
446
447
Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H.,
Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E., Kuehn, M.S., Taylor, C.M., Neph, S.,
14
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
Koch, C.M., Asthana, S., Malhotra, A., Adzhubei, I., Greenbaum, J.A., Andrews, R.M., Flicek, P.,
Boyle, P.J., Cao, H., Carter, N.P., Clelland, G.K., Davis, S., Day, N., Dhami, P., Dillon, S.C.,
Dorschner, M.O., Fiegler, H., Giresi, P.G., Goldy, J., Hawrylycz, M., Haydock, A., Humbert, R.,
James, K.D., Johnson, B.E., Johnson, E.M., Frum, T.T., Rosenzweig, E.R., Karnani, N., Lee, K.,
Lefebvre, G.C., Navas, P.A., Neri, F., Parker, S.C., Sabo, P.J., Sandstrom, R., Shafer, A., Vetrie,
D., Weaver, M., Wilcox, S., Yu, M., Collins, F.S., Dekker, J., Lieb, J.D., Tullius, T.D., Crawford,
G.E., Sunyaev, S., Noble, W.S., Dunham, I., Denoeud, F., Reymond, A., Kapranov, P., Rozowsky,
J., Zheng, D., Castelo, R., Frankish, A., Harrow, J., Ghosh, S., Sandelin, A., Hofacker, I.L.,
Baertsch, R., Keefe, D., Dike, S., Cheng, J., Hirsch, H.A., Sekinger, E.A., Lagarde, J., Abril, J.F.,
Shahab, A., Flamm, C., Fried, C., Hackermuller, J., Hertel, J., Lindemeyer, M., Missal, K., Tanzer,
A., Washietl, S., Korbel, J., Emanuelsson, O., Pedersen, J.S., Holroyd, N., Taylor, R., Swarbreck,
D., Matthews, N., Dickson, M.C., Thomas, D.J., Weirauch, M.T., Gilbert, J., Drenkow, J., Bell, I.,
Zhao, X., Srinivasan, K.G., Sung, W.K., Ooi, H.S., Chiu, K.P., Foissac, S., Alioto, T., Brent, M.,
Pachter, L., Tress, M.L., Valencia, A., Choo, S.W., Choo, C.Y., Ucla, C., Manzano, C., Wyss, C.,
Cheung, E., Clark, T.G., Brown, J.B., Ganesh, M., Patel, S., Tammana, H., Chrast, J., Henrichsen,
C.N., Kai, C., Kawai, J., Nagalakshmi, U., Wu, J., Lian, Z., Lian, J., Newburger, P., Zhang, X.,
Bickel, P., Mattick, J.S., Carninci, P., Hayashizaki, Y., Weissman, S., Hubbard, T., Myers, R.M.,
Rogers, J., Stadler, P.F., Lowe, T.M., Wei, C.L., Ruan, Y., Struhl, K., Gerstein, M., Antonarakis,
S.E., Fu, Y., Green, E.D., Karaoz, U., Siepel, A., Taylor, J., Liefer, L.A., Wetterstrand, K.A.,
Good, P.J., Feingold, E.A., Guyer, M.S., Cooper, G.M., Asimenos, G., Dewey, C.N., Hou, M.,
Nikolaev, S., Montoya-Burgos, J.I., Loytynoja, A., Whelan, S., Pardi, F., Massingham, T., Huang,
H., Zhang, N.R., Holmes, I., Mullikin, J.C., Ureta-Vidal, A., Paten, B., Seringhaus, M., Church, D.,
Rosenbloom, K., Kent, W.J., Stone, E.A., Batzoglou, S., Goldman, N., Hardison, R.C., Haussler,
D., Miller, W., Sidow, A., Trinklein, N.D., Zhang, Z.D., Barrera, L., Stuart, R., King, D.C., Ameur,
A., Enroth, S., Bieda, M.C., Kim, J., Bhinge, A.A., Jiang, N., Liu, J., Yao, F., Vega, V.B., Lee,
C.W., Ng, P., Yang, A., Moqtaderi, Z., Zhu, Z., Xu, X., Squazzo, S., Oberley, M.J., Inman, D.,
Singer, M.A., Richmond, T.A., Munn, K.J., Rada-Iglesias, A., Wallerman, O., Komorowski, J.,
Fowler, J.C., Couttet, P., Bruce, A.W., Dovey, O.M., Ellis, P.D., Langford, C.F., Nix, D.A.,
Euskirchen, G., Hartman, S., Urban, A.E., Kraus, P., Van Calcar, S., Heintzman, N., Kim, T.H.,
Wang, K., Qu, C., Hon, G., Luna, R., Glass, C.K., Rosenfeld, M.G., Aldred, S.F., Cooper, S.J.,
Halees, A., Lin, J.M., Shulha, H.P., Xu, M., Haidar, J.N., Yu, Y., Iyer, V.R., Green, R.D.,
Wadelius, C., Farnham, P.J., Ren, B., Harte, R.A., Hinrichs, A.S., Trumbower, H., Clawson, H.,
Hillman-Jackson, J., Zweig, A.S., Smith, K., Thakkapallayil, A., Barber, G., Kuhn, R.M.,
Karolchik, D., Armengol, L., Bird, C.P., de Bakker, P.I., Kern, A.D., Lopez-Bigas, N., Martin,
J.D., Stranger, B.E., Woodroffe, A., Davydov, E., Dimas, A., Eyras, E., Hallgrimsdottir, I.B.,
Huppert, J., Zody, M.C., Abecasis, G.R., Estivill, X., Bouffard, G.G., Guan, X., Hansen, N.F., Idol,
J.R., Maduro, V.V., Maskeri, B., McDowell, J.C., Park, M., Thomas, P.J., Young, A.C., Blakesley,
R.W., Muzny, D.M., Sodergren, E., Wheeler, D.A., Worley, K.C., Jiang, H., Weinstock, G.M.,
Gibbs, R.A., Graves, T., Fulton, R., Mardis, E.R., Wilson, R.K., Clamp, M., Cuff, J., Gnerre, S.,
Jaffe, D.B., Chang, J.L., Lindblad-Toh, K., Lander, E.S., Koriabine, M., Nefedov, M., Osoegawa,
K., Yoshinaga, Y., Zhu, B., and de Jong, P.J. (2007) Identification and analysis of functional
elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799-816
490
491
492
493
494
Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csardi, G., Harrigan, P., Weier, M., Liechti,
A., Aximu-Petri, A., Kircher, M., Albert, F.W., Zeller, U., Khaitovich, P., Grutzner, F., Bergmann,
S., Nielsen, R., Paabo, S., and Kaessmann, H. (2011) The evolution of gene expression levels in
mammalian organs. Nature 478(7369), 343-8
495
496
497
498
Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M. (2010) The Sanger FASTQ file
format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids
Research 38(6), 1767-71
499
15
500
501
502
Coleman, S.J., Zeng, Z., Wang, K., Luo, S., Khrebtukova, I., Mienaltowski, M.J., Schroth, G.P.,
Liu, J., and MacLeod, J.N. (2010) Structural annotation of equine protein-coding genes determined
by mRNA sequencing. Animal Genetics 41 Suppl 2, 121-30
503
504
505
Culhane, A.C., Thioulouse, J., Perriere, G., and Higgins, D.G. (2005) MADE4: an R package for
multivariate analysis of gene expression data. Bioinformatics 21(11), 2789-90
506
507
508
Eddy, S.R. (2012) The C-value Paradox, junk DNA and ENCODE. Current Biology 22(21), R898R899
509
510
511
Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI gene
expression and hybridization array data repository. Nucleic Acids Research 30(1), 207-10
512
513
514
Egli, M., Leeners, B., and Kruger, T.H. (2010) Prolactin secretion patterns: basic mechanisms and
clinical implications for reproduction. Reproduction 140(5), 643-54
515
516
517
Eisenberg, E., and Levanon, E.Y. (2003) Human housekeeping genes are compact. Trends in
Genetics : TIG 19(7), 362-5
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
Elsik, C.G., Tellam, R.L., Worley, K.C., Gibbs, R.A., Muzny, D.M., Weinstock, G.M., Adelson,
D.L., Eichler, E.E., Elnitski, L., Guigo, R., Hamernik, D.L., Kappes, S.M., Lewin, H.A., Lynn,
D.J., Nicholas, F.W., Reymond, A., Rijnkels, M., Skow, L.C., Zdobnov, E.M., Schook, L.,
Womack, J., Alioto, T., Antonarakis, S.E., Astashyn, A., Chapple, C.E., Chen, H.C., Chrast, J.,
Camara, F., Ermolaeva, O., Henrichsen, C.N., Hlavina, W., Kapustin, Y., Kiryutin, B., Kitts, P.,
Kokocinski, F., Landrum, M., Maglott, D., Pruitt, K., Sapojnikov, V., Searle, S.M., Solovyev, V.,
Souvorov, A., Ucla, C., Wyss, C., Anzola, J.M., Gerlach, D., Elhaik, E., Graur, D., Reese, J.T.,
Edgar, R.C., McEwan, J.C., Payne, G.M., Raison, J.M., Junier, T., Kriventseva, E.V., Eyras, E.,
Plass, M., Donthu, R., Larkin, D.M., Reecy, J., Yang, M.Q., Chen, L., Cheng, Z., Chitko-McKown,
C.G., Liu, G.E., Matukumalli, L.K., Song, J., Zhu, B., Bradley, D.G., Brinkman, F.S., Lau, L.P.,
Whiteside, M.D., Walker, A., Wheeler, T.T., Casey, T., German, J.B., Lemay, D.G., Maqbool,
N.J., Molenaar, A.J., Seo, S., Stothard, P., Baldwin, C.L., Baxter, R., Brinkmeyer-Langford, C.L.,
Brown, W.C., Childers, C.P., Connelley, T., Ellis, S.A., Fritz, K., Glass, E.J., Herzig, C.T.,
Iivanainen, A., Lahmers, K.K., Bennett, A.K., Dickens, C.M., Gilbert, J.G., Hagen, D.E., Salih, H.,
Aerts, J., Caetano, A.R., Dalrymple, B., Garcia, J.F., Gill, C.A., Hiendleder, S.G., Memili, E.,
Spurlock, D., Williams, J.L., Alexander, L., Brownstein, M.J., Guan, L., Holt, R.A., Jones, S.J.,
Marra, M.A., Moore, R., Moore, S.S., Roberts, A., Taniguchi, M., Waterman, R.C., Chacko, J.,
Chandrabose, M.M., Cree, A., Dao, M.D., Dinh, H.H., Gabisi, R.A., Hines, S., Hume, J., Jhangiani,
S.N., Joshi, V., Kovar, C.L., Lewis, L.R., Liu, Y.S., Lopez, J., Morgan, M.B., Nguyen, N.B.,
Okwuonu, G.O., Ruiz, S.J., Santibanez, J., Wright, R.A., Buhay, C., Ding, Y., Dugan-Rocha, S.,
Herdandez, J., Holder, M., Sabo, A., Egan, A., Goodell, J., Wilczek-Boney, K., Fowler, G.R.,
Hitchens, M.E., Lozado, R.J., Moen, C., Steffen, D., Warren, J.T., Zhang, J., Chiu, R., Schein, J.E.,
Durbin, K.J., Havlak, P., Jiang, H., Liu, Y., Qin, X., Ren, Y., Shen, Y., Song, H., Bell, S.N., Davis,
C., Johnson, A.J., Lee, S., Nazareth, L.V., Patel, B.M., Pu, L.L., Vattathil, S., Williams, R.L., Jr.,
Curry, S., Hamilton, C., Sodergren, E., Wheeler, D.A., Barris, W., Bennett, G.L., Eggen, A.,
Green, R.D., Harhay, G.P., Hobbs, M., Jann, O., Keele, J.W., Kent, M.P., Lien, S., McKay, S.D.,
McWilliam, S., Ratnakumar, A., Schnabel, R.D., Smith, T., Snelling, W.M., Sonstegard, T.S.,
Stone, R.T., Sugimoto, Y., Takasuga, A., Taylor, J.F., Van Tassell, C.P., Macneil, M.D.,
Abatepaulo, A.R., Abbey, C.A., Ahola, V., Almeida, I.G., Amadio, A.F., Anatriello, E., Bahadue,
S.M., Biase, F.H., Boldt, C.R., Carroll, J.A., Carvalho, W.A., Cervelatti, E.P., Chacko, E., Chapin,
J.E., Cheng, Y., Choi, J., Colley, A.J., de Campos, T.A., De Donato, M., Santos, I.K., de Oliveira,
16
550
551
552
553
554
555
556
557
558
559
560
561
562
563
C.J., Deobald, H., Devinoy, E., Donohue, K.E., Dovc, P., Eberlein, A., Fitzsimmons, C.J., Franzin,
A.M., Garcia, G.R., Genini, S., Gladney, C.J., Grant, J.R., Greaser, M.L., Green, J.A., Hadsell,
D.L., Hakimov, H.A., Halgren, R., Harrow, J.L., Hart, E.A., Hastings, N., Hernandez, M., Hu,
Z.L., Ingham, A., Iso-Touru, T., Jamis, C., Jensen, K., Kapetis, D., Kerr, T., Khalil, S.S., Khatib,
H., Kolbehdari, D., Kumar, C.G., Kumar, D., Leach, R., Lee, J.C., Li, C., Logan, K.M.,
Malinverni, R., Marques, E., Martin, W.F., Martins, N.F., Maruyama, S.R., Mazza, R., McLean,
K.L., Medrano, J.F., Moreno, B.T., More, D.D., Muntean, C.T., Nandakumar, H.P., Nogueira,
M.F., Olsaker, I., Pant, S.D., Panzitta, F., Pastor, R.C., Poli, M.A., Poslusny, N., Rachagani, S.,
Ranganathan, S., Razpet, A., Riggs, P.K., Rincon, G., Rodriguez-Osorio, N., Rodriguez-Zas, S.L.,
Romero, N.E., Rosenwald, A., Sando, L., Schmutz, S.M., Shen, L., Sherman, L., Southey, B.R.,
Lutzow, Y.S., Sweedler, J.V., Tammen, I., Telugu, B.P., Urbanski, J.M., Utsunomiya, Y.T.,
Verschoor, C.P., Waardenberg, A.J., Wang, Z., Ward, R., Weikard, R., Welsh, T.H., Jr., White,
S.N., Wilming, L.G., Wunderlich, K.R., Yang, J., and Zhao, F.Q. (2009) The genome sequence of
taurine cattle: a window to ruminant biology and evolution. Science 324(5926), 522-8
564
565
566
567
568
569
570
571
572
573
Flicek, P., Amode, M.R., Barrell, D., Beal, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates,
G., Fairley, S., Fitzgerald, S., Gil, L., Gordon, L., Hendrix, M., Hourlier, T., Johnson, N., Kahari,
A.K., Keefe, D., Keenan, S., Kinsella, R., Komorowska, M., Koscielny, G., Kulesha, E., Larsson,
P., Longden, I., McLaren, W., Muffato, M., Overduin, B., Pignatelli, M., Pritchard, B., Riat, H.S.,
Ritchie, G.R., Ruffier, M., Schuster, M., Sobral, D., Tang, Y.A., Taylor, K., Trevanion, S.,
Vandrovcova, J., White, S., Wilson, M., Wilder, S.P., Aken, B.L., Birney, E., Cunningham, F.,
Dunham, I., Durbin, R., Fernandez-Suarez, X.M., Harrow, J., Herrero, J., Hubbard, T.J., Parker, A.,
Proctor, G., Spudich, G., Vogel, J., Yates, A., Zadissa, A., and Searle, S.M. (2012) Ensembl 2012.
Nucleic Acids Research 40, D84-D90
574
575
576
577
578
Foley, C., Chapwanya, A., Creevey, C., Narciandi, F., Morris, D., Kenny, E., Cormican, P.,
Callanan, J.J., O'Farrelly, C., and Meade, K.G. (2012) Global endometrial transcriptomic profiling:
transient immune activation precedes tissue proliferation and repair in healthy beef cows. BMC
Genomics 13(1), 489
579
580
581
582
583
Forde, N., Duffy, G.B., McGettigan, P.A., Browne, J.A., Prakash Mehta, J., Kelly, A.K., MansouriAttia, N., Sandra, O., Loftus, B.J., Crowe, M.A., Fair, T., Roche, J.F., Lonergan, P., and Evans,
A.C. (2012) Evidence for an early endometrial response to pregnancy in cattle: both dependent
upon and independent of interferon tau. Physiological Genomics 44(16), 799-810
584
585
586
587
588
Freeman, T.C., Ivens, A., Baillie, J.K., Beraldi, D., Barnett, M.W., Dorward, D., Downing, A.,
Fairbairn, L., Kapetanovic, R., Raza, S., Tomoiu, A., Alberio, R., Wu, C., Su, A.I., Summers,
K.M., Tuggle, C.K., Archibald, A.L., and Hume, D.A. (2012) A gene expression atlas of the
domestic pig. BMC Biology 10, 90
589
590
591
592
Handcock, M.S., Morris, M., and ebrary Inc. (1999) 'Relative distribution methods in the social
sciences.' In Statistics for social science and public policy (Springer,: New York) Available at
http://site.ebrary.com/lib/princeton/Doc?id=5008065
593
594
595
596
597
Harhay, G.P., Smith, T.P., Alexander, L.J., Haudenschild, C.D., Keele, J.W., Matukumalli, L.K.,
Schroeder, S.G., Van Tassell, C.P., Gresham, C.R., Bridges, S.M., Burgess, S.C., and Sonstegard,
T.S. (2010) An atlas of bovine gene expression reveals novel distinctive tissue characteristics and
evidence for improving genome annotation. Genome Biology 11(10), R102
598
17
599
600
601
602
603
604
605
606
607
608
609
Hawrylycz, M.J., Lein, E.S., Guillozet-Bongaarts, A.L., Shen, E.H., Ng, L., Miller, J.A., van de
Lagemaat, L.N., Smith, K.A., Ebbert, A., Riley, Z.L., Abajian, C., Beckmann, C.F., Bernard, A.,
Bertagnolli, D., Boe, A.F., Cartagena, P.M., Chakravarty, M.M., Chapin, M., Chong, J., Dalley,
R.A., Daly, B.D., Dang, C., Datta, S., Dee, N., Dolbeare, T.A., Faber, V., Feng, D., Fowler, D.R.,
Goldy, J., Gregor, B.W., Haradon, Z., Haynor, D.R., Hohmann, J.G., Horvath, S., Howard, R.E.,
Jeromin, A., Jochim, J.M., Kinnunen, M., Lau, C., Lazarz, E.T., Lee, C., Lemon, T.A., Li, L., Li,
Y., Morris, J.A., Overly, C.C., Parker, P.D., Parry, S.E., Reding, M., Royall, J.J., Schulkin, J.,
Sequeira, P.A., Slaughterbeck, C.R., Smith, S.C., Sodt, A.J., Sunkin, S.M., Swanson, B.E., Vawter,
M.P., Williams, D., Wohnoutka, P., Zielke, H.R., Geschwind, D.H., Hof, P.R., Smith, S.M., Koch,
C., Grant, S.G., and Jones, A.R. (2012) An anatomically comprehensive atlas of the adult human
brain transcriptome. Nature 489(7416), 391-9
610
611
612
613
614
615
Hsiao, L.L., Dangond, F., Yoshida, T., Hong, R., Jensen, R.V., Misra, J., Dillon, W., Lee, K.F.,
Clark, K.E., Haverty, P., Weng, Z., Mutter, G.L., Frosch, M.P., MacDonald, M.E., Milford, E.L.,
Crum, C.P., Bueno, R., Pratt, R.E., Mahadevappa, M., Warrington, J.A., Stephanopoulos, G., and
Gullans, S.R. (2001) A compendium of gene expression in normal human tissues. Physiological
Genomics 7(2), 97-104
616
617
618
619
Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009) Bioinformatics enrichment tools: paths
toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research 37(1), 113
620
621
622
623
Keady, S.M., Creevey, C., Kenny , D.A., Waters, S.M. (In preparation) Transcriptional regulation
in M. longissimus dorsi during nutritional restriction and compensatory growth in Aberdeen Angus
steers using RNAseq technology.
624
625
626
Keibler, E., and Brent, M.R. (2003) Eval: a software package for analysis of genome annotations.
BMC Bioinformatics 4, 50
627
628
629
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D.
(2002) The human genome browser at UCSC. Genome Research 12(6), 996-1006
630
631
632
Lawrence, M., Gentleman, R., and Carey, V. (2009) rtracklayer: an R package for interfacing with
genome browsers. Bioinformatics 25(14), 1841-2
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
Lein, E.S., Hawrylycz, M.J., Ao, N., Ayres, M., Bensinger, A., Bernard, A., Boe, A.F., Boguski,
M.S., Brockway, K.S., Byrnes, E.J., Chen, L., Chen, T.M., Chin, M.C., Chong, J., Crook, B.E.,
Czaplinska, A., Dang, C.N., Datta, S., Dee, N.R., Desaki, A.L., Desta, T., Diep, E., Dolbeare, T.A.,
Donelan, M.J., Dong, H.W., Dougherty, J.G., Duncan, B.J., Ebbert, A.J., Eichele, G., Estin, L.K.,
Faber, C., Facer, B.A., Fields, R., Fischer, S.R., Fliss, T.P., Frensley, C., Gates, S.N., Glattfelder,
K.J., Halverson, K.R., Hart, M.R., Hohmann, J.G., Howell, M.P., Jeung, D.P., Johnson, R.A., Karr,
P.T., Kawal, R., Kidney, J.M., Knapik, R.H., Kuan, C.L., Lake, J.H., Laramee, A.R., Larsen, K.D.,
Lau, C., Lemon, T.A., Liang, A.J., Liu, Y., Luong, L.T., Michaels, J., Morgan, J.J., Morgan, R.J.,
Mortrud, M.T., Mosqueda, N.F., Ng, L.L., Ng, R., Orta, G.J., Overly, C.C., Pak, T.H., Parry, S.E.,
Pathak, S.D., Pearson, O.C., Puchalski, R.B., Riley, Z.L., Rockett, H.R., Rowland, S.A., Royall,
J.J., Ruiz, M.J., Sarno, N.R., Schaffnit, K., Shapovalova, N.V., Sivisay, T., Slaughterbeck, C.R.,
Smith, S.C., Smith, K.A., Smith, B.I., Sodt, A.J., Stewart, N.N., Stumpf, K.R., Sunkin, S.M.,
Sutram, M., Tam, A., Teemer, C.D., Thaller, C., Thompson, C.L., Varnam, L.R., Visel, A.,
Whitlock, R.M., Wohnoutka, P.E., Wolkey, C.K., Wong, V.Y., Wood, M., Yaylaoglu, M.B.,
18
648
649
Young, R.C., Youngstrom, B.L., Yuan, X.F., Zhang, B., Zwingman, T.A., and Jones, A.R. (2007)
Genome-wide atlas of gene expression in the adult mouse brain. Nature 445(7124), 168-76
650
651
652
Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A., and Dewey, C.N. (2010) RNA-Seq gene
expression estimation with read mapping uncertainty. Bioinformatics 26(4), 493-500
653
654
655
656
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and
Durbin, R. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16),
2078-9
657
658
659
Loven, J., Orlando, D.A., Sigova, A.A., Lin, C.Y., Rahl, P.B., Burge, C.B., Levens, D.L., Lee, T.I.,
and Young, R.A. (2012) Revisiting global gene expression analysis. Cell 151(3), 476-82
660
661
662
663
Mamo, S., Mehta, J.P., McGettigan, P., Fair, T., Spencer, T.E., Bazer, F.W., and Lonergan, P.
(2011) RNA sequencing reveals novel gene clusters in bovine conceptuses associated with
maternal recognition of pregnancy and implantation. Biology of Reproduction 85(6), 1143-51
664
665
666
667
Matthews, D., Waters, SM, Creevey, C., Morris, DG, Kenny DA, Diskin, MG (In preparation) The
effect of severe short term dietary restriction on gene expression in the bovine hypothalamus using
next generation RNA sequencing technology.
668
669
670
671
McCabe, M.S., Waters, S.M., Morris, D.G., Kenny, D.A., Lynn, D.J., and Creevey, C.J. (2012)
RNA-seq analysis of differential gene expression in liver from lactating dairy cows divergent in
negative energy balance. BMC Genomics 13(1), 193
672
673
674
675
Morozov, I.Y., Jones, M.G., Gould, P.D., Crome, V., Wilson, J.B., Hall, A.J., Rigden, D.J., and
Caddick, M.X. (2012) mRNA 3' tagging is induced by nonsense-mediated decay and promotes
ribosome dissociation. Molecular and Cellular Biology 32(13), 2585-95
676
677
678
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008) Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5(7), 621-8
679
680
681
682
O'Loughlin, A., Lynn, D.J., McGee, M., Doyle, S., McCabe, M., and Earley, B. (2012)
Transcriptomic analysis of the stress response to weaning at housing in bovine leukocytes using
RNA-seq technology. BMC Genomics 13(1), 250
683
684
685
686
Pearson, R.D., Liu, X., Sanguinetti, G., Milo, M., Lawrence, N.D., and Rattray, M. (2009) puma: a
Bioconductor package for propagating uncertainty in microarray analysis. BMC Bioinformatics 10,
211
687
688
689
690
691
Pluta, K., McGettigan, P.A., Reid, C.J., Browne, J.A., Irwin, J.A., Tharmalingam, T., Corfield, A.,
Baird, A.W., Loftus, B.J., Evans, A.C., and Carrington, S.D. (2012) Molecular aspects of mucin
biosynthesis and mucus formation in the bovine cervix during the periestrous period. Physiological
Genomics 44(24), 1165-1178
692
19
693
694
Quinlan, A.R., and Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic
features. Bioinformatics 26(6), 841-2
695
696
697
Robinson, J.T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G., and
Mesirov, J.P. (2011) Integrative genomics viewer. Nature Biotechnology 29(1), 24-6
698
699
700
701
Schug, J., Schuller, W.P., Kappen, C., Salbaum, J.M., Bucan, M., and Stoeckert, C.J., Jr. (2005)
Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biology
6(4), R33
702
703
704
Shannon, C.E., and Weaver, W. (1949) 'The mathematical theory of communication.' (University
of Illinois Press: Urbana,) v (i.e. vii), 117 p.
705
706
707
708
709
710
711
712
713
714
715
Siddiqui, A.S., Khattra, J., Delaney, A.D., Zhao, Y., Astell, C., Asano, J., Babakaiff, R., Barber, S.,
Beland, J., Bohacec, S., Brown-John, M., Chand, S., Charest, D., Charters, A.M., Cullum, R.,
Dhalla, N., Featherstone, R., Gerhard, D.S., Hoffman, B., Holt, R.A., Hou, J., Kuo, B.Y., Lee, L.L.,
Lee, S., Leung, D., Ma, K., Matsuo, C., Mayo, M., McDonald, H., Prabhu, A.L., Pandoh, P.,
Riggins, G.J., de Algara, T.R., Rupert, J.L., Smailus, D., Stott, J., Tsai, M., Varhol, R., Vrljicak, P.,
Wong, D., Wu, M.K., Xie, Y.Y., Yang, G., Zhang, I., Hirst, M., Jones, S.J., Helgason, C.D.,
Simpson, E.M., Hoodless, P.A., and Marra, M.A. (2005) A mouse atlas of gene expression: largescale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues
and cells. Proceedings of the National Academy of Sciences of the United States of America
102(51), 18485-90
716
717
718
719
Smyth, G.K. (2004) Linear models and empirical bayes methods for assessing differential
expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology
3, Article3
720
721
722
Stocco, C. (2008) Aromatase expression in the ovary: hormonal and molecular regulation. Steroids
73(5), 473-87
723
724
725
Struhl, K. (2007) Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nature
Structural and Molecular Biology 14(2), 103-5
726
727
728
729
730
Su, A.I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K.A., Block, D., Zhang, J., Soden, R.,
Hayakawa, M., Kreiman, G., Cooke, M.P., Walker, J.R., and Hogenesch, J.B. (2004) A gene atlas
of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of
Sciences of the United States of America 101(16), 6062-7
731
732
733
Trapnell, C., Pachter, L., and Salzberg, S.L. (2009) TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 25(9), 1105-11
734
735
736
737
738
Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L.,
Wold, B.J., and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology
28(5), 511-5
20
739
740
741
van Bakel, H., Nislow, C., Blencowe, B.J., and Hughes, T.R. (2010) Most "dark matter" transcripts
are associated with known genes. PLoS Biology 8(5), e1000371
742
743
744
745
746
Walsh, S.W., Mehta, J.P., McGettigan, P.A., Browne, J.A., Forde, N., Alibrahim, R.M., Mulligan,
F.J., Loftus, B., Crowe, M.A., Matthews, D., Diskin, M., Mihm, M., and Evans, A.C. (2012) Effect
of the metabolic environment at key stages of follicle development in cattle: focus on steroid
biosynthesis. Physiological Genomics 44(9), 504-17
747
748
749
750
Warrington, J.A., Nair, A., Mahadevappa, M., and Tsyganskaya, M. (2000) Comparison of human
adult and fetal expression and identification of 535 housekeeping/maintenance genes.
Physiological Genomics 2(3), 143-7
751
752
753
Wasserman, W.W., and Fickett, J.W. (1998) Identification of regulatory regions which confer
muscle-specific gene expression. Journal of Molecular Biology 278(1), 167-81
754
755
Additional Files:
756
Supplementary Table S1.
757
758
21
Table 1: Sequencing metrics for the 10 bovine tissue samples under investigation
Tissue type
Breed
of
animals
Number
of
animals
Sex
Number
of
samples
Library
Type
Read
length
(base
pairs)
Endometrium
20
F
20
Single read
36
F
36
37
F
30
Embryo
(Day 5 to 19)
Mixed
breed beef
HolsteinFriesian
HolsteinFriesian
Mixed
breed beef
Mixed
breed beef
Leukocytes
Liver
Granulosa
Theca
Cervix
Hypothalamus
Pituitary
Muscle
Total
GEO ID
Study
Original
study
Pubmed ID
36
Total
tissue
library
size
(Gbases)
119,950,155
4.3
GSE56392
22759920
Single read
36
263,863,637
9.5
GSE34317
37
Single read
36
356,596,059
12.8
GSE34317
F
30
Single read
42
475,102,282
20.0
GSE38225
28
Mixed
28
Pooled
Single read
84
664,986,589
55.9
GSE56513
Forde et al
2012
Walsh et al
2012
Walsh et al
2012
Pluta et al
2012
Mamo et al
2011
Simmental
16
M
55
Single read
36
1,229,589,572
44.3
GSE37447
22708644
HolsteinFriesian
Mixed
breed beef
Mixed
breed beef
12
F
21
Single read
36
289,137,975
10.4
GSE37544
23
F
23
Single read
42
594,937,671
24.9
GSE49540
3 pools
F
3
Pooled
Single read
42
106,119,419
4.5
O’Loughlin
et al 2012
McCabe et
al 2012
Matthews et
al (In prep.)
Matthews et
al (In prep)
In
preparation
In
preparation
27
M
27
Paired End
2x40
936,006,782
74.9
Keady et al
(in prep)
In
preparation
5,034,250,870
261.4
Aberdeen
Angus
280
22
Number of
reads
GSE48481
22414914
22414914
23092952
21795669
22607119
Table 2: Numbers of Ensembl annotated genes detected in each tissue type with ≥1,
≥5 reads (transcripts) and ≥10 tags per million reads (TPM). Note there are 24,616
genes in total in the Ensembl bovine annotation.
Tissue type
Number of
transcripts
detected
(reads ≥ 1)
Number of
transcripts
detected
(reads ≥ 5)
Number of
transcripts
detected
(TPM ≥ 10)
Endometrium
19,360
17,078
12,104
Granulosa
19,377
16,973
11,205
Theca
19,840
17,571
11,133
Cervix
20,287
18,186
12,596
Embryo (Day 5 to 19)
22,874
21,444
16,141
Leukocytes
19,772
17,632
10,205
Liver
19,141
16,740
9,782
Hypothalamus
21,259
19,006
11,996
Pituitary
18,465
15,861
10,080
Muscle
18,031
15,913
7,020
All above tissues
23,818
22,963
17,893
23
Table 3: Gini coefficient for each tissue. First column shows Gini coefficient for all
24,616 genes in Ensemble. Second column shows Gini coefficient for all non-zero
genes. The Gini coefficient is a measure of the unevenness in the distribution of the
reads (transcripts) among the genes. Samples with a Gini coefficient of 0.00 would
have an equal distribution of reads among all genes. Samples with a Gini coefficient
of 1.00 would have all reads from a single gene.
Gini
Gini
(all genes)
(all non-zero genes)
0.00
0.00
Endometrium
0.79
0.79
Hypothalamus
0.81
0.80
Embryo
0.83
0.82
Cervix
0.84
0.84
Pituitary
0.85
0.84
Granulosa
0.85
0.84
Theca
0.85
0.85
Leukocytes
0.87
0.87
Liver
0.89
0.89
Muscle
0.96
0.96
1.00
1.00
24
Decreasing number of genes contributing to total read count
Tissue
Table 4: List of over-represented KEGG pathways and DAVID functional clusters for the significantly tissue expressed genes. For those tissues with no overrepresented pathways or clusters the complete list of genes is shown. Final row of table shows over-represented pathways among genes exhibiting a
housekeeping profile.
Tissue
Total
sig
genes
Cervix
35
Extracellular region
Secretory Granule
Embryo
32
Protease
Proteinase inhibitor 12, Kunitz metazoa
Signal peptide
Regulation of transcription, DNA-dependent
Endometrium
3
Proenkephalin (PENK); Protease serine 16
(thymus) (PRSS16); Similar to thioesterase
superfamily member 5 (THEM5/Bt.33457)
Granulosa
4
Inhibin beta A (INHBA); Inhibin alpha (INHBB);
Aromatase (Cytochrome P450 XIX) (CYP19A1);
Follicle stimulating hormone receptor (FSHR)
Hypothalamus
36
Taurine and hypotaurine metabolism
Beta-alanine metabolism
Alanine, aspartate and glutamate metabolism
Butanoate metabolism
Type I diabetes mellitus
Neuron projection
Immunoglobulin V-set
Integral to membrane
Cell surface receptor linked signal
transduction
Membrane fraction
Leukocytes
35
Chemokine signaling
Cytokine-cytokine receptor interaction
Cell adhesion molecules (CAMs)
Primary immunodeficiency
Chemokine receptor activity
Plasma membrane
Integral to membrane
KEGG top pathways
DAVID top clusters
25
Genes (symbol)
Liver
196
Complement and coagulation cascades
Retinol metabolism
Drug metabolism
Steroid hormone biosynthesis
Metabolism of xenobiotics by cytochrome P450
Secreted
Enzyme inhibitor activity
Transition metal ion binding
Ion binding
Complement and coagulation cascades
Muscle
99
Tight junction
Viral myocarditis
Focal adhesion
Arrhythmogenic right ventricular cardiomyopathy
Hypertrophic cardiomyopathy
Myofibiril
Muscle organ development
Muscle protein
Striated muscle development
Sarcoplasmic reticulum
Pituitary
10
Theca
2
Housekeeping
411
Neuroactive ligand-receptor interaction
Follicle stimulating hormone beta (FSHB);
Glycoprotein hormones alpha polypeptide (CGA);
Gonadotrophin releasing hormone receptor
(GNRHR); Growth hormone (GH1)
Prolactin (PRL); Thyroid stimulating hormone
beta (TSHB); Similar to peptidyl-pro cis trans
isomerase (LOC782178); Pituitary specific
transcription factor (POU class 1 homeobox 1)
(POU1F1); Growth hormone releasing hormone
receptor (GHRHR); Immunoglobulin superfamily
member 1 (IGSF1)
CYPXVII, cythochrome P450 17A1 (CYP17A1);
Insulin-like 3 (leydig cell) (INSL3)
Huntington’s disease
Parkinson’s disease
Oxidative phosphorylation
Alzherimer’s disease
Proteasome
Amioacyl t-RNA synthesis
Nucleotide excision repair
Purine metabolism
RNA polymerase
Valine, leucine and isoleucine biosynthesis
Ubiquitin mediated proteolysis
Mitochondrion
Proteasome
Proteolysis
Translation/ribosome
DNA repair
Ubiquitin
Protein neddylation
tRNA aminoacylation
Methyltransferase activity
26
Figure 1: Hierarchical clustering of individual samples in ten bovine tissues.
27
Figure 2: Principal components plot showing first 2 principal components of 10 bovine
tissues.
28
Figure 3: Boxplot of Transcripts per million (TPM) showing the gene in each tissue that has the strongest tissue specific signature i.e. lowest Qgt
value. The last 2 plots show the 2 genes with strongest housekeeping profile – i.e. ubiquitously expressed at the same high level in all tissues
(highest entropy score Hg).
29
Download