Materials and Methods

advertisement
Materials and Methods
Tiling microarray and MPSS datasets
We obtained tiling microarray datasets for Arabidopsis thaliana (Arabidopsis)
and Oryza sativa spp japonica cv. Nipponbare (rice) from National Center for
Biotechnology Information’s (NCBI) Gene Expression Omnibus (GEO) database [1,
2], the series files are named GSE2247 and GSE6996, respectively. There are two
datasets for Arabidopsis, the first Arabidopsis tiling array results from Yamada et al.
in 2003 [3] and the second set of arrays that were described in 2005 by Stolc et al. [4].
We decided to use the latter as it is similar in design to the rice tiling array.
The details of the Maskless Array Synthesizer (MAS) technology microarray
design and hybridization for Arabidopsis genome tiling array experiment is described
here [4]. Briefly, 13 high-density arrays were constructed to represent the five nuclear
chromosomes and the genomes of mitochondria and chloroplast. A total of ~5M 36mer oligo-nucleotide probes, positioned every 46nt on average, were selected
uniformly from the two strands of the non-repetitive region of chromosomal
sequences (January 2002 release). The probes were synthesized at a feature density of
about 400,000 probes per array. The array was hybridized with fluorescence-labelled
cDNA reverse-transcribed from poly-adenylated mRNA extracted from T87 cultured
cell line. All the probes were remapped to the current release (The Arabidopsis
Information Resource (TAIR) version 7) of Arabidopsis genome. The corresponding
annotation file (TAIR7 gff3) was used for downstream analysis. Less than 0.1% of the
total ~5 million probes were lost or mapped to multiple locations in the genome. The
chromosome sequence files and the annotation file were downloaded from TAIR’s ftp
site [5].
-1-
The array design for rice is similar to the one described above for Arabidopsis
[6]. To summarize, 32 high-density arrays were constructed to represent a total of
~12.2 million probes chosen from the 12 chromosomes of the rice genome (TIGR
Rice Genome Annotation Release 1, September 2003 [7]). The rice array was
hybridized with pooled mRNA extracted from seedling root, seedling shoot, panicle
and suspension-cultured cells. The chromosome sequences and the annotation file
were obtained from the The Institute for Genome Resource’s (TIGR) ftp site [8]. We
were able to map 95% of the 12.2 million total probes to the current version of the
Rice genome (version 5); a small fraction of the probes mapped to more than five
locations in the genome. These probes were not considered for the work described
here. Although the Arabidopsis and rice tiling microarray datasets come from
different laboratories, both the arrays were constructed using the NimbleGen™ MAS
technology.
The MPSS (Massively Parallel Signature Sequencing) dataset for Arabidopsis
and rice were obtained from Plant MPSS databases at Delaware Biotechnology
Institute [9, 10]. The collection of rice MPSS dataset represents 249,990 distinct
transcripts in the form of short 17bp tags obtained from sequencing 22 polyadenylated mRNA libraries. These include libraries from 12 different untreated
tissues and six abiotic stress treatments [11]. A set of 121,581 reliable tags from this
collection were mapped to the current version of the rice genome. Of the 80,978 tags
that mapped to the genome, 68,413 tags mapped to a unique location and the rest
12,565 tags mapped to two or more locations. 11,633 of the 12,565 tags mapped to
two to ten locations. In total, we considered 100,274 tags or tag locations considering
the ones that map to multiple locations (up to 10 times) for downstream analysis.
-2-
40,603 of the total reliable tags were lost as they could not be mapped to any location
in the genome.
To analyse MPSS transcripts from Arabidopsis, we started with a set of
118,801 reliable tags obtained from the Arabidopsis MPSS database. These transcripts
are derived from 17 libraries constructed using mRNA from diverse tissues, mutants
and treatments [12]. As before for rice, these tags were mapped to the current version
of the A.thaliana genome build (TAIR version 7). 107,070 of the 118,801 tags
mapped to a unique genomic location. 8,956 of the remaining 9,128 tags mapped to
between two and ten genomic locations and these were considered. In total, we
considered 128,337 tags or tag locations for downstream analysis. A total of 2,603
tags were lost as they could not be mapped to any location in the genome.
Scoring computationally predicted gene structures on the tiling microarray
We used probe intensities from tiling array experiments for two goals. Given
the set of computationally predicted exons and gene structures, how many genes can
we identify as transcribed? We used binomial expansion, in the form of sign-test, to
detect transcription in annotated gene models[13]. To accomplish this, we first
identified probes that lie within the exons of a given gene model. For each probe, we
checked if its intensity was greater than the median of the array (slide) it came from.
For each gene model, we determined whether or not the number of probes that are
above the median is more than expected by chance alone. The probability, p, of
obtaining h probes with intensities above median out of N probes is given by the
equation
-3-
We chose a false-positive rate of 5%. Thus, gene models with a p-value less than 0.05
were scored as transcribed. We followed this procedure and scored computationally
predicted gene structures in Arabidopsis and rice.
Identifying transcriptionally active regions on the tiling microarray
Given the probe intensities and gene structure annotation, our second goal was
to identify regions of transcription outside exons. These regions are also known as
transcriptionally active regions (TARs) or transfrags [13, 14]. Since a probe or tile is
too short to constitute a transcribed region, we employed the scoring scheme
described in Cawley et al. [15] and Emanuelsson et al. [16] to identify TARs. In this
procedure, qualified tiles that are close to each other along the genomic sequence are
combined into features based on two parameters: (i) maxgap or maximum genomic
distance below which two adjacent qualified tiles can be joined and (ii) minrun, the
minimum length of a genomic region for it to be qualified as a feature. To identify
TARs in Arabidopsis and rice, the corresponding tile path was scanned for regions in
the genome that are covered by a maxgap of 20nt and a minrun of 174nt (at least four
consecutive probes). Only probes whose intensities lie in the 85th percentile or above
were considered when scanning the tile path.
Regression calculation of MPSS tag abundance vs tiling array intensities
We used the mean intensity percentile for each of the MPSS tags as described above
to see how well this measure of hybridization correlates with the abundance
measurement of the respective tags obtained from the MPSS experiments. There are
17 libraries in Arabidopsis and 22 libraries in rice with an abundance measurement
for all the reliable MPSS tags. We used normalized expression levels in "TPM"
(transcripts per million) and not the raw tag abundance for the 17 base MPSS
signatures. Hence, these expression measurements can be directly used for
-4-
comparison without any further processing. These normalized expression levels were
obtained from the MPSS database for Rice and Arabidopsis maintained at
http://mpss.udel.edu by Blake Meyers’ group. A detailed description on how these
normalized transcript abundance levels are arrived at is provided in the following
reference [12].
In order to be conservative we considered only tags that map to a unique
genomic location. We performed regression of the log2 transformed abundance values
against the intensity percentile for unique MPSS tags in both rice (Additional File 3)
and Arabidopsis (Additional File 4) and found that they are not correlated. This
observation was consistent when we calculated the regression for individual libraries,
pooled libraries and also for MPSS tags that overlap with tiling array TARs from the
two species. The correlation coefficient for the pooled 17 libraries in Arabidopsis was
0.27 (range, 0.15-0.42) and 0.13 for MPSS tags that overlap with TARs. For rice, the
correlation coefficient for the pooled 22 libraries was 0.25 (range, 0.13-0.26) and 0.13
for tags that overlap TARs. Regression of the log2 transformed abundance values
against the intensity percentile for unique MPSS tags in both Arabidopsis and rice are
not correlated. This suggests that the intensity value as a measure of hybridization as
deduced from tiling array experiments cannot be confidently translated into measures
of expression level for a given transcript. Further, the tag position bias observed for
Classic and Signature MPSS methods compound this correlation.
References
1.
Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene
expression and hybridization array data repository. Nucleic Acids Res
2002, 30(1):207-210.
-5-
2.
Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim
IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of
millions of expression profiles--database and tools update. Nucleic Acids
Res 2007, 35(Database issue):D760-765.
3.
Yamada K, Lim J, Dale JM, Chen H, Shinn P, Palm CJ, Southwick AM, Wu
HC, Kim C, Nguyen M et al: Empirical analysis of transcriptional activity
in the Arabidopsis genome. Science 2003, 302(5646):842-846.
4.
Stolc V, Samanta MP, Tongprasit W, Sethi H, Liang S, Nelson DC, Hegeman
A, Nelson C, Rancour D, Bednarek S et al: Identification of transcribed
sequences in Arabidopsis thaliana by using high-resolution genome tiling
arrays. Proc Natl Acad Sci U S A 2005, 102(12):4453-4458.
5.
TAIR Genome Release 7 [http://www.arabidopsis.org/download/index.jsp]
6.
Li L, Wang X, Sasidharan R, Stolc V, Deng W, He H, Korbel J, Chen X,
Tongprasit W, Ronald P et al: Global identification and characterization of
transcriptionally active regions in the rice genome. PLoS ONE 2007,
2(3):e294.
7.
Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen
F, Malek RL, Lee Y, Zheng L et al: The TIGR Rice Genome Annotation
Resource: improvements and new features. Nucleic Acids Res 2007,
35(Database issue):D883-887.
8.
TIGR Rice Genome Annotation Version 5
[ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseud
omolecules/version_5.0/all.chrs]
9.
Nakano M, Nobuta K, Vemaraju K, Tej SS, Skogen JW, Meyers BC: Plant
MPSS databases: signature-based transcriptional resources for analyses
-6-
of mRNA and small RNA. Nucleic Acids Res 2006, 34(Database
issue):D731-735.
10.
Meyers BC, Lee DK, Vu TH, Tej SS, Edberg SB, Matvienko M, Tindell LD:
Arabidopsis MPSS. An online resource for quantitative expression
analysis. Plant Physiol 2004, 135(2):801-813.
11.
Nobuta K, Venu RC, Lu C, Belo A, Vemaraju K, Kulkarni K, Wang W, Pillay
M, Green PJ, Wang GL et al: An expression atlas of rice mRNAs and small
RNAs. Nat Biotechnol 2007, 25(4):473-477.
12.
Meyers BC, Tej SS, Vu TH, Haudenschild CD, Agrawal V, Edberg SB,
Ghazal H, Decola S: The use of MPSS for whole-genome transcriptional
analysis in Arabidopsis. Genome Res 2004, 14(8):1641-1653.
13.
Royce TE, Rozowsky JS, Bertone P, Samanta M, Stolc V, Weissman S,
Snyder M, Gerstein M: Issues in the analysis of oligonucleotide tiling
microarrays for transcript mapping. Trends Genet 2005, 21(8):466-475.
14.
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL,
Tongprasit W, Samanta M, Weissman S et al: Global identification of
human transcribed sequences with genome tiling arrays. Science 2004,
306(5705):2242-2246.
15.
Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D,
Piccolboni A, Sementchenko V, Cheng J, Williams AJ et al: Unbiased
mapping of transcription factor binding sites along human chromosomes
21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004,
116(4):499-509.
16.
Emanuelsson O, Nagalakshmi U, Zheng D, Rozowsky JS, Urban AE, Du J,
Lian Z, Stolc V, Weissman S, Snyder M et al: Assessing the performance of
-7-
different high-density tiling microarray strategies for mapping
transcribed regions of the human genome. Genome Res 2007, 17(6):886897.
-8-
Download