03_SAC_pseudogenes_final_pap

advertisement
SACBI-140
In Silico Analysis of pseduogenes in the genome of Arabidopsis thaliana
chromosomes I, II and III: Implications for Microarray Data Analysis
Abstract:
We surveyed three of the five chromosomes of the model plant Arabidopsis thaliana for the
presence of pseudogenes recently reported in yeast. We introduce our application, Bison-Blast,
describe its capabilities as a sequence analysis tool, present our findings and discuss how
pseudogenes impact the analysis of microarray data. We found that 381 “potential pseudogenes
are in the chromosome I, 463 in the chromosome II and 588 in the chromosome III. Our results
suggest that the abundance of pseudognes in chromosomes II and III is proportional to their size.
However a low number of “potential pseudogenes” in the chromosome I of A. thaliana suggests
that some chromosomes are more prone for the accumulation of pseudogenic sequences than
others. Using the gene ontology annotation system containing 4696 molecular function entries
and after the removal of redundant sequences, we found only 19, 29 and 39 entries corresponding
to chromosome I, II and III respectively. For most of these entries, the gene functions were
putative and know followed by unknown and hypothetical. While we report a preliminary
analysis of a wide genomic scanning of the A. thaliana genome for homologues pseudogenes
derived from yeast, our report have relevant implications on microarray data since many
pseudogenes may be expressing a large number of non functional proteins and are adding a
considerable source of noise in microarray experiments.
1. Introduction:
1
Over the last century, genetic studies on a small number of organisms have played an important
role in the understanding numerous biological processes. However, in recent years genetic
research has shifted from how visible traits are transmitted to the study of the genome structure at
the molecular level. Advances in robotics, miniaturization and parallelization of molecular
biology tools has lead to the development of more sensitive and ultra high-throughput analytical
devices that are allowing the exploration of biological systems in a global schema. Genome
sequencing projects had proven to be a powerful and efficient approach for accessing the
complete gene structure of different organisms. Using this technology, an international
consortium released the genomic sequence of all 16 chromosomes constituting the nuclear
genome of yeast (Saccharomice. cerevisiae) lab strain S288 (Goffeau et al. 1996). This
information initiated a quest for the development of more complex comparative sequence tools
using sequence, motif, and structure of known proteins and translated expressed sequence tags
(ESTs) where the user queries a database and retrieves related sequences with user-specified
scores. Using data from functional or comparative genomic studies over the last five years,
previously non-annotated genes have been discovered in yeast (Velvelescu et al. 1997; Cliften,
2001). However as new sequencing techniques develop and more efficient computational tools
are available, new insights about the genomic structure of yeast has been published. Recently,
Kumar et al. (2002) and Harrison et al. (2002) reported a total of 137 new non-annotated genes
that represented 2 % of the yeast genome. From this gene set, 104 genes were <100 codons in
length. The same research group reported the existence of genomic DNA sequences released
from selective pressure with similarity to normal genes (Kumar et al.2002; Harrison and
Gerstein, 2002; Zhang et al. 2002). These disablements (know as pseudogenes) result in the loss
of gene function at the transcription or translation level (or both) since the sequence no longer
2
results in the production of a functional protein. Pseudogenes result from disablement of a gene
in many ways, e.g. creation of premature stop codons, disruptive frameshift mutations,
disablement of regulatory regions, and alterations in splice sites (Harrison and Gerstein, 2002).
From homology matches it has been reported that there may be up to a further 183 un-annotated
disabled pseudogenes in the S. cerevisiae strain S288C (Harrison et al. 2002; Harrison and
Gertein, 2002). These pseudogenes are characterized by the lack of introns, the presence of small
flanking direct repeats and polyadenine tail near the 3’ end (Harrison and Gerstein, 2002). Even
more recent analysis suggest that in the human genome it could be up to ~20,000 genes, with
approximately more than half transcribed (Zhang et al. 2002).
The genome of the flowering plant Arabidopsis thaliana has five chromosomes with low
repetitive DNA content representing a total of 120 Mbp. The Chromosome I contains about
6,850 open reading frames (ORFs) covering about 300 protein families, 236 transfer RNA
(tRNA) and 12 small nuclear RNAs (Theologis et al. 2000). The chromosome III encodes
approximately 5,220 predicted genes. Using sequence comparison tools over 60% of the
predicted ORFs in Arabidopsis match a paralogue somewhere else in the genome, and these
duplicated genes are organized into large syntenic blocks that might be several hundred ORFs
long (Stein, 2001). For example, one of the big surprises that emerged during the sequencing of
the mustard weed genome was evidence for several distinct large-scale duplication events in the
organism's past. However, one of the main limitations of comparative analysis tools is that most
of them are web based. Working with web browsers is extremely limited for two main reasons:
1) the query are restricted to the scope of the browser and querying this information manually is
time consuming and tedious. 2) these applications limit the efficiency of the user when questions
of biological significance need to query data sets held at different locations.
3
Given the growing recognition of both importance of genetic variation and usefulness of model
organisms it is important to attempt derive from them principles about gene products interactions
that appear to be similar. After a preliminary comparative genomic analysis, this paper analyzes
the implications of disabled pseudogenes of yeast and their homologues in Arabiposis thaliana.
We introduce our application Bison-Blast, describe its capabilities as a genomic data analysis
tool and discuss how our results represents a new paradigm for the analysis of microarray data as
these transcribed pseudogenes are considered an additional source of noise.
2. Methodology:
2.1. Sequence analysis:
The sequences of chromosomes I, II and III were download from the Arabidopsis Sequence
Initiative (ftp://tairpub:tairpub@ftp.arabidopsis.org/). In addition the sequence of 183 disabled
yeast
pseudogenes
was
retrieved
from
GENESENSUS
database
(http://bioinfo.mbb.yale.edu/genome/) and used as input for our BISON-BLAST application.
The BISON-BLAST is tool implemented in JAVA. BISON-BLAST integrates, analyses and
visualizes DNA and aminoacid sequences in Genebank, Swissprot, EMBL or ASCII formats.
BISON-BLAST can be used in both Linux and Windows environments and it is designed for the
analysis of medium to large number of sequences. Currently our tool uses the NCBI blast
algorithm versions 2.0 and 2.2.2. In addition, BISON-BLAST runs our parallized version of blast
2.0 (D-blast). Using a friendly graphical interface the user can perform sequence comparative
analysis including blastn, blastp, blastx, tblastn and blastpgp; filter and parse the results and
present them as a table that can be saved as ASCII file and implemented in SQL pipeline. The
BISON-BLAST Gui details are presented in Figure 1a and Figure 1b.
4
2.2 Filtering and Molecular Function Assignment of Potential Pseudogenes:
The recent A. thaliana ontology release file containing 4696 molecular function entries was
retrieved
from
the
Gene
Ontology
Consortia
(http://www.geneontology.org/cgi-
bin/GO/downloadGOGA.pl/gene_association.tigr_ath). We used these entries with the objective
to determine if there is any preferential type of “pseudogenic molecular functions” in A. thaliana
sequences that matched multiple yeast pseudogenes. Similar molecular function entries for both
ATH_locus and molecular function were eliminated.
5
3. Results and Discussion:
Comparison among genomes can be used for two purposes: inferring the phylogenetic
relationships of species, and estimating the number and type of genomic rearrangements that
have occurred since genomes last shared a common ancestor. Based in sequential analysis of
disabled pseudogenes of yeast with homologues in the A. thaliana genome, we argue that around
9 % of sequences annotated as genes are actually “potential pseduogenes”. We use the term
“potential pseudogenes” since the only way to determine this is by experiments in the laboratory.
The number of matches of yeast pseudognes in chromosomes I, II and III of Arabidopsis thaliana
6
was proportional to their size (Figure 2a). We found a total number of 1432 “potential disabled
pseudogenes” in the chromosomes I, II and II. This number included duplicates and triplicates
entries resulting from the same A. thaliana ORF matching different yeast pseudogenes. We found
that 381 “potential pseudogenes are in the chromosome I (5074 ORFs), 463 in the chromosome II
(4030 ORFs) and 588 in the chromosome III (5987) (Figure 2b). However, a low number of
“potential pseudogenes” in the chromosome I suggests that in Arabidopsis some chromosomes
are more prone for the accumulation of pseudogenic sequences than others. Our assumption takes
in consideration that the chromosome I have a 50 % gene density, while the chromosome III have
a 43 % gene density (Theologis et al. 2000; The Arabidopsis Genome Initiative 2000). An
interesting aspect our analysis is that most A. thaliana “potential pseudogenes” are either derived
from the use of in silico gene prediction and have been not experimentally determined.
7
Pseudogenes Chr I
Pseudogenes Chr II
Pseudogenes Chr III
180
160
140
120
100
80
60
40
20
0
0.001
0.01
0.1
1
Figure 1. Number of yeast pseudogenes matching A. thaliana chromosomes I, II and III.
Potential pseudogenes Chr I
Potential pseudogenes Chr II
Potential pseudogenes Chr III
600
550
500
450
400
350
300
250
200
150
100
50
0
0.001
Figure
1.
Number
0.01
of
A.
0.1
thaliana
8
ORFs
1
matching
yeast
pseudogenes.
Hypothetical
Putative
Unknow
Not clear
23%
40%
17%
20%
Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana
chromosome I.
Hypothetical
Putative
Unknow
Not clear
11%
23%
18%
48%
Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana
chromosome II.
9
Hypothetical
Putative
Unknow
Not clear
13%
37%
43%
7%
Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana
chromosome III.
Using the gene ontology annotation system we only found 19, 29 and 39 entries corresponding to
chromosome I, II and III respectively (Table 1, 2 and 3). The redundancy of molecular functions
across the chromosomes of A. thaliana can be attributed to a large number of segmental
chromosomal duplications arising from four distinct large-scale duplication events. Also the
results shifted to putative and know followed by unknown and hypothetical (Table 4). Our
analysis also found agreements with previous reports about pseudogenes. For example, the P450s
form one of the largest families of proteins in higher plants. Previously has been establish that the
A. thaliana genome contains 272 sequences with different P450 signature motifs of which 26
appear to be psedogenes that lack a complete open reading frame or contain frameshifts or inframe stop codons (Werck-Reichhart et al. 2002). Our analysis of the chromosome I, II and III of
this model plant identified 16 “potential pseduogenes” within this gene family.
10
4. Implications of Pseudogenes in the analysis of Microarray data:
Biological entities are the result of the complex interplay of the genetic make-up with the
environment (Kiberstis and Roberts, 2002). Since the information needed to make a protein is
contained in mRNA, one of the objectives of post-genomic techniques is to identify gene
function by quantifying mRNA abundance. Several techniques can be used including northern
blots, RT-PCR, differential display PCR (DD-PCR), serial analysis of gene expression (SAGE),
massive parallel signature sequencing (MPSS), cDNA amplified fragment length polymorphism
(cDNA-AFLP), rapid analysis of gene expression (RAGE), macroarrays and microarrays. These
techniques are having a considerable impact in several areas from basic research to clinical
diagnostics. Among them, DNA microarrays are becoming one of the most used approaches. Due
to their small size, high densities, and compatibility with fluorescence labeling, microarray
technology is becoming ideal to complete comparative analysis in changes of gene expression
level. In just over a few years since their conception, DNA microarrays have produced a
paradigm shift that is transforming the understanding of gene expression changes. However, as
more laboratories use microarrays to study gene expression changes, the size and complexity of
public microarray data is growing exponentially.
Various computational and statistical methods have been proposed and developed by both public
and public initiatives for microarray data analysis. These tools range from simple criteria to
define gene expression changes in a fold change cut-off to complex analysis using machine
learning techniques. Nevertheless none has yet gain widespread acceptance. While clustering is
an unsupervised method widely used for microarray data analysis, choosing a clustering
algorithm can be a daunting task. There is not an accurate approach to find the true cluster
structure and therefore objectively evaluate the “best” clustering method. This is due to the large
11
number of combinations between the number of distance matrixes and clustering algorithms.
Supervised methods represent an alternative to unsupervised microarray data analysis because it
takes a different approach in which previous knowledge about which genes are related each to
another. By having an explicit knowledge of the classes the different objects belong to, these
algorithms can perform an effective feature section. However, the more variables one models, the
difficult the modeling task becomes. This is a consequence of the space needed to find the
models increase exponentially with the number of model parameters, and with the number of
variables that it contain. For some datasets, these methods may not achieve a proper separation
since the kernel function is improperly defined or there are problems in the training set. Also, it is
often difficult to choose the kernel function, parameters and penalties.
The difficulty mining microarray data is exacerbated by the fact that there is not a complete
understanding of gene interactions and that post-translational and folding dynamic changes
occurring after mRNA synthesis may alter protein-protein interactions. Multiple proteins can
arise from a single gene or the mRNA is subjected to alternative splicing or post-translational
modification. The most relevant aspect of the information presented in this paper, which has
been not considered in previous reports studying pseudogenes is their implications on microarray
data. If a well-characterized protein is known to be involved in the initiation of a biological
process, then it is likely that a protein predicted from the genomic sequence that is similar to the
known protein will have the same function. Our preliminary evaluations suggest that at least 10%
of the sequences defined as genes can be coding for non-functional proteins. Although these
signals can be detected using microarrays, as they don’t code for proteins. We argue that they
should be not considered in the microarray data analysis process. Using a two-yeast hybrid
experiment we eliminated disabled ORFs and we noticed that results of our predictions were
12
significantly improved (data not shown). Our results suggest that computational approaches for
microarray data need to take in consideration relevant biological information.
Table 1. Gene ontology classification for potential pseudogenes on Chromosome I.
Chromosome I Protein Family, Annotation
ATH Locus
Auxin response transcription factor (ARF1)
At1g59750
bHLH protein, unknown protein
At1g12860
cytochrome P450, Putative
At1g13150
DEAD/DEAH box RNA helicase, Putative
At1g20920
Disease resistance protein (CC-NBS-LRR class), Putative
At1g50180
Disease resistance protein (CC-NBS-LRR class), Putative
At1g58400
Disease resistance protein (TIR-NBS-LRR class), Putative
At1g56540
Endomembrane protein 70, Putative
At1g14670
F-box protein family, unknown protein
At1g51370
Glutathione transferase, Putative
At1g59670
Hypothetical protein
At1g56530
KH domain protein, Putative
At1g33680
MADS-box protein, Putative
At1g18750
MADS-box protein, Putative
At1g31640
Myb family protein, Putative
At1g58220
myb-related transcription factor mixta, Putative
At1g18710
NTF2-containing RNA-binding protein, Putative
At1g13730
Peptidylprolyl isomerase, Putative
At1g18170
Pumilio-family RNA-binding protein, Putative
At1g35730
Pumilio-family RNA-binding protein, Putative
At1g35750
Serine/threonine protein phosphatase, PP2A
At1g10430
Syntaxin, Putative
At1g32270
Translation initiation factor eIF-2, Putative
At1g21160
Ubiquitin-specific protease 2 (UBP2)
At1g04860
Table 2. Gene ontology classification for potential pseudogenes on Chromosome II.
13
Protein Family Chromosome II
ATH Locus
Abscisic acid-insensitive 4 (ABI4),
At2g40220
Auxin respone transcription factor
At2g28350
BEL1-like homeobox 1 protein (BLH1), Putative
At2g35940
bHLH protein, unknown protein
At2g31220
bZIP transcription factor (POSF21)
At2g31370
Calcium-transporting ATPase 7
At2g22950
Chloroplast membrane protein (ALBINO3)(OXA1p)
At2g28800
COP1 regulatory protein
At2g32950
Cytochrome p450 family
At2g02580
Cytochrome p450 family
At2g34490
E3 ubiquitin ligase APC2, Putative
At2g04660
F-box protein family, AtFBL6
At2g25490
GATA zinc finger protein
At2g45050
Geranylgeranyl pyrophosphate synthase (GGPS2/GGPS5)
At2g23800
G-protein beta family
At2g26490
Homeodomain protein
At2g32370
Homeodomain transcription factor, WUSCHEL
At2g17950
Hydroxyproline-rich glycoprotein-related
At2g28240
Kinesin-related protein
At2g28620
Light-regulated myb protein, Putative
At2g36890
MADS-box protein
At2g26320
Mitochondrial chaperonin (HSP60)
At2g33210
Photosystem II oxygen-evolving complex 23 (OEC23), Putative
At2g30790
Pre-mRNA splicing factor SF3b, Putative
At2g18510
Pumilio-family RNA-binding protein, Putative
At2g29140
Pumilio-family RNA-binding protein, Putative
At2g29190
Pumilio-family RNA-binding protein, Putative
At2g29200
RRM-containing protein, Putative
At2g46780
RRM-containing RNA-binding protein, Putative
At2g19380
RUB1-conjugating enzyme, Putative
At2g18600
translation initiation factor eIF-2, Putative
At2g27700
U5 small nuclear ribonucleoprotein helicase, Putative
At2g42270
WRKY family transcription factor, Putative
At2g03340
WRKY family transcription factor, unknown protein
At2g30590
14
WRKY family transcription factor, Putative
At2g47260
Table 3. Gene ontology classification for potential pseudogenes on Chromosome III.
Molecular Function, Annotation
ATH Locus
20S proteasome alpha subunit D (PAD1)
At3g51260
Abscisic acid-insensitive protein 3 (ABI3)
At3g24650
Arginine/serine-rich protein SCL30
At3g55460
Armadillo-repeat-containing kinesin-related protein
At3g54870
bHLH protein
At3g07340
Chloroplast chaperonin 10, Putative
At3g60210
Cytochrome P450, Putative
At3g14620
DEAD/DEAH box helicase carpel factory-related
At3g03300
DEAD/DEAH box RNA helicase protein, Putative
At3g09720
DEAD/DEAH box RNA helicase, Putative
At3g16840
Disease resistance protein (TIR-NBS-LRR class), Putative
At3g25510
Disease resistance protein (TIR-NBS-LRR class), Putative
At3g51560
Dof zinc finger protein
At3g47500
Elongation factor Tu family protein
At3g22980
eukaryotic translation initiation factor 4A (eIF-4A), Putative
At3g19760
FKBP-type peptidyl-prolyl cis-trans isomerase, Putative
At3g12340
GATA zinc finger protein
At3g06740
GATA zinc finger protein
At3g50870
Glutathione reductase, Putative
At3g24170
G-protein beta family
At3g49180
Hairpin-induced protein, Putative
At3g11650
Homeodomain protein, GLABRA2 like 1 (HD-GL2-1)
At3g61150
Kinesin heavy chain, Putative
At3g63480
Kinesin-related protein
At3g10180
Kinesin-related protein, Putative
At3g12020
Kinesin-related protein TBK5, Putative
At3g16630
Light harvesting chlorophyll A/B binding protein, Putative
At3g27690
Light-harvesting chlorophyll a/b binding protein
At3g54890
Monodehydroascorbate reductase, Putative
At3g09940
15
Myb family protein
At3g47680
Myb family transcription factor (MYB108)
At3g06490
Poly(A) polymerase, Putative
At3g48830
Protein phosphatase
At3g19980
Protein phosphatase 2C (PP2C)
At3g16800
Protein phosphatase 2C (PP2C), Putative
At3g05640
RNA-binding terminal ear1 protein, Putative
At3g26120
RRM-containing protein
At3g04500
RRM-containing protein
At3g52150
Syntaxin SYP122
At3g52400
Transcriptional regulator (FUSCA3)
At3g26790
Unknown protein
At3g07660
Unknown protein
At3g51290
Zeta-carotene desaturase (ZDS), Putative
At3g04870
Table 4. Molecular function using the Gene Ontology annotation system
Putative
Know
Unknown
Hypothetical
Chromosome I
12
4
2
1
Chromosome II
11
16
2
0
Chromosome III
21
18
0
0
5. Conclusions:
Given the growing recognition of both the importance of genetic variation and usefulness of
model organisms, it is important to attempt to derive principles about gene product interactions
16
that appear to be similar. We presented that pseudogenes can be an important source of noise in
microarray experimentation.
6. References:
Goffeau, A. et al. 1996. Life with 6000 genes. Science 274, 546, 563-567.
Velculescu, V.E. et al. Characterization of the yeast transcriptome. Cell 88, 243-251 (1997).
Cliften, P.F. et al. Surveying Saccharomyces genomes to identify functional elements by
comparative DNA sequence analysis. Genome Res. 11, 1175-1186 (2001).
Stein, L. 2001. Nature Reviews Genetics 2, 493-503
Kiberstis, P.; Roberts, L. 2002. It's Not Just the Genes. Science (296):685.
17
Download