SACBI-140 In Silico Analysis of pseduogenes in the genome of Arabidopsis thaliana chromosomes I, II and III: Implications for Microarray Data Analysis Abstract: We surveyed three of the five chromosomes of the model plant Arabidopsis thaliana for the presence of pseudogenes recently reported in yeast. We introduce our application, Bison-Blast, describe its capabilities as a sequence analysis tool, present our findings and discuss how pseudogenes impact the analysis of microarray data. We found that 381 “potential pseudogenes are in the chromosome I, 463 in the chromosome II and 588 in the chromosome III. Our results suggest that the abundance of pseudognes in chromosomes II and III is proportional to their size. However a low number of “potential pseudogenes” in the chromosome I of A. thaliana suggests that some chromosomes are more prone for the accumulation of pseudogenic sequences than others. Using the gene ontology annotation system containing 4696 molecular function entries and after the removal of redundant sequences, we found only 19, 29 and 39 entries corresponding to chromosome I, II and III respectively. For most of these entries, the gene functions were putative and know followed by unknown and hypothetical. While we report a preliminary analysis of a wide genomic scanning of the A. thaliana genome for homologues pseudogenes derived from yeast, our report have relevant implications on microarray data since many pseudogenes may be expressing a large number of non functional proteins and are adding a considerable source of noise in microarray experiments. 1. Introduction: 1 Over the last century, genetic studies on a small number of organisms have played an important role in the understanding numerous biological processes. However, in recent years genetic research has shifted from how visible traits are transmitted to the study of the genome structure at the molecular level. Advances in robotics, miniaturization and parallelization of molecular biology tools has lead to the development of more sensitive and ultra high-throughput analytical devices that are allowing the exploration of biological systems in a global schema. Genome sequencing projects had proven to be a powerful and efficient approach for accessing the complete gene structure of different organisms. Using this technology, an international consortium released the genomic sequence of all 16 chromosomes constituting the nuclear genome of yeast (Saccharomice. cerevisiae) lab strain S288 (Goffeau et al. 1996). This information initiated a quest for the development of more complex comparative sequence tools using sequence, motif, and structure of known proteins and translated expressed sequence tags (ESTs) where the user queries a database and retrieves related sequences with user-specified scores. Using data from functional or comparative genomic studies over the last five years, previously non-annotated genes have been discovered in yeast (Velvelescu et al. 1997; Cliften, 2001). However as new sequencing techniques develop and more efficient computational tools are available, new insights about the genomic structure of yeast has been published. Recently, Kumar et al. (2002) and Harrison et al. (2002) reported a total of 137 new non-annotated genes that represented 2 % of the yeast genome. From this gene set, 104 genes were <100 codons in length. The same research group reported the existence of genomic DNA sequences released from selective pressure with similarity to normal genes (Kumar et al.2002; Harrison and Gerstein, 2002; Zhang et al. 2002). These disablements (know as pseudogenes) result in the loss of gene function at the transcription or translation level (or both) since the sequence no longer 2 results in the production of a functional protein. Pseudogenes result from disablement of a gene in many ways, e.g. creation of premature stop codons, disruptive frameshift mutations, disablement of regulatory regions, and alterations in splice sites (Harrison and Gerstein, 2002). From homology matches it has been reported that there may be up to a further 183 un-annotated disabled pseudogenes in the S. cerevisiae strain S288C (Harrison et al. 2002; Harrison and Gertein, 2002). These pseudogenes are characterized by the lack of introns, the presence of small flanking direct repeats and polyadenine tail near the 3’ end (Harrison and Gerstein, 2002). Even more recent analysis suggest that in the human genome it could be up to ~20,000 genes, with approximately more than half transcribed (Zhang et al. 2002). The genome of the flowering plant Arabidopsis thaliana has five chromosomes with low repetitive DNA content representing a total of 120 Mbp. The Chromosome I contains about 6,850 open reading frames (ORFs) covering about 300 protein families, 236 transfer RNA (tRNA) and 12 small nuclear RNAs (Theologis et al. 2000). The chromosome III encodes approximately 5,220 predicted genes. Using sequence comparison tools over 60% of the predicted ORFs in Arabidopsis match a paralogue somewhere else in the genome, and these duplicated genes are organized into large syntenic blocks that might be several hundred ORFs long (Stein, 2001). For example, one of the big surprises that emerged during the sequencing of the mustard weed genome was evidence for several distinct large-scale duplication events in the organism's past. However, one of the main limitations of comparative analysis tools is that most of them are web based. Working with web browsers is extremely limited for two main reasons: 1) the query are restricted to the scope of the browser and querying this information manually is time consuming and tedious. 2) these applications limit the efficiency of the user when questions of biological significance need to query data sets held at different locations. 3 Given the growing recognition of both importance of genetic variation and usefulness of model organisms it is important to attempt derive from them principles about gene products interactions that appear to be similar. After a preliminary comparative genomic analysis, this paper analyzes the implications of disabled pseudogenes of yeast and their homologues in Arabiposis thaliana. We introduce our application Bison-Blast, describe its capabilities as a genomic data analysis tool and discuss how our results represents a new paradigm for the analysis of microarray data as these transcribed pseudogenes are considered an additional source of noise. 2. Methodology: 2.1. Sequence analysis: The sequences of chromosomes I, II and III were download from the Arabidopsis Sequence Initiative (ftp://tairpub:tairpub@ftp.arabidopsis.org/). In addition the sequence of 183 disabled yeast pseudogenes was retrieved from GENESENSUS database (http://bioinfo.mbb.yale.edu/genome/) and used as input for our BISON-BLAST application. The BISON-BLAST is tool implemented in JAVA. BISON-BLAST integrates, analyses and visualizes DNA and aminoacid sequences in Genebank, Swissprot, EMBL or ASCII formats. BISON-BLAST can be used in both Linux and Windows environments and it is designed for the analysis of medium to large number of sequences. Currently our tool uses the NCBI blast algorithm versions 2.0 and 2.2.2. In addition, BISON-BLAST runs our parallized version of blast 2.0 (D-blast). Using a friendly graphical interface the user can perform sequence comparative analysis including blastn, blastp, blastx, tblastn and blastpgp; filter and parse the results and present them as a table that can be saved as ASCII file and implemented in SQL pipeline. The BISON-BLAST Gui details are presented in Figure 1a and Figure 1b. 4 2.2 Filtering and Molecular Function Assignment of Potential Pseudogenes: The recent A. thaliana ontology release file containing 4696 molecular function entries was retrieved from the Gene Ontology Consortia (http://www.geneontology.org/cgi- bin/GO/downloadGOGA.pl/gene_association.tigr_ath). We used these entries with the objective to determine if there is any preferential type of “pseudogenic molecular functions” in A. thaliana sequences that matched multiple yeast pseudogenes. Similar molecular function entries for both ATH_locus and molecular function were eliminated. 5 3. Results and Discussion: Comparison among genomes can be used for two purposes: inferring the phylogenetic relationships of species, and estimating the number and type of genomic rearrangements that have occurred since genomes last shared a common ancestor. Based in sequential analysis of disabled pseudogenes of yeast with homologues in the A. thaliana genome, we argue that around 9 % of sequences annotated as genes are actually “potential pseduogenes”. We use the term “potential pseudogenes” since the only way to determine this is by experiments in the laboratory. The number of matches of yeast pseudognes in chromosomes I, II and III of Arabidopsis thaliana 6 was proportional to their size (Figure 2a). We found a total number of 1432 “potential disabled pseudogenes” in the chromosomes I, II and II. This number included duplicates and triplicates entries resulting from the same A. thaliana ORF matching different yeast pseudogenes. We found that 381 “potential pseudogenes are in the chromosome I (5074 ORFs), 463 in the chromosome II (4030 ORFs) and 588 in the chromosome III (5987) (Figure 2b). However, a low number of “potential pseudogenes” in the chromosome I suggests that in Arabidopsis some chromosomes are more prone for the accumulation of pseudogenic sequences than others. Our assumption takes in consideration that the chromosome I have a 50 % gene density, while the chromosome III have a 43 % gene density (Theologis et al. 2000; The Arabidopsis Genome Initiative 2000). An interesting aspect our analysis is that most A. thaliana “potential pseudogenes” are either derived from the use of in silico gene prediction and have been not experimentally determined. 7 Pseudogenes Chr I Pseudogenes Chr II Pseudogenes Chr III 180 160 140 120 100 80 60 40 20 0 0.001 0.01 0.1 1 Figure 1. Number of yeast pseudogenes matching A. thaliana chromosomes I, II and III. Potential pseudogenes Chr I Potential pseudogenes Chr II Potential pseudogenes Chr III 600 550 500 450 400 350 300 250 200 150 100 50 0 0.001 Figure 1. Number 0.01 of A. 0.1 thaliana 8 ORFs 1 matching yeast pseudogenes. Hypothetical Putative Unknow Not clear 23% 40% 17% 20% Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana chromosome I. Hypothetical Putative Unknow Not clear 11% 23% 18% 48% Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana chromosome II. 9 Hypothetical Putative Unknow Not clear 13% 37% 43% 7% Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana chromosome III. Using the gene ontology annotation system we only found 19, 29 and 39 entries corresponding to chromosome I, II and III respectively (Table 1, 2 and 3). The redundancy of molecular functions across the chromosomes of A. thaliana can be attributed to a large number of segmental chromosomal duplications arising from four distinct large-scale duplication events. Also the results shifted to putative and know followed by unknown and hypothetical (Table 4). Our analysis also found agreements with previous reports about pseudogenes. For example, the P450s form one of the largest families of proteins in higher plants. Previously has been establish that the A. thaliana genome contains 272 sequences with different P450 signature motifs of which 26 appear to be psedogenes that lack a complete open reading frame or contain frameshifts or inframe stop codons (Werck-Reichhart et al. 2002). Our analysis of the chromosome I, II and III of this model plant identified 16 “potential pseduogenes” within this gene family. 10 4. Implications of Pseudogenes in the analysis of Microarray data: Biological entities are the result of the complex interplay of the genetic make-up with the environment (Kiberstis and Roberts, 2002). Since the information needed to make a protein is contained in mRNA, one of the objectives of post-genomic techniques is to identify gene function by quantifying mRNA abundance. Several techniques can be used including northern blots, RT-PCR, differential display PCR (DD-PCR), serial analysis of gene expression (SAGE), massive parallel signature sequencing (MPSS), cDNA amplified fragment length polymorphism (cDNA-AFLP), rapid analysis of gene expression (RAGE), macroarrays and microarrays. These techniques are having a considerable impact in several areas from basic research to clinical diagnostics. Among them, DNA microarrays are becoming one of the most used approaches. Due to their small size, high densities, and compatibility with fluorescence labeling, microarray technology is becoming ideal to complete comparative analysis in changes of gene expression level. In just over a few years since their conception, DNA microarrays have produced a paradigm shift that is transforming the understanding of gene expression changes. However, as more laboratories use microarrays to study gene expression changes, the size and complexity of public microarray data is growing exponentially. Various computational and statistical methods have been proposed and developed by both public and public initiatives for microarray data analysis. These tools range from simple criteria to define gene expression changes in a fold change cut-off to complex analysis using machine learning techniques. Nevertheless none has yet gain widespread acceptance. While clustering is an unsupervised method widely used for microarray data analysis, choosing a clustering algorithm can be a daunting task. There is not an accurate approach to find the true cluster structure and therefore objectively evaluate the “best” clustering method. This is due to the large 11 number of combinations between the number of distance matrixes and clustering algorithms. Supervised methods represent an alternative to unsupervised microarray data analysis because it takes a different approach in which previous knowledge about which genes are related each to another. By having an explicit knowledge of the classes the different objects belong to, these algorithms can perform an effective feature section. However, the more variables one models, the difficult the modeling task becomes. This is a consequence of the space needed to find the models increase exponentially with the number of model parameters, and with the number of variables that it contain. For some datasets, these methods may not achieve a proper separation since the kernel function is improperly defined or there are problems in the training set. Also, it is often difficult to choose the kernel function, parameters and penalties. The difficulty mining microarray data is exacerbated by the fact that there is not a complete understanding of gene interactions and that post-translational and folding dynamic changes occurring after mRNA synthesis may alter protein-protein interactions. Multiple proteins can arise from a single gene or the mRNA is subjected to alternative splicing or post-translational modification. The most relevant aspect of the information presented in this paper, which has been not considered in previous reports studying pseudogenes is their implications on microarray data. If a well-characterized protein is known to be involved in the initiation of a biological process, then it is likely that a protein predicted from the genomic sequence that is similar to the known protein will have the same function. Our preliminary evaluations suggest that at least 10% of the sequences defined as genes can be coding for non-functional proteins. Although these signals can be detected using microarrays, as they don’t code for proteins. We argue that they should be not considered in the microarray data analysis process. Using a two-yeast hybrid experiment we eliminated disabled ORFs and we noticed that results of our predictions were 12 significantly improved (data not shown). Our results suggest that computational approaches for microarray data need to take in consideration relevant biological information. Table 1. Gene ontology classification for potential pseudogenes on Chromosome I. Chromosome I Protein Family, Annotation ATH Locus Auxin response transcription factor (ARF1) At1g59750 bHLH protein, unknown protein At1g12860 cytochrome P450, Putative At1g13150 DEAD/DEAH box RNA helicase, Putative At1g20920 Disease resistance protein (CC-NBS-LRR class), Putative At1g50180 Disease resistance protein (CC-NBS-LRR class), Putative At1g58400 Disease resistance protein (TIR-NBS-LRR class), Putative At1g56540 Endomembrane protein 70, Putative At1g14670 F-box protein family, unknown protein At1g51370 Glutathione transferase, Putative At1g59670 Hypothetical protein At1g56530 KH domain protein, Putative At1g33680 MADS-box protein, Putative At1g18750 MADS-box protein, Putative At1g31640 Myb family protein, Putative At1g58220 myb-related transcription factor mixta, Putative At1g18710 NTF2-containing RNA-binding protein, Putative At1g13730 Peptidylprolyl isomerase, Putative At1g18170 Pumilio-family RNA-binding protein, Putative At1g35730 Pumilio-family RNA-binding protein, Putative At1g35750 Serine/threonine protein phosphatase, PP2A At1g10430 Syntaxin, Putative At1g32270 Translation initiation factor eIF-2, Putative At1g21160 Ubiquitin-specific protease 2 (UBP2) At1g04860 Table 2. Gene ontology classification for potential pseudogenes on Chromosome II. 13 Protein Family Chromosome II ATH Locus Abscisic acid-insensitive 4 (ABI4), At2g40220 Auxin respone transcription factor At2g28350 BEL1-like homeobox 1 protein (BLH1), Putative At2g35940 bHLH protein, unknown protein At2g31220 bZIP transcription factor (POSF21) At2g31370 Calcium-transporting ATPase 7 At2g22950 Chloroplast membrane protein (ALBINO3)(OXA1p) At2g28800 COP1 regulatory protein At2g32950 Cytochrome p450 family At2g02580 Cytochrome p450 family At2g34490 E3 ubiquitin ligase APC2, Putative At2g04660 F-box protein family, AtFBL6 At2g25490 GATA zinc finger protein At2g45050 Geranylgeranyl pyrophosphate synthase (GGPS2/GGPS5) At2g23800 G-protein beta family At2g26490 Homeodomain protein At2g32370 Homeodomain transcription factor, WUSCHEL At2g17950 Hydroxyproline-rich glycoprotein-related At2g28240 Kinesin-related protein At2g28620 Light-regulated myb protein, Putative At2g36890 MADS-box protein At2g26320 Mitochondrial chaperonin (HSP60) At2g33210 Photosystem II oxygen-evolving complex 23 (OEC23), Putative At2g30790 Pre-mRNA splicing factor SF3b, Putative At2g18510 Pumilio-family RNA-binding protein, Putative At2g29140 Pumilio-family RNA-binding protein, Putative At2g29190 Pumilio-family RNA-binding protein, Putative At2g29200 RRM-containing protein, Putative At2g46780 RRM-containing RNA-binding protein, Putative At2g19380 RUB1-conjugating enzyme, Putative At2g18600 translation initiation factor eIF-2, Putative At2g27700 U5 small nuclear ribonucleoprotein helicase, Putative At2g42270 WRKY family transcription factor, Putative At2g03340 WRKY family transcription factor, unknown protein At2g30590 14 WRKY family transcription factor, Putative At2g47260 Table 3. Gene ontology classification for potential pseudogenes on Chromosome III. Molecular Function, Annotation ATH Locus 20S proteasome alpha subunit D (PAD1) At3g51260 Abscisic acid-insensitive protein 3 (ABI3) At3g24650 Arginine/serine-rich protein SCL30 At3g55460 Armadillo-repeat-containing kinesin-related protein At3g54870 bHLH protein At3g07340 Chloroplast chaperonin 10, Putative At3g60210 Cytochrome P450, Putative At3g14620 DEAD/DEAH box helicase carpel factory-related At3g03300 DEAD/DEAH box RNA helicase protein, Putative At3g09720 DEAD/DEAH box RNA helicase, Putative At3g16840 Disease resistance protein (TIR-NBS-LRR class), Putative At3g25510 Disease resistance protein (TIR-NBS-LRR class), Putative At3g51560 Dof zinc finger protein At3g47500 Elongation factor Tu family protein At3g22980 eukaryotic translation initiation factor 4A (eIF-4A), Putative At3g19760 FKBP-type peptidyl-prolyl cis-trans isomerase, Putative At3g12340 GATA zinc finger protein At3g06740 GATA zinc finger protein At3g50870 Glutathione reductase, Putative At3g24170 G-protein beta family At3g49180 Hairpin-induced protein, Putative At3g11650 Homeodomain protein, GLABRA2 like 1 (HD-GL2-1) At3g61150 Kinesin heavy chain, Putative At3g63480 Kinesin-related protein At3g10180 Kinesin-related protein, Putative At3g12020 Kinesin-related protein TBK5, Putative At3g16630 Light harvesting chlorophyll A/B binding protein, Putative At3g27690 Light-harvesting chlorophyll a/b binding protein At3g54890 Monodehydroascorbate reductase, Putative At3g09940 15 Myb family protein At3g47680 Myb family transcription factor (MYB108) At3g06490 Poly(A) polymerase, Putative At3g48830 Protein phosphatase At3g19980 Protein phosphatase 2C (PP2C) At3g16800 Protein phosphatase 2C (PP2C), Putative At3g05640 RNA-binding terminal ear1 protein, Putative At3g26120 RRM-containing protein At3g04500 RRM-containing protein At3g52150 Syntaxin SYP122 At3g52400 Transcriptional regulator (FUSCA3) At3g26790 Unknown protein At3g07660 Unknown protein At3g51290 Zeta-carotene desaturase (ZDS), Putative At3g04870 Table 4. Molecular function using the Gene Ontology annotation system Putative Know Unknown Hypothetical Chromosome I 12 4 2 1 Chromosome II 11 16 2 0 Chromosome III 21 18 0 0 5. Conclusions: Given the growing recognition of both the importance of genetic variation and usefulness of model organisms, it is important to attempt to derive principles about gene product interactions 16 that appear to be similar. We presented that pseudogenes can be an important source of noise in microarray experimentation. 6. References: Goffeau, A. et al. 1996. Life with 6000 genes. Science 274, 546, 563-567. Velculescu, V.E. et al. Characterization of the yeast transcriptome. Cell 88, 243-251 (1997). Cliften, P.F. et al. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res. 11, 1175-1186 (2001). Stein, L. 2001. Nature Reviews Genetics 2, 493-503 Kiberstis, P.; Roberts, L. 2002. It's Not Just the Genes. Science (296):685. 17