Methods - Shiu Lab - Michigan State University

advertisement

Survey of Misannotations and

Pseudogenes in the Arabidopsis

Genome

By Tanmay Prakash

Research performed under the supervision of Dr. Shin-

Han Shiu and Dr. Kosuke Hanada

Department of Plant Biology,

Michigan State University

Acknowledgments:

I would like to thank Dr. Shin-Han Shiu guiding me along the research and giving me the opportunity.

I would like to thank Dr. Kosuke Hanada for also guiding me and helping me with my programming problems.

I would like to thank Dr. Melissa Lehti-Shiu for always being there to help with any questions I had

 I would like to thank Dr. Gail Richmond for giving me the opportunity to participate in the High School Honors

Science Programs at Michigan State University

Abstract

Misannotations are genes that have regions mislabeled as introns, exons, or untranslated regions. There are occasions where there are misannotations that are due to the existence of pseudogenes. This makes it difficult to conduct accurate research with this data. Pseudogenes lack purifying selective pressure and are useful for studying things such as neutral evolution. The purpose of this project is to find possible misannotations in the Arabidopsis thaliana genome and which of the possible misannotations are possible pseudogenes. This is done through the examination of sequence similarity in the ~25,000 genes of the

Arabidopsis to the 8,296 protein domain families to find possible misannotations.

Possible pseudogenes were found by checking the introns of the possible misannotations for stop codons. These analyses were done on the Calculon computer system. 299 possible misannotations were found and 51 genes of these were identified as possible pseudogenes.

Introduction

Pseudogenes are DNA sequences that no longer function but resemble the functional genes they once were (Torrents et al., 2003) .

There are two types of pseudogenes, processed and non-processed. Processed pseudogenes are formed by retrotransposition and comprise most of the pseudogenes in mammals. Non-processed pseudogenes are products of duplication of the entirety of portion of a segment of genes followed by mutations. Because polyploidiszation (the process of having more one sets of chromosomes) is common in plants, the majority of pseudogenes in plants are non-processed

(Blanc and Wolfe, 2004).

Pseudogenes are mainly identified by the existence of premature stop codons (Zhang et al., 2004). They can also be identified by their lack of purifying selective pressure. Since most of functional genes are subject to high purifying selection, synonymous substitutions (mutation in a codon that produces the same amino acid) tend to occur much more frequently than nonsynonymous substitutions (mutations result in different amino acids) in functional genes

(Torrents et al., 2003). In pseudogenes, however, there is low selective pressure so substitutions can be synonymous or nonsynonymous (mutations result in different amino acids). Based on these properties, the rates of nonsynonymous

(Ka) and synonymous (Ks) substitutions can be used as a measure of purifying selection pressure. Functional genes normally have Ka/Ks values that are significantly less than one. Genes whose Ka/Ks values are closer to 1 have a

higher likelihood to be pseudogenes. The Ka/Ks value is used as evidence of the existence of a pseudogene, though it is not a definite measure.

The annotation of a gene is the process of assigning its introns, exons, and un-translated regions. Gene prediction programs are generally used to annotate the genes. Whenever the programs come upon stop codons, they assume that the region is an intron. Pseudogenes contain premature stop often cause gene prediction programs to misannotate and not label pseudogenes. The proposed studies focus on the pseudogenes that are misannotated introns

(Figure 1). If part of a protein domain (folds of a protein that play a certain role

Figure 1 Possible Sequence Similarities

Exon

Intron

Exon Exon

Intron

Exon

Domain

Case 1

Exon

Intron

Exon

Domain

Case 2

Exon

Intron

Exon

Domain Domain

Case 3 and can appear in many

Case 4 different proteins) is found in the exons but not in the intron between the exons then for the intron is likely correctly annotated (Figure1, Case 1). However, if an

intron does contains part of a protein domain and the flanking exons contain parts of the same domain (Figure 1, Case 2), this intron is likely misannotated. If any stop codons are found in the introns, the misannotation is a potential pseudogene.

The criterion of sequence similarity in an intron and its flanking exons to the same domain was used to identify possible misannotations in the Arabidopsis genome. A search for premature stop codons in the introns of misannotated genes and an examination of signature of purifying selection were used to find pseudogenes in the genes of the Arabidopsis thaliana. Finding misannotations was important because misannotations can hinder. Moreover, it is expexted that pseudogenes data will be useful to to understand neutral evolution of

Arabidopsis thaliana (Li WH et al).

Methods

Misannotations were identified by examining the sequence similarity of the genes of Arabidopsis thaliana to all the protein domains. First, a computer program, written in UNIX shell script, used a protein query vs. translated database BLAST search (tblastn) to search for sequence similarity to each of the

8,296 protein domain families individually in the introns of the ~25,000 genes of

Arabidopsis. This job was split over 16 nodes on the Calculon to lessen the run time of the program. The 8,296 protein domain families were parsed out of the

Pfam-A database, downloaded from pfam.wustl.edu, and put into separated

FASTA format files and used as the queries for the searches. The introns of the

Arabidopsis genome, which were downloaded from www.arabidopsis.org, were

formatted and used as a nucleotide database for the search. The cut off expected value for the BLAST search was set to 1 to ensure that no misannotations could be missed. The results from the sequence similarity search were parsed out into separate files based on the domain to which the genes had sequence similarity. A HMMER search wasn’t used for the search for sequence similarity in the introns of the Arabidopsis thaliana because the program would take far too long to run. Then HMMER was used to search for sequence similarity to each of the 8,296 protein domain families in the coding sequence of the ~25,000 genes of Arabidopsis. The job was split over about 150

nodes of the Calculon and the resulting sequence similarity data was parsed out into one file.

A computer program then extracted genes that had sequence similarity to the same domain in the coding sequence and the introns. It did this by listing all the genes that had sequence similarity in the coding sequence and domains that they matched, and then checked if any sequence similarities in the introns had the same gene and domain pair. The introns with sequence similarities and all the exons of the genes with sequence similarity to the same domain in the coding sequence and the introns were mapped, with information about the location and matching domains, if any, of the introns and exons. These genes were further refined by another computer program, by examining the maps and extracting the genes that had sequence similarity to the same domain in an intron and its flanking exons. Genes that had sequence similarity to the same domain in an intron and its flanking exons were considered possible misannotations.

The search for pseudogenes used the possibly misannotated genes as the initial candidates. The region of sequence similarity in the introns of the misannotated genes, translated into an amino acid sequence and adjusted for best match by tblastn, were extracted from the BLAST output. These sequences were examined for premature stop codons, which were represented by asterisks and also evidence of a frameshift mutation. Genes that contained sequences with premature stop codons will be evaluated for their Ka/Ks values. Those genes whose Ka/Ks values are significantly close to 1 are identified as

pseudogenes; however, the Ka/Ks values still have to be evaluated, so the pseudogenes in this study are only identified by stop codons.

Results

The BLAST search for sequence similarities to any of the 8,296 protein domains in the introns of the Arabidopsis genes found 20,923 genes, because the expected value was set to 1 to prevent the culling out possible pseudogenes.

The HMMER search for sequence similarities to any of the 8,296 protein domains in the coding sequences of the Arabidopsis genes found 18,403 genes.

24,326 genes had a sequence similarity in either the introns or coding sequence.

Gene Domain

AT2G14130.1

PF02902.8

AT4G36000.1

PF00314.7

AT3G28958.1

PF02298.7

AT1G30590.1

PF05327.1

AT5G46000.1

PF01419.6

AT1G25210.1

PF07734.2

AT1G24880.1

PF07734.2

AT3G26590.1

PF01554.8

AT4G14770.1

PF03638.4

AT3G24320.1

PF01541.13

AT1G62305.2

PF03267.3

AT4G01265.1

PF05691.2

AT3G14120.1

PF04121.3

AT4G32910.1

PF07575.2

AT3G42580.1

PF02902.8

AT1G25054.1

PF07734.2

AT5G25990.1

PF03267.3

AT3G44130.1

PF07734.2

AT1G05820.1

PF04258.3

AT3G23670.2

PF06548.1

AT3G12980.1

PF02135.5

AT4G07530.1

PF07794.1

AT5G46070.1

PF02263.8

AT5G21030.1

PF02171.7

AT4G03830.1

PF04642.2

AT3G09100.2

PF03919.5

Table 1 Possible Pseudogenes and Their Protein domains

Gene Domain

AT1G44880.1

PF02902.8

AT1G14110.1

PF03254.3

AT5G11980.1

PF04124.2

AT4G36210.2

PF05277.2

AT3G17330.1

PF04146.5

AT5G42325.1

PF07500.3

AT4G09280.1

PF02902.8

AT1G57570.1

PF01419.6

AT4G23370.1

PF03080.4

AT3G44500.1

PF02902.8

AT1G67460.1

PF03193.6

AT4G23080.1

PF03080.4

AT3G30640.1

PF02902.8

AT4G37950.1

PF06045.1

AT4G01380.1

PF02298.7

AT5G09920.1

PF03874.5

AT4G02390.1

PF05406.4

AT2G18810.1

PF03754.3

AT3G14120.2

PF04121.3

AT1G25141.1

PF07734.2

AT4G05280.1

PF02902.8

AT1G24793.1

PF07734.2

AT5G44316.1

PF01458.7

AT2G30350.2

PF01541.13

AT2G06860.1

PF02902.8

343 genes were found to have sequence similarity to the same domain in both the coding sequence and the introns. 299 genes were found to have sequence similarities to the same domain in an intron and its flanking exon, meaning they are most likely misannotated genes. This means that ~1.2% of the Arabidopsis genome is misannotated. There are a substantial number of full-length cDNA and ESTs for Arabidopsis thaliana and these sequences have been used to improve its annotation significantly (Yamada et al., 2003 Science), which is likely the reason that so few possible misannotations were found.

51 of these misannotated genes were found to have stop codons in at least 1 intron. This is evidence of frameshift occurring in the genes, and thus evidence that the possibly misannotated genes are possible pseudogenes. This means that of the 299 misannotations found, 17% of them are due to pseudogenes. The Ka/Ks value must still be evaluated to verify that these genes are indeed pseudogenes.

The genes of the Arabidopsis thaliana have an average of 5.1 exons. The possibly misannotated regions have an average of 8.0 exons. The possible misannotated genes have an average of 10.6 exons. A possible reason for the higher number of exons possible pseudogenes is that gene prediction programs generally misannotate pseudogenes by annotating an intron whenever the come upon stop codons, and consequently increasing the number of exons.

Misannotation Frequency

0.6

0.5

0.4

0.3

0.2

0.1

0

0 2000 4000 6000 8000

Number of Genes Matching Domain

10000

The graph above represents the number of genes that matched a domain the percentage of the matching genes that are possible misannotations. As it shows, misannotations occur in higher percentage with domains that have fewer genes matched to it.

Future Research

The possible pseudogenes that have been found through this research must be further verified through analysis of the Ka/Ks value. The pseudogenes and misannotated genes can be analyzed more to determine some of the properties that make genes prone to misannotation. The same process will also be applied to the rice genome. The rice genome is an interesting genome because it is relatively new and thus prone to misannotations and undiscovered pseudogenes. The entire process will also be fully automated to allow it to be applied to any genome and be easily used by other researchers.

Reference

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool."

J. Mol. Biol. 215:403-410

Arabidopsis Genome Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.

Nature 408, 796–815.

David Torrents, Mikita Suyama, Evgeny Zdobnov and Peer Bork. A Genome-Wide

Survey of Human Pseudogenes.

Genome Research 13:2559-2567, 2003

Guillaume Blanc and Kenneth H. Wolfe. Functional Divergence of Duplicated Genes

Formed by Polyploidy during Arabidopsis Evolution.

Plant Cell. 2004 July; 16(7): 1679–1691.

Robert D. Finn, Jaina Mistry, Benjamin Schuster-Böckler, Sam Griffiths-Jones, Volker

Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard

Durbin, Sean R. Eddy, Erik L. L. Sonnhammer and Alex Bateman

Nucleic Acids Research (2006) Database Issue 34:D247-D251

Wistrand M, Sonnhammer EL. Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER.

BMC Bioinformatics. 2005 Apr 15;6-99.

Yamada K, Lim J, Dale JM. Empirical analysis of transcriptional activity in the

Arabidopsis genome.

Science. 2003 Oct 31;302(5646):842-6.

Zhaolei Zhang, Nick Carriero and Mark Gerstein. Comparative analysis of processed pseudogenes in the mouse and human genomes.

Trends in Genetics 2004 Feb;62-67

Li WH, Gojobori T, Nei M. Pseudogenes as a paradigm of neutral evolution.

Nature. 1981 Jul 16;292(5820):237-9

Download