The reference genome annotation project

The GO Reference Genome Annotation Project The Reference Genome Group of the Gene Ontology Consortium° *Corresponding author: Pascale Gaudet Northwestern University, Chicago, Illinois, USA pgaudet@northwestern.edu Page 1 Abstract The goal of the Reference Genome project is to provide a set of high quality gene annotations from a number of biologically diverse model organisms selected because of their considerable bodies of experimental literature and expert bio-curation teams. Towards this end we are engaged in a focused effort to significantly improve the depth and breadth of Gene Ontology (GO) annotations for twelve genomes through coordinated annotation. Concurrent annotation by multiple curators is improving annotation consistency across genome databases, and is providing important improvements to GO's logical structure and biological content. We expect this effort to increase the confidence of transitive annotation, that is, annotation by inference of newly sequenced genomes, since a comprehensive body of experimentally based annotations will be available for these twelve well-studied organisms. Our purpose in this paper is to introduce this project and discuss preliminary results. Background and Motivation The functional analysis of gene products (both proteins and RNAs) is a major endeavor that requires a judicious mix of experimental and computational tools. The Gene Ontology (GO) Consortium is a collaborative effort committed to providing structured vocabularies for the annotation of the functional role of gene products in a controlled, systematic way and in a species-neutral manner (Ashburner et al. 2000, Gene Ontology Consortium 2008). Experimentally determined functions from the biomedical literature are manually curated by database curators using GO terms to create descriptive annotations of the gene products. This annotation task is carried out by curators, a word whose root comes from the Latin cure: to look after and preserve. A curator in the sense used here is a Ph.D. trained professional life scientist whose task is to preserve published, and in some cases unpublished, biological data by abstracting and integrating them into a database. Future biological research is dependent on the availability of well-structured representations of biological data with detailed, accurate descriptions (Bourne 2006, Hower et al. 2008) provided by the curators of the data repositories. The annotations created by the curators provide a solid, dependable substrate for downstream computational analyses to automatically infer the functions of gene products that are as yet uncharacterized experimentally. GO is an essential semantic resource for curators and one of the most widely used tools for functional annotation. Each uniquely defined term in the hierarchically structured GO can be used across organisms and research areas, thus supporting powerful computational and comparative analyses of high-throughput genomic datasets. Highquality manual annotation by experts is an absolute prerequisite for seeding this system and, other than the major model organism database (MOD) projects, very few research communities have the resources or expertise to perform this labor-intensive task. Therefore, the functional annotation of other genomes typically relies on automated methods that provide the transitive inheritance of annotations from related genes for which reliable annotations are available. To address the important and much-discussed issue of errors arising from transitive annotation (see, for example, Smith 1996; Smith 1997, Wheeler and Boguski 1998; Iliopoulos et al. 2003; Artamonova et al. 2005), the GO Consortium was determined to provide the larger research community with a comprehensive set of complete annotations for key reference organisms. The nine organisms selected to provide this gold-standard reference set have the following characteristics: each represents a major clade from the phylogenetic spectrum; there exists a significant body of scientific literature on the organism; a reasonably sized community of researchers study the organism; and the organism is an important experimental model for the study of human disease, or for economically important activities such as agriculture. Thus the GO Consortium is committed to providing complete annotation of the human genome as well as those of eight important model organism genomes, the GO ‘Reference Genomes’: Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Dictyostelium discoideum, Drosophila melanogaster, Escherichia coli, Mus musculus, and Saccharomyces cerevisiae. In addition, the MOD curators for Gallus gallus, Rattus norvegicus, and Schizosaccharomyces pombe have joined the Reference Genome initiative and are fully participating in these efforts. Each model organism has its own advantages for studying different aspects of gene function, ranging from basic metabolic reactions to cellular processes, development, physiology, behavior, and Page 2 disease. All of these organisms are supported by database resources with GO curators who have the expertise to annotate gene products in these genomes according to rigorous standards set by the groups participating in the Reference Genome project (see below). We expect these reference annotations to have two important applications: they will provide improved annotations for researchers who use GO for analysis of both large and small scale datasets, and they will facilitate the annotation of new genomes where extensive experimental data on gene function is often unavailable. Methods The aim of the Reference Genome project is to annotate the gene products of the twelve participating genomes as completely as possible. There are two complementary aspects to achieving annotation completeness: “depth” and “breadth”. For maximal depth, annotations should be as precise and specific as possible; in other words, all experimentally determined information (primarily from the biomedical literature) about the gene products from each of these organisms should be curated. For maximal breadth, the annotations should cover every gene product in the genome, which means we should ensure that computational inferences are included in the annotation sets. In practice, because the inferred annotations are dependent upon the experimentally derived annotations, curation is carried out in two passes: first, literature -based annotation of experimental data followed by curator-approved computational inference to maximize both depth and breadth of annotation across all of Reference Genomes. Annotation collaboration Although the development of the GO has been a collaborative effort since its inception, each participating group initially worked independently in assigning GO annotations, with only limited coordination between the curators of the different databases. Thus, prior to this project, specific protocols for annotation varied between the database groups. Much of the challenge in this project arises from bringing twelve disparate annotation groups together to jointly determine common protocols for annotation and quality assurance. Agreement is critical if the Reference Genome resource is to satisfy our goal of providing semantically consistent functional annotations; therefore, the curators for the different organisms must curate in a consistent and reliable manner, otherwise the resulting annotations will not be systematically comparable and downstream analysis based upon these data will be skewed. Overall, the elements involved in managing and tracking this project are familiar to each of the database resource projects involved. Curators are responsible for capturing information from two sources: the literature, and the output of computational analyses. In the first case, a curator reads a research article and captures several key pieces of information: the organism being studied; the gene product to be annotated; the type of experiment performed; the GO term(s) that best describes the gene product function/process/location that has been determined; and an identifier (typically a PubMed ID) for the citation. Curators similarly review the output of computational analyses and approve or reject these data, once again capturing the organism; the gene product; the type of computational analysis, the GO term(s) describing what can be inferred from that computational analysis; an identifier referencing the computational algorithm; and a reference to the original entity, for example protein domain or related gene product, upon which the inference was based. Variation in annotation occurs because curators differ in their decisions as to what is appropriate to annotate and in choosing which GO terms to employ (Camon et al. 2005; Dolan et al. 2005). Following extensive discussions among the curation groups, two key decisions were made. First, the groups would simultaneously curate a number of homologous genes, to provide an opportunity for checking the consistency of the annotations provided by different groups. This strategy, coincidently, also improves the ontology, since several curators working simultaneously with particular nodes of the GO structure can collaboratively identify any omissions, ambiguities or logical inconsistencies in the GO and work towards their resolution with the ontology developers. Second, the homologous sets of genes would be prioritized based on a number of criteria, to be described below. Page 3 Annotation approach The organisms represented in the GO Reference Genome project span over 500 million years of evolutionary divergence. The premise that underpins comparative genomics is that homologous genes descended from a common ancestor, will have related functions. This is not, of course, to deny that genes will diverge in function over time, but it is a sound null hypothesis. For our purposes this means that a critical first step is to establish a standard approach to determining sets of homologous genes. Ideally, the evolutionary history of each gene in all organisms would be analyzed and stored in a single resource that could be used as the definitive reference for gene family relationships and homologous gene sets. However, generating this resource is a non-trivial problem, both theoretically and practically. At present no single such resource providing a fully satisfactory solution exists (Dolinski and Botstein 2007, Alexeyenko et al. 2006). One central confounding problem has been the lack of a ‘gold standard’ protein set: a shared resource used by all databases and homology prediction tools for analyzing the proteomes of the organisms used in this project. Because the different homology prediction tools do not use common, shared protein sets as inputs their results cannot be compared. Moreover, the protein sets that are being annotated by the GO Consortium members may, and often do, differ from those used by the different homology prediction programs. By and large it is the MODs that have the most authoritative protein datasets for the species they annotate. While at present the total number of gene products is imprecisely known (largely because the full extent of post-translational modifications and alternative splicing remain uncertain) there are reasonable estimates available from the MODs for the numbers of genes encoding protein products in each genome, ranging from 4,389 in E. coli (data from EcoCyc Version 12.1, http://ecocyc.org) to 27,029 in Arabidopsis (Rhee et al. 2008). The GO Consortium is now providing lists of protein sequence accessions for each organism to those who compute homology sets. This initiative first began in collaboration with the InParanoid group, and are now used as input to InParanoid, OrthoMCL, the P-POD database (Heinicke et al. 2007), and the PANTHER database (Mi et al. 2007). Having agreed to use standardized protein sequence datasets as inputs, we next considered the existing algorithmic approaches to the determination of homology that would best meet our objectives. We came to the conclusion that accurately resolving the relationships between genes in highly-duplicated gene families requires an accurate, though computationally expensive, model based on phylogenetic trees. Phylogenetic tree-based approaches to orthology prediction have both theoretical and practical advantages for the Reference Genome project. Theoretical, because they are based on an explicit evolutionary model and can be computationally evaluated, practical because they are amenable to attractive graphical output that facilitates the rapid identification of homology sets by multiple curators. To this end we are collaborating with the PANTHER database (http://www.pantherdb.org/) to produce a curated set of gene trees derived from the standardized protein-coding gene sets for each of the twelve organisms. The curators select a number of homologous sets to annotate in each round of concurrent annotation. Each member of the set is first annotated to the highest level of detail possible based on available experimental results. Then, related proteins are annotated based on inference from the functions of the characterized members of the family, in order to provide a predicted function for each protein whenever possible. Those inferences are made in very conservative manner, as we are very aware that even orthologous genes may have evolved new physiological functions layered on top of more ancient functions. For example in single cell eukaryotes, proteins of the syntaxin family are required for vesicle transport, and that function is conserved up the evolutionary tree. Yet in the higher eukaryotes these proteins are required for the novel function of neurotransmitter release. Large gene families and lineage-specific duplications also pose a problem, in that some members of a set may have diverged and assumed distinct functions from those of their ancestors. It is for these reasons that curators are conservative in drawing inferences concerning gene function from sequence similarity. Annotation priorities While the goal of the Reference Genome project is to cover all gene families, and subsequently maintain these annotations as a part of normal curation, this work will take time. Even by initially concentrating solely on protein Page 4 gene products1 this still presents a large and formidable target annotation list, especially given the limited resources of the contributing database groups. The magnitude of the curation task is somewhat offset by the fact that for many of these proteins there is little or no experimental data available. For example, only ~30% of mouse and human genes (out of ~25,000 total) have experimentally determined GO annotations while the proportion of S. cerevisiae genes (out of ~6000 total) that are experimentally annotated is over 70%. Nevertheless, it is clear that coordination of the Reference Genome project demands a coherent prioritization of targets for curation. Accordingly, Reference Genome curators are selecting targets using the following principles: 1. 2. 3. 4. Genes known to be implicated in human disease, and their orthologs in other taxa, e.g. the gene MSH6 (a DNA mismatch repair protein) which is known to be involved in a hereditary form of colorectal cancer in humans. Genes whose products are involved in known biochemical and signaling pathways, e.g. the gene PYGB (a phosphorylase) that participates in glycogen degradation. Genes whose products are very highly conserved during evolution, e.g. GLYA (a serine hydroxymethyltransferase) is a human gene that is conserved in E. coli. Genes identified from recently published literature as having an important or new scientific impact, e.g. POU5F1 (POU class 5 homeobox 1 gene) that is important for stem cell function. Annotation Consistency An important aspect of the Reference Genome project is to provide consistently generated annotations across the different participating groups. This requires that GO terms are applied with the same intended meaning; that evidence codes are used uniformly; and that ‘inferred from sequence similarity’ (ISS) annotations are based on the same methods. For the Reference Genome project curators review existing annotations as well as add new annotations based on more recent information. We have implemented appropriate procedures, employing a mix of automated and manual methods, to ensure the highest possible quality of annotations for the Reference Genome project, whether based on experimental data found in the literature or inferred from these experimental annotations. The review includes the following: (1) A periodic assessment of consistency using a peer review system in which a curator evaluates the experimentally determined annotations provided by other curators for a gene family; (2) A computational verification that all associations “inferred based on sequence similarity” (ISS) explicitly reference a gene in which experimental evidence is available; (3) The replacement of older annotations that are not based on direct experimental evidence or reliable inferences with annotations that adhere to these new standards. Literature curation consistency The quality control process for consistency and completeness of annotations begins when curators indicate that the targeted gene(s) in their organism has been comprehensively annotated based on the information available in the biomedical literature. The time and effort required for literature-based curation is a function of the amount of information available. If there is no literature, then the genes are immediately considered completely annotated. For genes with little literature, the curator reviews all available papers, but for genes for which hundreds of papers are available this is impractical. In these cases, curators assess the comprehensiveness of curation based upon recent reviews, and curate key primary publications accordingly. The curation consistency review process also identifies problems with the interpretation of particular GO terms. These terms are then flagged within the GO with a comment that a curator must be careful when using these terms. For example, "cell growth" is sometimes misused because it is confused with “cell proliferation” which contributes to the growth of organs. Certain concepts, such as "development", "differentiation" and "morphogenesis" are used with various, overlapping meanings in the literature. In GO they are specifically Although some GO curation teams are recording isoform details (when available) these protein annotations are not generally presented at this level, however we expect to incorporate and reflect isoform level annotations for all twelve organisms in the future. 1 Page 5 defined, and we verify that all annotations use terms as defined by the GO. The review of curation also identified several GO terms that proved to be ambiguously defined, or that were not logically consistent within the GO. These terms have been made obsolete by the GO and replaced by more appropriate terms. Examples of such terms include "electron transport" (replaced by two terms: "electron transport chain" and "oxidation reduction"), and "secretory pathway" (replaced by two terms: "exocytosis" and "vesicle-mediated transport"). At the beginning of the GO Consortium project, a number of annotations were made, usually from textbooks or review articles, which were solely supported by author statements and often represented what could be thought of as common knowledge. For example, ‘DNA polymerase [mouse Poln] (UniProt:Q7TQ07)’ performs 'DNA replication (GO:0006260)' in the 'nucleus (GO:0005634)'. Although this information is valuable and in most cases accurate, we soon realized that often such annotations could either not be traced to an original experiment, or the original experiment was actually carried out on a homologous gene product from another organism. Therefore, the Reference Genome project has made it standard practice to avoid making annotations based solely on author (generic) statements that lack direct experimental verification for the specific gene being annotated, and we are systematically replacing such annotations with new annotations supported by experimental or sequence similarity evidence as they arise. Consistent transitive curation of proteins in a homolog set To ensure maximal breadth of genome annotation, we transfer GO annotations from experimentally studied proteins to homologous proteins for which no experimental data are available. These inferences take place after all the experimental data have been captured. When all curators have completed their literature-based annotations for a given target gene set, a second iteration of annotation is done, but now based on inference. This was initially an idiosyncratic process unique to each database group. However, the use of tree-based methods to examine homologous groups allows for a more rigorous and systematic transference of this information. This is not an automatic process, rather a curator reviews each inferred annotation with care since the function of a gene can diverge during evolution, particularly after gene duplication events that may free one of the duplicated copies from selection constraints and allow the evolution of new functionality. One of the gold standards of the Reference Genome project is that inferential annotations are made only in cases where there is experimental data from the organism from which the annotation is inferred. For example, orthologs of the mismatch repair MSH2 are present in all eukaryotes. However, the role in mismatch repair (GO:0000710) has only been experimentally demonstrated in a subset of species: e.g. human, mouse, Arabidopsis, the yeasts, and C. elegans. Therefore, inferential annotations in other organisms must refer to a corresponding ortholog from one these species in which experimental work has been done (this need not necessarily be a species from the Reference Genome project). Hence, after making inferences, queries are run to verify that this is the case. Progress As of October 2008, we have annotated nearly 400 homology sets, comprising approximately 2500 genes. In the upcoming months, we plan to scale up the effort by having all the proteins from the twelve organisms organized into the PANTHER family database and annotation tool. The annotations may be viewed using AmiGO, the GOC browser (http://amigo.geneontology.org/, Carbon et al. 2008). In the upcoming release of AmiGO (end of 2008) a number of new displays will be available that are specifically designed for public browsing of data from the Reference Genome project. The additions include a table listing all gene families that have complete Reference Genome annotations. For each entry in the table there will be a link to a “Comparison Graph” as shown in (Figure 1) to highlight the common functions for each gene family as well as those particular to certain organisms or groups of organisms. In addition to the graphical representation of the annotation data, individual pages for each gene family will also be available which list all annotations in that family as shown in (Figure 2). Page 6 Improvements to annotations Gene products selected for concurrent annotation in the course of the Reference Genome project have improved the breadth and depth of annotation coverage of gene products as shown in Figure 3. Those genes have a higher percentage of annotations derived from published experimental research. Moreover, the annotation information for those genes is significantly higher relative to the annotation information prior to this effort. The number of genes annotated by inference through homology has also increased, further contributing to increased breadth and depth of genome coverage in the annotations. In all cases, the percentage of genes annotated only with noncurator reviewed computed annotation is reduced, demonstrating that this effort does contribute to improving the breadth and quality of annotation for each genome. Improvements to GO The collaborative annotation of a group of similar gene products has also proven to be useful for the development of the GO itself. For example, as a direct consequence of the Reference Genome project, 223 ontology change requests were made. These changes were incorporated in the ontologies making up slightly more than 10% of the total ontology change requests during this period. Examples of requested new terms include regulation of NAD(P)H oxidase activity, DNA 5'-adenosine monophosphate hydrolase activity, neurofilament bundle assembly, and quinolinate metabolic process. We have also enhanced the ontology by adding synonyms, improving definitions, and correcting inconsistencies. Discussion The aim of the Reference Genome project is to provide a source of reliable GO annotations for twelve key genomes based upon rigorous standards. This endeavor faces many difficult challenges, such as: the determination and provision of reference protein sets for each genome; the establishment of gene families for curation; the application of consistent best practices for annotation; and the development of methodologies for evaluating progress towards our goal. Although this is a laborious effort, steady progress is being made in developing this resource for the research community. This initiative has propelled the GOC into the provision of standardized protein sets for these genomes, which we expect to be of broad utility beyond the Reference Genome project. By engaging curators from across the MODs in joint discussions we are observing improvements in curational consistency and refinement of the GOC best practices guidelines (see http://geneontology.org/GO.annotation.conventions.shtml). The genes that have been targeted by the Reference Genome project have significantly improved annotation specificity as compared to their previous annotation, and the number of genes annotated by inference through homology has also increased. This increased breadth and depth of genome coverage in the annotations is one of the major goals of the project. An additional beneficial side effect has been the improvements to the GO itself, which will consequently improve the accuracy of inferences based on these annotations. Genomes that are fully and reliably functionally annotated empower scientific research in the community as they are essential for use in automated inferential annotation of other genomes, and this motivates the Reference Genome project's work. We encourage users to communicate with the GO Consortium (send e-mail to gohelp@genome.stanford.edu) with questions or suggestions for improvements to better achieve this aim. Data availability Access to all GOC software and data is free and without constraints of any kind. An overview of the project as well as links to all resources described below can be found at http://www.geneontology.org/GO.refgenome.shtml. Annotations made by the databases participating in the Reference Genome project are available from the GOC website in gene_association file format (http://www.geneontology.org/GO.current.annotations.shtml). The protein sequence datasets are available for the community as a standardized resource from http://www.geneontology.org/gp2protein/. The exact queries used to gather statistics for the annotation improvement reports can be found at: http://www.geneontology.org/GO.database.schema-with-views.shtml. Page 7 A. B. C. Figure 1. [ UPDATE] The Comparison Graph shows all annotations, both experimental (evidence codes: IDA, IMP, IGI, IPI, IEP) as well as those inferred from sequence similarity to an experimentally characterized gene (ISS) and by curators (IC). Colored wedges are used to indicate where in the GO graph there are direct annotations to specific terms for the different species. The species key (panel B) is presented to the user in the upper left corner of the browser window and allows the user to select which species to show. Panel A presents the zoomed out window for an overview of the complete annotation picture. In this particular screen capture “[e] experimental” has been selected, which is presented as a bulls-eye placed over those terms with experimentally based annotations. Panel C presents the zoomed in view with the detailed annotation on each Page 8 term. At either level of zoom mousing over term provides a pop-up window with these same annotation details. Figure 2.[ UPDATE] The Categories List highlights common or related annotations. Page 9 Figure 3. UPDATE AND PROVIDE LEGEND XXX Acknowledgements The Reference Genome effort is overseen by Pascale Gaudet2dictyBase and includes these representatives from the curational staff: Tanya BerardiniTAIR, Emily DimmerGOA, , Stacia R. EngelSGD, Petra Fey2, David P. HillMGI, Doug HoweZFIN, Jim HuE.coliDB, Rachael HuntleyGOA, Varsha K. KhodiyarUCL, Ranjana KishoreWormBase, Donghui Li3, Ruth C. LoveringGOA, Fiona McCarthyAgBase, Li NiMGI ,Victoria PetriRGD, Susan TweedieFlyBase, Kimberly Van AukenWormBase, and Valerie WoodGeneDB—as well as the following computational staff representatives: Siddhartha Basu3dictyBase, Seth CarbonBBOP, Mary DolanMGI, and Christopher J. MungallBBOP—those establishing the protein dictyBaseNorthwestern University, Chicago, IL, USA; TAIRCarnegie Institution, Department of Plant Biology, Stanford, CA, USA; EBI, Hinxton, UK; MGIThe Jackson Laboratory, Bar Harbor, ME, USA; SGDDepartment of Genetics, Stanford University, Stanford, CA, USA; ZFINUniversity of Oregon, Eugene, OR, USA; E.coliDBTexas A&M University, College Station, TX, USA; UCLDept of Medicine, University College London, London, UK; WormBaseCalifornia Institute of Technology, Pasadena, CA, USA; AgBaseMississippi State University, Starkville, MS, USA; RGDMedical College of Wisconsin, Milwaukee, WI, USA; FlyBaseDepartment of Genetics, University of Cambridge, Cambridge, UK; GeneDBWellcome Trust Sanger Institute, Hinxton, UK; BBOPLawrence Berkeley National Laboratory, Berkeley, CA, USA; PPODPrinceton University, Princeton, NJ, USA; PANTHERSRI, Palo Alto, CA, USA, GOEOEBI, Hinxton, UK. GOAUniProt, Page 10 families to be annotated: Kara DolinskiPPOD, and Paul ThomasPANTHER—and, of course, the four PIs of the GO Consortium: Michael AshburnerFlyBase, Judith A. BlakeMGI, J. Michael CherrySGD, and Suzanna LewisBBOP. The authors particularly wish to thank and recognize the invaluable contributions of their curator colleagues in the GO Consortium whose work ensures that the objectives of the Reference Genome project are fully realized: Rama BalakrishnanSGD, Gail BinkleySGD, Karen R. ChristieSGD, Maria C. CostanzoSGD, Jennifer DeeganGOEO, Alexander D. DiehlMGI, Qing DongSGD, Harold DrabkinMGI, Dianna G. FiskSGD, Midori HarrisGOEO, Jodi E. HirschmanSGD, Benjamin C. HitzSGD, Eurie L. HongSGD, Amelia IrelandGOEO, Cynthia J. KriegerSGD, Jane LomaxGOEO, Stuart R. MiyasatoSGD, Robert S. NashSGD, Julie ParkSGD, Debby SiegeleE.coliDB, Dmitry SitnikovMGI, Marek S. SkrzypekSGD, Shuai WengSGD, Edith D. WongSGD, and Kathy K. ZhuSGD. We also thank these Principal Investigators for their enthusiastic support of this effort that is taking place within their research groups: Rolf ApweilerGOA, Carol BultMGI, Rex ChisholmdictyBase, Janan EppigMGI, Howard JacobRGD, Julian ParkhillGeneDB, Seung RheeTAIR, Martin RingwaldMGI, Paul SternbergWormBase, and Monte WesterfieldZFIN. The Gene Ontology Consortium is supported by a grant, NIH-NHGRI P41 grant HG002273. Curation at the model organism databases is supported as follows: FlyBase, Medical Research Council grant G0500293; dictyBase: National Institutes of Health (NIH) grants GM64426 and HG00022; MGI, NIH-NHGRI P41 grant HG000330 and NIH grant HD XXXXX; WormBase: US NIH- NHGRI P41 grant HG02223; Human Cardiovascular GO team: British Medical Research Council; British Heart Foundation grant SP/07/007/23671; ZFIN, NIH-NCRR P41 grant HG002659-06; GOA, core EMBL funding; E. coli XXX; SGD XXX; chicken XXX; RGD XXX; Pombe XXX; TAIR XXX; and chicken XXX. We are very grateful to the following for stimulating discussions: Richard Durbin and Erik Sonnhammer. References Alexeyenko A, Linberg J, Perz-Bercoff A and Sonnhammer ELL, 2006. Overview and comparison of ortholog databases. Drug Discovery Today: Technologies. 3: 137-143. Artamonova, G. Frishman, M. S. Gelfand and D. Frishman. 2005. Mining sequence annotation databanks for association patterns. Bioinformatics 21 Suppl. 3: iii49-iii57. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. 2000. Gene Ontology: Tool for the unification of biology. Nature Genetics 25: 25-29. Berglund AC, Sjölund E, Ostlund G and Sonnhammer EL. 2008. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Res. 36: D263-D266. Bourne PE, McEntyre J (2006) Biocurators: Contributors to the World of Science. PLoS Comput Biol 2(10): e142. Camon EB, Barrell DG, Dimmer EC, Lee F, Magrane M, Maslen J, Binns D, and Apweiler R, 2005. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics 6: S17. Dolan, M.E., Ni, L., Camon, E., and Blake, J.A. 2005. A procedure for assessing GO annotation consistency. Bioinformatics Suppl 1:i136-i143. Dolinski K and Botstein D, 2007. Orthology and functional conservation in eukaryotes. A. Review Genetics 41: 465507. Fitch WM, 1970. Distinguishing homologous from analogous proteins. Systematic Biol. 19: 99-113. Heinicke S, Livstone MS, Lu C, Oughtred R, Kang F, Angiuoli SV, White O, Botstein D, Dolinski K. 2007. The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists. PloS ONE 2: e766. Howe, Costanzo, Fey, Gojobori, Hannick, Hide, Hill, Kania, Schaeffer, St Pierre, Twigger, White & Rhee. 2008 Big data: The future of biocuration. Nature 455: 47. Page 11 Iliopoulos I, Tsoka S, Andrade MA, Enright AJ, Carroll M, Poullet P, Promponas V, Liakopoulos T, Palaios G, Pasquier C, Hamodrakas S, Tamames J, Yagnik AT, Tramontano A, Devos D, Blaschke C, Valencia A, Brett D, Martin D, Leroy C, Rigoutsos I, Sander C, Ouzounis CA. 2003. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics 19: 717-726. Li L, Stoeckert CJ Jr, Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13: 178-189. Mi H, Guo N, Kejariwal A, Thomas PD. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res. 2007 Jan;35:D247-52. Nature Publishing Group (2007) The database revolution. Nature 445, 229-230. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde-Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS, Morris Q, Klein-Seetharaman J, Bar-Joseph Z, Chen T, Sun F, Troyanskaya OG, Marcotte EM, Xu D, Hughes TR, Roth FP. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 2008; 9:S2 http://genomebiology.com/2008/9/S1/S2 Penkett CJ, Morris JA, Wood V and Bahler J. 2006. YOGY: A web-based, integrated database to retrieve protein orthologs and associated Gene Ontology terms. Nucleic Acids Res. 34: W330-W334. Rhee SY, Wood V, Dolinski K, Draghici S. 2008. Use and misuse of the gene ontology annotations. Nat. Rev. Genet. 9: 509-515. Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Hériché JK, Hu Y, Kristiansen K, Li R, Liu T, Moses A, Qin J, Vang S, Vilella AJ, Ureta-Vidal A, Bolund L, Wang J, Durbin R. TreeFam: 2008 Update. Nucleic Acids Res. 2008 Jan;36(Database issue):D735-40. Smith, RF. 1996. Perspectives: sequence database searching in the era of large-scale genomic sequencing. Genome Res. 6: 653-660. Smith, TF and Zhang, X. 1997. The challenges of genome sequence annotation or “The devil is in the details”. Nat. Biotechnology 15:1222-1223 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003 4: 41. The Gene Ontology Consortium. 2008. The Gene Ontology Project in 2008. Nucleic Acids Res. 36: D440-4. Wheelan, SJ and Boguski MS. 1998. Late-Night Thoughts on the Sequence Annotation Problem. Genome Res. 8: 168-169. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E. 2008. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36: D13-D21. Page 12

The reference genome annotation project

Related documents

Products

Support

The reference genome annotation project

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib