Lecture 10. Genetic and Genomic Databases

Bi190 Advanced Genetics 2011 Lecture 10/ho9 1 Annotating Genome Lecture 10. Genetic and Genomic Databases Biological information resources that organize information around genomes and genes have become essential tools of life science research. The information comes from multiple sources, is organized so as to be computable, and is displayed for use. It is important to know where the information comes from, how it is organized using ontologies (standardized terms and their relationships), how one can search for information. Also, everyone needs a sense of how complete the information is. We also include a discussion of the importance of data standards. The extent of genome-scale data surpass a human’s ability to comprehend it at one time. thus have to rely on databases of genomic information. Because genomic databases interposed between primary data and the geneticist, it is crucial to understand how information gets into such database, how one assesses the quality of these data, and concepts underlying their storage, query and integration. We are the the There are many ways to organize knowledge. While humans are amazingly good at dealing with a hodge-podge of information, computers are notoriously bad at this task: computers need structured information! In the genome era, the large amounts of information led to the need for us to use computers to store and manipulate information because computers are very fast and accurate at doing repetitive tasks. Information can be compiled in a standard form. Information can be assembled into defined structures. One of the ways to organize biological information is to attach to the linear structure of a genome. Another way is by the anatomy of the organism. Yet another way is to human disease or a specific property. There are now thousands of biological databases. These databases range in size, complexity, purpose, and whether they serve humans, computers or both. A. How data gets into databases A genome database organizes information around a genome seqeuence. A genome database typically is organized around a genomic sequence. For example, the database might simply store the genome sequence and some description of features of that sequence. Such descriptions are called annotations. This is where it can get rather complex with many thousands of specific types and sub-types of features. …tctctctatatgatctgcagcaggtcatctctgcggcttatgcgttagcgcg… What types of information do we want? We might want to know what regions of the genome are repetitive, or the extent of each gene. Genomic databases have a variety of content DNA sequence (Chapter 8) is submitted by the producer of the data to one of the large public DNA sequence databases (GenBank or ..). The sequence typically includes some annotations. RNA or cDNA sequence is obtained by submission of the data upon publication. Primary annotations to the genomic sequence Extent of clones Genes Other elements Sternberg 2011 Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome Gene expression data relates a gene or specific DNA sequence to a level, time, place or condition of gene expression. It might be derived from direct assay of the mRNA, protein, or from a reporter gene construct. For large-scale gene expression data (see Chapter 11) such as microarray or RNA-seq the data are submitted to standard databases. For other assays there is no requirement for submission of data upon publication, and most such data are curated by hand. Genetic variants relative to the reference sequence. Sequence conservation. A quantitative measure of the extent of conservation is calculated from aligning multiple sequences. Genetic mapping data Gene-Gene interactionsprotein-protein, protein-RNA interactions Association of genetic variants to phenotypes Figure. Zoomed in view showing amino acids. Sternberg 2011 2 Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome 2. Stable, unique identifiers help maintain data integrity Imagine a gene with 12 names. Example from human or yeast. For example, the yeast SIN3 gene is also known as YOL004W, CPE1, GAM2, RPD1, SDI1, SDS16, and UME4. Databases store synonyms and make it easy to keep these straight. However, imagine two names that mean different things. cdc25 is the S. pombe////and CDC25, find worse cases. Imagine genes that change names because they merge or split. These nightmares happen frequently and can lead to great confusion. An excellent solution to these situations is to assign a stable name that identifies the gene. The largest database of biomedical abstracts (PubMed) does not store first names of authors of papers. Many individual researchers have the same initials and thus these names are ambiguous in the database. It would be an enormous task to disambiguate these names. As humans we would prefer to use whatever symbol we like, and computer programs can often help us translate our symbols into hidden unique identifiers. When there is ambiguity, we might get to choose. For example, if one searches PubMed for “elegans” you get returns that include C. elegans the nematode and S. elegans the turtle. Searching for “C. elegans” you might get C. elegans the flowering plant (Camelia elegans). NLM-NCBI uses unique identifiers for each taxon, so with discipline the user can find specifically the species of interest. Results can be tied to reagents to keep track of an inference chain Imagine an RNAi experiment is done with a particular sequence that is uniquely mappable to the genome but it is assigned to a gene, T. We associate T to Phenotype W using this RNAi reagent. Now, new sequencing of cDNA reveals that T, which had been predicted from the Sternberg 2011 3 Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome genomic sequence and had only partial cDNA support, is split into two genes, T and U. If the RNAi sequence is now in U, and to keep the database straight, someone has to realize this and change the association of the Phenotype to U. However, if the RNAi experiment is associated with a sequence that is remapped continually to genome, then the Phenotype is associate correctly with U. Many data are curated by humans In many cases, data is entered into genomic database after examination and processing by a professional curator. For example, a biologist reads a research paper and extracts information. Data in a table in a paper is reasonably standard format. Researchers enter data Data is extracted automatically from papers B. Ontologies and their use Ontologies organize information Ontologies formalize some types of information. An ontology is a description of the relationship among defined terms. Both the terms and their relationships are defined, and this structured information allows computers to utilize the information effectively. Ontologies cover many types of information Anatomy Ontologies e.g. mouse Phenotype e.g. C. elegans The Gene Ontologies capture some basic information about genes and their products. Gene Ontologies (GO) include evidence for the associations. MOD organism or group sea urchin cellular slime mold Drosphila melanogaster (fruitfly) C. elegans and other nematodes yeast budding yeast fission Gramene mouse rat zebrafish frog Sternberg 2011 database URL http://www.spbase.org/SpBase/ http://dictybase.org/ http://flybase.org/ wormbase.org http://www.gramene.org/ 4 Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome The protein content of a genome can help define Pathways Figure Genetic modules based on association in the genome. Conceptual view of a biological database database table organism human mouse C. elegans Drosophila S. cerevisaie S. pombe rat zerafish Arabidopsis gramene Gene Ontology Consortium Sternberg 2011 URL http://genome.ucsc.edu/cgibin/hgGateway http://www.informatics.jax.org/ http://www.wormbase.org/ http://flybase.org/ http://www.yeastgenome.org/ http://www.pombase.org/ (not yet live) http://old.genedb.org/genedb/pombe/ genome size 12.5 Mb (~14.1 Mb http://rgd.mcw.edu/ http://zfin.org/cgibin/webdriver?MIval=aaZDB_home.apg http://www.arabidopsis.org/ http://www.gramene.org/ http://www.geneontology.org/ 5 Bi190 Advanced Genetics Variation Sternberg 2011 2011 Lecture 10/ho9 Annotating Genome 6 Bi190 Advanced Genetics Sternberg 2011 2011 Lecture 10/ho9 Annotating Genome 7 Bi190 Advanced Genetics Sternberg 2011 2011 Lecture 10/ho9 Annotating Genome 8

Lecture 10. Genetic and Genomic Databases

Related documents

Products

Support

Lecture 10. Genetic and Genomic Databases

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib