Lecture 10. Genetic and Genomic Databases

advertisement
Bi190 Advanced Genetics
2011 Lecture 10/ho9
1
Annotating Genome
Lecture 10. Genetic and Genomic Databases
Biological information resources that organize information around genomes and genes have
become essential tools of life science research. The information comes from multiple sources, is
organized so as to be computable, and is displayed for use. It is important to know where the
information comes from, how it is organized using ontologies (standardized terms and their
relationships), how one can search for information. Also, everyone needs a sense of how
complete the information is. We also include a discussion of the importance of data standards.
The extent of genome-scale data surpass a human’s ability to comprehend it at one time.
thus have to rely on databases of genomic information. Because genomic databases
interposed between primary data and the geneticist, it is crucial to understand how
information gets into such database, how one assesses the quality of these data, and
concepts underlying their storage, query and integration.
We
are
the
the
There are many ways to organize knowledge. While humans are amazingly good at dealing with
a hodge-podge of information, computers are notoriously bad at this task: computers need
structured information! In the genome era, the large amounts of information led to the need for
us to use computers to store and manipulate information because computers are very fast and
accurate at doing repetitive tasks. Information can be compiled in a standard form. Information
can be assembled into defined structures. One of the ways to organize biological information is
to attach to the linear structure of a genome. Another way is by the anatomy of the organism.
Yet another way is to human disease or a specific property. There are now thousands of
biological databases. These databases range in size, complexity, purpose, and whether they
serve humans, computers or both.
A. How data gets into databases
A genome database organizes information around a genome seqeuence.
A genome database typically is organized around a genomic sequence. For example, the
database might simply store the genome sequence and some description of features of that
sequence. Such descriptions are called annotations. This is where it can get rather complex
with many thousands of specific types and sub-types of features.
…tctctctatatgatctgcagcaggtcatctctgcggcttatgcgttagcgcg…
What types of information do we want? We might want to know what regions of the
genome are repetitive, or the extent of each gene.
Genomic databases have a variety of content
DNA sequence (Chapter 8) is submitted by the producer of the data to one of the large public
DNA sequence databases (GenBank or ..). The sequence typically includes some annotations.
RNA or cDNA sequence is obtained by submission of the data upon publication.
Primary annotations to the genomic sequence
Extent of clones
Genes
Other elements
Sternberg 2011
Bi190 Advanced Genetics
2011 Lecture 10/ho9
Annotating Genome
Gene expression data relates a gene or specific DNA sequence to a level, time, place or
condition of gene expression. It might be derived from direct assay of the mRNA, protein, or
from a reporter gene construct. For large-scale gene expression data (see Chapter 11) such as
microarray or RNA-seq the data are submitted to standard databases. For other assays there
is no requirement for submission of data upon publication, and most such data are curated by
hand.
Genetic variants relative to the reference sequence.
Sequence conservation. A quantitative measure of the extent of conservation is calculated from
aligning multiple sequences.
Genetic mapping data
Gene-Gene interactionsprotein-protein, protein-RNA interactions
Association of genetic variants to phenotypes
Figure. Zoomed in view showing amino acids.
Sternberg 2011
2
Bi190 Advanced Genetics
2011 Lecture 10/ho9
Annotating Genome
2. Stable, unique identifiers help maintain data integrity
Imagine a gene with 12 names. Example from human or yeast. For example, the yeast SIN3
gene is also known as YOL004W, CPE1, GAM2, RPD1, SDI1, SDS16, and UME4. Databases
store synonyms and make it easy to keep these straight. However, imagine two names that
mean different things. cdc25 is the S. pombe////and CDC25, find worse cases.
Imagine genes that change names because they merge or split. These nightmares happen
frequently and can lead to great confusion. An excellent solution to these situations is to assign
a stable name that identifies the gene.
The largest database of biomedical abstracts (PubMed) does not store first names of authors of
papers. Many individual researchers have the same initials and thus these names are
ambiguous in the database. It would be an enormous task to disambiguate these names.
As humans we would prefer to use whatever symbol we like, and computer programs can often
help us translate our symbols into hidden unique identifiers. When there is ambiguity, we might
get to choose.
For example, if one searches PubMed for “elegans” you get returns that include C. elegans the
nematode and S. elegans the turtle. Searching for “C. elegans” you might get C. elegans the
flowering plant (Camelia elegans). NLM-NCBI uses unique identifiers for each taxon, so with
discipline the user can find specifically the species of interest.
Results can be tied to reagents to keep track of an inference chain
Imagine an RNAi experiment is done with a particular sequence that is uniquely mappable to
the genome but it is assigned to a gene, T. We associate T to Phenotype W using this RNAi
reagent. Now, new sequencing of cDNA reveals that T, which had been predicted from the
Sternberg 2011
3
Bi190 Advanced Genetics
2011 Lecture 10/ho9
Annotating Genome
genomic sequence and had only partial cDNA support, is split into two genes, T and U. If the
RNAi sequence is now in U, and to keep the database straight, someone has to realize this and
change the association of the Phenotype to U. However, if the RNAi experiment is associated
with a sequence that is remapped continually to genome, then the Phenotype is associate
correctly with U.
Many data are curated by humans
In many cases, data is entered into genomic database after examination and processing by a
professional curator.
For example, a biologist reads a research paper and extracts information.
Data in a table in a paper is reasonably standard format.
Researchers enter data
Data is extracted automatically from papers
B. Ontologies and their use
Ontologies organize information
Ontologies formalize some types of information. An ontology is a description of the relationship
among defined terms. Both the terms and their relationships are defined, and this structured
information allows computers to utilize the information effectively.
Ontologies cover many types of information
Anatomy Ontologies e.g. mouse
Phenotype e.g. C. elegans
The Gene Ontologies capture some basic information about genes and their products.
Gene Ontologies (GO) include evidence for the associations.
MOD
organism or group
sea urchin
cellular slime mold
Drosphila melanogaster (fruitfly)
C. elegans and other nematodes
yeast budding
yeast fission
Gramene
mouse
rat
zebrafish
frog
Sternberg 2011
database URL
http://www.spbase.org/SpBase/
http://dictybase.org/
http://flybase.org/
wormbase.org
http://www.gramene.org/
4
Bi190 Advanced Genetics
2011 Lecture 10/ho9
Annotating Genome
The protein content of a genome can help define Pathways
Figure Genetic modules based on association in the genome.
Conceptual view of a biological database
database table
organism
human
mouse
C. elegans
Drosophila
S. cerevisaie
S. pombe
rat
zerafish
Arabidopsis
gramene
Gene Ontology
Consortium
Sternberg 2011
URL
http://genome.ucsc.edu/cgibin/hgGateway
http://www.informatics.jax.org/
http://www.wormbase.org/
http://flybase.org/
http://www.yeastgenome.org/
http://www.pombase.org/ (not yet
live)
http://old.genedb.org/genedb/pombe/ genome size 12.5 Mb (~14.1
Mb
http://rgd.mcw.edu/
http://zfin.org/cgibin/webdriver?MIval=aaZDB_home.apg
http://www.arabidopsis.org/
http://www.gramene.org/
http://www.geneontology.org/
5
Bi190 Advanced Genetics
Variation
Sternberg 2011
2011 Lecture 10/ho9
Annotating Genome
6
Bi190 Advanced Genetics
Sternberg 2011
2011 Lecture 10/ho9
Annotating Genome
7
Bi190 Advanced Genetics
Sternberg 2011
2011 Lecture 10/ho9
Annotating Genome
8
Download