TR2005-1979 - eCommons@Cornell

advertisement
I.
Introduction
BioDataBase (I tentatively named it :>) is the database that will contain every
possible data about organisms in the universe that exist, and have been ever been
existed. The concept of BioDataBase may evolve to the prototype or the starting
point of the materialization of the Library of Life Project. The examples may not
have practical values (e.g. barking sounds of Dalmatian), but they are good for easy
illustration purpose.
II.
Hierarchical structure of BioDataBase
The taxonomy of organisms has the following hierarchy:
1. universal organism (this is the root of taxonomy tree.)
2. karyote (eukaryote, prokaryote, virus)
3. kingdom (plant kingdom, animal kingdom, )
4. Phylum (E.g. chordate(backboned animals) )
5. Class (e.g. mammal)
6. order (e.g. carnivore, herbivore)
7. family (e.g. felines, canines)
8. genus (e.g. wolves, domestic dogs, foxes)
9. species (e.g. Dalmatian, Chihuahua, arctic fox)
10. individual_organism (e.g. a dog named “Cutie”)
Things like kingdom, phylum, class are the levels of the taxonomy tree. Things like
eukaryote, plant kingdom, mammal class are the nodes of the taxonomy tree. Each node
contains the attribute set. And each node inherits attribute sets from ancestors.
For example, “universal organism” node is the root node and it contains attributes that are
common to all organisms, like “size” or “picture of the organism”. “Mammal” node
contains attributes that are common to mammals, like “milk-feeding period.” “Canine”
node is in the level of “family” in the tree and it contains attributes like “barking sounds”
or “snout size,” (snout is the long, horizontal portion of canine’s face that contains nose
and mouse.) which are properties peculiar to canines (assume foxes and wolves also
bark). Here is a picture of part of the taxonomy tree:
Order Carnivore: (prey, hunting method)
Family Feline: (meow, whisker)
Family Canine: (bark, snout)
Genus Domestic dog: (bad habit, country of origin)
Species Dalmatian
Species Chihuahua
Genus Fox: (rabbit hunt, hole size)
Species Chihuahua
Genus Wolf
Species Chihuahua
III.
Relation of each node
Consider the node “Domestic dog” in the genus level of the taxonomy tree. We
assign a relation to the node. The name of the relation is “Genus_Domestic_Dog”
and it contains all the species that belong to the “domestic dog” genus. Similarly,
“fox” node has “Genus_fox” relation that contains all the fox species. Assume all the
foxes hunt for rabbits and they dig a hole on the ground to live inside of it.
CREATE TABLE Genus_Domestic_Dog
( speciesID, bad_habits, country_of_origin
PRIMARY KEY (speciesID) )
Genus_Domestic_Dog
SpeciesID
Chihuahua
Dalmatian
Bad_habits
They bite any strangers
They bark too often
Country_of_origin
Mexico
France
Genus_Fox
SpeciesID
Arctic fox
Red fox
Hunt frequency for rabbit
3 times a day
2 times a day
Hole size in diameter
0.3 m
0.5 m
Note that a node in Genus level contains Species tuples. This makes hierarchical
sense, but it can be confusing at first.
Now consider the node “Canine” in the family level. We name its relation
“Family_Canine”. Now it contains all the genuses that belongs to this family, like
wolf, fox, domestic dog. We want to store the attributes that are peculiar (they
distinguish canine family from other families) and common (they are common to all
canines). Assuming domestic dogs, wolves and foxes all “bark,” “barking sound” is a
common attribute to all canines and peculiar attribute relative to other families (cats
or tigers in feline family do not bark). But we want to relate the “barking sound” to
each species, not to each genus. So the Family_Canine relation must contain
SpeciesID:
CREATE TABLE Family_Canine
( speciesID, Barking sound, Snout size,
PRIMARY KEY (speciesID) )
SpeciesID
Chihuahua
Dalmatian
Red fox
Arctic fox
Barking sound
Bow-bow
wow-wow
bow-bow
Wow-wow
Snout size
1 cm
3 cm
4 cm
4 cm
IV. Joining of nodes in the tree
I denote “join of relations of nodes” as “join of nodes”, since each node has a relation.
Now we want to process queries that need to “join” this relation to the relations in the
higer level (higher means being closer to the root) in the tree. E.g. consider a query
like “Find the barking sound of the dog whose country of origin is Mexico.” Recall
that “country of origin” is an attribute of Genus_Domestic_Dog and “barking sound”
is the attribute belongs to “Family_Canine” node. So, we need to join the relation
Genus_Domestic_Dog with Family_Canine:
“Find the barking sound of the dog whose country of origin is Mexico.”
SELECT Canine.barking_sound
FROM Family_Canine Canine, Genus_Domestic_Dog Dog
WHERE Dog.country_of_origin = Mexico AND
Dog.speciesID = Canine.speciesID
We call this kind of join as “vertical join,” which joins “node A” with its ancestor
“node B”. In above, Genus_Domestic_Dog join with its ancester Family_Canine.
In contrast, “horizontal join” is defined such that “node A” joins with another “node
B” which is not its ancestor. The nodes may or may not be in the same level. In this
case, consider the two nodes’ lowest common ancestor “node C”. Being a common
ancestor, “node C” contains common attributes between “node A” and “node B”. A
and B also shares the attributes of other higher common ancestors, but the concept of
lowest common ancestor is expected to be useful in may applications.
As an example of horizontal join of two nodes whose levels are different (see the
picture on next page), consider the query:
“Find (species A, species B) pair such that species A’s whisker size is 5 cm, and
species B’s hole size is 0.8 m, and both species’ hunting methods are the same.”
SELECT Feline.speciesID, Fox.speciesID
FROM Family_Feline Feline, Genus_Fox Fox,
Order_Carnivore Carnivore1, Carnivore2
WHERE Feline.whisker_size = 5 AND
Fox.hole_size = 0.8 AND
Feline.speciesID = Carnivore1.speciesID
Fox.speciesID = Carnivore2.speciesID
AND
AND
Carnivore1.hunting_method = Carnivore2.hunting_method
Order Carnivore: (prey, hunting method)
Family Canine: (bark, snout)
Family Feline: (meow, whisker)
Genus Domestic dog: (bad habit, country of origin)
Species Dalmatian
Species Chihuahua
Genus Fox: (rabbit hunt, hole size)
Species Chihuahua
Genus Wolf
Species Chihuahua
V. Genomic Data in BioDataBase
Genes and species are in many-to-many relationship. That is, one species have many
genes, and one gene exist in many species. Hence, a relation for this relationship is:
CREATE TABLE Gene_Species (
GeneID
Gene1
Gene1
Gene2
Gene3
geneID
INTEGER,
speciesID STRING,
PRIMARY KEY (geneID, speciesID) )
Species
monkey
human
human
rat
We want to store properties for each gene:
CREATE TABLE Gene_property ( geneID
INTEGER,
Size
INTEGER,
Protein_this_gene_encodes STRING,
….(other properties)…
PRIMARY KEY (geneID) )
GeneID
Gene1
Gene2
Gene3
Gene4
Size
35 kb
40 kb
100 kb
199 kb
Protein
…
Myoglobin
Hemoglobin
Insulin
leghemoglobin
…
Also, we want to store some ‘bulky’ information separately, because they takes too much
space:
CREATE TABLE Gene_sequence ( geneID
INTEGER,
Sequence STRING,
PRIMARY KEY (geneID) )
GeneID
sequence
Gene1
ATGCCCT…
Gene2
AAGG…
Gene3
TACC…
Gene4
GGTAAA..
VI. Taxonomy tree and Genomic data
Consider again the taxonomy tree in the previous section. For each node in the tree, we
can associate a relation of genes that is common to all the descendents of the node. E.g.
the Family_Canine node has a relation of genes that all the canine species (like
Dalmatian, red fox, black wolf) have in common. This is also a many-to-many relation.
More importantly, node inherit the set of “common genes” from ancestors:
Order Carnivore: (Gene1, gene6)
Family Canine: (gene4, gene5)
Family Feline: (gene2, gene3)
Genus Domestic dog: (gene2, gene19)
Species Dalmatian (gene9)
Genus Wolf
Genus Fox: (gene7, gene3)
Species Chihuahua
(gene10)
Species Chihuahua
(gene11)
Species Chihuahua
(gene3)
Note that we assume a individuals of the same species share all the genes they have. One
key point is that we do not want any redundancy of genes in the tree. So (Gene1, Gene6)
are the maximal set of genes that feline and canine share.
To find all the common genes that “node A” and “node B” (they may be in the same
level) have in common,
0. initialize a temporary set X.
1. find the lowest common ancestor of node A and node B. Let’s call this node C.
2. add all the genes of node C in set X.
3. For all the ancestors of C (walk up the tree, parent by parent), add their genes to
X.
4. Consider two path: from node A (including A) to node C (excluding C)
From node B (including B) to node C (excluding C).
Roughly, consider all the node pairs in two path and examine for common genes
and add to X.
To find all the genes a node has, simply walk up the tree adding genes of its ancestors.
Similarly, we can do useful operations.
The nodes and genes form a many-to-many relationship in general (although the nodes
along the path from a node to roots are disjoint each other). In SQL, the relation will be:
CREATE TABLE Gene_Common_In_Node ( nodeID
INTEGER,
geneID STRING,
PRIMARY KEY (nodeID) )
nodeID
geneID
Family_Canine
Gene2
Family_Canine
Gene3
Species_Dalmatian
Gene9
…
...
It may take centuries to get this relation for all the nodes of the tree, since it means
genome sequencing, gene finding of the species. But we can build a partial tree, and
using the partial tree, we may even be able to predict which genes must be in a node
whose set of common genes is not known.
VII. Genomics, Ecology, Environment, and Zoology
BioDataBase contains all the data for all the organisms in the universe. Since all the data
is too bulky, it is necessary to partition the data into “area-specific database.” Consider
three areas: genomics, ecology, environment. E.g. for genomics area, we have
genomics_database, which contains relations like:
Gene_Species ( geneID INTEGER,
SpeciesID STRING,
expression_level INTEGER) // how often the protein that the gene
// encodes are produced
For the area of ecology, we have ecology_database, which contains:
Species_distribution( speciesID
STRING,
location
GPS_FORMAT, // the global positioning system
size_population INTEGER
PRIMARY KEY (speciedID, location) )
For the area of environment:
Location_climate(
location
Humidity
Rainfall
GPS_FORMAT_INTERVAL,
INTEGER
INTEGER ) // unit is “cm per year”
The join of relations from different areas makes a very useful information, like “what
genes are expressed a lot in the area where it rains a lot?” : (see next page)
“Find the genes such that
1. they are common in all the species who live in the areas where rainfall is more
than 100 cm/year.
2. they are expressed more than 70 unit of expression_level”
SELECT Gene.GeneID, Gene.SpeciesID
FROM
Gene_Species Gene,
Species_Distribution Animal,
Location_Climate Climate
WHERE ( Climate.Fainfall > 100 AND
Climate.Location = Animal.Location AND
Animal.SpeciesID = Gene.SpeciesID AND
Gene.Expression_Level > 70 )
DIVIDE /// if this operation is not supported by SQL, translate this using “NOT EXIST”
///and “EXCEPT” operators, as described in page 150 of the “cow book”
SELECT Animal2.SpeciesID
FROM
Species_Distribution Animal2
WHERE Climate.Fainfall > 100 AND
Climate.Location = Animal2.Location)
Similarly, we can curate zoological data (data about the appearance of species as seen in
a zoo) like:
Species_zoo_info( speciesID, picture, sound, average_size)
You may get acquire a set of genes that correlate with the size of a species:
“find genes that are common in species whose size is more than 2 m, and that are
expressed more than 70 unit of expression_level”
SELECT Gene.GeneID, Gene.SpeciesID
FROM
Gene_Species Gene,
Species_zoo_info Animal,
WHERE
( Animal.average_size > 2 AND
Animal.SpeciesID = Gene.SpeciesID
Gene.Expression_Level > 70 )
DIVIDE
SELECT Animal2.SpeciesID
FROM
Species_zoo_info Animal2
WHERE Animal2.Average_size > 2 )
AND
VIII. Conclusion
The examples aforementioned may not be correct or biologically interesting in
themselves: they are made to illustrate the concepts of BioDataBase. The point is that
we can solve real interesting questions by means of the BioDataBase. As of now,
Library of Life Project is in premature stage. Thus we are flexible in terms of the
changes in the project direction. I welcome your comments about this document!
Download