I. Introduction BioDataBase (I tentatively named it :>) is the database that will contain every possible data about organisms in the universe that exist, and have been ever been existed. The concept of BioDataBase may evolve to the prototype or the starting point of the materialization of the Library of Life Project. The examples may not have practical values (e.g. barking sounds of Dalmatian), but they are good for easy illustration purpose. II. Hierarchical structure of BioDataBase The taxonomy of organisms has the following hierarchy: 1. universal organism (this is the root of taxonomy tree.) 2. karyote (eukaryote, prokaryote, virus) 3. kingdom (plant kingdom, animal kingdom, ) 4. Phylum (E.g. chordate(backboned animals) ) 5. Class (e.g. mammal) 6. order (e.g. carnivore, herbivore) 7. family (e.g. felines, canines) 8. genus (e.g. wolves, domestic dogs, foxes) 9. species (e.g. Dalmatian, Chihuahua, arctic fox) 10. individual_organism (e.g. a dog named “Cutie”) Things like kingdom, phylum, class are the levels of the taxonomy tree. Things like eukaryote, plant kingdom, mammal class are the nodes of the taxonomy tree. Each node contains the attribute set. And each node inherits attribute sets from ancestors. For example, “universal organism” node is the root node and it contains attributes that are common to all organisms, like “size” or “picture of the organism”. “Mammal” node contains attributes that are common to mammals, like “milk-feeding period.” “Canine” node is in the level of “family” in the tree and it contains attributes like “barking sounds” or “snout size,” (snout is the long, horizontal portion of canine’s face that contains nose and mouse.) which are properties peculiar to canines (assume foxes and wolves also bark). Here is a picture of part of the taxonomy tree: Order Carnivore: (prey, hunting method) Family Feline: (meow, whisker) Family Canine: (bark, snout) Genus Domestic dog: (bad habit, country of origin) Species Dalmatian Species Chihuahua Genus Fox: (rabbit hunt, hole size) Species Chihuahua Genus Wolf Species Chihuahua III. Relation of each node Consider the node “Domestic dog” in the genus level of the taxonomy tree. We assign a relation to the node. The name of the relation is “Genus_Domestic_Dog” and it contains all the species that belong to the “domestic dog” genus. Similarly, “fox” node has “Genus_fox” relation that contains all the fox species. Assume all the foxes hunt for rabbits and they dig a hole on the ground to live inside of it. CREATE TABLE Genus_Domestic_Dog ( speciesID, bad_habits, country_of_origin PRIMARY KEY (speciesID) ) Genus_Domestic_Dog SpeciesID Chihuahua Dalmatian Bad_habits They bite any strangers They bark too often Country_of_origin Mexico France Genus_Fox SpeciesID Arctic fox Red fox Hunt frequency for rabbit 3 times a day 2 times a day Hole size in diameter 0.3 m 0.5 m Note that a node in Genus level contains Species tuples. This makes hierarchical sense, but it can be confusing at first. Now consider the node “Canine” in the family level. We name its relation “Family_Canine”. Now it contains all the genuses that belongs to this family, like wolf, fox, domestic dog. We want to store the attributes that are peculiar (they distinguish canine family from other families) and common (they are common to all canines). Assuming domestic dogs, wolves and foxes all “bark,” “barking sound” is a common attribute to all canines and peculiar attribute relative to other families (cats or tigers in feline family do not bark). But we want to relate the “barking sound” to each species, not to each genus. So the Family_Canine relation must contain SpeciesID: CREATE TABLE Family_Canine ( speciesID, Barking sound, Snout size, PRIMARY KEY (speciesID) ) SpeciesID Chihuahua Dalmatian Red fox Arctic fox Barking sound Bow-bow wow-wow bow-bow Wow-wow Snout size 1 cm 3 cm 4 cm 4 cm IV. Joining of nodes in the tree I denote “join of relations of nodes” as “join of nodes”, since each node has a relation. Now we want to process queries that need to “join” this relation to the relations in the higer level (higher means being closer to the root) in the tree. E.g. consider a query like “Find the barking sound of the dog whose country of origin is Mexico.” Recall that “country of origin” is an attribute of Genus_Domestic_Dog and “barking sound” is the attribute belongs to “Family_Canine” node. So, we need to join the relation Genus_Domestic_Dog with Family_Canine: “Find the barking sound of the dog whose country of origin is Mexico.” SELECT Canine.barking_sound FROM Family_Canine Canine, Genus_Domestic_Dog Dog WHERE Dog.country_of_origin = Mexico AND Dog.speciesID = Canine.speciesID We call this kind of join as “vertical join,” which joins “node A” with its ancestor “node B”. In above, Genus_Domestic_Dog join with its ancester Family_Canine. In contrast, “horizontal join” is defined such that “node A” joins with another “node B” which is not its ancestor. The nodes may or may not be in the same level. In this case, consider the two nodes’ lowest common ancestor “node C”. Being a common ancestor, “node C” contains common attributes between “node A” and “node B”. A and B also shares the attributes of other higher common ancestors, but the concept of lowest common ancestor is expected to be useful in may applications. As an example of horizontal join of two nodes whose levels are different (see the picture on next page), consider the query: “Find (species A, species B) pair such that species A’s whisker size is 5 cm, and species B’s hole size is 0.8 m, and both species’ hunting methods are the same.” SELECT Feline.speciesID, Fox.speciesID FROM Family_Feline Feline, Genus_Fox Fox, Order_Carnivore Carnivore1, Carnivore2 WHERE Feline.whisker_size = 5 AND Fox.hole_size = 0.8 AND Feline.speciesID = Carnivore1.speciesID Fox.speciesID = Carnivore2.speciesID AND AND Carnivore1.hunting_method = Carnivore2.hunting_method Order Carnivore: (prey, hunting method) Family Canine: (bark, snout) Family Feline: (meow, whisker) Genus Domestic dog: (bad habit, country of origin) Species Dalmatian Species Chihuahua Genus Fox: (rabbit hunt, hole size) Species Chihuahua Genus Wolf Species Chihuahua V. Genomic Data in BioDataBase Genes and species are in many-to-many relationship. That is, one species have many genes, and one gene exist in many species. Hence, a relation for this relationship is: CREATE TABLE Gene_Species ( GeneID Gene1 Gene1 Gene2 Gene3 geneID INTEGER, speciesID STRING, PRIMARY KEY (geneID, speciesID) ) Species monkey human human rat We want to store properties for each gene: CREATE TABLE Gene_property ( geneID INTEGER, Size INTEGER, Protein_this_gene_encodes STRING, ….(other properties)… PRIMARY KEY (geneID) ) GeneID Gene1 Gene2 Gene3 Gene4 Size 35 kb 40 kb 100 kb 199 kb Protein … Myoglobin Hemoglobin Insulin leghemoglobin … Also, we want to store some ‘bulky’ information separately, because they takes too much space: CREATE TABLE Gene_sequence ( geneID INTEGER, Sequence STRING, PRIMARY KEY (geneID) ) GeneID sequence Gene1 ATGCCCT… Gene2 AAGG… Gene3 TACC… Gene4 GGTAAA.. VI. Taxonomy tree and Genomic data Consider again the taxonomy tree in the previous section. For each node in the tree, we can associate a relation of genes that is common to all the descendents of the node. E.g. the Family_Canine node has a relation of genes that all the canine species (like Dalmatian, red fox, black wolf) have in common. This is also a many-to-many relation. More importantly, node inherit the set of “common genes” from ancestors: Order Carnivore: (Gene1, gene6) Family Canine: (gene4, gene5) Family Feline: (gene2, gene3) Genus Domestic dog: (gene2, gene19) Species Dalmatian (gene9) Genus Wolf Genus Fox: (gene7, gene3) Species Chihuahua (gene10) Species Chihuahua (gene11) Species Chihuahua (gene3) Note that we assume a individuals of the same species share all the genes they have. One key point is that we do not want any redundancy of genes in the tree. So (Gene1, Gene6) are the maximal set of genes that feline and canine share. To find all the common genes that “node A” and “node B” (they may be in the same level) have in common, 0. initialize a temporary set X. 1. find the lowest common ancestor of node A and node B. Let’s call this node C. 2. add all the genes of node C in set X. 3. For all the ancestors of C (walk up the tree, parent by parent), add their genes to X. 4. Consider two path: from node A (including A) to node C (excluding C) From node B (including B) to node C (excluding C). Roughly, consider all the node pairs in two path and examine for common genes and add to X. To find all the genes a node has, simply walk up the tree adding genes of its ancestors. Similarly, we can do useful operations. The nodes and genes form a many-to-many relationship in general (although the nodes along the path from a node to roots are disjoint each other). In SQL, the relation will be: CREATE TABLE Gene_Common_In_Node ( nodeID INTEGER, geneID STRING, PRIMARY KEY (nodeID) ) nodeID geneID Family_Canine Gene2 Family_Canine Gene3 Species_Dalmatian Gene9 … ... It may take centuries to get this relation for all the nodes of the tree, since it means genome sequencing, gene finding of the species. But we can build a partial tree, and using the partial tree, we may even be able to predict which genes must be in a node whose set of common genes is not known. VII. Genomics, Ecology, Environment, and Zoology BioDataBase contains all the data for all the organisms in the universe. Since all the data is too bulky, it is necessary to partition the data into “area-specific database.” Consider three areas: genomics, ecology, environment. E.g. for genomics area, we have genomics_database, which contains relations like: Gene_Species ( geneID INTEGER, SpeciesID STRING, expression_level INTEGER) // how often the protein that the gene // encodes are produced For the area of ecology, we have ecology_database, which contains: Species_distribution( speciesID STRING, location GPS_FORMAT, // the global positioning system size_population INTEGER PRIMARY KEY (speciedID, location) ) For the area of environment: Location_climate( location Humidity Rainfall GPS_FORMAT_INTERVAL, INTEGER INTEGER ) // unit is “cm per year” The join of relations from different areas makes a very useful information, like “what genes are expressed a lot in the area where it rains a lot?” : (see next page) “Find the genes such that 1. they are common in all the species who live in the areas where rainfall is more than 100 cm/year. 2. they are expressed more than 70 unit of expression_level” SELECT Gene.GeneID, Gene.SpeciesID FROM Gene_Species Gene, Species_Distribution Animal, Location_Climate Climate WHERE ( Climate.Fainfall > 100 AND Climate.Location = Animal.Location AND Animal.SpeciesID = Gene.SpeciesID AND Gene.Expression_Level > 70 ) DIVIDE /// if this operation is not supported by SQL, translate this using “NOT EXIST” ///and “EXCEPT” operators, as described in page 150 of the “cow book” SELECT Animal2.SpeciesID FROM Species_Distribution Animal2 WHERE Climate.Fainfall > 100 AND Climate.Location = Animal2.Location) Similarly, we can curate zoological data (data about the appearance of species as seen in a zoo) like: Species_zoo_info( speciesID, picture, sound, average_size) You may get acquire a set of genes that correlate with the size of a species: “find genes that are common in species whose size is more than 2 m, and that are expressed more than 70 unit of expression_level” SELECT Gene.GeneID, Gene.SpeciesID FROM Gene_Species Gene, Species_zoo_info Animal, WHERE ( Animal.average_size > 2 AND Animal.SpeciesID = Gene.SpeciesID Gene.Expression_Level > 70 ) DIVIDE SELECT Animal2.SpeciesID FROM Species_zoo_info Animal2 WHERE Animal2.Average_size > 2 ) AND VIII. Conclusion The examples aforementioned may not be correct or biologically interesting in themselves: they are made to illustrate the concepts of BioDataBase. The point is that we can solve real interesting questions by means of the BioDataBase. As of now, Library of Life Project is in premature stage. Thus we are flexible in terms of the changes in the project direction. I welcome your comments about this document!