Registering a Data Source with the BIRN Data Integration Environment: Requirements On of the goals of the BIRN project is to provide a means for data repositories produced across the test-beds to be seamlessly analyzed, searched or browsed as a whole. The BIRN data integration environment is the critical enabling technology that makes this possible. In order for the data integration environment to add another BIRN site’s data source to this mix, a source must first register itself. Ontologies play a critical role in bridging information from the new source to sources that are already a part of the integrated data environment. The data integration environment will use concepts specified according to formal ontological knowledge resources to provide users of the “pooled” BIRN data repository a means to determine conceptually related data across the disparate repositories, the specifics of those semantic relationships and how to use them to group conceptually related data elements for statistical analysis. In order to perform this task, each source needs to provide a semantic description of the relevant knowledge domain(s) in terms of known ontological concepts and map these ontological concepts to their specific location in the data source databases schema (table, field or value). All concepts in use by the BIRN databases should be referenced to core concepts contained in the BONFIRE, a collection of cross-mapped terminologies and ontologies maintained by the BIRN. The BONFIRE currently contains the UMLS, the latest version of Neuronames, and concepts related to these two core terminologies provided by individual users. Additional knowledge sources will be added upon request and review. BONFIRE allows users to propose a new concept and to provide the relationship between the proposed concept and existing concepts. Mapping the database: Each source is expected to provide a mapping between the values in the database and the BONFIRE. Users are also encouraged to map table names and field (column) names to provide a semantic mark up of the database schema. The purpose of this exercise is to: 1) Provide semantic information on the content and structure of the database that is both human and machine-readable 2) Link concepts contained in source databases to larger conceptual networks for the purposes of data integration across scales, species and disciplines At the time of source registration, each source should create an XML file or an Excel spreadsheet with the following information: Ontological source Ontological ID Table Name Column Name Value Data type umls C0080105 subject umls C0206334 subject strain varchar2 umls C0025914 subject popular_name mouse varchar2 The first column defines the ontological source. The second column provides the ontological concept id. This ID is a unique identifier provided by the source ontology/terminology. Table and column names refer to the source database Table and column (field) names respectively, while the “value” refers to the concept entered into the database. The “Data type” refers to the type of data defined for that field in the database. In the example above, the table name “subject” has been mapped in the 1st row; the field name “strain” has been mapped in the 2nd row and the value “mouse” has been mapped in the 3rd row. Adding concepts to BONFIRE: The BIRN is defining a set of core ontologies/terminologies for mapping. Copies of these will be maintained in the BONFIRE database. If the source ontology does not contain the required concept, it may be added using the BONFIRE tool. Choose “Add New Concept” under the pull down menu. If the new term is a synonym of an existing term, provide the existing term ID in the designated box. For each new concept added, the user is asked to provide the concept, a definition of the concept and the Semantic Type. A list of Semantic Types currently in use by UMLS can be found at: http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html. An additional comment field is also available for any notes that may be required for curation. Users should also enter their name under ‘”Your ID”. Once a concept is added, it is supplied a unique identifier. To distinguish user defined terms from those in the existing ontologies, the prefix BF- is appended to the concept ID, e.g., BF_T0000022. New terms are designated with a “U” for “Uncurated”. The OTF will set policies for curation of BIRN concepts. Users may also supply one or more relationships to link proposed concepts to existing concepts. These relationships form the basis for the multiscale and crossdiscipline data integration for BIRN because they allow concepts to be linked into semantic networks. These networks will not only contain distinct specifications of separate knowledge domains (e.g., neuroanatomy, structural components of neurons, gene expression, behavioral assessment, image data provenance, etc.), but a gradually enriched map of how concepts across these domains relate to one-another (e.g., brain regions active during a specific behavioral task, neuroanatomical spatial maps of gene expression, etc.). While adding a concept to the BONFIRE is relatively straightforward, determining the relationship or relationships that link this concept to existing concepts can be tricky. We will be using the UMLS relationship labels. The UMLS documentation contains the following information about relationships: 2.3.2 Relationship Labels All relationships (outside the basic concept structure) in the Metathesaurus carry a general label (REL), describing their basic nature, such as Broader, Narrower, Child of, Qualifier of, etc and are identified by their source. Most of these relationships are either directly asserted in a source vocabulary or are implied by the structure of the source vocabulary. About a quarter of the relationships in the Metathesaurus also carry an additional label (RELA), obtained from a source vocabulary, that explains the nature of the relationship more exactly, such as is_a, branch_of, component_of. The Digital Anatomist vocabulary and RxNorm are examples of source vocabularies that include such relationship labels. A complete list of the additional relationship labels appears in MRDOC.RRF and in Appendix B.3 in this documentation. http://www.nlm.nih.gov/research/umls/meta2.html BONFIRE provides both a list of general relationships and also additional labels utilized by the UMLS source vocabularies. We may eventually add the ability to relate concepts based on the new relations ontology provided as a part of the Open Biomedical Ontology project (http://obo.sourceforge.net/relationship/), now managed by the National Center for Biomedical Ontology (http://bioontology.org/). The simplest relationship to add is the “is a” or “parent-child” relationship. The “parentchild” relationship is used to link members of a class to a parent class. Thus, if the statement “X” is a “Y”, e.g., Purkinje cell is a neuron, is true, then X is a child of Y and Y is the parent of X. In this case, the superclass is “neuron” and the subclass is “Purkinje cell”. If the relationship linking two concepts is not “is a” but some form of “has a”, then the relationship is not “parent-child” , e.g., Purkinje cell has a dendrite. Because the statements “Purkinje cell is a dendrite” or “dendrite is a Purkinje cell” is not true, this relationship is not parent-child. When using the “parent-child” relationship, the two concepts must be of the same semantic type. Part II. Registering database’s Foreign and Primary Keys to BIRN data integration environment Another requirement will be to specify the data integrity constraints used in a source relational database (RDBMS). For example, some MySQL sources use a MySQL table format not implicitly supporting foreign key – primary key constraints. The source curators must then track referential integrity information manually by adding referential columns to store cross-table link information to their database schema. Curators of source data repositories using an RDBMS, that includes an implicit referential integrity mechanism, who have chosen to not implement this feature in their schema and don’t create primary key – foreign key constrains must do the same. The data integration environment must ultimately have access to all available referential integrity information kept by the source data repository in order to be effective and provide sufficient, automated data integrity quality assurance. This information also can be provided in form of an XML file or Excel table that will be incorporated into mediator registry. Example 2: A clinical database on a MySQL server that does not support foreign keys. This database has multiple tables. Table “subject” has column “subjectid” that is the primary key and table “assessment” has column “subjectid” that is the foreign key reference to table “subject” column “subjectid”. This information is reflected in the table where Record ID identifies each unique entry in the table, Key ID identifies each unique key (e.g. composite keys are represented as multiple rows with the same Key ID), Order denotes the order of a column in a key (i.e. this is important for composite keys). Table Name and Column Name identify the names of the columns in the key. Key Type denotes if the key is either a Primary (P) or Unique (U) key. FK refers to any foreign key references from a column to the target primary key column. Reference denotes any intra-table references for a specific column. Here the template to represent referential integrity information. Record Key Order ID ID Table Name Column Name Key FK Type 1 1 1 subject subjectid P 2 2 1 assessment subjectid N 3 3 1 assessment uniqueID P 4 4 1 assessment tableID U 5 4 2 assessment uniqueID U 6 5 1 tupleAccess tableID N 7 6 2 tupleAccess tupleID N 8 7 1 tables tableID P Reference 1 8 8 4 5 This information will be warehoused during registration and used by the BIRN data integration environment when dynamically mapping the ontological query to the available, registered data sources. Here is an example of how a conceptually-based query can be translated into SQL by using these resulting source domain graphs and mapping information.