Part II. Registering database`s Foreign and Primary Keys to

advertisement
Registering a Data Source with the BIRN Data Integration
Environment: Requirements
On of the goals of the BIRN project is to provide a means for data repositories produced
across the test-beds to be seamlessly analyzed, searched or browsed as a whole. The
BIRN data integration environment is the critical enabling technology that makes this
possible. In order for the data integration environment to add another BIRN site’s data
source to this mix, a source must first register itself. Ontologies play a critical role in
bridging information from the new source to sources that are already a part of the
integrated data environment. The data integration environment will use concepts
specified according to formal ontological knowledge resources to provide users of the
“pooled” BIRN data repository a means to determine conceptually related data across the
disparate repositories, the specifics of those semantic relationships and how to use them
to group conceptually related data elements for statistical analysis.
In order to perform this task, each source needs to provide a semantic description of the
relevant knowledge domain(s) in terms of known ontological concepts and map these
ontological concepts to their specific location in the data source databases schema (table,
field or value).
All concepts in use by the BIRN databases should be referenced to core concepts
contained in the BONFIRE, a collection of cross-mapped terminologies and ontologies
maintained by the BIRN. The BONFIRE currently contains the UMLS, the latest version
of Neuronames, and concepts related to these two core terminologies provided by
individual users. Additional knowledge sources will be added upon request and review.
BONFIRE allows users to propose a new concept and to provide the relationship between
the proposed concept and existing concepts.
Mapping the database: Each source is expected to provide a mapping between the
values in the database and the BONFIRE. Users are also encouraged to map table names
and field (column) names to provide a semantic mark up of the database schema. The
purpose of this exercise is to:
1) Provide semantic information on the content and structure of the database that
is both human and machine-readable
2) Link concepts contained in source databases to larger conceptual networks for
the purposes of data integration across scales, species and disciplines
At the time of source registration, each source should create an XML file or an Excel
spreadsheet with the following information:
Ontological
source
Ontological
ID
Table Name
Column
Name
Value
Data type
umls
C0080105
subject
umls
C0206334
subject
strain
varchar2
umls
C0025914
subject
popular_name mouse
varchar2
The first column defines the ontological source. The second column provides the
ontological concept id. This ID is a unique identifier provided by the source
ontology/terminology. Table and column names refer to the source database Table and
column (field) names respectively, while the “value” refers to the concept entered into the
database. The “Data type” refers to the type of data defined for that field in the database.
In the example above, the table name “subject” has been mapped in the 1st row; the field
name “strain” has been mapped in the 2nd row and the value “mouse” has been mapped
in the 3rd row.
Adding concepts to BONFIRE:
The BIRN is defining a set of core
ontologies/terminologies for mapping. Copies of these will be maintained in the
BONFIRE database. If the source ontology does not contain the required concept, it may
be added using the BONFIRE tool. Choose “Add New Concept” under the pull down
menu. If the new term is a synonym of an existing term, provide the existing term ID in
the designated box. For each new concept added, the user is asked to provide the
concept, a definition of the concept and the Semantic Type. A list of Semantic Types
currently in use by UMLS can be found at:
http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html.
An
additional comment field is also available for any notes that may be required for curation.
Users should also enter their name under ‘”Your ID”. Once a concept is added, it is
supplied a unique identifier. To distinguish user defined terms from those in the existing
ontologies, the prefix BF- is appended to the concept ID, e.g., BF_T0000022. New terms
are designated with a “U” for “Uncurated”. The OTF will set policies for curation of
BIRN concepts.
Users may also supply one or more relationships to link proposed concepts to
existing concepts. These relationships form the basis for the multiscale and crossdiscipline data integration for BIRN because they allow concepts to be linked into
semantic networks. These networks will not only contain distinct specifications of
separate knowledge domains (e.g., neuroanatomy, structural components of neurons,
gene expression, behavioral assessment, image data provenance, etc.), but a gradually
enriched map of how concepts across these domains relate to one-another (e.g., brain
regions active during a specific behavioral task, neuroanatomical spatial maps of gene
expression, etc.).
While adding a concept to the BONFIRE is relatively straightforward, determining the
relationship or relationships that link this concept to existing concepts can be tricky.
We will be using the UMLS relationship labels. The UMLS documentation
contains the following information about relationships:
2.3.2 Relationship Labels
All relationships (outside the basic concept structure) in the Metathesaurus carry a
general label (REL), describing their basic nature, such as Broader, Narrower, Child of,
Qualifier of, etc and are identified by their source. Most of these relationships are either
directly asserted in a source vocabulary or are implied by the structure of the source
vocabulary.
About a quarter of the relationships in the Metathesaurus also carry an additional label
(RELA), obtained from a source vocabulary, that explains the nature of the relationship
more exactly, such as is_a, branch_of, component_of. The Digital Anatomist vocabulary
and RxNorm are examples of source vocabularies that include such relationship labels. A
complete list of the additional relationship labels appears in MRDOC.RRF and in
Appendix B.3 in this documentation. http://www.nlm.nih.gov/research/umls/meta2.html
BONFIRE provides both a list of general relationships and also additional labels utilized
by the UMLS source vocabularies.
We may eventually add the ability to relate concepts based on the new relations ontology
provided
as
a
part
of
the
Open
Biomedical
Ontology
project
(http://obo.sourceforge.net/relationship/), now managed by the National Center for
Biomedical Ontology (http://bioontology.org/).
The simplest relationship to add is the “is a” or “parent-child” relationship. The “parentchild” relationship is used to link members of a class to a parent class. Thus, if the
statement “X” is a “Y”, e.g., Purkinje cell is a neuron, is true, then X is a child of Y and
Y is the parent of X. In this case, the superclass is “neuron” and the subclass is “Purkinje
cell”. If the relationship linking two concepts is not “is a” but some form of “has a”, then
the relationship is not “parent-child” , e.g., Purkinje cell has a dendrite. Because the
statements “Purkinje cell is a dendrite” or “dendrite is a Purkinje cell” is not true, this
relationship is not parent-child. When using the “parent-child” relationship, the two
concepts must be of the same semantic type.
Part II. Registering database’s Foreign and Primary Keys to BIRN data integration
environment
Another requirement will be to specify the data integrity constraints used in a source
relational database (RDBMS). For example, some MySQL sources use a MySQL table
format not implicitly supporting foreign key – primary key constraints. The source
curators must then track referential integrity information manually by adding referential
columns to store cross-table link information to their database schema. Curators of source
data repositories using an RDBMS, that includes an implicit referential integrity
mechanism, who have chosen to not implement this feature in their schema and don’t
create primary key – foreign key constrains must do the same. The data integration
environment must ultimately have access to all available referential integrity information
kept by the source data repository in order to be effective and provide sufficient,
automated data integrity quality assurance.
This information also can be provided in form of an XML file or Excel table that will be
incorporated into mediator registry.
Example 2: A clinical database on a MySQL server that does not support foreign keys.
This database has multiple tables.
Table “subject” has column “subjectid” that is the primary key and table “assessment”
has column “subjectid” that is the foreign key reference to table “subject” column
“subjectid”. This information is reflected in the table where Record ID identifies each
unique entry in the table, Key ID identifies each unique key (e.g. composite keys are
represented as multiple rows with the same Key ID), Order denotes the order of a column
in a key (i.e. this is important for composite keys). Table Name and Column Name
identify the names of the columns in the key. Key Type denotes if the key is either a
Primary (P) or Unique (U) key. FK refers to any foreign key references from a column
to the target primary key column. Reference denotes any intra-table references for a
specific column.
Here the template to represent referential integrity information.
Record Key Order
ID
ID
Table Name
Column
Name
Key FK
Type
1
1
1
subject
subjectid
P
2
2
1
assessment
subjectid
N
3
3
1
assessment
uniqueID
P
4
4
1
assessment
tableID
U
5
4
2
assessment
uniqueID
U
6
5
1
tupleAccess
tableID
N
7
6
2
tupleAccess
tupleID
N
8
7
1
tables
tableID
P
Reference
1
8
8
4
5
This information will be warehoused during registration and used by the BIRN data
integration environment when dynamically mapping the ontological query to the
available, registered data sources.
Here is an example of how a conceptually-based query can be translated into SQL by
using these resulting source domain graphs and mapping information.
Download