Requirements of a Taxonomy Database Tcl-DB a Prototype

advertisement
Requirements of a Taxonomy
Database
Tcl-DB a Prototype
Outline
1. Requirements
• Hierarchy
• Alternative Search Terms: Synonyms and Vernaculars
• Alternative Spellings
• Alternative Classifications
2. Tcl-DB Prototype System
• Tcl-DB Structure
• 2NF
3. Extensibile: Adding a new data source e.g. NCBI
4. Tcl-DB: UID Tracking
5. Tcl-DB: Stats
6. Utility and Further Work
1. Hierarchy
2. Alternative Search Terms: Synonyms and Vernaculars
3. Alternative Spellings: Caenorabditis elegans, C elegans
and Caenorhabditis elegans
4. Alternative Classifications:
Tcl-DB Prototype System. Proposed Architecture
Tcl-DB: Logical Structure
Tcl-DB Physical Database Structure
Assertion:
Resolving the M:M with an association entity
Node:
Hierarchical Queries
Nested Set, Path and Connect by
>select count(name_id) from node
start with name_id = ‘100891'
connect by prior name_id = parent_name_id;
>select count(name_id) from node
where path like '/%';
>select count(name_id) from node
where left_id between 1 and 9290;
synonym_name and vernacular:
subtypes,multi-valued attributes or weak entities
Tcl-DB: 2NF
Kingdom
KINGDOM_ID
ASSERTION
ASSERTION
PK
ASSERTION_ID
I2,I1
I1
I1
NAME_ID
SOURCE_ID
DBSOURCE_ID
AID
NID
RANK_ID
KINGDOM_ID
NAME_ID
NAME_TEXT
SOURCE_ID
Rank
RANK_ID
RANK_NAME
SOURCE_ID
PK
ASSERTION_ID
I1,I2
I1
I1
NAME_ID
SOURCE_ID
DBSOURCE_ID
AID
NID
RANK
KINGDOM
Tcl-DB: Procedures, Packages and Functions:
Adding a new data source e.g. NCBI
Step 1: Build Views, what names are already in the database
Step 2: Move names from view to Tcl schema
Step 3: Fill the nodes table in tcl schema
Step 4: fill synonym_name table in tcl schema
Step 5: fill vernacular table in tcl schema
Tcl-DB: UID Tracking
after name data load:
1. Run two joins on name and nids_mv
•
Nids – name_id when the name_text exist
•
Null – name_id when the name_text not exist
2. Update name and give all new names a NID
3. Update name give all names their original NID
4. Refresh the NID_view
Tcl-DB: Utility and Further Work
Computing Interesting Stats:
•How much overlap between ITIS and NCBI?
•How many names unique to NCBI?
•How many of these are binomials Vs ‘environmental sample
256’
•How many of these names can be matched allowing for 1 – 3
letter mismatches.
•NCBI taxonomy – data quality, Integrity and Usability?
Transitively closing the Synonyms Table and Vernacular Table
Building an interface.
•Spell checkers
Lots of Questions?
How do we use this to build taxonomically aware
databases?
How about updates to the data?
Database links , Web services, Simple DB Cross
References?
Use Genbank Model?
Open to Suggestions/Ideas!
Do we need to think about:
PhyloCode?
Type Specimens?
Download