Why Bother with VIVO (the ontology)?

advertisement
The Pragmatics of Ontology and
Heterogeneous Data Sources
The Ins and Outs of CTSAsearch
David Eichmann
School of Library and Information Science
University of Iowa
Research Networking
• Programmatic support for discovery and
use of research and scholarly
information regarding people and
resources.
• They are essentially special purpose
institutional knowledge management
systems.
Representative RN Systems
•
•
•
•
•
Profiles (Harvard)
VIVO (VIVO Consortium)
Loki (Iowa)
SciVal Experts (aka Pure – Elsevier)
A number of others
Why Bother with VIVO
(the ontology)?
• Words in a profile are just sequences of
characters carrying no meaning
– Try asking Google Scholar what grant
funded a given hit…
• With structure and relationship comes
meaning, aka semantics
– Enter the Semantic Web!
Connecting the Dots
• The real challenge here is translation of
information already in existence in
scattered sources
– Research networking tools
– Citation databases (e.g., PubMED)
– Award databases (e.g., NIH Reporter)
– Curated archives (e.g., GenBank)
– Locked up in text (the research literature)
CTSAsearch – version 1
• 10 SPARQL endpoints
• 19 institutions
• 124,945 individuals
• Proved challenging for some sites to
handle the queries
CTSAsearch – version 1
• subclass
| count
• --------------------+--------•
NonFacultyAcademic | 2592383
•
FacultyMember
|
26826
•
NonAcademic
|
15268
•
EmeritusFaculty
|
2134
•
EmeritusProfessor |
2070
• Postdoc
|
1226
•
Librarian
|
232
•
Student
|
89
•
GraduateStudent
|
71
CTSAsearch – version 2
• 10 SPARQL endpoints (19 institutions)
• 15 VIVO sites
– Harvested with customized crawler
• 14 Profile sites
– Harvested with customized crawler
CTSAsearch – version 2
• subclass
| count
• --------------------+--------•
NonFacultyAcademic | 2592885
•
FacultyMember
|
55499
•
NonAcademic
|
15430
•
Student
|
11074
•
GraduateStudent
|
10951
• EmeritusFaculty
|
3096
•
EmeritusProfessor |
2072
•
Postdoc
|
1410
•
Librarian
|
264
CTSAsearch – architecture
•
•
•
•
•
1 VIVO-based SPARQL harvester
2(!) VIVO-based crawlers
1 Profiles-based crawler
2 Platform-specific HTML crawlers
1 CSV-based loader
CTSAsearch – architecture
SPARQL Endpoint
VIVO
Ontology
CTSAsearch
MEDLINE
(NLM)
D2RQ RDF Mapping
Scopus
(Elsevier)
Analytics
Unified Internal
ORCID
Staging
Staging
Staging
Staging
Staging
External
External
External
External
External
VIVO
SPARQL
VIVO
Crawl
Profiles
Crawl
HTML
Crawl
CSV Load
PMID2DOI
(OCLC)
CTSAsearch – current
• 45,456,417 VIVO-derived triples
• 48,569,115 Profiles-derived triples
Recent Work
• Cross-linkage across sites
– Resolving ‘stubs’
– Formation of a single ecosystem
• Macro concerns
– Institution-scale analytics
– Pondering reflection
Current “profile”
CTSAsearch/Polyglot –
version x
• Temporary SPARQL endpoint:
– http://marengo.info-science.uiowa.edu:2020
• Shared visualization widgets
– Intended for embedding in institutional sites
• Community-wide sameAs assertions
Pattuelli’s Spectrum of
Relationships (2012)
http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf
Pattuelli’s Spectrum of
Relationships (2012)
RN
Tools
http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf
Pattuelli’s Spectrum of
Relationships (2012)
Linked
In
RN
Tools
http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf
Pattuelli’s Spectrum of
Relationships (2012)
• Ontologies used
– foaf (Friend of a Friend)
– rel (Relationship)
– mo (Music)
• Echos of Trigg’s link taxonomy
– Trigg, R. 1983. Network-Based Approach to Text Handling
for the Online Scientific Community. Ph.D. dissertation,
Department of Computer Science, University of Maryland,
technical report TR-1346
Connecting the Dots – Take 2
Figure courtesy of
Melissa Haendel, OHSU
PubMed Central Open Access
•
•
•
•
886,172 papers (as of 1/1/15)
423,764 with acknowledgements
994,931 sentences
4,329,972 parses
The Simple Cases
•
•
•
•
•
•
PMCID: 3008610
SeqNum: 2
SentNum: 6
Sentence: EK analysed the data.
POS: [EK/NNP, analysed/VBD, the/DT, data/NNS, ./.]
Parse: [S
[NP EK/NNP ]
[VP analysed/VBD
[NP the/DT data/NNS ]
] ./. ]
And the Not So Simple…
•
•
•
PMCID: 4159542
Sentence: We thank Sheila Harvey, Clinical Trials Unit Manager at
ICNARC, and Ruth Canter, Trials Administrator at ICNARC, for their
assistance in chasing completed surveys; Dr Kevin Gunning for early
advice and project development; Drs Neill K. J. Adhikari and Gordon D.
Rubenfeld for feedback and discussion of analysis plan; Dr Chris AKY
Chong for his valuable comments on the initial draft of this manuscript;
and our Responders: Addenbrooke’s Hospital ( Dr Kevin Gunning ),
Airedale General Hospital ( Dr John Scriven ), Alexandra Hospital ( Dr
Tracey Leach ), Arrowe Park Hospital ( Dr Lawrence Wilson ), Barnet
Hospital ( Dr AH Wolff ), …
8,245 character long sentence
Extract Entities/Relationships
with Syntactic Queries
• [S [NP:Author NN:Author ] [VP NN [NP:Person ] [PP ] , [PP ] ] ]
• S <1NP:Author <2[VP <1/thank/ <2(NP) <3(PP) ]
– For the sentence having this pattern, match the object noun
phrase and the next prepositional phrase
• NP <#2 <1(NNP) <2(NNP)
– For the noun phrase, extract two proper nouns
• PP <#2 <1DT <2(NP)
– For the prepositional phrase, match the noun phrase
Person Results Snippet
ID
Title
First Name
Middle Name
Last Name
76
Hans
Matrin
77
Jeff
Vieira
78
P.
ZAMORE
Eric
Schon
80
Carlos
Lois
81
Andrea
Möll
82
Elena
Govorkova
83
K.
79
84
Prof.
Dr.
Michael
M.
Pollard
Berton
Relationships for Person 77
PMCID
Category
PP
4006053
Support
the kind gift of rKSHV.219
4006053
Support
the kind gift of rKSHV.219 and for helpful
discussions
4006053
Collaboration
helpful discussions
Relationships for Person 79
PMCID
Category
PP
2801706
Resource
the rabbit polyclonal antibody
2801706
Resource
the ECFP and EYFP plasmids
4013013
Collaboration
his helpful advice and discussions
Category Frequencies
Category
Count
Collaboration
47,052
46,327
Technique
33,598
Resource
8,894
Support
6,836
Event
3,744
Project
854
Place Name
229
Publication
Component
210
Place
186
Organization
93
Next Steps
• Continue slogging through extraction pattern
definition
• Define patterns for
– funding declarations
– chairs, fellowships, etc.
• Merge data into CTSAsearch visualizations
• Align current category scheme with Melissa
Haendel’s current draft ontology for CASRAI
taxonomy and then merge with VIVO-ISF
In the Next Year
• Joint work with Melissa Haendel (OHSU) on
administrative supplement to OHSU’s CTSA
bridging RNs and NIH’s SciENcv
–
–
–
–
–
Map SciENcv data model to VIVO-ISF
Enable bi-directional data exchange
Integrate clinical/trial data sources
Integrate SciENcv, ORCID data into CTSAsearch
Multi-granularity search and visualization
Questions?
• Email: [email protected]
Download