The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University of Iowa Research Networking • Programmatic support for discovery and use of research and scholarly information regarding people and resources. • They are essentially special purpose institutional knowledge management systems. Representative RN Systems • • • • • Profiles (Harvard) VIVO (VIVO Consortium) Loki (Iowa) SciVal Experts (aka Pure – Elsevier) A number of others Why Bother with VIVO (the ontology)? • Words in a profile are just sequences of characters carrying no meaning – Try asking Google Scholar what grant funded a given hit… • With structure and relationship comes meaning, aka semantics – Enter the Semantic Web! Connecting the Dots • The real challenge here is translation of information already in existence in scattered sources – Research networking tools – Citation databases (e.g., PubMED) – Award databases (e.g., NIH Reporter) – Curated archives (e.g., GenBank) – Locked up in text (the research literature) CTSAsearch – version 1 • 10 SPARQL endpoints • 19 institutions • 124,945 individuals • Proved challenging for some sites to handle the queries CTSAsearch – version 1 • subclass | count • --------------------+--------• NonFacultyAcademic | 2592383 • FacultyMember | 26826 • NonAcademic | 15268 • EmeritusFaculty | 2134 • EmeritusProfessor | 2070 • Postdoc | 1226 • Librarian | 232 • Student | 89 • GraduateStudent | 71 CTSAsearch – version 2 • 10 SPARQL endpoints (19 institutions) • 15 VIVO sites – Harvested with customized crawler • 14 Profile sites – Harvested with customized crawler CTSAsearch – version 2 • subclass | count • --------------------+--------• NonFacultyAcademic | 2592885 • FacultyMember | 55499 • NonAcademic | 15430 • Student | 11074 • GraduateStudent | 10951 • EmeritusFaculty | 3096 • EmeritusProfessor | 2072 • Postdoc | 1410 • Librarian | 264 CTSAsearch – architecture • • • • • 1 VIVO-based SPARQL harvester 2(!) VIVO-based crawlers 1 Profiles-based crawler 2 Platform-specific HTML crawlers 1 CSV-based loader CTSAsearch – architecture SPARQL Endpoint VIVO Ontology CTSAsearch MEDLINE (NLM) D2RQ RDF Mapping Scopus (Elsevier) Analytics Unified Internal ORCID Staging Staging Staging Staging Staging External External External External External VIVO SPARQL VIVO Crawl Profiles Crawl HTML Crawl CSV Load PMID2DOI (OCLC) CTSAsearch – current • 45,456,417 VIVO-derived triples • 48,569,115 Profiles-derived triples Recent Work • Cross-linkage across sites – Resolving ‘stubs’ – Formation of a single ecosystem • Macro concerns – Institution-scale analytics – Pondering reflection Current “profile” CTSAsearch/Polyglot – version x • Temporary SPARQL endpoint: – http://marengo.info-science.uiowa.edu:2020 • Shared visualization widgets – Intended for embedding in institutional sites • Community-wide sameAs assertions Pattuelli’s Spectrum of Relationships (2012) http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf Pattuelli’s Spectrum of Relationships (2012) RN Tools http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf Pattuelli’s Spectrum of Relationships (2012) Linked In RN Tools http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf Pattuelli’s Spectrum of Relationships (2012) • Ontologies used – foaf (Friend of a Friend) – rel (Relationship) – mo (Music) • Echos of Trigg’s link taxonomy – Trigg, R. 1983. Network-Based Approach to Text Handling for the Online Scientific Community. Ph.D. dissertation, Department of Computer Science, University of Maryland, technical report TR-1346 Connecting the Dots – Take 2 Figure courtesy of Melissa Haendel, OHSU PubMed Central Open Access • • • • 886,172 papers (as of 1/1/15) 423,764 with acknowledgements 994,931 sentences 4,329,972 parses The Simple Cases • • • • • • PMCID: 3008610 SeqNum: 2 SentNum: 6 Sentence: EK analysed the data. POS: [EK/NNP, analysed/VBD, the/DT, data/NNS, ./.] Parse: [S [NP EK/NNP ] [VP analysed/VBD [NP the/DT data/NNS ] ] ./. ] And the Not So Simple… • • • PMCID: 4159542 Sentence: We thank Sheila Harvey, Clinical Trials Unit Manager at ICNARC, and Ruth Canter, Trials Administrator at ICNARC, for their assistance in chasing completed surveys; Dr Kevin Gunning for early advice and project development; Drs Neill K. J. Adhikari and Gordon D. Rubenfeld for feedback and discussion of analysis plan; Dr Chris AKY Chong for his valuable comments on the initial draft of this manuscript; and our Responders: Addenbrooke’s Hospital ( Dr Kevin Gunning ), Airedale General Hospital ( Dr John Scriven ), Alexandra Hospital ( Dr Tracey Leach ), Arrowe Park Hospital ( Dr Lawrence Wilson ), Barnet Hospital ( Dr AH Wolff ), … 8,245 character long sentence Extract Entities/Relationships with Syntactic Queries • [S [NP:Author NN:Author ] [VP NN [NP:Person ] [PP ] , [PP ] ] ] • S <1NP:Author <2[VP <1/thank/ <2(NP) <3(PP) ] – For the sentence having this pattern, match the object noun phrase and the next prepositional phrase • NP <#2 <1(NNP) <2(NNP) – For the noun phrase, extract two proper nouns • PP <#2 <1DT <2(NP) – For the prepositional phrase, match the noun phrase Person Results Snippet ID Title First Name Middle Name Last Name 76 Hans Matrin 77 Jeff Vieira 78 P. ZAMORE Eric Schon 80 Carlos Lois 81 Andrea Möll 82 Elena Govorkova 83 K. 79 84 Prof. Dr. Michael M. Pollard Berton Relationships for Person 77 PMCID Category PP 4006053 Support the kind gift of rKSHV.219 4006053 Support the kind gift of rKSHV.219 and for helpful discussions 4006053 Collaboration helpful discussions Relationships for Person 79 PMCID Category PP 2801706 Resource the rabbit polyclonal antibody 2801706 Resource the ECFP and EYFP plasmids 4013013 Collaboration his helpful advice and discussions Category Frequencies Category Count Collaboration 47,052 46,327 Technique 33,598 Resource 8,894 Support 6,836 Event 3,744 Project 854 Place Name 229 Publication Component 210 Place 186 Organization 93 Next Steps • Continue slogging through extraction pattern definition • Define patterns for – funding declarations – chairs, fellowships, etc. • Merge data into CTSAsearch visualizations • Align current category scheme with Melissa Haendel’s current draft ontology for CASRAI taxonomy and then merge with VIVO-ISF In the Next Year • Joint work with Melissa Haendel (OHSU) on administrative supplement to OHSU’s CTSA bridging RNs and NIH’s SciENcv – – – – – Map SciENcv data model to VIVO-ISF Enable bi-directional data exchange Integrate clinical/trial data sources Integrate SciENcv, ORCID data into CTSAsearch Multi-granularity search and visualization Questions? • Email: david-eichmann@uiowa.edu