Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester Outline • Motivation: Astronomy & The Virtual Observatory • Data Integration Challenges 1. Locating the relevant data 2. Requesting the required data 3. Understanding the returned data 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar Using Semantic Web tools and technology to overcome these challenges 1 Context: Astronomy • Data collected across electromagnetic spectrum • Analysed within one wavelength Image: Wikipedia • Data collection is – expensive – time consuming • Existing data – large quantities – freely available 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 2 Virtual Observatory “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 3 Searching for Brown Dwarfs • Data sets: – Near Infrared, 2MASS/UK Infrared Deep Sky Survey – Optical, APMCAT/Sloan Digital Sky Survey • Complex colour/motion selection criteria • Similar problems – Halo White Dwarfs Image: AstroGrid 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 4 Deep Field Surveys • Observations in multiple wavelengths – Radio to X-Ray • Searching for new objects – Galaxies, stars, etc • Requires correlations across many catalogues – – – – ISO Hubble SCUBA etc 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 5 Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers – Archive centres – Research labs Virtual Observatory • Heterogeneous data – Relational – XML – Image Files 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 6 Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data 1. Which data sources contain relevant data? 2. How do I query the relevant data sources? 3. How can I interpret/combine/analyse the data? 23 June 2009 Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar 7 Finding relevant data sources 1. Which data sources contain relevant data? 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 8 Which data sources do I use? • VO registry – 65,000+ entries – Many mirrored services • VOExplorer – Registry search tool • Resources tagged with keywords - 6df - survey - galaxy - galaxies 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar - redshift - redshifts - 2mass 9 Analysis of Registry Keywords Problems: – – – – – Plural/singular Case Abbreviations Different tags Specificity of tags Thanks to Sébastien Derriere for this data. 23 June 2009 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 2 Interstellar 4 supernova medium 3 Clusters of 2 QSO Galaxies 2 QSOs 3 Infrared:stars 2 SNR 3 Quasars: general 2 Variable Star 3 Supernova 2 White Dwarf 3 White dwarfs 2 clusters of 3 galaxies galaxies 2 Comets 2 stars 2 Cool stars 1 Asteroids 2 Extragalactic 1 BL Lac Source 1 Be/X-ray binary 2 Extragalactic stars objects 1 Binary stars 2 Infrared: stars ... A.J.G. Gray - Bolzano-Bozen Seminar 10 Analysis of Registry Keywords Problems: – Plural/singular – Case Solution: (standard IR techniques) – Stemming • Star & Stars become Star – Case normalisation • lowercase 23 June 2009 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 2 Interstellar 4 supernova medium 3 Clusters of 2 QSO Galaxies 2 QSOs 3 Infrared:stars 2 SNR 3 Quasars: general 2 Variable Star 3 Supernova 2 White Dwarf 3 White dwarfs 2 clusters of 3 galaxies galaxies 2 Comets 2 stars 2 Cool stars 1 Asteroids 2 Extragalactic 1 BL Lac Source 1 Be/X-ray binary 2 Extragalactic stars objects 1 Binary stars 2 Infrared: stars ... A.J.G. Gray - Bolzano-Bozen Seminar 11 Analysis of Registry Keywords Problems: – Abbreviations – Different tags – Specificity of tags Solution: Need to understand semantics! 23 June 2009 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 2 Interstellar 4 supernova medium 3 Clusters of 2 QSO Galaxies 2 QSOs 3 Infrared:stars 2 SNR 3 Quasars: general 2 Variable Star 3 Supernova 2 White Dwarf 3 White dwarfs 2 clusters of 3 galaxies galaxies 2 Comets 2 stars 2 Cool stars 1 Asteroids 2 Extragalactic 1 BL Lac Source 1 Be/X-ray binary 2 Extragalactic stars objects 1 Binary stars 2 Infrared: stars ... A.J.G. Gray - Bolzano-Bozen Seminar 12 Semantic Options • Folksonomies – Keyword tags, freely chosen • Vocabulary – Controlled list of words with definitions • Taxonomy – Relationships: Broader/Narrower/Related Image: Leonard Cohen Search • Thesaurus – Synonyms, antonyms, see also • Ontology – Formal specification of a shared conceptualisation – OWL “Vocabulary” used to cover vocabularies, taxonomies, and thesauri. 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 13 What is a Vocabulary? 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 14 What is a Controlled Vocabulary? • A set of terms with: – – – – Label Synonyms Definition Relationships to other terms: • Broader term • Narrower term • Related term 23 June 2009 • Example: – – – – “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships carrying semantic information: • BT: “Galaxy” • NT: “Barred spiral galaxy” • RT: “Spiral arm” A.J.G. Gray - Bolzano-Bozen Seminar 15 Existing Vocabularies in Astronomy • Journal Keywords • IAU Thesaurus – Developed for tagging papers – 311 terms – Actively used – Developed for libraries in 1993 – 2,551 terms – Never really used • Astronomy Visualization • Unified Content Descriptor (UCD) Metadata (AVM) – Tagging images – 217 terms – Actively used 23 June 2009 – Tagging resource data – 473 terms – Actively used A.J.G. Gray - Bolzano-Bozen Seminar 16 Common Vocabulary Format Requirements: – Provide term identifiers • Unambiguous tagging – Capture semantic relationships – Avoids problems of: • • • • Spelling Case Plurality problems Tags • Poly-hierarchy structure – Machine processable • Allows inter-operability • “Machine intelligence” 23 June 2009 – Automated reasoning: • Interested in all “Supernova” • Items tagged as “1a Supernova” also returned A.J.G. Gray - Bolzano-Bozen Seminar 17 SKOS – W3C standard for sharing vocabularies – Based on RDF • Semantic model for describing resources – Provides URI for each term – Captures properties of terms – Encodes relationships between terms • Enables automated reasoning • Standard serialisations • “Looser” semantics than OWL – Adopted by IVOA as a standard for vocabularies 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 18 Example SKOS Vocabulary Term Example In turtle notation #spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a spiral structure”@en; broader #galaxy; BT: “Galaxy” narrower #barredSpiralGalaxy; NT: “Barred spiral galaxy” related #spiralArm . RT: “Spiral arm” “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships: 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 19 Inter-operable Vocabularies Which vocabulary should I use? Inter-vocabulary mappings • One that you know! • Closest match to your needs • Vocabulary terms related using mappings • Broad match: – Part of the SKOS standard – One mapping file per pair of vocabularies 23 June 2009 – more general term • Narrow match: – more specific term • Related match: – associated term • Exact match: – equivalent term • Close match: – similar but not equivalent term A.J.G. Gray - Bolzano-Bozen Seminar 20 Mapping Editor 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 21 Putting it all together • Use vocabulary concepts for – Tagging (using URI) • Resources in the registry • VOEvent packets – Searching by vocabulary concept • User keyword search converted to vocabulary URI • Provides semantic advantages – Reasoning about terms • Relationships (Intra-vocabulary) • Mappings (Inter-vocabulary) • Requires a mechanism to convert a string to a concept 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 22 Vocabulary Explorer • Search and browse vocabularies – Configure • Vocabularies • Mappings • Based on Information Retrieval techniques • Matching mechanisms • Ranking results http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/ 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 23 Search Results • Evaluation over 59 queries • nDCG evaluation model (distinguishes highly relevant/relevant/not relevant) Run BB2 BM25 DFRBM25 IFB2 InexpB2 InexpC2 InL2 PL2 TF-IDF Initial 0.93 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 Query Expansion 0.93 0.94 0.94 0.95 0.95 0.94 0.95 0.94 0.94 Term weighting 1 0.93 0.95 0.95 0.95 0.95 0.95 0.95 0.96 0.96 Term weighting 2 0.93 0.95 0.95 0.96 0.96 0.95 0.96 0.96 0.96 Combined 0.91 0.94 0.94 0.94 0.94 0.93 0.94 0.94 0.94 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 24 Vocabulary Explorer Screenshot 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 25 Vocabulary Explorer Screenshot 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 26 Vocabulary Explorer Screenshot 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 27 Vocabulary Explorer Screenshot 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 28 Finding the Right Term: Conclusions • Vocabularies improve search – Remove ambiguity – Increase precision and recall – Enable • Reasoning about relevance • Faceted browsing • Provided tools for working with vocabularies – Reliable search from keyword string to vocabulary term – Exploration of vocabularies – Mapping terms across vocabularies 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 29 Extracting relevant data 2. How do I query the relevant data sources? 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 30 Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers – Archive centres – Research labs Virtual Observatory • Heterogeneous data – Relational – XML – Image Files 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 31 A Data Integration Approach • Heterogeneous sources Query1 Relies on agreement – Autonomous of a common global – Local schemas schema Queryn Global Schema • Homogeneous view – Mediated global schema • Mapping Mappings – LAV: local-as-view – GAV: global-as-view 23 June 2009 Wrapper1 Wrapperi Wrapperk DB1 DBi DBk A.J.G. Gray - Bolzano-Bozen Seminar 32 P2P Data Integration Approach • Heterogeneous sources Query1 Queryn – Autonomous – Local schemas • Heterogeneous views Schema1 – Multiple schemas Schemaj • Mappings – From sources to common schema – Between pairs of schema • Require common integration data model Can RDF do this? 23 June 2009 Mappings Wrapper1 Wrapperi Wrapperk DB1 DBi DBk A.J.G. Gray - Bolzano-Bozen Seminar 33 Resource Description Framework rdf:type IAU:Star #Sun #name The Sun #foundIn #name The Galaxy #MilkyWay #name rdf:type Milky Way • W3C standard • Designed as a metadata data model • Contains semantic details • Ideal for linking distributed data • Queried through SPARQL IAU:BarredSpiral 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 34 SPARQL • Declarative query language – Select returned data (projection) • Graph or tuples • Attributes to return – Describe structure of desired results – Filter data (selection) • W3C standard • Syntactically similar to SQL – Should be easy for astronomers to learn! 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 35 Integrating Using RDF SPARQL query Mappings Common Model (RDF) – Expose schema and data as RDF – Need a SPARQL endpoint We will focus on • Allows multiple exposing relational data sources – Access models RDF / XML – Storage models RDF / Relational Conversion Conversion Relational DB XML DB 23 June 2009 • Data resources • Easy to relate data from multiple sources A.J.G. Gray - Bolzano-Bozen Seminar 36 RDB2RDF: Two Approaches Extract-Transform-Load • Data replicated as RDF Query-driven Conversion • Data stored as relations – Data can become stale • Native SPARQL query support – Limited optimisation mechanisms Existing RDF stores • Jena • Seasame 23 June 2009 • Native SQL query support – Highly optimised access methods • SPARQL queries must be translated Existing translation systems • D2RQ • SquirrelRDF A.J.G. Gray - Bolzano-Bozen Seminar 37 System Test Hypothesis Is it viable to perform querydriven conversions to facilitate data access from a data model that an astronomer is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? 23 June 2009 SPARQL query SPARQL query Common Model (RDF) Mappings RDB2RDF RDF / XML Conversion Relational DB XML DB A.J.G. Gray - Bolzano-Bozen Seminar 38 Astronomical Test Data Set • SuperCOSMOS Science Archive (SSA) Data extracted from scans of Schmidt plates Stored in a relational database About 4TB of data, detailing 6.4 billion objects Fairly typical of astronomical data archives Image: SuperCOSMOS Science Archive – – – – • Schema designed using 20 real queries • Personal version contains – Data for a specific region of the sky – About 0.1% of the data – About 500MB 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 39 Analysis of Test Data • Using personal version – About 500MB in size (similar size to related work) • Organised in 14 Relations – Number of attributes: 2 – 152 • 4 relations with more than 20 attributes – Number of rows: 3 – 585,560 – Two views • Complex selection criteria in views 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar Makes this different from business cases and previous work! 40 Is SPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 41 Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SQL: SELECT ra, dec, sCorMagB, sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) 23 June 2009 SPARQL: SELECT ?ra ?decl ?sCorMagB ?sCorMagR2 ?sCorMagI WHERE { …<bindings>… FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} A.J.G. Gray - Bolzano-Bozen Seminar 42 Analysis of Test Queries Query Feature Query Numbers Arithmetic in body 1-5, 7, 9, 12, 13, 15-20 Arithmetic in head 7-9, 12, 13 Ordering 1-8, 10-17, 19, 20 Joins (including self-joins) 12-17, 19 Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20 Aggregate functions (including Group By) 7-9, 18 Math functions (e.g. power, log, root) 4, 9, 16 Trigonometry functions 8, 12 Negated sub-query 18, 20 Type casting (e.g. Radians to degrees) 7, 8, 12 Server defined functions 10, 11 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 43 Expressivity of SPARQL Features • Select-project-join • Arithmetic in body • Conjunction and disjunction • Ordering • String matching • External function calls (extension mechanism) 23 June 2009 Limitations • Range shorthands • Arithmetic in head • Math functions • Trigonometry functions • Sub queries • Aggregate functions • Casting A.J.G. Gray - Bolzano-Bozen Seminar 44 Analysis of Test Queries Query Feature Query Numbers Arithmetic in body 1-5, 7, 9, 12, 13, 15-20 Arithmetic in head 7-9, 12, 13 Ordering 1-8, 10-17, 19, 20 Joins (including self-joins) 12-17, 19 Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20 Aggregate functions (including Group By) 7-9, 18 Math functions (e.g. power, log, root) 4, 9, 16 Trigonometry functions 8, 12 Negated sub-query 18, 20 Type casting (e.g. radians to degrees) 7, 8, 12 Server defined functions 10, 11 Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 45 Can RDB2RDF tools feasibly expose large science archives for data integration? 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 46 Experiment • Time query evaluation – 5 out of 20 queries used – No joins • Systems compared: SQL query SPARQL query SPARQL query – Relational DB (Base line) • MySQL v5.1.25 – RDB2RDF tools RDB2RDF • D2RQ v0.5.2 • SquirrelRDF v0.1 – RDF Triple stores • Jena v2.5.6 (SDB) • Sesame v2.1.3 (Native) 23 June 2009 Relational DB A.J.G. Gray - Bolzano-Bozen Seminar Relational DB Triple store 47 Experimental Configuration • 8 identical machines – 64 bit Intel Quad Core Xeon 2.4GHz – 4GB RAM – 100 GB Hard drive – Java 1.6 – Linux • 10 repetitions 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 48 7,468 19,984 372,561 17,793 4,090 1,307 7,229 2,733 1000 5,339 21,492 485,932 3,450 Performance Results 900 800 ms 700 MySQL 600 D2RQ 500 SqRDF Jena 400 Sesame 300 200 1 100 0 # Query 1 23 June 2009 # Query 2 # Query 3 # Query 5 A.J.G. Gray - Bolzano-Bozen Seminar # Query 6 49 The Show Stopper: Query Translation • Each bound variable resulted in a self-join – RDBMS cannot optimize for this – RDBMS perform badly with self-joins • Each row retrieved with a separate query – 1 query becomes n queries, where n is cardinality of relation • Predicate selection in RDB2RDF tool – No optimization possible 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 50 Extracting Relevant Data: Conclusions • SPARQL not expressive enough for real (astronomy) queries • RDBMS benefits from 30+ years research – Query optimisation – Indexes • RDF stores are improving – Require existing data to be replicated • RDB2RDF tools show promise – Need to exploit relational database 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 51 Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! More work needed on query translation… 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 52 Conclusions & Future Work Traditional Integration Challenges 1. Locating data Semantic Web Solution • SKOS Vocabularies – Removes ambiguity – Enables limited machine understanding 2. Extracting relevant data • RDB2RDF Tools – Requires improved query translation 3. Understanding data • Semantic model mappings – Follow “chains” of mappings – Relies on RDB2RDF work 23 June 2009 A.J.G. Gray - Bolzano-Bozen Seminar 53