Insert Presentation title here

advertisement
Using Semantic Web Technology to
Integrate Scientific Data
Alasdair J G Gray
University of Manchester
Outline
• Motivation: Astronomy & The Virtual Observatory
• Data Integration Challenges
1. Locating the relevant data
2. Requesting the required data
3. Understanding the returned data
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
Using Semantic
Web tools and
technology to
overcome these
challenges
1
Context: Astronomy
• Data collected across
electromagnetic spectrum
• Analysed within one
wavelength
Image: Wikipedia
• Data collection is
– expensive
– time consuming
• Existing data
– large quantities
– freely available
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
2
Virtual Observatory
“facilitate the international
coordination and
collaboration necessary for
the development and
deployment of the tools,
systems and organizational
structures necessary to
enable the international
utilization of astronomical
archives as an integrated
and interoperating virtual
observatory.”
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
3
Searching for Brown Dwarfs
• Data sets:
– Near Infrared, 2MASS/UK Infrared Deep Sky Survey
– Optical, APMCAT/Sloan Digital Sky Survey
• Complex colour/motion selection criteria
• Similar problems
– Halo White Dwarfs
Image: AstroGrid
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
4
Deep Field Surveys
• Observations in multiple wavelengths
– Radio to X-Ray
• Searching for new objects
– Galaxies, stars, etc
• Requires correlations across many catalogues
–
–
–
–
ISO
Hubble
SCUBA
etc
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
5
Virtual Observatory: The Problems
Locate, retrieve, and interpret
relevant data
• Heterogeneous publishers
– Archive centres
– Research labs
Virtual Observatory
• Heterogeneous data
– Relational
– XML
– Image Files
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
6
Virtual Observatory: The Problems
Locate, retrieve, and interpret
relevant data
1. Which data sources contain
relevant data?
2. How do I query the relevant
data sources?
3. How can I
interpret/combine/analyse
the data?
23 June 2009
Virtual Observatory
A.J.G. Gray - Bolzano-Bozen Seminar
7
Finding relevant data sources
1. Which data sources contain relevant data?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
8
Which data sources do I use?
• VO registry
– 65,000+ entries
– Many mirrored services
• VOExplorer
– Registry search tool
• Resources tagged with
keywords
- 6df
- survey
- galaxy
- galaxies
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
- redshift
- redshifts
- 2mass
9
Analysis of Registry Keywords
Problems:
–
–
–
–
–
Plural/singular
Case
Abbreviations
Different tags
Specificity of tags
Thanks to Sébastien Derriere
for this data.
23 June 2009
75 Star
52 Galaxy
37 Stars
36 Galaxies
16 AGN
12 Cluster of
Galaxies
12 Nebulae
11 Planets
10 GRB
10 Globular
Clusters
8 Star Cluster
7 Nebula
6 Variable stars
5 Hot stars
5 Pulsar
2 Interstellar
4 supernova
medium
3 Clusters of
2 QSO
Galaxies
2 QSOs
3 Infrared:stars
2 SNR
3 Quasars: general 2 Variable Star
3 Supernova
2 White Dwarf
3 White dwarfs
2 clusters of
3 galaxies
galaxies
2 Comets
2 stars
2 Cool stars
1 Asteroids
2 Extragalactic
1 BL Lac
Source
1 Be/X-ray binary
2 Extragalactic
stars
objects
1 Binary stars
2 Infrared: stars ...
A.J.G. Gray - Bolzano-Bozen Seminar
10
Analysis of Registry Keywords
Problems:
– Plural/singular
– Case
Solution:
(standard IR techniques)
– Stemming
• Star & Stars become
Star
– Case normalisation
• lowercase
23 June 2009
75 Star
52 Galaxy
37 Stars
36 Galaxies
16 AGN
12 Cluster of
Galaxies
12 Nebulae
11 Planets
10 GRB
10 Globular
Clusters
8 Star Cluster
7 Nebula
6 Variable stars
5 Hot stars
5 Pulsar
2 Interstellar
4 supernova
medium
3 Clusters of
2 QSO
Galaxies
2 QSOs
3 Infrared:stars
2 SNR
3 Quasars: general 2 Variable Star
3 Supernova
2 White Dwarf
3 White dwarfs
2 clusters of
3 galaxies
galaxies
2 Comets
2 stars
2 Cool stars
1 Asteroids
2 Extragalactic
1 BL Lac
Source
1 Be/X-ray binary
2 Extragalactic
stars
objects
1 Binary stars
2 Infrared: stars ...
A.J.G. Gray - Bolzano-Bozen Seminar
11
Analysis of Registry Keywords
Problems:
– Abbreviations
– Different tags
– Specificity of tags
Solution:
Need to understand
semantics!
23 June 2009
75 Star
52 Galaxy
37 Stars
36 Galaxies
16 AGN
12 Cluster of
Galaxies
12 Nebulae
11 Planets
10 GRB
10 Globular
Clusters
8 Star Cluster
7 Nebula
6 Variable stars
5 Hot stars
5 Pulsar
2 Interstellar
4 supernova
medium
3 Clusters of
2 QSO
Galaxies
2 QSOs
3 Infrared:stars
2 SNR
3 Quasars: general 2 Variable Star
3 Supernova
2 White Dwarf
3 White dwarfs
2 clusters of
3 galaxies
galaxies
2 Comets
2 stars
2 Cool stars
1 Asteroids
2 Extragalactic
1 BL Lac
Source
1 Be/X-ray binary
2 Extragalactic
stars
objects
1 Binary stars
2 Infrared: stars ...
A.J.G. Gray - Bolzano-Bozen Seminar
12
Semantic Options
• Folksonomies
– Keyword tags, freely chosen
• Vocabulary
– Controlled list of words with
definitions
• Taxonomy
– Relationships:
Broader/Narrower/Related
Image: Leonard Cohen Search
• Thesaurus
– Synonyms, antonyms, see also
• Ontology
– Formal specification of a shared
conceptualisation – OWL
“Vocabulary” used to cover
vocabularies, taxonomies, and
thesauri.
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
13
What is a Vocabulary?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
14
What is a Controlled Vocabulary?
• A set of terms with:
–
–
–
–
Label
Synonyms
Definition
Relationships to other
terms:
• Broader term
• Narrower term
• Related term
23 June 2009
• Example:
–
–
–
–
“Spiral galaxy”
“Spiral nebula”
“A galaxy having a spiral structure”
Relationships carrying semantic
information:
• BT: “Galaxy”
• NT: “Barred spiral galaxy”
• RT: “Spiral arm”
A.J.G. Gray - Bolzano-Bozen Seminar
15
Existing Vocabularies in Astronomy
• Journal Keywords
• IAU Thesaurus
– Developed for tagging
papers
– 311 terms
– Actively used
– Developed for libraries in
1993
– 2,551 terms
– Never really used
• Astronomy Visualization • Unified Content
Descriptor (UCD)
Metadata (AVM)
– Tagging images
– 217 terms
– Actively used
23 June 2009
– Tagging resource data
– 473 terms
– Actively used
A.J.G. Gray - Bolzano-Bozen Seminar
16
Common Vocabulary Format
Requirements:
– Provide term identifiers
• Unambiguous tagging
– Capture semantic
relationships
– Avoids problems of:
•
•
•
•
Spelling
Case
Plurality problems
Tags
• Poly-hierarchy structure
– Machine processable
• Allows inter-operability
• “Machine intelligence”
23 June 2009
– Automated reasoning:
• Interested in all “Supernova”
• Items tagged as “1a Supernova”
also returned
A.J.G. Gray - Bolzano-Bozen Seminar
17
SKOS
– W3C standard for sharing vocabularies
– Based on RDF
• Semantic model for describing resources
– Provides URI for each term
– Captures properties of terms
– Encodes relationships between terms
• Enables automated reasoning
• Standard serialisations
• “Looser” semantics than OWL
– Adopted by IVOA as a standard for vocabularies
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
18
Example SKOS Vocabulary Term
Example
In turtle notation
#spiralGalaxy a concept;
prefLabel “Spiral galaxy”@en;
altLabel “Spiral nebula”@en;
definition “A galaxy having a
spiral structure”@en;
broader #galaxy;
BT: “Galaxy”
narrower #barredSpiralGalaxy;
NT: “Barred spiral galaxy” related #spiralArm .
RT: “Spiral arm”
“Spiral galaxy”
“Spiral nebula”
“A galaxy having a spiral
structure”
Relationships:
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
19
Inter-operable Vocabularies
Which vocabulary should I use?
Inter-vocabulary mappings
• One that you know!
• Closest match to your
needs
• Vocabulary terms related
using mappings
• Broad match:
– Part of the SKOS standard
– One mapping file per pair
of vocabularies
23 June 2009
– more general term
• Narrow match:
– more specific term
• Related match:
– associated term
• Exact match:
– equivalent term
• Close match:
– similar but not equivalent term
A.J.G. Gray - Bolzano-Bozen Seminar
20
Mapping Editor
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
21
Putting it all together
• Use vocabulary concepts for
– Tagging (using URI)
• Resources in the registry
• VOEvent packets
– Searching by vocabulary concept
• User keyword search converted to vocabulary URI
• Provides semantic advantages
– Reasoning about terms
• Relationships (Intra-vocabulary)
• Mappings (Inter-vocabulary)
• Requires a mechanism to convert a string to a concept
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
22
Vocabulary Explorer
• Search and browse
vocabularies
– Configure
• Vocabularies
• Mappings
• Based on Information
Retrieval techniques
• Matching mechanisms
• Ranking results
http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
23
Search Results
• Evaluation over 59 queries
• nDCG evaluation model (distinguishes highly relevant/relevant/not relevant)
Run
BB2
BM25
DFRBM25
IFB2
InexpB2
InexpC2
InL2
PL2
TF-IDF
Initial
0.93
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.95
Query
Expansion
0.93
0.94
0.94
0.95
0.95
0.94
0.95
0.94
0.94
Term
weighting 1
0.93
0.95
0.95
0.95
0.95
0.95
0.95
0.96
0.96
Term
weighting 2
0.93
0.95
0.95
0.96
0.96
0.95
0.96
0.96
0.96
Combined
0.91
0.94
0.94
0.94
0.94
0.93
0.94
0.94
0.94
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
24
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
25
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
26
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
27
Vocabulary Explorer Screenshot
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
28
Finding the Right Term: Conclusions
• Vocabularies improve search
– Remove ambiguity
– Increase precision and recall
– Enable
• Reasoning about relevance
• Faceted browsing
• Provided tools for working with vocabularies
– Reliable search from keyword string to vocabulary term
– Exploration of vocabularies
– Mapping terms across vocabularies
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
29
Extracting relevant data
2. How do I query the relevant data sources?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
30
Virtual Observatory: The Problems
Locate, retrieve, and interpret
relevant data
• Heterogeneous publishers
– Archive centres
– Research labs
Virtual Observatory
• Heterogeneous data
– Relational
– XML
– Image Files
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
31
A Data Integration Approach
• Heterogeneous sources
Query1
Relies on agreement
– Autonomous
of a common global
– Local schemas
schema
Queryn
Global Schema
• Homogeneous view
– Mediated global schema
• Mapping
Mappings
– LAV: local-as-view
– GAV: global-as-view
23 June 2009
Wrapper1
Wrapperi
Wrapperk
DB1
DBi
DBk
A.J.G. Gray - Bolzano-Bozen Seminar
32
P2P Data Integration Approach
• Heterogeneous sources
Query1
Queryn
– Autonomous
– Local schemas
• Heterogeneous views
Schema1
– Multiple schemas
Schemaj
• Mappings
– From sources to common
schema
– Between pairs of schema
• Require common integration
data model
Can RDF do this?
23 June 2009
Mappings
Wrapper1
Wrapperi
Wrapperk
DB1
DBi
DBk
A.J.G. Gray - Bolzano-Bozen Seminar
33
Resource Description Framework
rdf:type
IAU:Star
#Sun
#name
The Sun
#foundIn
#name
The Galaxy
#MilkyWay
#name
rdf:type
Milky Way
• W3C standard
• Designed as a metadata
data model
• Contains semantic details
• Ideal for linking
distributed data
• Queried through SPARQL
IAU:BarredSpiral
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
34
SPARQL
• Declarative query language
– Select returned data (projection)
• Graph or tuples
• Attributes to return
– Describe structure of desired results
– Filter data (selection)
• W3C standard
• Syntactically similar to SQL
– Should be easy for astronomers to learn!
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
35
Integrating Using RDF
SPARQL
query
Mappings
Common
Model (RDF)
– Expose schema and data as
RDF
– Need a SPARQL endpoint
We will focus on
• Allows multiple
exposing relational data
sources – Access models
RDF / XML
– Storage models
RDF / Relational
Conversion
Conversion
Relational DB
XML DB
23 June 2009
• Data resources
• Easy to relate data from
multiple sources
A.J.G. Gray - Bolzano-Bozen Seminar
36
RDB2RDF: Two Approaches
Extract-Transform-Load
• Data replicated as RDF
Query-driven Conversion
• Data stored as relations
– Data can become stale
• Native SPARQL query support
– Limited optimisation
mechanisms
Existing RDF stores
• Jena
• Seasame
23 June 2009
• Native SQL query support
– Highly optimised access methods
• SPARQL queries must be
translated
Existing translation systems
• D2RQ
• SquirrelRDF
A.J.G. Gray - Bolzano-Bozen Seminar
37
System Test Hypothesis
Is it viable to perform querydriven conversions to
facilitate data access from
a data model that an
astronomer is familiar
with?
Can RDB2RDF tools feasibly
expose large science
archives for data
integration?
23 June 2009
SPARQL
query
SPARQL
query
Common
Model (RDF)
Mappings
RDB2RDF
RDF / XML
Conversion
Relational DB
XML DB
A.J.G. Gray - Bolzano-Bozen Seminar
38
Astronomical Test Data Set
• SuperCOSMOS Science Archive (SSA)
Data extracted from scans of Schmidt plates
Stored in a relational database
About 4TB of data, detailing 6.4 billion objects
Fairly typical of astronomical data archives
Image: SuperCOSMOS Science Archive
–
–
–
–
• Schema designed using 20 real queries
• Personal version contains
– Data for a specific region of the sky
– About 0.1% of the data
– About 500MB
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
39
Analysis of Test Data
• Using personal version
– About 500MB in size (similar size to related work)
• Organised in 14 Relations
– Number of attributes: 2 – 152
• 4 relations with more than 20 attributes
– Number of rows: 3 – 585,560
– Two views
• Complex selection criteria in views
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
Makes this different
from business cases
and previous work!
40
Is SPARQL expressive enough?
Can the 20 sample queries be expressed in SPARQL?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
41
Real Science Queries
Query 5: Find the positions and (B,R,I) magnitudes of all star-like
objects within delta mag of 0.2 of the colours of a quasar of redshift
2.5 < z < 3.5
SQL:
SELECT ra, dec, sCorMagB,
sCorMagR2, sCorMagI
FROM ReliableStars
WHERE (sCorMagB-sCorMagR2
BETWEEN 0.05 AND 0.80) AND
(sCorMagR2-sCorMagI
BETWEEN -0.17 AND 0.64)
23 June 2009
SPARQL:
SELECT ?ra ?decl ?sCorMagB
?sCorMagR2 ?sCorMagI
WHERE {
…<bindings>…
FILTER (?sCorMagB –
?sCorMagR2 >= 0.05 &&
?sCorMagB - ?sCorMagR2 <=
0.80)
FILTER (?sCorMagR2 –
?sCorMagI >= -0.17 &&
?sCorMagR2 - ?sCorMagI <=
0.64)}
A.J.G. Gray - Bolzano-Bozen Seminar
42
Analysis of Test Queries
Query Feature
Query Numbers
Arithmetic in body
1-5, 7, 9, 12, 13, 15-20
Arithmetic in head
7-9, 12, 13
Ordering
1-8, 10-17, 19, 20
Joins (including self-joins)
12-17, 19
Range functions (e.g. Between, ABS)
2, 3, 5, 8, 12, 13, 15, 17-20
Aggregate functions (including Group By)
7-9, 18
Math functions (e.g. power, log, root)
4, 9, 16
Trigonometry functions
8, 12
Negated sub-query
18, 20
Type casting (e.g. Radians to degrees)
7, 8, 12
Server defined functions
10, 11
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
43
Expressivity of SPARQL
Features
• Select-project-join
• Arithmetic in body
• Conjunction and disjunction
• Ordering
• String matching
• External function calls
(extension mechanism)
23 June 2009
Limitations
• Range shorthands
• Arithmetic in head
• Math functions
• Trigonometry functions
• Sub queries
• Aggregate functions
• Casting
A.J.G. Gray - Bolzano-Bozen Seminar
44
Analysis of Test Queries
Query Feature
Query Numbers
Arithmetic in body
1-5, 7, 9, 12, 13, 15-20
Arithmetic in head
7-9, 12, 13
Ordering
1-8, 10-17, 19, 20
Joins (including self-joins)
12-17, 19
Range functions (e.g. Between, ABS)
2, 3, 5, 8, 12, 13, 15, 17-20
Aggregate functions (including Group By)
7-9, 18
Math functions (e.g. power, log, root)
4, 9, 16
Trigonometry functions
8, 12
Negated sub-query
18, 20
Type casting (e.g. radians to degrees)
7, 8, 12
Server defined functions
10, 11
Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
45
Can RDB2RDF tools feasibly expose large
science archives for data integration?
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
46
Experiment
• Time query evaluation
– 5 out of 20 queries used
– No joins
• Systems compared:
SQL
query
SPARQL
query
SPARQL
query
– Relational DB (Base line)
• MySQL v5.1.25
– RDB2RDF tools
RDB2RDF
• D2RQ v0.5.2
• SquirrelRDF v0.1
– RDF Triple stores
• Jena v2.5.6 (SDB)
• Sesame v2.1.3 (Native)
23 June 2009
Relational DB
A.J.G. Gray - Bolzano-Bozen Seminar
Relational DB
Triple store
47
Experimental Configuration
• 8 identical machines
– 64 bit Intel Quad Core Xeon 2.4GHz
– 4GB RAM
– 100 GB Hard drive
– Java 1.6
– Linux
• 10 repetitions
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
48
7,468
19,984
372,561
17,793
4,090
1,307
7,229
2,733
1000
5,339
21,492
485,932
3,450
Performance Results
900
800
ms
700
MySQL
600
D2RQ
500
SqRDF
Jena
400
Sesame
300
200
1
100
0
# Query 1
23 June 2009
# Query 2
# Query 3
# Query 5
A.J.G. Gray - Bolzano-Bozen Seminar
# Query 6
49
The Show Stopper: Query Translation
• Each bound variable resulted in a self-join
– RDBMS cannot optimize for this
– RDBMS perform badly with self-joins
• Each row retrieved with a separate query
– 1 query becomes n queries,
where n is cardinality of
relation
• Predicate selection in RDB2RDF tool
– No optimization possible
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
50
Extracting Relevant Data: Conclusions
• SPARQL not expressive enough for real
(astronomy) queries
• RDBMS benefits from 30+ years research
– Query optimisation
– Indexes
• RDF stores are improving
– Require existing data to be replicated
• RDB2RDF tools show promise
– Need to exploit relational database
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
51
Can RDB2RDF Tools Feasible Expose
Large Science Archives for Data
Integration?
Not currently!
More work needed on query translation…
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
52
Conclusions & Future Work
Traditional Integration Challenges
1. Locating data
Semantic Web Solution
• SKOS Vocabularies
– Removes ambiguity
– Enables limited machine
understanding
2. Extracting relevant data
• RDB2RDF Tools
– Requires improved query
translation
3. Understanding data
• Semantic model mappings
– Follow “chains” of mappings
– Relies on RDB2RDF work
23 June 2009
A.J.G. Gray - Bolzano-Bozen Seminar
53
Download