Integrated support for data integration and science portals

advertisement
Integrated support for data
integration and science portals
Amarnath Gupta
University of California San Diego
Overview
• We will first
– Discuss what “cyberinfrastructure” for science means
– Situate the business of “data integration” within the
cyberinfrastructure setting
• Then we will briefly describe a few cyberinfrastructure
projects in different science disciplines
– Biomedical sciences, geo-sciences, environmental sciences, marine
biology, physical oceanography …
• We will examine some dimensions of the data integration
problem
– Discuss how they are approached in different projects from a CS
/Data Management perspective
• Discuss common and complementary themes across these
approaches
ISSGC06 – Ischia, Italy
2
Cyberinfrastructure
• Cyberinfrastructure is the organized
aggregate of technologies enabling
access and coordination of information
technology resources to facilitate science,
engineering, and societal goals.
– Data access from distributed systems
–
–
–
–
Data inter-operability and assimilation
Computation: grid based and workflows
Visualization
Tools
– Information Integration: highlighted
today
National Science Foundation’s
Cyberinfrastructure
NSF Blue Ribbon Panel
(Atkins) Report
provided a compelling
and comprehensive
vision of an integrated
Cyberinfrastructure
Modified from Berman,
SDSC, 2005
ISSGC06 – Ischia, Italy
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
A.K.Sinha, Virginia Tech, 2005
3
ISSGC06 – Ischia, Italy
Source: Mark Ellisman
4
Source: Mark Ellisman
ISSGC06 – Ischia, Italy
5
We are here:
(a) Making more general-purpose data
integration infrastructure over distributed
resources
(b) Extending to accommodate various
scientific applications with stored and
streaming data
ISSGC06 – Ischia, Italy
Source: Mark Ellisman
6
GEONgrid Software Layers
Portal (login, myGEON)
Registration
Registration
Services
GEONworkbench
GEONsearch
Data
Integration
Services
Indexing
Services
Modeling
Environment
Workflow
Services
Visualization
& Mapping
Services
Core Grid Services
GT3, OGSA-DAI, GSI, CAS, gridFTP, SRB, PostGIS, mySQL, DB2
Physical Grid
RedHat Linux, ROCKS, Internet, I2, OptIPuter (planned)
GEON Space
ISSGC06 – Ischia, Italy
7
BIRN: Major System Components
Collaborating Groups of Biomedical Researchers
Data Integration
Mechanisms
Distributed Data Collections Mgmnt
Distributed Data File Management
Computation/Analysis Facilities
Identity/Login
Management
Authorization and
Role Definition
Overall Operations
Command/Batch Access
Application Portal
Domain
Application Tools
Integrated SW Distribution
Complete Workflows
Registered BIRN Data
ISSGC06 – Ischia, Italy
8
BIRN: Specific Implementations
Mouse, Function, Morphometry (+ New Areas and Users )
BIRN Data
Integration Suite
Storage Resource Broker (SRB)
AFS (file system)
Condor, Globus: Local clusters + Teragrid
GSI-Based.
GAMA + MyProxy
SRB for Access
Control to Data
BIRN-CC
Command/Batch Access
BIRN Portal
e.g., AFNI, Air,
3DSlicer, LONI, ..
Semi-Annual BIRN SW
Distribution
Pegasus, Kepler, Loni Pipeline, etc.
Registered BIRN Data
ISSGC06 – Ischia, Italy
9
Haystack
Web
portals
Utopia
Java applications
KAVE
provenance
capture
Pedro semantic
publication
myGrid
ontology
Legacy
applications
mIR myGrid
information
repository
Freefluo
workflow
engine
Notification
service
Web Service (Grid Service)
communication fabric
Soaplab
Executable codes with
an IDL
ISSGC06 – Ischia, Italy
Metadata Management
Service & workflow
discovery
Pedro semantic
publication
GRIMOIRES
federated
UDDI+ registry
KAVE metadata
store
e-Science events
Feta
semantic
discovery
Data
Management
e-Science coordination
information
model
External
Services
e-Science
process
patterns
e-Science
mediator
LSID support
myGrid
Core
Services
Applications
enactment
LSID
Launchpad
Taverna
e-Science
workbench
Workflow
Thirdparty
tools
The OntoGrid View
Gowlab
Web Sites
OGSA-DAI DQP
service
Web Services
Courtesy: Carole Goble
AMBIT
text extraction
service
OGSA-DAI
databases
10
A Word about Data in Science
Excerpts from a Report by NSF’s Office of the Cyberinfrastructure
• Data. … data are any and all complex data entities from
observations, experiments, simulations, models, and higher order
assemblies, along with the associated documentation needed to
describe and interpret the data.
• Metadata. Metadata are a subset of data, and are data about
data. Metadata summarize data content, context, structure,
inter-relationships, and provenance (information on history and
origins). They add relevance and purpose to data, and enable the
identification of similar data in different data collections.
• Ontology. An ontology is the systematic description of a given
phenomenon, often includes a controlled vocabulary and
relationships, captures nuances in meaning and enables
knowledge sharing and reuse.
ISSGC06 – Ischia, Italy
11
What is data integration?
• For applications where there are a number of data
sources (recall previous slide)
– Geographically distributed
– Having data on different platforms
– (may be) on systems with different query capabilities (e.g., different
DBMSs, files, spreadsheets)
• Perhaps even having different data models
– Having different schema
– BUT about one common, general theme
• One may want to construct
– A general-purpose information system such that
• All these data sources can be co-accessed as if they belong to a
single data source
• It can produce “combined information objects” on-demand for ad hoc
queries to facilitate problem-specific analyses performed through
other software products (workflows, atlases, statistical packages …)
• Data integration refers to a body of techniques to
produce such an information system
ISSGC06 – Ischia, Italy
12
Data Integration vis-à-vis Data Grid
• A different aspect of data management
Semantic data Organization (with behavior)
myActiveNeuroCollection
patientRecordsCollection
Virtual Data Transparency
image.cgi image.wsdl
image.sql
Data Replica Transparency
image_0.jpg…image_100.jpg
Interorganizational
Information
Storage
Management
Data Identifier Transparency
E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...
Storage Location Transparency
Storage Resource Transparency
ISSGC06 – Ischia, Italy
Courtesy: Reagan Moore
and Arun Jagatheesan 13
Data Integration in Science Starts with Science
Questions
• GeoScience (GEON)
– What is the geologic and geophysical record of Super-Continent assembly
and dispersal?
– What are the architectures of terrain boundaries at depth?
– How do composition, temperature and strain fabrics vary within the
lithosphere and asthenosphere? Are lithospheric and asthenospheric strain
coupled?
• Neuroscience (BIRN)
– DATA
Find volumetric
MRIs
of humans with
specific
NEEDED data/metadata
TO ADDRESSfrom
THESE
QUESTIONS
ARE
diagnosis(es)
DISTRIBUTED
ACROSS THE WORLD
• Which structures are decreased/increased in size relative to normal controls
• Which structures show structural differences across a variety of diagnoses
– Given a structure which shows structural differences
• Which other structures are associated with it
• Do any of these associated structures show structural differences
• Do these other changed structures have commonalities (i.e. cell types,
neurotransmitters, other afferent/efferent connections)
• Environmental Science (PAKT, CAMERA)
– Explain biodiversity by correlating distribution of a taxonomic group with
spatial (temporal) distribution of temperature, dissolved oxygen, salinity.
– What accounts for large-scale genetic variation in microbial genomes that
share a very recent common ancestry among coral reef habitats?
ISSGC06 – Ischia, Italy
14
A Science Question can be Complex
Q1. What is the geologic and geophysical record of SuperContinent assembly and dispersal?
Needs complex integration of geophysical data with those
associated with sub-crustal lithosphere ages, its composition
and physical properties (seismic, thermal etc), surface
geology and associated events chronology
ISSGC06 – Ischia, Italy
Adapted from D.Seber, SDSC
A.K.Sinha, Virginia Tech, 2005
15
Converting Questions to Queries
ISSGC06 – Ischia, Italy
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
A.K.Sinha, Virginia Tech, 2005
16
(Some) Dimensions of Information Integration
in Cyberinfrastructure Projects
• Source Information Model
• Integration Engine’s Information Model
– Specification of semantic correspondences across
sources
– The 3-party power play among “global schema”, “local
schema”, “ontology”
• Query paradigms over integrated data
• The mechanics of
– query planning
– query execution
ISSGC06 – Ischia, Italy
17
About Semantic Correspondences
• The general problem
– For any data integration across multiple sources there
needs to be a way to
• Specify how two objects from different data sources may
correspond
• Specify of the “joining” of these two objects would create a
composite data object
• What’s the big deal?
– Identical object versus equivalent objects
– Complete objects versus partial objects
– Multi-scale representations of the same object
– Handling definitional differences
– Taking into account natural variability
– these
Contextual
correspondence
Are
always specifiable
through ontological standards like OWL?
Do we need to have “correspondence checking” services?
Listen to Oscar and Carol’s session tomorrow for a different angle
ISSGC06 – Ischia, Italy
18
About the 3-party Power Play
• While we want to create a single (cyber-) infrastructure
with a data integration component, different applications
have different integration scenarios
– Is there a single global schema?
– Do new applications (and hence global schema) get added all the
time over existing sources and ontologies?
– Are the sources fixed? Do new sources get added all the time?
Do sources come and go?
• Are sources added dynamically as “data sets” that users want to
integrate “on the fly”?
– Do local schemata come with their own ontologies? Is there a
global ontology that all local ontologies must map to?
– How does the global schema (if one exists) relate to the global
and local ontologies?
– Do new (or modified) ontologies get added all the time?
– Do the local schemata evolve all the time?
Is there a general way to manage this?
Do we
need
ISSGC06
– Ischia,
Italyto architect any cyberinfrastructure components differently?
19
Source Information Models
• BIRN
– Data Sources
• Relational DBMS
– Standard data types
– Semantic data types (attribute-domain references to ontologies)
• Some data and computation sources expose a set of functions
• Key constraints
– Ontology Sources
• Simplifying assumptions
– Ontologies can be approximated by edge-labeled directed graphs
stored in relational systems
– Graph traversal functions can be mimicked as database functions
• BONFIRE
– Glue ontology for simple inter-ontology mappings and extensions
– Image and Spatial Data Sources
• Discussed later
ISSGC06 – Ischia, Italy
20
Source Information Models
• GEON
– Data Sources
•
•
•
•
•
Assumption: all data are in GEONSpace
Items and Item details
Any relational jdbc data source (e.g., Excel files) is admitted
Standard relational data types, shapefiles for spatial data
Semantic Data types by connecting to ontology
– Ontology Sources
• Any OWL-specified ontology
– Registration in GEON
• Level 1: Federation Based Integration
» Users should know the component database schemata
• Level 2: View Based Integration
» Same as in BIRN
• Level 3: Ontology Based Integration
» Preferred Method
ISSGC06 – Ischia, Italy
21
Source Information Models
• PAKT (marine biogeography)
– Data Sources
• Relational
• Spatial (vectors) supported by GIS and Spatial DBMS
• Spatial (raster – continuously partitionable arrays)
– ArcGIS (map algebra),
– Nested, non-aligned, multiple resolution
• Spatially-indexed time series
• Function-exposing sources (WSDL)
– Parameter and result data types are interpretable or
BLOBS
– Ontology Sources
• Any ontology specified in a subset of OWL
• Any DAG-structured data source
ISSGC06 – Ischia, Italy
22
Source Information Models
• CAMERA
– PAKT ++
– Data sources that export annotated sequences as a
base data type
– Phylogenetic trees
– XML repositories with XPath/XQuery Processor
– RDBMS with XML processing capabilities
– Graphs such as molecular interaction networks (e.g.,
biological pathways), chemical reaction networks …
ISSGC06 – Ischia, Italy
23
Integration Engine’s Information Model
• BIRN
– Sources from the mediator’s view
• Base relations may have binding patterns
• Distinction between data and metadata is not strictly
observed
– SRB metadata catalog is treated as a relational source
with some special functions
• Files are accessed by reference to data-grid URIs (SRB ids)
– Integration Model
• Essentially Global-as-view (GAV) mediation
• “semantic” aspect of the mediation executed through opaque
functions over ontology sources
• Key constraints not used during standard query processing but
are used for keyword queries
ISSGC06 – Ischia, Italy
24
Integration Engine’s Information Model
• BIRN (contd.)
– The 3-party power-play
• Many integrated views used by several global schemata on a
relatively fixed set of sources
• Ontologies are used in two ways
– A global view may be defined using ontology functions
– Keyword queries use simple ontological relationships
• Some terms in the global schema mapped to ontologies
through semantic typing
– Otherwise the global schema and integrated views are
independent from the ontology
• Some data are warped to a common atlas coordinate systems
to enable atlas queries
– Atlas mapping ≡ spatial annotation
ISSGC06 – Ischia, Italy
25
Integration Engine’s Information Model
• BIRN Integration
architecture
– Gateway
• has XML API for source registration,
source schema update
Atlas Client Query Client Onto Client
• Has XML API for queries
Atlas Query Ontological Query
• Can be accessed as web service
Processor
Processor
Spatial
Registry
Mediator
OTIS
Data Grid Access Wrapper Access
– Registry
• API-based access to schema elements
and view definitions
• Implemented over MySQL for
portability
• Spatial registry for image data
– Planner and Executor
• Described later
– Wrappers
• Local and remote
– OTIS
• Inverted index for ontological terms
ISSGC06 – Ischia, Italy
26
BIRN Tool: Source Registration
ISSGC06 – Ischia, Italy
27
Information Engine’s Information Model
• GEON
– Sources from the Integration Engine’s Viewpoint
• Metadata (Item-level information) maintained in a GEON
standard called ADN (Alexandria-Delese-NASA)
• Item-detail level information is either any relationalizable
data or shapefiles
• Any WMS, WFS service is a valid source for map information
management
• Does not permit an external ontology source, all ontologies
have to be defined in the GEON framework
– Integration Model
• Every source schema is registered to an ontology
ISSGC06 – Ischia, Italy
28
Integration Engine’s Information Model
• 3-party power play
– Several global schemata can be defined
– A global schema IS the OWL-DL compliant ontology
– A couple of consequences
• All transitive closure information is pre-computed after
registration
• If a concept class have key constraints, subsumption is NEXPTime hard, and undecidable if the key constraint has a
complex domain
– Does not matter much in practice because subsumption is
hardly computed
• Pragmatics
– As new sources join, or new applications are attempted,
the ontology needs to evolve
ISSGC06 – Ischia, Italy
29
Geon Data Registration
Click on Submission
to register a dataset
Input a data set name
Select a zipped
shapefile
Choose an ontology class
ISSGC06 – Ischia, Italy
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
A.K.Sinha, Virginia Tech, 2005
30
Registration of Item Detail
SiO2 is an instance of class
AnalyticalOxideConcentration and has all
information about the element Si
Planetary Material Ontology
ISSGC06 – Ischia, Italy
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
A.K.Sinha, Virginia Tech, 2005
31
ODAL (Ontological Database Annotation Language)
• Create a partial model of ontologies from database
• Independent on any GUI
• Independent on any concrete implementations
• reusable
GUI
<odal:NamedIndividuals odal:id="RockSample"
odal:database="VTDatabase">
<odal:Class odal:resource="http://geon.vt.edu#RockSample" />
<odal:Table>Samples</odal:Table>
<odal:Table>RockTexture</odal:Table>
<odal:Table>RockGeoChemistry</odal:Table>
generate
<odal:Table>ModalData</odal:Table>
<odal:Table>MineralChemistry</odal:Table>
<odal:Table>Images</odal:Table>
<odal:Column>ssID</odal:Column>
</odal:NamedIndividuals>
to ODAL
processor
The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry,
ModalData,MineralChemistry and Images represent instances of RockSample
ISSGC06 – Ischia, Italy
32
ODAL: Import Ontologies
The Ontologies used for annotating a database can be
imported as follows:
<?xml version="1.0"?>
<odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:odal = “http://www.sdsc.edu/odal#” >
<odal:Ontology>
<odal:Imports rdf:resource="http://www.library.org/Book.owl"/>
<odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/>
</odal:Ontology>
……
</odal:ODAL>
ISSGC06 – Ischia, Italy
33
ODAL: Database Connection Declaration
The target database for making annotation is declared as
follows:
<?xml version="1.0"?>
<odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:odal = “http://www.sdsc.edu/odal#” >
……
<odal:Database odal:id="PublicationDatabase">
<odal:DatabaseProductName>Oracle<odal:DatabaseProductName>
<odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion>
<odal:Host>oracle.sdsc.edu</odal:Host>
<odal:Port>3456</odal:Port>
<odal:DatabaseName>Publications</odal:DatabaseName>
</odal:Database>
……
</odal:ODAL>
ISSGC06 – Ischia, Italy
34
ODAL: Simple Named Individuals
Suppose the book ontology contains a class Book and the schema
Collection contains a table book-price with a column ISBN.
<odal:NamedIndividuals odal:id="BookInTableBookPrice"
odal:database="PublicationDatabase" >
<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/>
<odal:Schema>Collections</odal:Schema>
<odal:Table>book-price</odal:Table>
<odal:Column>ISBN</odal:Column>
</odal:NamedIndividuals>
The statement says that each value in the column ISBN represents a book
individual.
odal:id gives a name to the declaration, and represents the set of the
individuals generated by the statement.
ISSGC06 – Ischia, Italy
35
ODAL: The Names of Individuals
<odal:NamedIndividuals odal:id="BookInTableBookPrice"
odal:database="PublicationDatabase" >
<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/>
<odal:Schema>Collections</odal:Schema>
<odal:Table>book-price</odal:Table>
<odal:Column>ISBN</odal:Column>
</odal:NamedIndividuals>
ISBN
0817313478
Individual Name
…
(BookInTableBookPrice, PublicationDatabase.Collections.book-price.ISBN:0817313478)
ISSGC06 – Ischia, Italy
36
ODAL: Named Individuals from Multiple Columns
Suppose an ontology contains a class Location and a database table
Rock-Sample with two columns Latitude and Longitude.
<odal:NamedIndividuals odal:id="LocationInTableRockSample" >
<odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/>
<odal:Schema>California</odal:Schema>
<odal:Table>Rock-Sample</odal:Table>
<odal:Column>Latitude</odal:Column>
<odal:Column>Longitude</odal:Column>
</odal:NamedIndividuals>
The statement says that a pair of latitude and longitude gives a location
ISSGC06 – Ischia, Italy
37
ODAL: Named Individuals with Conditions
<odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" >
<odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/>
<odal:Table>employee</odal:Table>
<odal:Column>EmployeeId</odal:Column>
<odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition>
</odal:NamedIndividuals>
<odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" >
<odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/>
<odal:Table>employee</odal:Table>
<odal:Column>EmployeeId</odal:Column>
<odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition>
</odal:NamedIndividuals>
A condition in an odal:Condition element should be a Boolean expression which is
valid to be used in any WHERE clauses of SQL queries
ISSGC06 – Ischia, Italy
38
ODAL: Data Type Property Declaration
…
SSN
…
age
…
…
123-56-7890
…
8
…
Person
hasAge
posInt
<odal:NamedIndividuals odal:id="PersonInTablePerson" >
<odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/>
<odal:Table>Person</odal:Table>
<odal:Column>ssn</odal:Column>
</odal:NamedIndividuals>
<odal:OntologyProperty>
<odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/>
<odal:Table>person</odal:Table>
<odal:Domain odal:resource="PersonInTablePerson" />
<odal:Range odal:resource="age" />
</odal:OntologyProperty>
ISSGC06 – Ischia, Italy
39
Conditions for Joining Individuals from Different
Resources
• Usually we don’t make join on individuals cross different resources
Rock
RockSampleID
10001
…
RockID
10001
……
• A set of datatype properties can be declared as a key for a class in
the ontology. We do join cross multiple resources based on keys.
e.g. {hasLatitude, hasLongitude} can be declared as a key of Location
Two locations from different resources are same if they have the same
latitude and longitude
We don’t know whether 10001 represents the same rock in the two
resources. By default, we assume they are not.
ISSGC06 – Ischia, Italy
40
The Architecture of GEON Semantic Mediator
Oracle
DB2
SQL
Server
MySQL
PostgreSQL
PostGIS
Query Execution
Query
Optimization
Query
Planning
Internal Database
SQL Parser
Spatial SQL against federal schemas
Mediator JDBC Driver
SOQL
GUI
Semantic Query Rewriter
SOQL
Parser
Ontology
Reasoner
ODAL Processor
OWL
Portal or Application
ODAL
SOQL Processor
ISSGC06 – Ischia, Italy
41
The Map Integration Architecture
ISSGC06 – Ischia, Italy
42
Map Integration
ISSGC06 – Ischia, Italy
43
Integration Engine’s Information Model
• PAKT (briefly)
– Type extensibility of the mediator
• Nested relational query language extended by tree and a
restricted set of graph pattern operations
• Construction operations important
• Passive extensibility
– Source more powerful than the mediator
– Source exports a set of type-based optimization rules to
the mediator
• Active extensibility
– Mediator extends its set of interpreted types
– Ontology management
• Ontological queries processed by a separate co-processor that
interoperates with mediator
• Query planner partitions the query into ontological and
mediated query processors
ISSGC06 – Ischia, Italy
44
Query Paradigms
• What are the different kinds of queries scientists
and applications pose to an integrated system?
– Ontologically supported mediated queries
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
16+ Terabytes
Total Number of Files (in thousands)
Feb-06
Jun-06
Jun-05
Oct-05
Oct-04
Feb-05
Feb-04
Jun-04
16 million files
Jun-03
21,038 raw image files per subject
2.4 GB of raw image data per subject
25 GB to 40 GB of processed image data per subject
10 million slices of functional imaging data in Phase II
7 Terabytes of image data for all of the Phase II analyses
(conservative estimate of 25 GB/subject)
Oct-03
•
•
•
•
•
BIRN Data Grid Usage
Oct-02
Feb-03
– Metadata-based file access
Total Number of Files
(in thousands)
Total Size of Storage (in Gigabytes)
• “Find most recent FMRI data of all patients with low scores in
working memory tasks having volumetric changes of hippocampus
over 10% in 2 years”
– Keyword queries
• FMRI “working memory task” hippocampus
– Ontologically supported keyword queries
– Associative searches
ISSGC06 – Ischia, Italy
45
GEON: SOQL (Simple Ontology Query Language)
Query single or integrated resources
• via ontologies (i.e., high level logical views)
• independent on any physical presentation (i.e. schemas)
RockSample
location
hasSiO2
ValueWithUnit value
Location
lat
long
float
unit
string
GUI
generate
ISSGC06 – Ischia, Italy
SELECT X.location.*;
FROM RockSample X
WHERE X.location.lat > 60
AND X.location.long > 100
AND X.hasSiO2.value < 30
AND X.hasSiO2.unit =‘weightPercetage’
to SOQL
processor
46
Question: Finding all seismic stations within 1 mile from
railroads
GEON
SOQL
GUI
SELECT X.code, X.location.*
FROM SeismicStation X, Railroad Y
WHERE distance(X.location, Y.geometry) < 1
SOQL Processor
SELECT X2.stationcode, X2.lat, X2.lon
FROM railroads_of_the_united_states X1,
stationdatatable X2
WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1
Schema Mediator
distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1
SELECT X1.the_geom
FROM railroads X1
ISSGC06 – Ischia, Italy
Railroad
shapefile
Seismic
Stations
SELECT X2.stationcode, X2.lat, X2.lon
FROM stationdatatable X2
WHERE bounding box condition
47
BIRN: A Functional View of the Mediation Process
Planner
Query Expression
Execution Engine
Pre-Executable Plan
(UCQ+ + Nesting + Grouping & Aggregate)
Flattening of Nested Queries
View Unfolding
Normalization to DNF
Executable Plan
Execution Control
Post-processing
+ aggregate
Result Building
Predicate Reordering
(binding patterns + maximal chunk)
Result Reporting
Maximal Feasible Plan
Algebraic Plan
Cost/Selectivity-based Optimization
Pre-Executable Plan
ISSGC06 – Ischia, Italy
48
View Definition and Query Language
• Union of conjunctive queries
• May contain function term
• Expressed in XML Datalog with aggregated functions
• Query
q(X,F(Y)):-r1(X,Z),r2(Z,Y), - where F(Y) –
aggregate function operated on set of Y and X group-by variables.
• Planner and Executor translate this to:
– q’(X,Y):-r1(X,Z),r2(Z,Y)
– q(X,W):-F(gb(q’(X,Y))
– Where group-by “gb” function with aggregate function F
pushed to data source whenever possible or evaluate at
Mediator.
• Query Language allows for nested query – inner queries
are assigned to intermediate variables that are used by
main query
ISSGC06 – Ischia, Italy
49
BIRN” Mapping Relations
• Ontology Mapping -maps data values from a
source to an ontology term of a known ontology
(UMLS)
• Joinable relation pairs attributes from different
relations
• Value-Map – maps mediator-supported data value
to source supported (for example: gender – 0/1 at
some source is male/female for mediator)
ISSGC06 – Ischia, Italy
50
Processing Ontological Queries
ISSGC06 – Ischia, Italy
Courtesy: Vadim Astakhov
51
PAKT: Spatial and Taxonomic Queries
ISSGC06 – Ischia, Italy
52
Example Queries
OBIS
Biological
OBIS
Geo-Spatial
Biological
WOA
Geo-Spatial
Physiochemical
Q1: where is species X found?
OBIS(scientific_name,lat,long)
Q3: where is species X found given certain physical parameter?
OBIS(scientific_name,lat,long)
WOA(physio,lat,long)
Q2: for a given polygon, what species are found?
OBIS(scientific_name,m_lat,m_long,m_lat,m_long)
Q4: what are the aggregated physical properties of species X?
OBIS(scientific_name,lat,long)
WOA(physio,lat,long)
Italics:
input
Underline:
output
extended
Geo-Spatial
OBIS
Biological
Benth_Hab
WOA
Geo-Spatial
Habitat
Physiochemical
Benth_Hab
Habitat
Q5: where is habitat X found?
CMECS(habitat,physio)
BH(habitat_grp,shape)
Q7: where is habitat X found given certain physical parameter?
CMECS(habitat,physio)
BH(habitat_grp,shape)
WOA(physio,lat,long)
Q6: for a given polygon A, what habitats are found?
CMECS(habitat,physio)
BH(habitat_grp,shape) PolygonA
Q8: what are the aggregated physical properties of habitat X?
CMECS(habitat,physio)
BH(habitat_grp,shape)
WOA(physio,lat,long)
Q9: what species can be found at habitat X?
CMECS(habitat,physio)
BH(habitat_grp,shape)
OBIS(scientific_name,lat,long)
ISSGC06 – Ischia, Italy
Q10: what habitats is a species X found at ?
CMECS(habitat,physio)
BH(habitat_grp,shape)
OBIS(scientific_name,lat,long)
53
Frequent Query Patterns
• Example queries are joins of
– Left query patterns: habitat-spatial, and
– Right query patterns: spatial-environmental/species
distribution
CMECS(habitat,physio)
BH(habitat_grp,shape)
BH(..,shape)
WOA(physio,lat,long)
PolygonA )
BH(..,shape)
OBIS(scientific_name,lat,long)
CMECS(habitat,physio)
BH(habitat_grp,shape)
BH(..,shape)
WOA(physio,lat,long)
CMECS(habitat,physio)
BH(habitat_grp,shape)
BH(..,shape)
OBIS(scientific_name,lat,long)
(
Onto-module’s queries
ISSGC06 – Ischia, Italy
API
Mediator’s queries
54
The Resource Management Aspect of Query
Evaluation
DQP
node 5
reduce
node 4
node 3
DQP
DQP
join (A1,B1)
join (A2,B2)
node 1
node 2
DQP
scan (A)
DQP
scan (B)
OGSA-DAI
OGSA-DAI
DBMS
DBMS
data
data
• Primarily done by the
Manchester group
(Watson et al)
• Polar*
– Based on OQL (internally
monoid comprehension)
– Multi-node planning
• Plan partitioning
• Exchange operator
– Attribute sensitivity
– Data & index
repartitioning
• Plan scheduling
– Query execution
From Amy Krause
ISSGC06 – Ischia, Italy
55
The Adaptivity Issue in DQP on a Grid
• Monitoring-Assessment-Response framework of adaptive
query processing in a grid (by Gounaris)
– Monitoring:
• a separate module that keeps track of information like
– Has a resource (e.g., memory availability) changed more than
10%?
– Has the data volume changed recently?
• Occurs between operators or within an operator’s execution process
• Other modules subscribe to this notification
– Assessment
• Diagnosis is carried out for suboptimal execution, resource shortage,
resource idleness, unmet performance requirements, unmet user
needs
– Response
• Operator replacement ore rescheduling, machine rescheduling, plan
re-optimization…
ISSGC06 – Ischia, Italy
56
Commonalities and Complementarities
• Common themes
– Overall architectural similarity of cyberinfrastructure projects
• Service orientation
– The data integration task is part of a larger scientific computing,
exploration and analysis process
• Has impact on integration setting, design decisions and performance
expectations
– Mediation with semantic mapping and reasoning seems to be
winning
• Complementary approaches
– Details of the architecture
• Relationship with workflows
– Styles of mediation
– Extensibility of mediator
– Adaptivity of query planning and evaluation
ISSGC06 – Ischia, Italy
57
Thank you!
Questions? Comments? Integrated
Queries?
Download