Integrated support for data integration and science portals Amarnath Gupta University of California San Diego Overview • We will first – Discuss what “cyberinfrastructure” for science means – Situate the business of “data integration” within the cyberinfrastructure setting • Then we will briefly describe a few cyberinfrastructure projects in different science disciplines – Biomedical sciences, geo-sciences, environmental sciences, marine biology, physical oceanography … • We will examine some dimensions of the data integration problem – Discuss how they are approached in different projects from a CS /Data Management perspective • Discuss common and complementary themes across these approaches ISSGC06 – Ischia, Italy 2 Cyberinfrastructure • Cyberinfrastructure is the organized aggregate of technologies enabling access and coordination of information technology resources to facilitate science, engineering, and societal goals. – Data access from distributed systems – – – – Data inter-operability and assimilation Computation: grid based and workflows Visualization Tools – Information Integration: highlighted today National Science Foundation’s Cyberinfrastructure NSF Blue Ribbon Panel (Atkins) Report provided a compelling and comprehensive vision of an integrated Cyberinfrastructure Modified from Berman, SDSC, 2005 ISSGC06 – Ischia, Italy CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005 3 ISSGC06 – Ischia, Italy Source: Mark Ellisman 4 Source: Mark Ellisman ISSGC06 – Ischia, Italy 5 We are here: (a) Making more general-purpose data integration infrastructure over distributed resources (b) Extending to accommodate various scientific applications with stored and streaming data ISSGC06 – Ischia, Italy Source: Mark Ellisman 6 GEONgrid Software Layers Portal (login, myGEON) Registration Registration Services GEONworkbench GEONsearch Data Integration Services Indexing Services Modeling Environment Workflow Services Visualization & Mapping Services Core Grid Services GT3, OGSA-DAI, GSI, CAS, gridFTP, SRB, PostGIS, mySQL, DB2 Physical Grid RedHat Linux, ROCKS, Internet, I2, OptIPuter (planned) GEON Space ISSGC06 – Ischia, Italy 7 BIRN: Major System Components Collaborating Groups of Biomedical Researchers Data Integration Mechanisms Distributed Data Collections Mgmnt Distributed Data File Management Computation/Analysis Facilities Identity/Login Management Authorization and Role Definition Overall Operations Command/Batch Access Application Portal Domain Application Tools Integrated SW Distribution Complete Workflows Registered BIRN Data ISSGC06 – Ischia, Italy 8 BIRN: Specific Implementations Mouse, Function, Morphometry (+ New Areas and Users ) BIRN Data Integration Suite Storage Resource Broker (SRB) AFS (file system) Condor, Globus: Local clusters + Teragrid GSI-Based. GAMA + MyProxy SRB for Access Control to Data BIRN-CC Command/Batch Access BIRN Portal e.g., AFNI, Air, 3DSlicer, LONI, .. Semi-Annual BIRN SW Distribution Pegasus, Kepler, Loni Pipeline, etc. Registered BIRN Data ISSGC06 – Ischia, Italy 9 Haystack Web portals Utopia Java applications KAVE provenance capture Pedro semantic publication myGrid ontology Legacy applications mIR myGrid information repository Freefluo workflow engine Notification service Web Service (Grid Service) communication fabric Soaplab Executable codes with an IDL ISSGC06 – Ischia, Italy Metadata Management Service & workflow discovery Pedro semantic publication GRIMOIRES federated UDDI+ registry KAVE metadata store e-Science events Feta semantic discovery Data Management e-Science coordination information model External Services e-Science process patterns e-Science mediator LSID support myGrid Core Services Applications enactment LSID Launchpad Taverna e-Science workbench Workflow Thirdparty tools The OntoGrid View Gowlab Web Sites OGSA-DAI DQP service Web Services Courtesy: Carole Goble AMBIT text extraction service OGSA-DAI databases 10 A Word about Data in Science Excerpts from a Report by NSF’s Office of the Cyberinfrastructure • Data. … data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data. • Metadata. Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, inter-relationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections. • Ontology. An ontology is the systematic description of a given phenomenon, often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse. ISSGC06 – Ischia, Italy 11 What is data integration? • For applications where there are a number of data sources (recall previous slide) – Geographically distributed – Having data on different platforms – (may be) on systems with different query capabilities (e.g., different DBMSs, files, spreadsheets) • Perhaps even having different data models – Having different schema – BUT about one common, general theme • One may want to construct – A general-purpose information system such that • All these data sources can be co-accessed as if they belong to a single data source • It can produce “combined information objects” on-demand for ad hoc queries to facilitate problem-specific analyses performed through other software products (workflows, atlases, statistical packages …) • Data integration refers to a body of techniques to produce such an information system ISSGC06 – Ischia, Italy 12 Data Integration vis-à-vis Data Grid • A different aspect of data management Semantic data Organization (with behavior) myActiveNeuroCollection patientRecordsCollection Virtual Data Transparency image.cgi image.wsdl image.sql Data Replica Transparency image_0.jpg…image_100.jpg Interorganizational Information Storage Management Data Identifier Transparency E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Storage Location Transparency Storage Resource Transparency ISSGC06 – Ischia, Italy Courtesy: Reagan Moore and Arun Jagatheesan 13 Data Integration in Science Starts with Science Questions • GeoScience (GEON) – What is the geologic and geophysical record of Super-Continent assembly and dispersal? – What are the architectures of terrain boundaries at depth? – How do composition, temperature and strain fabrics vary within the lithosphere and asthenosphere? Are lithospheric and asthenospheric strain coupled? • Neuroscience (BIRN) – DATA Find volumetric MRIs of humans with specific NEEDED data/metadata TO ADDRESSfrom THESE QUESTIONS ARE diagnosis(es) DISTRIBUTED ACROSS THE WORLD • Which structures are decreased/increased in size relative to normal controls • Which structures show structural differences across a variety of diagnoses – Given a structure which shows structural differences • Which other structures are associated with it • Do any of these associated structures show structural differences • Do these other changed structures have commonalities (i.e. cell types, neurotransmitters, other afferent/efferent connections) • Environmental Science (PAKT, CAMERA) – Explain biodiversity by correlating distribution of a taxonomic group with spatial (temporal) distribution of temperature, dissolved oxygen, salinity. – What accounts for large-scale genetic variation in microbial genomes that share a very recent common ancestry among coral reef habitats? ISSGC06 – Ischia, Italy 14 A Science Question can be Complex Q1. What is the geologic and geophysical record of SuperContinent assembly and dispersal? Needs complex integration of geophysical data with those associated with sub-crustal lithosphere ages, its composition and physical properties (seismic, thermal etc), surface geology and associated events chronology ISSGC06 – Ischia, Italy Adapted from D.Seber, SDSC A.K.Sinha, Virginia Tech, 2005 15 Converting Questions to Queries ISSGC06 – Ischia, Italy CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005 16 (Some) Dimensions of Information Integration in Cyberinfrastructure Projects • Source Information Model • Integration Engine’s Information Model – Specification of semantic correspondences across sources – The 3-party power play among “global schema”, “local schema”, “ontology” • Query paradigms over integrated data • The mechanics of – query planning – query execution ISSGC06 – Ischia, Italy 17 About Semantic Correspondences • The general problem – For any data integration across multiple sources there needs to be a way to • Specify how two objects from different data sources may correspond • Specify of the “joining” of these two objects would create a composite data object • What’s the big deal? – Identical object versus equivalent objects – Complete objects versus partial objects – Multi-scale representations of the same object – Handling definitional differences – Taking into account natural variability – these Contextual correspondence Are always specifiable through ontological standards like OWL? Do we need to have “correspondence checking” services? Listen to Oscar and Carol’s session tomorrow for a different angle ISSGC06 – Ischia, Italy 18 About the 3-party Power Play • While we want to create a single (cyber-) infrastructure with a data integration component, different applications have different integration scenarios – Is there a single global schema? – Do new applications (and hence global schema) get added all the time over existing sources and ontologies? – Are the sources fixed? Do new sources get added all the time? Do sources come and go? • Are sources added dynamically as “data sets” that users want to integrate “on the fly”? – Do local schemata come with their own ontologies? Is there a global ontology that all local ontologies must map to? – How does the global schema (if one exists) relate to the global and local ontologies? – Do new (or modified) ontologies get added all the time? – Do the local schemata evolve all the time? Is there a general way to manage this? Do we need ISSGC06 – Ischia, Italyto architect any cyberinfrastructure components differently? 19 Source Information Models • BIRN – Data Sources • Relational DBMS – Standard data types – Semantic data types (attribute-domain references to ontologies) • Some data and computation sources expose a set of functions • Key constraints – Ontology Sources • Simplifying assumptions – Ontologies can be approximated by edge-labeled directed graphs stored in relational systems – Graph traversal functions can be mimicked as database functions • BONFIRE – Glue ontology for simple inter-ontology mappings and extensions – Image and Spatial Data Sources • Discussed later ISSGC06 – Ischia, Italy 20 Source Information Models • GEON – Data Sources • • • • • Assumption: all data are in GEONSpace Items and Item details Any relational jdbc data source (e.g., Excel files) is admitted Standard relational data types, shapefiles for spatial data Semantic Data types by connecting to ontology – Ontology Sources • Any OWL-specified ontology – Registration in GEON • Level 1: Federation Based Integration » Users should know the component database schemata • Level 2: View Based Integration » Same as in BIRN • Level 3: Ontology Based Integration » Preferred Method ISSGC06 – Ischia, Italy 21 Source Information Models • PAKT (marine biogeography) – Data Sources • Relational • Spatial (vectors) supported by GIS and Spatial DBMS • Spatial (raster – continuously partitionable arrays) – ArcGIS (map algebra), – Nested, non-aligned, multiple resolution • Spatially-indexed time series • Function-exposing sources (WSDL) – Parameter and result data types are interpretable or BLOBS – Ontology Sources • Any ontology specified in a subset of OWL • Any DAG-structured data source ISSGC06 – Ischia, Italy 22 Source Information Models • CAMERA – PAKT ++ – Data sources that export annotated sequences as a base data type – Phylogenetic trees – XML repositories with XPath/XQuery Processor – RDBMS with XML processing capabilities – Graphs such as molecular interaction networks (e.g., biological pathways), chemical reaction networks … ISSGC06 – Ischia, Italy 23 Integration Engine’s Information Model • BIRN – Sources from the mediator’s view • Base relations may have binding patterns • Distinction between data and metadata is not strictly observed – SRB metadata catalog is treated as a relational source with some special functions • Files are accessed by reference to data-grid URIs (SRB ids) – Integration Model • Essentially Global-as-view (GAV) mediation • “semantic” aspect of the mediation executed through opaque functions over ontology sources • Key constraints not used during standard query processing but are used for keyword queries ISSGC06 – Ischia, Italy 24 Integration Engine’s Information Model • BIRN (contd.) – The 3-party power-play • Many integrated views used by several global schemata on a relatively fixed set of sources • Ontologies are used in two ways – A global view may be defined using ontology functions – Keyword queries use simple ontological relationships • Some terms in the global schema mapped to ontologies through semantic typing – Otherwise the global schema and integrated views are independent from the ontology • Some data are warped to a common atlas coordinate systems to enable atlas queries – Atlas mapping ≡ spatial annotation ISSGC06 – Ischia, Italy 25 Integration Engine’s Information Model • BIRN Integration architecture – Gateway • has XML API for source registration, source schema update Atlas Client Query Client Onto Client • Has XML API for queries Atlas Query Ontological Query • Can be accessed as web service Processor Processor Spatial Registry Mediator OTIS Data Grid Access Wrapper Access – Registry • API-based access to schema elements and view definitions • Implemented over MySQL for portability • Spatial registry for image data – Planner and Executor • Described later – Wrappers • Local and remote – OTIS • Inverted index for ontological terms ISSGC06 – Ischia, Italy 26 BIRN Tool: Source Registration ISSGC06 – Ischia, Italy 27 Information Engine’s Information Model • GEON – Sources from the Integration Engine’s Viewpoint • Metadata (Item-level information) maintained in a GEON standard called ADN (Alexandria-Delese-NASA) • Item-detail level information is either any relationalizable data or shapefiles • Any WMS, WFS service is a valid source for map information management • Does not permit an external ontology source, all ontologies have to be defined in the GEON framework – Integration Model • Every source schema is registered to an ontology ISSGC06 – Ischia, Italy 28 Integration Engine’s Information Model • 3-party power play – Several global schemata can be defined – A global schema IS the OWL-DL compliant ontology – A couple of consequences • All transitive closure information is pre-computed after registration • If a concept class have key constraints, subsumption is NEXPTime hard, and undecidable if the key constraint has a complex domain – Does not matter much in practice because subsumption is hardly computed • Pragmatics – As new sources join, or new applications are attempted, the ontology needs to evolve ISSGC06 – Ischia, Italy 29 Geon Data Registration Click on Submission to register a dataset Input a data set name Select a zipped shapefile Choose an ontology class ISSGC06 – Ischia, Italy CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005 30 Registration of Item Detail SiO2 is an instance of class AnalyticalOxideConcentration and has all information about the element Si Planetary Material Ontology ISSGC06 – Ischia, Italy CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005 31 ODAL (Ontological Database Annotation Language) • Create a partial model of ontologies from database • Independent on any GUI • Independent on any concrete implementations • reusable GUI <odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> <odal:Class odal:resource="http://geon.vt.edu#RockSample" /> <odal:Table>Samples</odal:Table> <odal:Table>RockTexture</odal:Table> <odal:Table>RockGeoChemistry</odal:Table> generate <odal:Table>ModalData</odal:Table> <odal:Table>MineralChemistry</odal:Table> <odal:Table>Images</odal:Table> <odal:Column>ssID</odal:Column> </odal:NamedIndividuals> to ODAL processor The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample ISSGC06 – Ischia, Italy 32 ODAL: Import Ontologies The Ontologies used for annotating a database can be imported as follows: <?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” > <odal:Ontology> <odal:Imports rdf:resource="http://www.library.org/Book.owl"/> <odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/> </odal:Ontology> …… </odal:ODAL> ISSGC06 – Ischia, Italy 33 ODAL: Database Connection Declaration The target database for making annotation is declared as follows: <?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” > …… <odal:Database odal:id="PublicationDatabase"> <odal:DatabaseProductName>Oracle<odal:DatabaseProductName> <odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion> <odal:Host>oracle.sdsc.edu</odal:Host> <odal:Port>3456</odal:Port> <odal:DatabaseName>Publications</odal:DatabaseName> </odal:Database> …… </odal:ODAL> ISSGC06 – Ischia, Italy 34 ODAL: Simple Named Individuals Suppose the book ontology contains a class Book and the schema Collection contains a table book-price with a column ISBN. <odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" > <odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column> </odal:NamedIndividuals> The statement says that each value in the column ISBN represents a book individual. odal:id gives a name to the declaration, and represents the set of the individuals generated by the statement. ISSGC06 – Ischia, Italy 35 ODAL: The Names of Individuals <odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" > <odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column> </odal:NamedIndividuals> ISBN 0817313478 Individual Name … (BookInTableBookPrice, PublicationDatabase.Collections.book-price.ISBN:0817313478) ISSGC06 – Ischia, Italy 36 ODAL: Named Individuals from Multiple Columns Suppose an ontology contains a class Location and a database table Rock-Sample with two columns Latitude and Longitude. <odal:NamedIndividuals odal:id="LocationInTableRockSample" > <odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/> <odal:Schema>California</odal:Schema> <odal:Table>Rock-Sample</odal:Table> <odal:Column>Latitude</odal:Column> <odal:Column>Longitude</odal:Column> </odal:NamedIndividuals> The statement says that a pair of latitude and longitude gives a location ISSGC06 – Ischia, Italy 37 ODAL: Named Individuals with Conditions <odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition> </odal:NamedIndividuals> <odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition> </odal:NamedIndividuals> A condition in an odal:Condition element should be a Boolean expression which is valid to be used in any WHERE clauses of SQL queries ISSGC06 – Ischia, Italy 38 ODAL: Data Type Property Declaration … SSN … age … … 123-56-7890 … 8 … Person hasAge posInt <odal:NamedIndividuals odal:id="PersonInTablePerson" > <odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/> <odal:Table>Person</odal:Table> <odal:Column>ssn</odal:Column> </odal:NamedIndividuals> <odal:OntologyProperty> <odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/> <odal:Table>person</odal:Table> <odal:Domain odal:resource="PersonInTablePerson" /> <odal:Range odal:resource="age" /> </odal:OntologyProperty> ISSGC06 – Ischia, Italy 39 Conditions for Joining Individuals from Different Resources • Usually we don’t make join on individuals cross different resources Rock RockSampleID 10001 … RockID 10001 …… • A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys. e.g. {hasLatitude, hasLongitude} can be declared as a key of Location Two locations from different resources are same if they have the same latitude and longitude We don’t know whether 10001 represents the same rock in the two resources. By default, we assume they are not. ISSGC06 – Ischia, Italy 40 The Architecture of GEON Semantic Mediator Oracle DB2 SQL Server MySQL PostgreSQL PostGIS Query Execution Query Optimization Query Planning Internal Database SQL Parser Spatial SQL against federal schemas Mediator JDBC Driver SOQL GUI Semantic Query Rewriter SOQL Parser Ontology Reasoner ODAL Processor OWL Portal or Application ODAL SOQL Processor ISSGC06 – Ischia, Italy 41 The Map Integration Architecture ISSGC06 – Ischia, Italy 42 Map Integration ISSGC06 – Ischia, Italy 43 Integration Engine’s Information Model • PAKT (briefly) – Type extensibility of the mediator • Nested relational query language extended by tree and a restricted set of graph pattern operations • Construction operations important • Passive extensibility – Source more powerful than the mediator – Source exports a set of type-based optimization rules to the mediator • Active extensibility – Mediator extends its set of interpreted types – Ontology management • Ontological queries processed by a separate co-processor that interoperates with mediator • Query planner partitions the query into ontological and mediated query processors ISSGC06 – Ischia, Italy 44 Query Paradigms • What are the different kinds of queries scientists and applications pose to an integrated system? – Ontologically supported mediated queries 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 16+ Terabytes Total Number of Files (in thousands) Feb-06 Jun-06 Jun-05 Oct-05 Oct-04 Feb-05 Feb-04 Jun-04 16 million files Jun-03 21,038 raw image files per subject 2.4 GB of raw image data per subject 25 GB to 40 GB of processed image data per subject 10 million slices of functional imaging data in Phase II 7 Terabytes of image data for all of the Phase II analyses (conservative estimate of 25 GB/subject) Oct-03 • • • • • BIRN Data Grid Usage Oct-02 Feb-03 – Metadata-based file access Total Number of Files (in thousands) Total Size of Storage (in Gigabytes) • “Find most recent FMRI data of all patients with low scores in working memory tasks having volumetric changes of hippocampus over 10% in 2 years” – Keyword queries • FMRI “working memory task” hippocampus – Ontologically supported keyword queries – Associative searches ISSGC06 – Ischia, Italy 45 GEON: SOQL (Simple Ontology Query Language) Query single or integrated resources • via ontologies (i.e., high level logical views) • independent on any physical presentation (i.e. schemas) RockSample location hasSiO2 ValueWithUnit value Location lat long float unit string GUI generate ISSGC06 – Ischia, Italy SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’ to SOQL processor 46 Question: Finding all seismic stations within 1 mile from railroads GEON SOQL GUI SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1 SOQL Processor SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 Schema Mediator distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 SELECT X1.the_geom FROM railroads X1 ISSGC06 – Ischia, Italy Railroad shapefile Seismic Stations SELECT X2.stationcode, X2.lat, X2.lon FROM stationdatatable X2 WHERE bounding box condition 47 BIRN: A Functional View of the Mediation Process Planner Query Expression Execution Engine Pre-Executable Plan (UCQ+ + Nesting + Grouping & Aggregate) Flattening of Nested Queries View Unfolding Normalization to DNF Executable Plan Execution Control Post-processing + aggregate Result Building Predicate Reordering (binding patterns + maximal chunk) Result Reporting Maximal Feasible Plan Algebraic Plan Cost/Selectivity-based Optimization Pre-Executable Plan ISSGC06 – Ischia, Italy 48 View Definition and Query Language • Union of conjunctive queries • May contain function term • Expressed in XML Datalog with aggregated functions • Query q(X,F(Y)):-r1(X,Z),r2(Z,Y), - where F(Y) – aggregate function operated on set of Y and X group-by variables. • Planner and Executor translate this to: – q’(X,Y):-r1(X,Z),r2(Z,Y) – q(X,W):-F(gb(q’(X,Y)) – Where group-by “gb” function with aggregate function F pushed to data source whenever possible or evaluate at Mediator. • Query Language allows for nested query – inner queries are assigned to intermediate variables that are used by main query ISSGC06 – Ischia, Italy 49 BIRN” Mapping Relations • Ontology Mapping -maps data values from a source to an ontology term of a known ontology (UMLS) • Joinable relation pairs attributes from different relations • Value-Map – maps mediator-supported data value to source supported (for example: gender – 0/1 at some source is male/female for mediator) ISSGC06 – Ischia, Italy 50 Processing Ontological Queries ISSGC06 – Ischia, Italy Courtesy: Vadim Astakhov 51 PAKT: Spatial and Taxonomic Queries ISSGC06 – Ischia, Italy 52 Example Queries OBIS Biological OBIS Geo-Spatial Biological WOA Geo-Spatial Physiochemical Q1: where is species X found? OBIS(scientific_name,lat,long) Q3: where is species X found given certain physical parameter? OBIS(scientific_name,lat,long) WOA(physio,lat,long) Q2: for a given polygon, what species are found? OBIS(scientific_name,m_lat,m_long,m_lat,m_long) Q4: what are the aggregated physical properties of species X? OBIS(scientific_name,lat,long) WOA(physio,lat,long) Italics: input Underline: output extended Geo-Spatial OBIS Biological Benth_Hab WOA Geo-Spatial Habitat Physiochemical Benth_Hab Habitat Q5: where is habitat X found? CMECS(habitat,physio) BH(habitat_grp,shape) Q7: where is habitat X found given certain physical parameter? CMECS(habitat,physio) BH(habitat_grp,shape) WOA(physio,lat,long) Q6: for a given polygon A, what habitats are found? CMECS(habitat,physio) BH(habitat_grp,shape) PolygonA Q8: what are the aggregated physical properties of habitat X? CMECS(habitat,physio) BH(habitat_grp,shape) WOA(physio,lat,long) Q9: what species can be found at habitat X? CMECS(habitat,physio) BH(habitat_grp,shape) OBIS(scientific_name,lat,long) ISSGC06 – Ischia, Italy Q10: what habitats is a species X found at ? CMECS(habitat,physio) BH(habitat_grp,shape) OBIS(scientific_name,lat,long) 53 Frequent Query Patterns • Example queries are joins of – Left query patterns: habitat-spatial, and – Right query patterns: spatial-environmental/species distribution CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) WOA(physio,lat,long) PolygonA ) BH(..,shape) OBIS(scientific_name,lat,long) CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) WOA(physio,lat,long) CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) OBIS(scientific_name,lat,long) ( Onto-module’s queries ISSGC06 – Ischia, Italy API Mediator’s queries 54 The Resource Management Aspect of Query Evaluation DQP node 5 reduce node 4 node 3 DQP DQP join (A1,B1) join (A2,B2) node 1 node 2 DQP scan (A) DQP scan (B) OGSA-DAI OGSA-DAI DBMS DBMS data data • Primarily done by the Manchester group (Watson et al) • Polar* – Based on OQL (internally monoid comprehension) – Multi-node planning • Plan partitioning • Exchange operator – Attribute sensitivity – Data & index repartitioning • Plan scheduling – Query execution From Amy Krause ISSGC06 – Ischia, Italy 55 The Adaptivity Issue in DQP on a Grid • Monitoring-Assessment-Response framework of adaptive query processing in a grid (by Gounaris) – Monitoring: • a separate module that keeps track of information like – Has a resource (e.g., memory availability) changed more than 10%? – Has the data volume changed recently? • Occurs between operators or within an operator’s execution process • Other modules subscribe to this notification – Assessment • Diagnosis is carried out for suboptimal execution, resource shortage, resource idleness, unmet performance requirements, unmet user needs – Response • Operator replacement ore rescheduling, machine rescheduling, plan re-optimization… ISSGC06 – Ischia, Italy 56 Commonalities and Complementarities • Common themes – Overall architectural similarity of cyberinfrastructure projects • Service orientation – The data integration task is part of a larger scientific computing, exploration and analysis process • Has impact on integration setting, design decisions and performance expectations – Mediation with semantic mapping and reasoning seems to be winning • Complementary approaches – Details of the architecture • Relationship with workflows – Styles of mediation – Extensibility of mediator – Adaptivity of query planning and evaluation ISSGC06 – Ischia, Italy 57 Thank you! Questions? Comments? Integrated Queries?