Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NERC DataGrid data model and its application Andrew Woolf1 (A.Woolf@rl.ac.uk), Ray Cramer2, Marta Gutierrez3, Kerstin Kleese van Dam1, Siva Kondapalli2, Susan Latham3, Bryan Lawrence3, Roy Lowry2, Kevin O’Neill1, Ag Stephens3 1 CCLRC e-Science Centre 2 British Oceanographic Data Centre 3 British Atmospheric Data Centre 1 Outline Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 NERC DataGrid – data integration problem Semantics as integration key CSML Wrapper/mediator architecture Use and future 2 NERC DataGrid Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 British Atmospheric Data Centre Simulations British Oceanographic Data Centre Assimilation 3 NDG data integration Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Most (but not all) NDG data is file-based… On the Grid, no-one should know if you’re a file or relational table… (one service to bind them all) The file problem • multiple formats • focus usually on container, not content Scientific file format examples (earth sciences): • • • • • • netCDF HDF4 HDF5 GRIB NASA Ames ... 4 NDG data integration Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 5 NDG data integration Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Typically, API is fundamental point of reference • • • • binary format details not always exposed (or guaranteed) public API often the only supported access mechanism API typically implemented as optimised native library why reinvent a well-known working interface? Data Format Description Language (DFDL) • XML ‘facade’ to file formats • earth science files often giga-scale XML query interface not likely to be efficient • encapsulating format not the issue for NDG... • ...integrating domain-specific semantics efficiently across files and formats is! 6 NDG data integration Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Information and file contents • same information in different file formats – want to expose information, not format (seen earlier) • in addition, semantic information structures may be composed across files 7 Integration – semantics Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Want semantic access to information, not abstract data • getData(potential temperature from ERA-40 dataset in North Atlantic from 1990 to 2000) • not: getData(“era40.nc”, ‘PTMP’, 20:50, 300:340, 190:200) • or even worse: for j=1990:2000 getData(“era40_”+j+“.nc”, ‘PTMP’, 20:50, 300:340) Lossy is OK! • Care less about completeness of representation than semantic unification 8 NDG data integration Integration approaches: warehousing Integration approaches: wrapper/mediator dataInterface NDG mediatorService wrapper wrapper BODC BADC n co io rs ve n BODC co n dataInterface mediatorService at rm fo at rm fo o si er nv Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 BADC 9 Integration – semantics Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Summary: What we require is • semantic access to information (within and across files); • and to use native (well-known) efficient APIs under the covers also: • scalability across providers • warehousing not an option (tera-scale!) • enhance access and use, ‘outwards-facing’ (e.g. impacts community, policymakers) • storage heterogeneity 10 Integration – semantics Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Database data modelling • • • • Relational model (Codd, 1970) Entity-relationship model (Chen, 1976) Semantic data models Object-oriented data models (inheritance, aggregation, behaviour) File-based data modelling • Far less advanced • Abstract models (‘variables’, ‘arrays’, etc.; no ‘object’ file formats in widespread use for earth science data) • API-driven 11 Integration – semantics Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Fundamentally, an information community is defined by shared semantics • semantics often (but not always) implicit • use information semantics for data integration Semantics as integration ‘key’ • common language across providers (and users) • supports wrapper/mediator architecture NDG Solution components: • semantic data model (Climate Science Modelling Language) • storage descriptor (wrapper) • data services (mediator) 12 CSML Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Geographic ‘features’ • • • “abstraction of real world phenomena” [ISO 19101] Object models for data types – type or instance Encapsulate important semantics in universe of discourse Application schema • • • Defines semantic content and logical structure of datasets ISO standards provide conceptual toolkit: • spatial/temporal referencing • geometry (1-, 2-, 3-D) • topology • dictionaries (phenomena, units, etc.) GML – canonical encoding [from ISO 19109 “Geographic information – Rules for Application Schema”] 13 CSML Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 CSML aims: • provide semantic integration mechanism for NDG data • explore new standards-based interoperability framework • emphasise content, not container Design principles: • offload semantics onto parameter type (‘phenomenon’, observable, measurand) • e.g. wind-profiler, balloon temperature sounding • offload semantics onto CRS • e.g. scanning radar, sounding radar • ‘sensible plotting’ as discriminant • ‘in-principle’ unsupervised portrayal • explicitly aim for small number of weakly-typed features (in accordance with governance principle and NDG remit) 14 CSML Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Semantic data model • Climate Science Modelling Language (CSML), http://ndg.nerc.ac.uk/csml • Weakly-typed conceptual models for range of information types • Independent of storage concerns • Based on ISO ‘geographic feature types’ framework • Defined on basis of geometric and topologic structure CSML feature type Description TrajectoryFeature Discrete path in time and space of a platform or instrument. PointFeature GridFeature PointSeriesFeature Single point measurement. Single ‘profile’ of some parameter along a directed line in space. Single time-snapshot of a gridded field. Series of single datum measurements. ProfileSeriesFeature Series of profile-type measurements. GridSeriesFeature Timeseries of gridded parameter fields. ProfileFeature Examples ship’s cruise track, aircraft’s flight path raingauge measurement wind sounding, XBT, CTD, radiosonde gridded analysis field tidegauge, rainfall timeseries vertical or scanning radar, shipborne ADCP, thermistor chain timeseries numerical weather prediction model, ocean general circulation model 15 CSML Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 CSML feature type examples: ProfileSeriesFeature ProfileFeature GridFeature 16 Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Wrapper Numerical array descriptors • • • «Type» GML::AbstractGMLType +id +metaDataProperty +description +name provides ‘wrapper’ architecture for legacy data files proxy for numerical content within feature instances ‘Connected’ to data model numerical content through ‘xlink:href’ «Type» AbstractArrayDescriptor +arraySize[1] +uom[0..1] +numericType[0..1] +numericTransform[0..1] +regExpTransform[0..1] * +component 1 «Type» AggregatedArray +aggType[1] +aggIndex[1] Three subtypes: • • • InlineArray ArrayGenerator FileExtract (NASAAmes, NetCDF, GRIB) Composite design pattern for aggregation «Type» InlineArray +values[*] «Type» ArrayGenerator +expression[1] «Type» NASAAmesExtract +variableName[1] +index[0..1] «Type» AbstractFileExtract +fileName[1] «Type» NetCDFExtract +variableName[1] «Type» GRIBExtract +parameterCode[1] +recordNumber[0..1] +fileOffset[0..1] 17 Wrapper Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 File extract examples: <NDGNASAAmesExtract> <arraySize>526</arraySize> <numericType>double</numericType> <fileName>/data/BADC/macehead/mh960606.cf1</fileName> <variableName>CFC-12</variableName> </NDGNASAAmesExtract> <NDGNetCDFExtract gml:id="feat04azimuth"> <arraySize>10000</arraySize> <fileName>radar_data.nc</fileName> <variableName>az</variableName> </NDGNetCDFExtract> <NDGGRIBExtract> <arraySize>320 160</arraySize> <numericType>double</numericType> <fileName>/e40/ggas1992010100rsn.grb</fileName> <parameterCode>203</parameterCode> <recordNumber>5</ recordNumber> <fileOffset>289412</fileOffset> </NDGGRIBExtract> 18 Wrapper Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Aggregated array • arrays may be aggregated along an ‘existing’ or ‘new’ dimension <AggregatedArray gml:id="globaltemperature"> <arraySize>180 360</arraySize> <aggType>existing</aggType> <aggIndex>1</aggIndex> <component> <NetCDFExtract> <arraySize>90 360</arraySize> <fileName>northern_hemisphere.nc</fileName> <variableName>TMP</variableName> </NetCDFExtract> </component> <component> <NetCDFExtract> <arraySize>90 360</arraySize> <fileName>southern_hemisphere.nc</fileName> <variableName>TMP</variableName> </NetCDFExtract> </component> </AggregatedArray> 19 Mediator Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Data services (mediator) • Data services expose semantic model: • Mappings to third-party data models (e.g. file formats, OPeNDAP) • Canonical serialisation (e.g. ISO 19118 UML XML mapping) – Geography Markup Language • Example services: • netCDF file instantiation • OPeNDAP delivery • Open Geospatial Consortium (OGC) web services, e.g. Web Feature Service, Web Coverage Service • Pushed down to the file level, data access request should use optimised native file format-specific I/O 20 Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Mediator NetCDF WCS WFS OPeNDAP .... instantiateNetCDF( DatasetID, FeatureID) <CSML> <CSML> <CSML> Provides semantic abstraction layer <CSML> (SAX) demarshalling CSMLAbstractFeature +writeNetCDF() AbstractFileExtract +read() filestore 21 Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Using CSML Example of CSML use – MarineXML Data from different parts of the marine community conforming to a variety of schema (XSD) XSD For each XSD (for the source data) there is an XSLT to translate the data to the Feature Types (FT) defined by CSML. The FT’s and XSLT are maintained in a ‘MarineXML registry’ XML Biological Species Phenomena in the XSD must have an associated portrayal The FTs can then be translated to equivalent FTs for display in the ECDIS system S52 Portrayal Library XSD XML XSLT XSD XML XML Parser Chl-a from Satellite XSLT XSLT XSLT Marine XML GML (NDG) XML Feature Types Measured Hydrodynamics XSD XML Modelled Hydrodynamics The result of the translation is an encoding that contains the marine data in weakly typed (i.e. generic) Features XSLT SENC SeeMyDENC XSLT ECDIS acts as an example client for the data. with thanks to Keiran Millard, HR Wallingford Data Dictionary Features in the source XSD must be present in the data dictionary. 22 Using CSML Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 EU project – MarineXML <gml:definitionMember> <om:Phenomenon gml:id="taxon"> <gml:description>The taxon name</gml:description> <gml:name codeSpace="http://www.vliz.be">taxon</gml:name> </om:Phenomenon> </gml:definitionMember> </NDGPhenomenonDefinitions> <!--===================================================================--> <gml:FeatureCollection> <!-- ============================================================== --> <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> 'ANTHOZOA',63.1,missing 'Scoloplos armiger',66.1,missing 'Spio filicornis',10,missing 'Spiophanes bombyx',60.3,missing 'Capitellidae',131.8,missing 'Pholoe',10,missing 'Owenia fusiformis',23.4,missing 'Hypereteone lactea',6.8,missing 'Anaitides groenlandica',13.2,missing 'Anaitides mucosa',6.8,missing “MarineXML is an initiative of the IOC/IODE of “... there is a momentum such UNESCO to improve marinefrom dataorganisations exchange within as“The IHOcommunity. and WMO adopt consistent the marine The European Commission NDG formatto proved a robust recipient for approaches for the vocabulary oftotheir data along has provided funding contribution initiative the dataa from each community. Itthis produced the implementation of ISOtoStandards as part ofreference its 5th Framework Programme economical files with few redundant elements, prescribed by thethe [Open undertake a ‘pre-standardisation’ task of striking about right Geospatial balance between weak Consortium]...” identifying the approaches and strong typing.” the marine community should adopt regarding XML technology to achieve improved data exchange.” 23 Conclusions/future Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions • Mechanism is lossy, in general • semantic integration is far more important than completeness of representation • Emphasis on content, not container • Mediator services can expose data model • Well-known community formats – use efficient legacy APIs • Initial semantic decoration can add context to entire workflow chain • Loose relationship between legacy file data model and semantic (feature) instance to which it is mapped 24 Conclusions/future Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Current and future work (NDG) • Implement tooling: • CSML parsing/processing • Automated ‘scanner’: {files} CSML • Implement NDG data delivery (mediator) services layered over data model Further perspectives • Integrate with broader interoperability frameworks (e.g. ‘semantics repositories’ Feature Type Catalogues – WMO, IOC, INSPIRE) • Generalise approach: • meta-model for data modelling • ‘data storage description language’ for file mappings (DFDL role?) • canonicalised serialisation for workflows 25 Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions/future Managing semantics conceptual model Class1 -End1 -End2 «datatype» DataType1 1 define data models * Class2 auto-generate XSD GML dataset auto-generated parser <gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList> 'ANTHOZOA',63.1,missing 'Scoloplos armiger',66.1,missing 'Spio filicornis',10,missing 'Spiophanes bombyx',60.3,missing 'Capitellidae',131.8,missing GML app schema populate dataset instances26 Spatiotemporal Databases e-Science Institute, Edinburgh 01-Nov-2005 Conclusions/future Parser: «interface» SAX2::ContentHandler +setDocumentLocator() +startDocument() +endDocument() +startElement() +endElement() +characters() +processingInstruction() +ignorableWhitespace() +startPrefixMapping() +endPrefixMapping() +skippedEntity() «interface» SAX2::DTDHandler +notationDecl() +unparsedEntity() «interface» SAX2::ErrorHandler +error() +fatalError() +warning() SAX2::DefaultHandler «interface» SAX2::EntityResolver +resolveEntity() Stack of Builders (for UML meta-model) • • current class, object, attribute specialised for particular UMLXML mapping Builder receives: • • filtered SAX events built object Builder returns: • • • ObjectParser -process() builderStack * ObjectBuilder -currentClass -currentObject -currentAttribute +xmlElement() : BuilderUnion +xmlCharacters() : BuilderUnion +xmlAttribute() : BuilderUnion +processObject() : BuilderUnion ObjectPropertyBuilder «union» BuilderUnion AbstractClass AbstractObject AbstractBuilder ISO19118Builder built object new object class new Builder (for inheritance through substitutionGroups) 27