Data Format Description Language (DFDL) WG Martin Westhead EPCC, University of Edinburgh M.Westhead@epcc.ed.ac.uk Alan Chappell PNNL chappella@battelle.org Agenda • • • • Introduction and welcome - Martin Westhead 10mins Binary Format Description Language (BFD) - Alan Chappell 10mins Binary XML (BinX) - Stephen Rutherford 10mins DFDL - Martin Westhead 15mins – – – Big picture Structural Description Language Charter (20 mins Discussion) • Examples repository - Alan Chappell 10mins – Bruce Barkstrom Examples at NASA (15mins Discussion) Motivation • There will never be a standard data format – – – – E.g. XML – verbose, tree-based, explicit structure Legacy formats Application specific formats One size will never fit all • But could we provide a language for describing formats – Transparency of physical representation – Automatic format conversion – Unambiguous description of data There’s more… Explicit structure enables: • Standard transformation to/from XML representation – Could allow application to read/write XML – But provide underlying efficient binary representation • Data stream/file becomes database – – – – Point to parts of the structure Extract parts of the structure Modify parts of the structure Integrate parts of different structures And more… • Generic tools possible – Browsing – Conversion and transformation • Annotation of data – E.g. identify bits that depict hurricane in an image • Enables general semantic labels, many ontologies could be developed e.g.: – S.I. units, SQL types, Time – Community specific labels, “starClass = whiteDwarf” – Application specific labels, “nodeColour = green” • Could lead to a standard transformation language Not fairy tales • Based on implemented work – BinX http://www.epcc.ed.ac.uk/gridserve/WP5/Binx/ – BFD part of the Scientific Annotation Middleware project (http://www.scidac.org/SAM/) • Generalized and extended a little • Formal semantics • Foundation for extensibility Approach • Separate out structure and semantics • General structural language – – – – Repetition Pointers References to data New structures can be built (compositionality) • Semantics – – – – Hard to express so…we don’t General labeling Label semantics define elsewhere (ontologies) Labels can be added (extensibility) Structure – arbitrary labels bunchThings fooPair foo bunchThings fooSet bunchThings bunchThings foo fooPair fooPair fooPair . . . . . . thing 0 thing 1 thing 1 thing 0 thing 0 thing 1 thing 1 thing 1 . . . . . . Structure – example labels byte complex float complex Array byte byte byte float complex complex complex . . . . . . bit 0 bit 1 bit 1 bit 0 bit 0 bit 1 bit 1 bit 1 . . . . . . Structural language • Formal semantics – Structured binary sequence – Defines hierarchical structure over underlying sequence of binary values • Language for describing hierarchical structure – Repetition • Explicit number repeats • Termination characters – Data reference • Conditionals • Data size – Pointers • Scope – As general as possible but – Must be concise and implementable • Draft language definition on web page (www.epcc.ed.ac.uk/dfdl) CSV file example char:=byte data:=[(char - [',']).*] field:=[data; [',']] finalField:=[data; [‘\n’]] row:=[field.*] :: [finalField] table:=[row.*] Semantic labels • Many ontologies possible • Initial scope probably: – Basic types (floating point, integer, character) – Simple structures (structs, arrays, tables) • Obvious extensions: – SQL types – XML Schema types • Key WG goal: – Define form and requirements of new ontologies What is an Ontology? • XML Schema for new types • Structural description of new types • Definition of core API behaviour on new type • API extensions • Relationships to other types WG goals • Formal language for DFDL data structure • Standard representation of this language in XML • Requirements for DFDL ontology • Basic types ontology • Basic structures ontology Currently under discussion • Abstraction from the underlying binary – Compression, encoding, encryption – Physical vs. conceptual binary sequence • Abstraction of description – complex:=[foo; foo] – Instantiate “foo:= float” or “foo:= double” at use time • Filtering of results – Getting to data model and leave format behind – CSV -> [[value; value; value]; [value; value; value]] DFDL in the VO • Generic tools • Metadata possibilities – Ontologies can define relationships between types – E.g. polar to Cartesian – Standard classes over data objects Getting involved • Webpages: http://www.epcc.ed.ac.uk/dfdl • Mailing list (dfdl@gridforum.org) • My address: M.Westhead@epcc.ed.ac.uk