XML for Data Grid Applications Chip Watson Thomas Jefferson National Accelerator Facility May 31, 2016 PPDG Meeting 1 Why XML? -- Industry Trends Strategy: Use web technologies, follow the success of the web... E-commerce companies (especially B2B) are currently investing heavily in XML technologies... Example news items: [December 11, 2000] "iPlanet Unveils Industry's First Full-Up B2B Commerce Platform…[based upon XML]” [December 08, 2000] "Schemantix (formerly Praxis) to Launch Schemantix Development Platform (SxDP) at XML 2000.’’ “Microsoft is augmenting its OLE DB for OLAP protocol with new interfaces based on XML…`The brass tacks on this is we're all going to run our analytical apps over the Internet, and the language these apps will use to communicate with their data sources will be XML,’ says Clay Young, VP of marketing at online analytical processing software vendor Knosys Inc.” -- InformationWeek, Dec 7, 2000 May 31, 2016 PPDG Meeting 2 What is XML ? eXtensible Markup Language – Like HTML, but with user defined tags – Tags refer to content, not presentation: <?xml version='1.0' encoding='ISO-8859-1'?> Properties of node <directory name="/clas" owner="root" group="other" modified="Aug 22 08:34"> <file name='97-12'/> <file name='98-02'/> Node contents <file name='98-03'/> <directory name='comm97'/> <directory name='e1'/> </directory> May 31, 2016 XML has a tree data model PPDG Meeting 3 XML vs CORBA • XML is more verbose – data transported as character strings (~2x for float) – data is self describing, with string tags (~2x) (however, lists are separated by single whitespace, so string lists are carried with little overhead) • CORBA is harder to deploy – requires ORB, complex libraries, name server, etc. • Both are language neutral – XML supported in C/C++, Java, Perl, etc. May 31, 2016 PPDG Meeting 4 What about SOAP ? Simple Object Access Protocol SOAP is a protocol specification for invoking methods on servers, services, components and objects (RPC system). SOAP codifies the existing practice of using XML and HTTP as a method invocation mechanism. The SOAP specification mandates a small number of HTTP headers that facilitate firewall/proxy filtering. The SOAP specification also mandates an XML vocabulary that is used for representing method parameters, return values, and exceptions. May 31, 2016 PPDG Meeting 5 Simple POST vs SOAP • Simple POST – query contains tagged string values, like http://xxx.yyy.zzz/page?name=xyzzy&owner=watson • SOAP – query contains structured arguments, even user defined types (example to follow) In either case, response is an http response of type xml, with arbitrary (tree-like) structure May 31, 2016 PPDG Meeting 6 SOAP structure example <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <SOAP-ENV:Body> <ppdg:AddFile xmlns:ppdg=”http://schemas.ppdg.org/soap/xmlns.ppdg"> <directory>/clas/90-03/</directory> <file>test7.dat <owner name=“watson”/> <activity name=“calibration”/> </file> </ppdg:AddFile> </SOAP-ENV:Body> </SOAP-ENV:Envelope> May 31, 2016 PPDG Meeting 7 Analysis: Simple vs SOAP • ReplicaCatalog & ReplicaHost (OO api) – need to send method name & [0-2] string args • Future catalog queries – may need to send many selection criteria, but this could be done as a simple query string (hence 1 argument) – question: may want to “batch” requests, sending, for example, an array of file names to resolve ? [could be done as many single calls, and let TCP buffer] • Conclusion: Requirements do NOT dictate SOAP – May still choose SOAP for standardization reasons…although the proposer does not have a good track record here May 31, 2016 PPDG Meeting 8 Prototyping XML at Jlab Goals: • Get experience w/ XML • Get experience w/ using XML in servlets • Demonstrate feasibility of using XML as web protocol for ReplicaCatalog and ReplicaHost • Deploy prototype replica system for experimental physics data stored in Jlab silo – currently OSM + custom java infrastructure – plan to replace OSM, resulting in pure java infrastructure May 31, 2016 PPDG Meeting 9 XML & HTML sql db ldap db XML servlet xml client HTML servlet html client corba obj style sheet Two types of servlets used, one generating xml, another which calls the first, and uses a library (few calls) to apply a style sheet to the xml and generate html May 31, 2016 PPDG Meeting 10 Prototype Components • ReplicaCatalog – java servlet producing XML – xsl style sheet to translate this to html for browsers – servlet to do formating (via style sheet) • ReplicaHost – java servlet producing XML – xsl style sheet to translate this to html for browsers – servlet to do formating (via style sheet) • Simple file transfer servers – currently bbftpd, but soon httpd, gsiftpd May 31, 2016 PPDG Meeting 11 Replica Catalog • Implemented as Java servlet (Apache + Tomcat) – currently uses fork rsh ls /mss … to get listing of silo contents for demo purposes – will use mysql via jdbc for persistent store (very soon) – supports tree data model (maps existing silo system) • Produces XML output for directory: • listing of one directory, contents are files + subdirectories • includes properties of this directory (owner, etc.) for file: • properties of the file (owner, etc.) • ReplicaHost(s) holding the file May 31, 2016 PPDG Meeting 12 Replica Host • Gives access information (disk-resident, offline, etc.) • If disk resident, locally translates file name (virtual path) to URL(s), indicating supported protocols, such as http://xxx.jlab.org/diskcache9/clas/file7.dat bbftp://bbftp.jlab.org/diskcache9/clas/file7.dat gsiftp://xxx.jlab.org/diskcache9/clas/file7.dat • Future (within 1-2 months): – – – – – support request to stage to disk support request to “pin” a file (advisory only) support request to store a file (push and/or pull?) manage update to catalog in response to local deletions of files web pages to fetch any file via browser May 31, 2016 PPDG Meeting 13 Demo • xml test of ReplicaCatalog viewed as xml • processed with style sheet & viewed as html May 31, 2016 PPDG Meeting 14 Note: Directory Model Changed Recommendation: – Change the catalog data model to allow file system (tree) symantics in the logical name space. – Hierarchical (apparently) containers – Actual containers may still be flat: /a/b/c is one container /a/b/c/d/e is a separate container /a/b/c appears to contain “d” (even if not implemented that way in storage) This will probably be more attractive to physicists and other users. May 31, 2016 PPDG Meeting 15 Future Activities 1. Finish SQL database for ReplicaCatalog 2. Finish integration of ReplicaHost and Jlab silo 3. Create exportable package for ReplicaHost – Disk cache manager (java based) • mountable by local clients – ReplicaHost (java servlet based) – File transfer daemons • • • • May 31, 2016 http bbftp gsiftp gridftp PPDG Meeting 16 PPDG Sub-project (1) Protocol standardization – choice of simple or SOAP – standardization of method names and / or arguments for requests – XML tag name standardization – response standardization (e.g. one directory listing) May 31, 2016 PPDG Meeting 17 PPDG Sub-project (2) 1. Shared ReplicaCatalog servlet implementation – standardize java interface to local persistent store – implement reference implementations: 1. above LDAP (compatible w/ or extending Globus solution) 2. above JDBC (Jlab design, open to revisions of schema) 2. Shared ReplicaHost servlet implementation – standardize java interface to local silo, disk managers – implement reference implementations: 1. CORBA calls to SRB 2. RMI calls to Jlab disk & silo managers 3. other? May 31, 2016 PPDG Meeting 18 PPDG Sub-project (2) 3. C/C++ and Java client libraries – for Java & C++, implementing an OO api with local browsing of xml data 4. Extend ReplicaHost to support queueing of transfer requests... ...to/from other ReplicaHosts – negotiate transfer protocol with other host – negotiate push/pull with other host ...to/from remote transfer daemon – protocol and direction fixed May 31, 2016 PPDG Meeting 19