DDS, A Seismic Processing Architecture Reproducible research workshop UBC, Vancouver, 2006 D W Randall L. Selzler Jerry Ehlers Joseph A. Dellinger* RSelzler @ Data-Warp.com Jerry.Ehlers @ BP.com Joseph.Dellinger @ BP.com DDS ORIGINS: Amoco TRC, early 90’s DDS began at the Amoco Tulsa Research Center at a time of great organizational strain. The job of the TRC was to do research and crunch data, not to write software. Creating software is expensive! Amoco’s solution was an edict that “everyone will use DISCO, or else”. 2 Else! But DISCO just wasn’t good enough! And so chaos ensued... We were “mired in seismic processing diversity”. DDS grew up surrounded by: • • • • USP (Amoco internal trace-header based) SEPlib (ASCII header pointing to data cubes) SU (SEGY trace-header based) DISCO (proprietary monitor-based system) .... and needed to be compatible with all of these! 3 Although formally cast as a research group, in fact the TRC also functioned as an “internal contractor” processing shop. 1) So to catch on, not only would any software have to be usable for quick-turnaround research, but 2) the ability to process large datasets efficiently and in parallel was also of vital importance. [Terabytes of data, Connection Machines, MPI, OpenMP] 3) The group had accumulated a considerable number and variety of computers. [All “Unix”, but CM5, Cray, Sun, SGI, Linux, Linux clusters, 32 and 64 bit...] 4) Finally, there was an urgent need for software that could accomodate all the various mutant SEGY formats coming into the shop, as well as DISCO, SEPlib, SU, and USP! and out of the chaos came... John Etgen was using SEPlib for migration algorithm research on the CM200, a machine that required massively parallel data I/O. He showed SEPlib to Randy Selzler: “I want something that looks like THIS, but can handle the large industrial-strength jobs I need to do!” And thus DDS was born... 5 How SEPlib did it “header” file data file ... processing history ... esize=4 (bytes) data_format=xdr_float in=data_location n1=trace_length n2=number_traces_per_record n3=number_records regularly sampled cube of IEEE 4-byte floats of dimension n1 x n2 x n3 d1=sample_interval o1=starting sample etc... SEPlib was the system favored by the folks writing programs that worked on large data volumes instead of individual traces. 6 DDS can look a lot like SEPlib SEPlib header file DDS “dictionary” file ... processing history ... ... processing history ... esize=4 (bytes) data_format=xdr_float type=float4 format=fcube in=data_location data= data location n1=trace_length n2=number_traces_per_record n3=number_records axis= t offset cdp size.t = trace length size.offset=number traces per record size.cdp= number records d1=sample_interval o1=starting sample label1=seconds etc... delta.t= sample_interval origin.t= starting sample units.t= seconds etc... 7 DDS can look a lot like SEPlib “dictionary” file data file type=float4 format=fcube data= data location axis= t offset cdp size.t = trace length size.offset=number traces per record size.cdp= number records delta.t= sample_interval origin.t= starting sample units.t= seconds etc... regularly sampled cube of IEEE 4-byte floats of dimension size.t x size.offset x size.cdp (command-line arguments look a LOT like SEPlib too) 8 DDS’s Generalizations Dictionary … axis= t y cmp … size.t= 1000 size.y= 96 size.cmp= 24 … delta.t= 0.008 units.t= s … origin.y= 5000 units.y= m … format= segy data= oak39_@ • N-Dimensional Array of I/O Records • Densely populated for random access • Sequential access if sparse • Meaningful Axis Names • t, x, y, z, w, kx, ky, kz, cmp, shot, offset, … • Extensible Axis Attributes • Regular grid (size, origin, delta, units, …) • Variable grid (grid.z= 1 3 5 7 11, …) • Non-numeric (label.attr= Vp Vs rho) Binary Data Card Header Line Header Great for research! Exotic algorithms and unforeseen domains can be accurately represented and processed as easily as traditional ones. Traces… 9 How USP did it USP-format data file historical line header (processing history and 3 data dimensions) element count trace header trace samples element count trace header trace samples element count trace header trace samples traces Unix Seismic Processing USP was Amoco’s internally home-grown trace-based processing system, beloved of Amoco’s signal processors. USP is similar to SU in concept. USP uses longer trace headers than SU, but they still turned out to not be long enough! USP is still used as much as ever today. 10 SU and USP use fixed-format trace headers defined by include files /* * hdr.h – SU include file for segy offset array */ static struct { char *key; char *type; int offs; } hdr[] = { { "tracl", "i", 0}, { "tracr", "i", 4}, { "fldr", "i", 8}, { "tracf", "i", 12}, { "ep", "i", 16}, { "cdp", "i", 20}, { "cdpt", "i", 24}, { "trid", "h", 28}, { "nvs", "h", 30}, { "nhs", "h", 32}, { "duse", "h", 34}, { "offset", "i", 36}, { "gelev", "i", 40}, { "selev", "i", 44}, { "sdepth", "i", 48}, { "gdel", "i", 52}, { ... 11 DDS also plays well with USP DDS dictionary file USP-format data file type=float4 format=usp data= data location axis= t offset cdp comp size.t = trace length size.offset=number traces per record size.cdp= number records size.comp= number components delta.t= sample_interval origin.t= starting sample units.t= seconds etc... DDS knows what USP headers look like! line header (three dimensions) element count trace header trace samples traces element count trace header trace samples element count trace header trace samples 12 and SEGY... DDS dictionary file SEGY-format data file type=float4ibm EBCDIC cards binary header format=segy data= data location axis= t offset cdp comp size.t = trace length size.offset=number traces per record size.cdp= number records size.comp= number components trace header IBM-format samples traces trace header IBM-format samples trace header IBM-format samples delta.t= sample_interval origin.t= starting sample units.t= seconds etc... Note DDS only bothers to convert back to SEGY’s archaic IBM floats when writing to disk! 13 DDS can speak SU note input format auto-detected editd in=minute2.usp 3s=16 3e=16 2s=2 2e=32 2i=2 out_format= su out_data= stdout: | supswigp clip=.2 > wiggle.ps \ \ \ \ DDS dictionaries can point at dictionaries! dict.comp1 type=float4ibm type=float4ibm format=segy slice.comp data= dict.comp1 dict.comp2 dict.comp3 axis= t offset cdp comp size.t = trace length size.offset=number traces per record size.cdp= number records size.comp= number components format=segy data= data.c1.segy SEGY binary data file data.c1.segy axis= t offset cdp size.t = trace length size.offset=number traces per record size.cdp= number records ... dict.comp2 type=float4ibm format=segy ... data= dict.c2.segy axis= t offset cdp SEGY binary data file 15 data.c2.segy DDS plays well with mutant SEGY bridge in= Atlantis_EQ.segy \ in_format=segy \ out_format=usp \ comment="Component Type" \ straight map map:segy:usp.RcComp= "TotalStatic" \ \ comment="Src and rec locations" \ map:segy:usp.SrPtXC= "SrcX / 10" \ map:segy:usp.SrPtYC= "SrcY / 10" \ map:segy:usp.SrPtEl= "15" \ fixed number map:segy:usp.ShtDep= "SrcDepth / 10" \ \ map:segy:usp.RcPtXC= "GrpX / 10" \ map:segy:usp.RcPtYC= "GrpY / 10" \ map:segy:usp.GrpElv= "Spare.I4[10] / 10" \ map:segy:usp.CabDep= "Spare.I4[10]" \ map:segy:usp.DstSgn= "DstSgn / 10" \ arithmetic \ calculation comment="Rec point and line numbers" \ map:segy:usp.DpPtLn= "Spare.I4[8]" \ map:segy:usp.DpPtLt= "Spare.I4[9]" \ \ comment="Dead or Live" \ map:segy:usp.StaCor= '( TrcIdCode - 1 ) * 30000' \ |\ editd in= stdin: 3e=106 out_data= raw.usp 16 Data formats and mappings • This is how DDS differs from SEPlib... The properties of the binary data, and all the elements within the binary data, are looked up in the “dictionary”. • Even the array of trace samples is just another trace field as far as DDS is concerned. • DDS knows a few default formats, but can use any format that you can define. • It can also map to and from any format that you can define the necessary mappings for. • This has the important side effect of documenting the data format, making future reproducibility possible 17 DDS supports generic formats In fact, besides having a few built-in default formats such as USP, SU, and SEGY that are convenient for geophysicists, there is nothing in the core of DDS that limits it to being a seismic processing system! 18 Internal data formats • Programs can define their own internal data formats as well, simply by writing definitions into their own internal dictionary: fdds_printf (‘MOD_FIELD’, ‘ *+ float MyHeader1, MyHeader2;\n\0’) • DDS will then convert from the format of the data, as documented by its dictionary, to the internal format specified by the program. • On output, the internal format will be converted back into whatever output format has been requested on the command line, or by default, the output format will be the same as the input format. 19 Leverage Diversity? Interoperate! Data handling is fundamental… Non-DDS Application Format and API Emulation With Random Access I/O USP Re-link 1998 Proof of Concept Disk File Pipe/Socket Tape DISCO Support 1997-2003 Generic Read DDS Application DDS Application Generic I/O Non-DDS Application API Emulation API Emulation Foreign Library Generic I/O Foreign Format Any DDS Supported Format Generic Write Disk File Pipe/Socket Tape Non-DDS Application 20 Are you scared yet? • You can probably imagine that all this translating between formats can get very complicated... ... fmt:SAMPLE_TYPE= typedef float4 SAMPLE_TYPE; fmt:USP_ADJUST= typedef enum4 {USP_LINE_PAD \= 0, USP_TRACE_PAD \= 0, USP_HLH_SIZE \= 2236} USP_ADJUST; fmt:SEQUENCE= typedef USP_TRACE SEQUENCE; alias:fmt:USP_TRACE_PAD= fmt:USP_ADJUST alias:fmt:USP_HLH_SIZE= fmt:USP_ADJUST alias:fmt:USP_LINE_PAD= fmt:USP_ADJUST usp_NumRec= 2056 ... But still better than having to change your code or relink your code for every different mutant data format! It also makes it possible to 21 interoperate with historical data formats without too much pain. DDS scripting as a Rosetta stone /apps/global/bin/bridge \ in= /hpc/dat13/zdsr01/Node/EQ/all.segy \ in_format=segy out_format=usp \ comment="Component Type" \ map:segy:usp.RcComp= "TotalStatic" \ comment="Src and rec locations" \ map:segy:usp.SrPtXC= "SrcX / 10" \ map:segy:usp.SrPtYC= "SrcY / 10" \ map:segy:usp.SrPtEl= "15" \ map:segy:usp.ShtDep= "SrcDepth / 10" \ comment="Azimuth, Roll Tilt" \ map:segy:usp.TVPT01= "100 * Spare.F4[11]" \ map:segy:usp.TVPT02= "100 * Spare.F4[12]" \ map:segy:usp.TVPT03= "100 * Spare.F4[13]" \ comment="Dead or Live" \ map:segy:usp.StaCor= '( TrcIdCode - 1 ) * 30000' \ comment="Shot Time" \ map:segy:usp.TVPT15=Date.DateYear \ map:segy:usp.TVPT16=Date.DateDay \ map:segy:usp.TVPT17=Date.DateHour \ map:segy:usp.TVPT18=Date.DateMin \ map:segy:usp.TVPT19=Date.DateSec \ .... 22 In Conclusion: caveats • Things aren’t so complicated if you use DDS as if it were SEPlib, but then what’s the point? • Because so much functionality already exists in USP, there has been little motivation to flesh out DDS. • The external distribution is a subset of the same stuff we use internally. There has been little effort put into improving the “packaging”. • While there is some documentation, it is somewhat lacking! 23 In Conclusion: upsides • The software infrastructure inside BP today is based almost entirely on DDS and USP. It is BP’s infrastructure both for research and for processing. BP’s advanced imaging team in Houston is “BP’s largest contractor”. • The DDS I/O library was released publicly in 2003 on “freeusp.org”. The core of the USP system was released a year or so earlier on the same web site, along with some ARCO-heritage processing systems as well. • By releasing USP and DDS, BP hoped to make it easier to share algorithms with academia and contractors. • Randy Selzler now wants to create a successor to DDS, but that’s his talk, as the “prophet”, to give... 24