ORIGAMI AND GIPS: RUNNING A HYPERSPECTRAL SOUNDER PROCESSING SYSTEM ON A LIGHTWEIGHT ON-DEMAND DISTRIBUTED COMPUTING FRAMEWORK Maciej Smuga-Otto, Raymond Garcia, Graeme Martin, Bruce Flynn, Robert Knuteson Space Science and Engineering Center University of Wisconsin - Madison The scientific workflow need we are addressing involves the retrieval, processing and visualization of large volumes of data, often on the order of Terabytes, associated with hyperspectral sounder-imager instruments. Science workstations, compute clusters and storage area networks are the hardware components available to us. Likewise, certain commodity software packages for visualization, queueing, data storage management and grid resource management are readily available. While there are also off the shelf science workflow management and scientific programming packages out there, these are often proprietary and impose serious architectural constraints on the rest of the system. Origami: Step-by-step Origami is the outcome of a parallel effort in distributed computing at the SSEC - it is an experiment in rapid prototyping of distributed science workflows using lightweight web components. As such, it can be extended and other components This XML interface style has already spread to GEOCAT and LEOCAT, two other satellite data processing frameworks being developed at the SSEC. These frameworks utilize a fundamentally similar XML interface format as well as the suite of tools for managing it. • visualization McIDAS-V MATLAB, IDL User chooses desired product and searches for input data Metadata server Dispatcher process on cluster distributes work order among nodes Workstation • data discovery THREDDS Web portals Custom databases Cluster Cluster nodes run job, access raw data on SAN User is notified of completed work order Launch visualization of product (eg. McIDAS V) SAN Storage System Centralized Computing resources User visualization workstation to serve as a glue technology between GIPS running on of a true distributed computing system. To make GIPS compatible with an external distribution and workflow manager such as Origami, we found that it was beneficial to convert all algorithm interfaces to a common XML format. This XML interface file would be converted to an API-level interface for compiling into GIPS, and would also be used by the Origami engine to perform workflow and scheduling decisions. individual nodes, Processing raw data from the Geosynchronous Imaging Fourier Transform Spectrometer (GIFTS) instrument was the principal motivation for this work. Initially the GIFTS Information Processing System (GIPS) was meant to cover various aspects of distributed computing as well as hosting the GIFTS algorithms. Eventually, it was down-scoped to handle provisioning of the algorithms with appropriate data and metadata. This form of GIPS is suitable for running on a single compute node, and the need arose to find a distributed computing solution to match. Origami by-the-numbers: Workflow for processing and visualizing large volumes of data • data storage management Storage area networks: XSAN Distributed Volumes: LUSTRE Federated data service: SRB Service Layer • data provenance tracking Initial FFT (Ifg to Spectrum) Stage Complex Spectra (optional) Nonlinearity and tilt correction Complex Spectra Radiometric Calibration Stage Computing Cluster • computational payload GIPS GEOCAT, LEOCAT Individual binaries Separate by band, (and optionally by scan direction and pixel) Interferograms • queueing and resource management Sun Grid Engine OpenPBS custom scripts • metadata management Stream of incoming observations (interferograms) with metadata Data • scientific workflow globus and other grid technologies Component level architecture of the distributed processing system contributes to individual buy vs. build decisions. Metadata Retrieve records for time, band, scan direction, and (optionally) pixel index of incoming observations create context from blackbody observations use context to apply correction create context from blackbody observations use context to calibrate Real Calibrated Spectra Pipeline Reference Database Instance(s) Spectral Resampling Stage use context to calibrate Nonlinearity study Stage (research) Science user-provided files Task Dispatcher XML interface description interface file generator Radiometric Calibration Study Stage Algorithm Source Code Spectral Calibration Study Stage (research) Algorithm Interface files Real Calibrated Resampled Spectra Deploy Bundle reference (audit) metadata with outgoing calibrated spectra Compile Algorithm Binaries Deployment script Algorithm At the heart of GIPS is an algorithm pipeline that produces calibrated radiances from raw imaging interferometer data and is a candidate heavyweight ground processing task Integration of Origami and GIPS drove the creation of an XML interface for algorithms hosted by the system