3SAQS Technical Workshop October 31 – November 1, 2013 Data Warehouse Status and Planning Update Zac Adelman (UNC-IE) Shawn McClure (CSU-CIRA) Tom Moore (WGA-WRAP) Summary of Past Quarter Activities • Researched and experimented with large data transfer technologies (iRODS, Globus Connect, etc.) • Configured a large dual RAID array on the primary file server (~20TB) and designed a third RAID array to bring the total storage capacity to 50TB+ • Imported the WestJump source data files onto the primary file server and organized them into a uniform folder structure (meteorology, emissions, results) • Created an FTP site on the primary file server for facilitating direct, basic access to the source data files • Made available the current inventory of source data files on the new FTP site • Began the design of the content, format, and coding protocols for submitting model results and other data to the TSDW • Began the design of the schema and code infrastructure for the “project overview and tracking” system • Continued to refine the database, software, and website infrastructure supporting the data warehouse • Continued to refine various pre-processing components • XML Generator for metadata • Boundary Conditions Generator • CAMx Post Processing Utility • RDBMS data import system • Refined the logical and physical file system design • Refined the data verification and validation system Operational Website Components • User Login Form • User Registration/Modification Form • User Profile/Account Form • User Feedback Form • Dataset Request Form • Database Query Wizard • Raw Data Download • Interactive Charts • Dynamic Contour Maps • Site Metadata Reports • Monitoring Site Metadata Browser • File Explorer • FTP site Authentication and Authorization System Possible Future Website Components • Modeled Emissions Summary Tool • Modeled-to-Observed Data Comparison Tool • Air Quality Summary Reports o Visibility o Deposition o Ozone o Other • Model Data Mapping Tool • Source Apportionment Tool • Various Unpublished Monitoring Data Tools • Backend Web Services and Processing Components Summary of Coming Activities • Conduct additional use case tests • Finalize the large data transfer system • Import preexisting/legacy air quality studies and results • Commence production-level data warehouse operations (hosting, data analysis and processing, maintenance, et cetera) • Design visualization and analysis tools for modeling results and performance evaluation • Design the “project overview and tracking” interface for the TSDW website TSDW Architecture Diagram - Overview IRMA NPS Standard API TSDW Website NRIS AQS BLM USFS EPA States JSON Standard HTTP API OGC XML TSDW FTP Site Web Services Other Data Systems Data Services Users and Providers External TSDW Interface Air Quality-Specific Software Libraries Third Party Software Libraries TSDW Software Libraries Generalized Software Libraries Data Access Layer TSDW Data Management Data Files RDBMS Spatial DB Data Acquisition and Import System Source Data TSDW Data Flow Diagram - Overview Meteorological Inputs Emissions Inventories Source Categories Data Sources Weather Observations · · · · Landuse/Landcover Initial Conditions Physics Options · · · · · Point & Area Sources Oil and Gas Biogenic Fire (anthro, natural) · etc State & Local Agencies EPA Mexico Canada etc Model Inputs Monitoring Data Land Use & Cover 3SAQS Boundary Conditions AQS Initial Conditions Photolysis Rates IMPROVE CASTNet Data Provider Processing Model-Ready Processing Model-Ready Processing Model-Ready Processing (e.g. reformatting, regridding) (e.g. reformatting, regridding) (e.g. reformatting, regridding) Meteorological Models Emissions Modeling (e.g. SMOKE) (e.g. WRF, MM5) Met Data Processing BEIS (e.g. MCIP2) MOVES Three State Data Warehouse File Server Model-Ready Input Data Database Server Gridded Model Results Air Quality Modelers Photochemical Grid Modeling CMAQ CAMx Web Services DBMS-Ready Model Results Website Products, Reports, and Analyses Planners, Stakeholders, and Users Oil and Gas Permits Recommendations TSDW Use Cases Definition of "Use Case": A list of steps defining the interactions between a user and a system to achieve a specific goal. The "user" can be a human or an external system, depending on context. Scopes of Use Cases: The subset of users to which the functionality of a given use case is made available • Internal: The TSDW administration and development team • External: A subset of external users that have been granted a specific role • Public: The general public - anyone who visits the TSDW website Potential User Roles: • • • • • • • Administrators Project Managers Project Team Members Stakeholders Data Providers Planners Public Use Case Description Obtain and Manage Model Input Data (Scope: Internal) 1. Obtain model input data from data provider(s) 2. Copy model input data files to file server 3. Organize model input data on the file server a. File and folder naming convention b. Physical file system organization (what developers see) c. Logical file system organization (what the user sees) d. Dataset partitioning (temporal, spatial, functional, etc.) 4. Perform periodic backup of "active" model input data 5. Perform periodic archival of "inactive" model input data 6. Track and manage the versioning of the model input data Use Case Description Harvest File Metadata Using the XML Metadata Generator (Scope: Internal) 1. An administrator locates the desired root folder in the file system 2. An administrator executes the XML Generator program to produce XML files containing file metadata 3. (Ideally, the above two tasks could be automatically run as a "cron" task on a regular, periodic basis, rather than as a two-step manual process.) 4. The File Indexing Utility (FIU) processes the newly-generated files to extract the relevant file metadata 5. The FIU updates the RDBMS with the file metadata 6. The new file metadata is automatically reflected in the TSDW File Explorer Tool Dependencies: · · · The XML File Metadata Generator program The File Indexing Utility (FIU) The appropriate RDBMS schema, SQL scripts, and software libraries for managing source file metadata Use Case Description Download Model Input Data from TSDW, Online Method (Scope: External) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. User logs into the TSDW website User fills out the Dataset Request form The user is redirected to the Dataset Request confirmation message/page The DR form is passed to the Dataset Packaging System (DPS) The DPS registers metadata about the request into the RDBMS The DPS locates the physical files that are needed to fulfill the order The DPS assembles, organizes, and compresses the component files into a downloadable "package" The DPS creates a unique "PackageID" that will be linked with this package throughout its lifecycle The DPS registers metadata about the package (including the "PackageID") into the RDBMS The DPS notifies the requesting user of the package's availability The user logs back into the TSDW website (if necessary) The user initiates a session of the Dataset Transfer System (DTS) to download the files The DTS registers metadata about the package "receipt" into the RDBMS The DIS notifies the appropriate TSDW administrator(s) of the download Dependencies: · · · · · · Dataset Request Form Dataset Request confirmation message/page Dataset Packaging System (DPS) (could be one-and-the-same with iRODS or Globus) Appropriate RDBMS schema and SQL scripts/commands for managing Dataset Request metadata Appropriate RDBMS schema for associating Dataset Requests with Users and Projects A high volume data transfer program such as iRODS or Globus Connect Server Use Case Description "Download" Model Input Data from TSDW, Offline Method (Scope: External) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. User logs into the TSDW website User fills out the Dataset Request form The DR form is passed to the Dataset Packaging System (DPS) The DPS registers metadata about the request into the RDBMS The DPS locates the physical files that are needed to fulfill the order The DPS creates a unique "PackageID" that will be linked with this package throughout its lifecycle The DPS registers metadata about the package (including the "PackageID") into the RDBMS The DPS notifies the requesting user of the order receipt and future hard drive shipment The DPS sends a list of the files that comprise the order to a TSDW administrator A TSDW administrator copies the selected files onto a hard disk drive (HDD) or drives A TSDW administrator mails the drive(s) to the requesting user A TSDW administrator records the shipment in the RDBMS Dependencies: · · · · · · Dataset Request Form Dataset Request confirmation message/page Dataset Packaging System (DPS) Appropriate RDBMS schema and SQL scripts/commands for managing Dataset Request metadata Appropriate RDBMS schema for associating Dataset Requests with Users and Projects A manual process for copying data files onto hard disks and mailing them to users Use Case Description Download Boundary Conditions Generator (Scope: External) 1. 2. 3. 4. 5. 6. 7. 8. 9. User logs into the TSDW website User navigates to the Modeling Utilities section of the website User fills out the Boundary Conditions Generator (BCG) download form The BCG download form is passed to the Utility Tracking System (UTS) The UTS extracts information from the metadata file associated with the current BCG The UTS associates this metadata with the appropriate User record in the RDBMS The UTS redirects the user to a download link for the BCG The user downloads the BCG and any associated instructions and configuration files The DIS notifies the appropriate TSDW administrator(s) of the download Dependencies: · · · · · Boundary Conditions Generator (BCG) program BCG user guide BCG download form BCG download confirmation message/page and installation file link The appropriate RDBMS schema, SQL scripts, and software libraries for managing BCG download metadata Use Case Description Download the CAMx Post-Processing Utility (Scope: External) 1. 2. 3. 4. 5. 6. 7. 8. 9. User logs into the TSDW website User navigates to the Modeling Utilities section of the website User fills out the CAMx Post-Processing Utility (CPPU) download form The CPPU download form is passed to the Utility Tracking System (UTS) The UTS extracts information from the metadata file associated with the current CPPU The UTS associates this metadata with the appropriate User record in the RDBMS The UTS redirects the user to a download link for the CPPU The user downloads the CPPU and any associated instructions and configuration files The DIS notifies the appropriate TSDW administrator(s) of the download Dependencies: · · · · · CAMx Post-Processing Utility (CPPU) program CPPU user guide CPPU download form CPPU download confirmation message/page and installation file link The appropriate RDBMS schema, SQL scripts, and software libraries for managing CPPU download metadata Use Case Description Upload Model Results (Scope: External) 1. 2. 3. User logs into the TSDW website User navigates to the Modeling Results Upload section of the website User fills out the Modeling Results Upload form a. User provides a standard description of the model results b. User provides the "Package ID" of the model input data used c. User provides the Background Conditions Generator "Version ID", if relevant d. User provides the CAMx Post-Processing Utility "Version ID", if relevant e. User selects the files to upload f. User clicks the "Submit" button on the form 4. The Model Results Upload form is passed to the Data Import System (DIS) 5. The data files are uploaded and cataloged by the DIS 6. The DIS creates a unique "DatasetID" that will be linked to this upload throughout its lifecycle 7. The DIS registers metadata about the upload (including the "DatasetID") into the RDBMS 8. The DIS notifies the uploading user of the upload success or failure (generally, its "status") 9. The DIS places the file(s) into the appropriate location(s) on the TSDW file system 10. The DIS notifies the appropriate TSDW administrator(s) of the upload Dependencies: · · · Modeling Results Upload (MRU) form MRU system Appropriate RDBMS schema and SQL scripts/commands for managing MRU metadata Use Case Description Import Database-Ready Model Results (Scope: Internal) 1. An administrator locates the newly-imported model results (which have been generated by the CPPU and uploaded to the TSDW) 2. And administrator executes the appropriate scripts/commands using the Data Import System (DIS) 3. The DIS reads and imports the database-ready model results into the RDBMS a. The DIS verifies that all the necessary metadata is present in the RDBMS b. The DIS transforms the data into the appropriate schema for import c. The DIS maps source codes and names to internal codes and names, as needed d. The DIS imports the data from the source file(s) into the RDBMS e. The DIS makes/updates the appropriate metadata records in the RDBMS for tracking the imported model Dataset f. The imported model results become automatically available via the relevant tools on the TSDW website Dependencies: · · · The CAMx Post-Processing Utility (CPPU) for generating the database-ready model results The Dataset Import System (DIS) Appropriate RDBMS schema and SQL scripts/commands for managing Model Results metadata Use Case Description Visualize and Analyze Monitoring Data (Scope: External) 1. User logs into the TSDW website 2. The user chooses an appropriate visualization and/or analysis tool to use 3. Using the tool, the user specifies spatial, temporal, and other dimensional filters for the data as well as display and formatting options 4. The tool displays monitoring data in various output products, such as: a. Data summary tables b. Bar charts c. Line charts d. Pie charts e. Contour maps Dependencies: · · · An appropriate collection of monitoring data Specific design specifications for monitoring data output products An appropriate collection of online visualization tools and technologies Use Case Description Visualize and Analyze Model Results (Scope: External) 1. The user logs into the TSDW website 2. The user chooses an appropriate visualization and analysis tool to use 3. Using the tool, the user specifies spatial, temporal, and other dimensional filters for the data as well as display and formatting options 4. The tool displays model performance and evaluation results in various output products, such as: a. Normalized mean error and bias b. Mean normalized error and bias c. Root mean square error d. Correlation coefficients e. Soccer plots f. Box and whisker plots g. Bugle plots h. Spatial statistical plots i. Spatial concentration plots with observation overlays Dependencies: · · · An appropriate collection of model results data Specific design specifications for model results output products An appropriate collection of online visualization tools and technologies Use Case Description View Project Data and Metadata (Scope: External) 1. 2. 3. 4. A user logs into the TSDW website The user navigates to the Projects and Studies section of the TSDW website The user views metadata associated with the projects that he/she has permission to view a. Name, purpose, description b. Contact information: project manager(s), contractors, etc. c. Associated datasets: Model input data downloaded, model results uploaded, etc. d. Analysis products: Charts, graphs, summaries, etc. The user views data associated with the projects that he/she has permission to view a. Model input data i. Meteorological inputs ii. Emissions inputs iii. Initial and Boundary Conditions iv. Ancillary inputs (land use, land cover, photolysis) b. Model configuration metadata c. Model results i. Gridded results ii. Observation-paired results d. Monitoring data Dependencies: · · Appropriate RDBMS schema and SQL scripts/commands for managing Project metadata o Projects o Users o Downloaded/Uploaded Datasets o Documents o Analysis products An online user interface for the Projects and Studies section of the TSDW website Use Case Summary • Obtain and Manage Model Input Data (Scope: Internal) • Harvest File Metadata Using the XML Metadata Generator (Scope: Internal) • Download Model Input Data from TSDW, Online Method (Scope: External) • "Download" Model Input Data from TSDW, Offline Method (Scope: External) • Download Boundary Conditions Generator (Scope: External) • Download the CAMx Post-Processing Utility (Scope: External) • Upload Model Results (Scope: External) • Import Database-Ready Model Results (Scope: Internal) • Visualize and Analyze Monitoring Data (Scope: External) • Visualize and Analyze Model Results (Scope: External) • View Project Data and Metadata (Scope: External) Thanks. Review of the 3SDW Overall System Ecosystem and Architecture Guidance, Requirements, Feedback, Funding NPS WGA CIRA Architecture, Design, Implementation, Management, and Operation Monitored Aerosol Deposition Raw Data Gaseous Modelers AQS, VIEWS Tools Modeled Emissions Planners Met Air Quality Documents 3SDW WestJump, future modeling, etc Results Managers Acquisition Integration Management Distribution Presentation Identification, Acquisition, Pre- and Postprocessing, Extraction Verification, Validation, QA/QC, Mapping, Flagging, Tranformation Storage, Backup, Restore, Security, Summarizing, Statistics Searching, Querying, Filtering, Aggregating, Formatting, Packaging Charting, Graphing, Mapping, Analyzing User Login Form User Registration/Modification Form User Profile/Account Form User Feedback Form Dataset Request Form Raw Data Download (Query Wizard) Time Series Charts (Query Wizard) Dynamic Contour Maps (Query Wizard) Site Metadata Report (Query Wizard) Monitoring Site Browser Modeled Emissions Summary Tool Modeled-to-Obs Comparison Tool Air Quality Summary Reports Model Data Mapping Tool Future Online Visualization Tools First TSDW Modeling Use Case Report and Results First Use Case - Beta Test Steps • Testers visited the TSDW website and registered with the system to create an account • Testers visited the Data Request web page and entered their requests for the WestJump Base08b dataset • Each request was stored in the database • The system determined whether or not each request could be automatically filled or had to be manually assembled • The system sent emails to the appropriate TSDW team members to notify them of the data requests • TSDW team members assembled the dataset requests (copied the relevant data files onto hard drives) • The datasets (hard drives) were delivered to the beta testers • The system updated the dataset requests to reflect their “filled” status First Use Case - Beta Test Steps (cont’d) • Using the delivered datasets, testers ran the models and generated results • Testers returned the model output results to the 3SDW • The test results were assessed by TSDW team members • Testing outcomes were summarized for the May 3-State AQ Study Technical Workshop • The TSDW team refines the dataset ordering, download, packaging, and delivery system according to lessons learned • The TSDW team develops the next Use Case testing scenario(s) Summer (June – October) 2013 3SAQS Technical Work Review Data Warehouse Activities Summary of Coming Activities • Implement the collaborative components of the warehouse • Implement the ongoing news and updates section • 3SDW on-line for NEPA air quality analysis projects by end of October • Out-bound data delivery and in-bound data ingestion for NEPA and other air quality studies • Data warehouse operations (hosting, data analysis and processing, maintenance, et cetera) • Plans for storage/access/visualization for modeling results and evaluation tools • Store UT BLM ARMS and other studies’ data in 3SDW after evaluation using protocols Testing and Refinement Help • All users, collaborators, and partners can help with testing • Please report bugs – don’t endure them • Use the website Feedback form • Send direct email to team members • Provide as much information as possible up-front • Stay abreast of ongoing additions and updates • Be an active part of the design process - make suggestions for features and refinements • Don’t assume it can’t be done • Don’t assume it can be done