Federated PM and Haze Data Warehouse Project a sub- project of (enter your sticker & logo here ) St. Louis Midwest Supersite Project RPO Regional Planning Organization SupSite EPA Supersites NARSTO NARSTO PM EPA EPA Division1, Division2, Division2 Me Me and my dog for our aerosol project Nov 20, 2001, RBH PM/Haze Data Flow in Support of AQ Management FLM RPO FLM RPO FLM RPO Federal Land Managers Regional Planning Orgs EPA EPA EPA EPA Regul. & Research Shared PM/Haze Data SuperSite NARSTO Industry Academic Other: Private, Academic • PM and haze data are used for may parts of AQ management, mostly in form of Reports • There are numerous organizations in need of data relevant to PM/Haze • The variety of pertinent (ambient, emission) data come from many different sources • Most interested parties (stakeholders) are both producers and consumers of PM and haze data • To produce relevant reports, the data need to be ‘processed’ (integrated, filtered aggregated) • There is a general willingness to share data but the resistances to data flow and processing are too high Scientific and Administrative Rationale for Resource Sharing • • • • Scientific Rationale: Regional haze and its precursors have a 1000-10000 km airshed. (Smoke, Dust, Haze) – Data integration Substantial fraction of haze originates from natural sources or from out-ofjurisdiction man-made sources Cross-RPO data and knowledge sharing yields better operational and science support to AQ management Management Rationale: • Haze control within some RPOs cannot yield • Data sharing saves money and …. A Strategy for the Federated PM/Haze Data Warehouse • Negotiate with the data providers ‘open up’ their data servers for limited, controlled, access in accordance with clear ‘access contract’ with the Federated Warehouse • Design an interface to the warehoused datasets that has simple data access and satisfies the data needs of most integrating users.(oxymoron ????) • Facilitate the the development of shared value-adding processes (analysis tools, methods) that refine the raw data to useful knowledge Three-Tier Federated Data Warehouse Architecture (Note: In this context, ‘Federated’ differs from ‘Federal’ in the direction of the driving force. Federated meant to indicate a driving force for sharing from ‘bottom up’ i.e. from the members, not dictated from ‘above’, by the Feds) 1. Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members 2. Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets 3. User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing Federated Data Warehouse User Tier Data presentation, processing Proxy Tier Data homogenization, transformation Provider Tier Heterogeneous data in distributed SQL Servers Federated Data Warehouse Interactions • The Provider servers interact only with the Proxy Server in accordance with the Federation Contract – – • The contract sets the rules of interaction (accessible data subsets, types of queries) Strong server security measures enforced, e.g. through Secure Socket layer The data User interacts only with the generic Proxy Server using flexible Web Services interface – – – Generic data queries, applicable to all data in the Warehouse (e.g. data sub-cube by space, time, parameter) The data query is addressed to the Web Service provided by the Proxy Server Uniform, self-describing data packages are passed to the user for presentation or further processing Federated Data Warehouse Proxy Tier Provider Tier Data Homogenization, etc. Heterogeneous Data User Tier Data Consumption Presentation SQLDataAdapter1 SQLServer1 Processing SQLDataAdapter2 SQLServer2 Integration CustomDataAdapter LegacyServer Data Access & Use Proxy Server Member Servers Web Service, Uniform Query & Data Fire Wall, Federation Contract Live Demo of the Data Warehouse Prototype http://capita.wustl.edu/DSViewer/DSviewer.aspx Currently online data are accessible from the CIRA (IMPROVE) and CAPITA SQL servers Uniform Data Query regardless of the native schema: Query by parameter, location, time, method The hidden DataAdopter - accepts the uniform query - accesses the data server - transforms the original to uniform data - delivers uniforms DataSets A rudimentary viewer displays the data in a table for browsing. ‘Global’ and ‘Local’ AQ Analysis • • • • • AQ data analysis needs to be performed at both global and local levels The ‘global’ refers to regional national, and global analysis. It establishes the largerscale context. ‘Local’ analysis focuses on the specific and detailed local features Both global and local analyses are needed for for full understanding. Global-local interaction (information flow) needs to be established for effective management. National and Local AQ Analysis Integration for Global-Local Activities Global and local activities are both needed – e.g. ‘think global, act local’ ‘Global’ and ‘Local’ here refers to relative, not absolute scale Global Activity Local Benefit Global data, tools => Improved local productivity Global data analysis => Spatial context; initial analysis Analysis guidance => Standardized analysis, reporting Local Activity Global Benefit Local data, tools => Improved global productivity Local data analysis => Elucidate, expand initial analysis Identify relevant issues => Responsive, relevant global work Data Re-Use and Synergy • • • Data producers maintain their own workspace and resources (data, reports, comments). Part of the resources are shared by creating a common virtual resources. Web-based integration of the resources can be across several dimensions: Spatial scale: Data content: Local – global data sharing Combination of data generated internally and externally Local Local User Shared part of resources User Content Virtual Shared Resources User Data, Knowledge Tools, Methods Content User Global • • Global User The main benefits of sharing are data re-use, data complementing and synergy. The goal of the system is to have the benefits of sharing outweigh the costs. Federated Data Warehouse Features • Data reside in their respective home environment where it can mature. ‘Uprooted’ data in separated databases are not easily updated, maintained, enriched. • Abstract (universal) query/retrieval facilitates integration and comparison along the key dimensions (space, time, parameter, method) • The open data query based on Web Services promotes the building of further value chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc.. • The data access through the Proxy server protects the data providers and the data users from security breaches, excessive detail