A New Enterprise Data Management Strategy for the US EPA

advertisement
9/5/2007
A New Enterprise Data Management Strategy for the US EPA
Part 3: Integration of Data Tables
Brand L. Niemann, Senior Enterprise Architect,
US EPA Enterprise Architecture Team, and
Co-Chair, Federal Semantic Interoperability Community of Practice (SICoP)
Summary
Part 1 (1) outlined “A New Enterprise Data Management Strategy for the US EPA” based
on:
The premise of reusing the data and information, rather than changing the data systems
themselves, by putting the business and technical rules, logic, etc. into the data itself
using markup languages; and
The concepts and standards of the Semantic Web (also called the Data Web or Web 3.0)
were the most important tenets of the reuse are:
Bring the data and the metadata back together.
Bring the structured and unstructured data and information back together.
Bring the data and information description and context back together.
Part 2 (2), also for the Metatopia 2007 Conference (3), shows how the use of high-quality
content based on considerable multi-disciplinary subject matter expertise can be reused to
build a knowledgebase that contains both an ontology and a database of instances. In this
case the knowledgebase contains an inventory of the data assets that shows the critical
importance of standardized metadata and cross-agency data sharing to the mission of the
US EPA and how the implementation of the data asset inventory supports the four
functionalities of DRM 3.0 and Web 3.0, namely, Integration of Data and Metadata,
Harmonization, Enhanced Search, and Mashups.
Part 3 shows how to integrate data tables within and across categories in the inventory of
the data assets. Systems for exporting relational data to RDF have existed since the
beginning of the Semantic Web (4, 5). Recently, the Semantic Web developers have
focused on SPARQL query-rewriters and interpreters to access relational data directly.
Both of these approaches share an expression of relational data in RDF. Access to this
structured data can increase the size and utility of the Semantic Web many times over (6).
This position paper for the upcoming W3C Workshop on RDF Access to Relational
Databases (6) makes Government data tables and relational databases (e.g. LandView 6
and 7 on DVD) (1) readily accessible for Semantic Web pilots.
Table of Contents
1. Introduction
2. Data Asset Inventory
3. Data Table Integration
4. Recommendations
5. References
1
1. Introduction
A New Enterprise Data Management Strategy for the US EPA exists in two parts (1, 2)
so far for presentation at the upcoming Metatopia 2007 Conference (3). At the recent
W3C/WSRI Workshop entitled “Toward More Transparent Government on eGovernment
and the Web” (4) SICoP suggested that a clear message about the role of RDF in data
exchange and a series of pilots using government data sources would help educate and
demonstrate the value of the Semantic Web (aka tha Data Web) to the Federal
Government. The W3C has a new Semantic Web Layer Cake (5) in which RDF has
moved into the XML space and has been expanded with query and rules! The W3C also
has an upcoming Workshop on RDF Access to Relational Databases (6).
RDF has generate renewed discussion among the Data Management Community (7) and
the new Semantic Web Layer Cake has engendered considerable discussion among the
Ontology Community (8) and The upcoming Workshop on RDF Access to Relational
Databases will draw members of the Semantic Web and relational database communities
together to examine commonalities, distinctions and next steps for expressing relational
data in RDF. Consumers and potential consumers of RDF data will provide use cases and
goals (e.g. SICoP).
2
The ubiquity of relational data makes it an attractive next target for the Semantic Web.
Much of the data that is used in automation is stored in relational databases. RDF's
grounding in universal terms makes RDF attractive to the relational database community.
Expressing relational data in RDF allows them to join relational data with data in other
databases or in other forms. Despite the ubiquity and utility of relational data, connecting
data between databases remains problematic and resource-intensive. Joining data between
independently-developed relational databases requires tedious scripting, data
warehousing, or tailored integration systems. RDF queries accessing multiple relational
databases have shown that RDF can be used to unify independent relational databases
and link in external sources, e.g. documents and data from the Web. The provenance for
relational data in the RDF can also be expressed in RDF. All of this data will be available
for access by query and rules languages (6).
In several deployed systems, the tuples in a relation are identified by a URL composed of
table name and primary key attributes/value pairs. This provides the subject of a set of
triples, each expressing the attributes of that tuple. For these, the predicate is composed
of table name and the attribute name. The objects are literals to express simple relational
attributes and URI references to express foreign key relationships to other tuples. Foreign
keys to multiple other tuples are simple expressed as repeated attributes (6).
2. Data Asset Inventory
The summary statistics of the data asset database are summarized in the table below from
Part 2 (2) of the A New Enterprise Data Management Strategy for the US EPA.
Category
Topic
Concept
Indicator Metadata Instance Data
Tables
Question Name
Data
Exhibit Elements/
Source /
Titles
Attributes
Quality
3
27
27
58
71
7*
18
18
41
48
5
12
12
25
31
3
18
18
39
49
Air
Water
Land
Human
Health
Ecological 5*
11
11
Condition
5
23
86
86
* One question without an indicator.
Provenance Provenance
Agency:
EPA
Agency:
Non-EPA
23
9
5
0
4
9
7
18
22
22
3
8
185
221
40
46
It is significant to note that more than half the indicators are from non-EPA agencies so
data sharing and reuse is critical to EPA’s mission to reporting on the State of the
Environment.
3
3. Data Table Integration
The individual data tables with their elements and attributes (recall Section 2) were
compiled into 5 multi-sheet spreadsheets (Microsoft Excel), one for each of the 5 topics
in the 2007 EPA Report on the Environment. The multi-sheet spreadsheet for “water” is
shown below for the index (table of contents) and the Exhibit 5-2 indicator data tables.
4
4. Recommendations
Part 1 (1) outlined “A New Enterprise Data Management Strategy for the US EPA, Part 2
(2) showed a knowledgebase that contains an inventory of the data, and Part 3 shows how
to integrate data tables within and across categories in the inventory of the data assets.
This position paper for the upcoming W3C Workshop on RDF Access to Relational
Databases (6) makes Government data tables and relational databases (e.g. LandView 6
and 7 on DVD) (1) readily accessible for Semantic Web pilots.
5. References
(1) A New Enterprise Data Management Strategy for the US EPA, August 15, 2007.
Word: http://colab.cim3.net/file/work/SICoP/EPADRM3.0/BNiemann08152007.doc
PowerPoint: http://colab.cim3.net/file/work/SICoP/2007-11-06/BNiemann11062007.ppt
LandView 6 and 7 (in process): http://landview.census.gov
(2) A New Enterprise Data Management Strategy for the US EPA - Part 2: Inventory of
Data Assets, August 29, 2007.
Word: http://colab.cim3.net/file/work/SICoP/EPADRM3.0/BNiemann08292007.doc
(3) Metatopia 2007, November 5-7, 2007, Hosted by Data Management Association of
the National Capital Region.
Home Page: http://www.wilshireconferences.com/metatopia/index.html
Agenda: http://www.wilshireconferences.com/metatopia/agenda.html
Authors Abstract: http://www.wilshireconferences.com/metatopia/Sessions/e1.html
(4) Toward More Transparent Government Workshop on eGovernment and the Web,
United States National Academy of Sciences, Washington DC, USA , June 18-19, 2007,
W3C and Web Science Research Initiative.
http://www.w3.org/2007/06/eGov-dc/agenda.html
(5) Current Semantic Web Layer Cake.
http://www.w3.org/2007/03/layerCake.png
http://ontolog.cim3.net/forum/ontolog-forum/2007-07/msg00256.html
(6) W3C Workshop on RDF Access to Relational Databases, October 25-26, 2007,
Boston, MA, USA. http://www.w3.org/2007/03/RdfRDB/cfp
(7) Discussion of RDF at the Data Management Discuss Discussion Board, August 22,
2007. Some highlights: An old idea in a new web form, has more flexibility than a table
model, it has a lot of practical problems that the industry is working out, and vendors that
have adopted it doing just fine if it does turn out that “RDF eats tables for lunch”.
http://tech.groups.yahoo.com/group/dm-discuss/
(8) Ontolog Forum Discussion of the Current Semantic Web Layer Cake, July 28, 2007,
to the present
http://ontolog.cim3.net/forum/ontolog-forum/2007-07/index.html
http://ontolog.cim3.net/forum/ontolog-forum/2007-08/index.html
5
Download