AST_2011_Final_Project_Write_Up

advertisement
SWQP: Semantic Web Quality Portal
Jin Guang Zheng, Ping Wang
Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, New York, USA
TWC-TR#14
ABSTRACT
When solving problems related to water quality in environmental science, the scientists typically need collect data
from various sources and perform analysis on these data.
This process can be complex and time consuming. In this
paper, we present Semantic Water Quality Portal (SWQP), a
web portal powered by semantic web technologies for water
related information discovery and analysis. The proposed
SWQP collects and integrates water related data from various sources and performs automatic reasoning and analysis
on the data collected, and finally presents the analyzed results in a user friendly interface. SWQP has demonstrated
using the semantic web technologies we can ease the difficulty and complexity when solving water related problems.
More specific, SWQP supports the following features that
are enabled by semantic web technologies: 1. Provide data
provenance information in a structured format using Proof
Markup Language(PML) and provides provenance based
reasoning, 2. Support automatic inference and reasoning
using Web Ontology Language – OWL, 3. Support visualization over water related data using SPARQL and Google
Visualization tools.
1
INTRODUCTION
Water quality problems have been a major concern for environmental scientists as well as local citizens. People have
been devoted tremendous amount of efforts to solve these
problems. Identifying possible polluted water sources, pollutants in the water sources and possible polluters for the
pollution are a few problems that people are interested in.
To monitor and control water quality, the authorities1,2 have
been collecting data about water quality, pollutants, etc. for
years, and set up regulations to identify polluted water
sources. With this tremendous amount of data, the process
of identifying polluted water sources and pollutants, etc. can
be complex and time consuming even for trained professionals, and not to mention local citizens. Furthermore,
citizens or scientist may be interested in viewing the trend
1. http://www.epa.gov/
2. http://www.usgs.gov/
of contaminants in a water source to get insight view about
the water source.
Motivation Example: Imagine that children in a local area
start getting sick with the symptom of vomiting. The parents are suspecting there is something wrong with the drinking water. Then they contact authorities and ask them to
check the water sources. Authorities then will collect data
from sample water from water sources and also getting data
from various authorized agencies such as Environment Protection Agency (EPA), U.S. Geological Survey (USGS),
and regulation data from the state. Then authorities will
perform analysis on the data collected. And finally, they
will report the results to citizens and take further actions. In
this use case, collecting data from various sources and performing analysis (identifying polluted water sources, pollutants, etc.) could require both domain knowledge and significant human efforts and time.
In this paper we proposed Semantic Water Quality Portal
(SWQP), a semantic web technologies enabled water quality
portal for identifying polluted water sources, pollutants and
possible sources of pollutions. In this SWQP, we have enabled following semantic web technologies based features: 1.
Provenance based data selection, integration, and reasoning,
2. OWL typed automatic inference and reasoning, 3. Google
visualization over RDFized water related data.
2
2.1
METHODS
System Overview
The system architecture of SWQP is illustrated in Figure 1.
There are five major components in this system: 1. Data
Conversion, 2. Ontology, 3. Provenance, 4. Jena-Pellet
based reasoner, 5. Front-end interface and visualization.
Data Conversion Component: There are two converters
used in the system. One of the converters is a general converter, which is able to convert any data in CSV format to
RDF format. Another converter is an ad-hoc converter for
SWQP, which converts regulation data from PDF format,
HTML format to CSV format.
1
Luciano, J et al.
ETST_2011_Luciano_Joanne_A1
for this visualization component: trend visualization part
and map visualization part. Details are discussed at section
2.4.
2.2
Figure 1, SWQP System Architecture, this figure illustrates
how SWQP components works together.
Ontology Component: In SWQP, we designed a core
regulation ontology. When data are converting to RDF [1]
format, we encode the data using the ontology. Therefore,
we can perform reasoning on the data we collected. The
ontology itself is designed and encoded use OWL2 [2].
Subset of the ontology is illustrated in figure 2. Details are
discussed at section 2.2
In SWQP, we designed two types of ontology: core EPA
ontology3 and regulation ontologies4. The core ontology is
called EPA Ontology. This core EPA Ontology consists of
18 classes, 4 Object properties, 10 Data properties, and imports numerous existing ontologies such as sweet ontology[3], time ontology[4], etc. which models complex relationships (e.g. subclass, disjoint, etc.) between classes such
as water sources, facilities, measurements, and contaminants, etc. as illustrated in figure 2. For example, a polluted
water source is modeled as intersection of water source and
an instance that has a measurement over a threshold. Use
this modeling, we will be able to perform automatic reasoning such as “any water source has a measurement over certain threshold is a polluted water source”.
Besides the core EPA ontology, we also designed regulation
ontologies. The number of concepts and properties varies,
since each state can have different number of water regulations. The main purpose of regulation ontology is to model
the state and EPA regulation data. For example, in California, the state regulation data defines 0.01 mg/l as a threshold
for Arsenic. This information is encoded in the regulation
ontology. Combining the regulation ontology and core ontology we will be able to perform reasoning over water
source such as “any water source contains 0.01 mg/l of Arsenic is a polluted water source.”
2.3
Figure 2, subset of SWQP ontology
Provenance Component: There are two levels of provenance information we captured using our provenance component: Data Level Provenance and Application Level
Provenance. Details are discussed at section 2.4.
Back-end Reasoner Component: We also built a back-end
reasoner using JENA and PELLET. In SWQP, the reasoner
performs OWL 2 reasoning using our over the data we collected from various sources to determine polluted water
sources and polluting facilities.
Visualization Component: This component is responsible
for mashing up and representing the data we collected from
various sources in a meaningful way. There are two parts
2
Ontology and Reasoning
Visualization
In SWQP, we build 2 types of visualization to better present
the analyzed results and the water quality data we collected.
As aforementioned, the visualization contains two parts:
Map visualization and Trend visualization.
Map visualization: After back-end reasoner finishes the
analysis on whether or not a water has been polluted or a
facility has violated any regulation, map visualization will
display the results on the Google Map: each facility or
water source will be presented on the Google Map using
different markers to identify the type of the site (polluted
water source, facility, etc.). Figure 3 shows an example of
map visualization.
1. http://tw2.tw.rpi.edu/zhengj3/owl/epa.owl
2. http://tw2.tw.rpi.edu/zhengj3/owl/
Luciano, J et al.
SWQP: Semantic Web Quality Portal
The second level is application level provenance: the main
purpose of this level of provenance data is to provide explanation to the user when a water source is marked as polluted
or a facility been marked as violating facility, and directs
user to the quantitative data we used in our inference and
reasoning. For example, if user selected a polluted water
source in the map visualization, a window will be pop up
and provides explanation to the user.
3
DISCUSSION
Figure 3, Map Visualization
Trend visualization: This part of visualization provides
functionality to visualize the water quality data related to
the selected water site or facility as time series. With this
feature, the user can observe and analyze the trends of the
water quality data over time. For example, a user may be
interest to see how the amount of Arsenic in a water source
changes over time.
3.1
Data
The data sources of our portal span across several government agencies, such as EPA and USGS and Federal and
State Regulation agencies.
EPA Data: We get permit compliance and enforcement
status of facilities regulated by the National Pollutant Discharge Elimination System (NPDES) under the Clean Water
Act5 (CWA) from ICIS-NPDES6, which is one of the EPA
information management systems. The compliance and enforcement status of facilities contains measurements of pollutants in the water discharged by the facilities, and also the
threshold values for up to 5 test types for each pollutant.
USGS Data: We also fetch the National Water Information
System7 (NWIS) water quality data provided by USGS. The
NWIS water quality data gives measurements of substance
contained by in water collected at USGS data-collection
stations.
Figure 4, Trend Visualization
2.4
Provenance
In SWQP, we generate two levels of provenance data. The
first level is data level provenance: when data are converted
to RDF using our data conversion component, we inject
provenance information about data sources using PML [5].
These provenance data are used to support provenance
based data query. For example, if a user is interested in
applying EPA regulation to the data collected from USGS
agency. The data level provenance we generate will be able
to identify the sources of data collected and query appropriate data.
5. http://www.epa.gov/agriculture/lcwa.html
6.http://www.epa-echo.gov/echo/compliance_report_water_icp.html
7. http://waterdata.usgs.gov/nwis
8. http://water.epa.gov/drink/contaminants/
Regulation Data: The water portal also makes use of water
regulations, which are lists of Contaminants and their Maximum Contaminant Level8 (MCLs). For now, we have encoded the national level drinking water regulations from
EPA, and also the state drinking water regulations for California and Massachusetts.
3.2
Semantic Technologies/Claims
Semantic web technologies have been proven to be able to
bring many benefits to various types of applications (e.g.
semantic mediawiki [6] etc.) through semantic data integration, semantic query, etc. In SWQP, we have implemented
most of our features using semantic web technologies: Ontology based reasoning, data integration, provenance based
query and reasoning, etc.
Claim 1: Semantic Data Integration helps SWQP to integrate data from various sources, and eases the process of
future data integration.
3
Luciano, J et al.
As we aforementioned, SWQP integrates data from various
sources, such as EPA, USGS, state regulation authorities
etc. The data from these sources are typically in different
formats. This heterogeneous nature of the water related data
is one of the major challenges that researchers face when
they need to analyze these water data: 1. It is difficult to
query these data and integrate the data for particular usage,
2. Data are stored using different schema, the semantics of
the terms in different schema can be very different from
each other.
SWQP has overcome these problems caused by heterogeneous data using semantic web technologies: 1. Data from
different sources are converted into RDF format, and loaded
into triple store. Then, we can use SPARQL [7] to fetch
data. 2. When we convert data, we use EPA ontology as the
central schema to encode the converted data, therefore we
have consistent semantics for the converted data.
Another benefit of semantic data integration is it is much
easier to import data in the future. Imagine if we are importing more heterogeneous data, it will be difficult to use
other technologies to describe and store the data, since the
schema are typically fixed in other technologies difficult to
alter. Whereas using semantic web technology, all we need
to do is to change the ontology by adding few equivalent
statement or new properties and classes.
Claim 2: Automatic inference and reasoning supported by
semantic web technologies helps SWQP to perform automatic analysis on water qualities etc.
One of the most important tasks in water related problems is
analyzing the data. However, given the amount of data that
are consumed in the analysis process, this analyzing task
can be very complex and time consuming. For example, to
identify if a water source is polluted or not, we need to
compare all measurements of all contaminants with the corresponding limits in the adopted water regulations. Furthermore, as we aforementioned in Ontology and Reasoning
section, the ontology we designed allows us state rules such
as “any water source has a measurement over certain
threshold is a polluted water source”, “any measurement
contains 0.01 mg/l of Arsenic is a threshold”. Combining
these rules, ontology will produce “any water source contains 0.01 mg/l of Arsenic is a polluted water source.”
Based on these inference rules, SWQP can perform automatic analysis to identify types of water sources, etc.
Claim 3: Provenance information encoded in semantic web
technology helps SWQP gain trust from users.
The primary usage of SWQP is to identify polluted water
sources and polluting facilities in the region specified by the
user input. However, the answers from SWQP are not likely
4
ETST_2011_Luciano_Joanne_A1
to be trusted by users if it does not provide users with the
option to examine how the answers are reached. As pointed
out in [8], knowledge provenance, which includes source
identification, source authoritativeness, deductive proof
trace, can be used to provide understandable explanation to
users.
The provenance support in SWQP is multi-folded. The
source meta-information is captured and encode in PML
while the data collection stage. With the source metainformation, not only do we enable users to identify the
source of the water quality data and water regulations, we
also provide users with provenance based query. If a new
New York resident who just moved from California thinks
that the California water regulations are stricter and can
identify water pollution better, he/she could choose to apply
California water regulations on the New York water quality
data.
Besides source identification, we also provide proof for the
answers given by SWQP. Each polluted water source or
facility in the map visualization is accompanied with a link
from which the trends of the water quality data are displayed. Users can easily check if the reported pollution is
true or false by observing the water quality data visualized
as time series.
To enable users to have more complete understanding of the
answers, we would like to provide deductive proof traces for
the answers in the near future.
3.3
Evaluation
There are no standard benchmarks we can use to evaluate
our approach. So we design our own evaluation approach.
We are considering evaluate SWQP to answer following
questions: 1. How easy is it to deploy SWQP to analyze
water quality in other states, 2. How easy is it for user to use
SWQP to obtain analyzed results, 3. How easy is it to deploy SWQP to solve other environment related problems.
These evaluations may require human studies, where we can
invite people to use the system: performing certain tasks
(identify number of polluted water source in your home
town, etc.) and answering various type of questions (In scale
1 to 5 how would you rate the system w.r.t your experience,
etc.).
4
RELATED WORK
Three areas of work are considered to be related to SWQP:
semantic web portal, water quality ontology, and provenance.
There is a diverse literature on semantic web portal systems
[9], for example, LOGD [10], Semantic Web Portal[11],
SEAL[12], Health Finland[13], etc. As web portal, these
systems provide different functionalities for users to interact
Luciano, J et al.
with the data, such as integrating data [10][11][12], visualizing data[10][11], searching data[12][13], etc. Our work
differs from these systems in 2 aspects: 1. SWQP provides
automatic inference and reasoning, 2. SWQP captures provenance information and provides provenance related functionalities.
Water and environment related ontology research and development have always been communities’ interests. Considerable numbers of ontologies have been developed to
describe water [14][15] and environment related data[3].
Ontologies developed by Chau [14] and Parekh [15] are for
describing water data such as quality and contaminants for
simulation purpose. SWEET [3] is a more general ontology
that models and describes the environment we live in. The
ontology developed for SWQP is aiming to model the water
quality, pollution and related data for reasoning and finding
pollutions. In this ontology, we also import some terms
from SWEET ontology to describe and model some water
related information.
There also has been considerable amount of research efforts
in semantic provenance, especially in the field of e-Science.
myGrid [16] proposes the COHSE open hypermedia system,
which generates, annotates and links provenance data to
build a web of provenance documents, data, services and
workflows for experiments in biology. The Multi-Scale
Chemical Science [17] (CMCS) project develops a generalpurpose infrastructure for collaboration across many disciplines. It also contains a provenance subsystem for tracking,
viewing and using data provenance. ---[18] presents a taxonomy of the provenance techniques used in e-science projects.
Provenance is critical in e-science, because users are not
likely to trust the results of scientific experiments if the
provenance of the results cannot be identified. Similarly, an
environmental portal also needs to support provenance to
gain trust from users. Currently, SWQP provides provenance information on both data source level as well as application level. Furthermore, SWQP provides provenance
based data query as discuss at section 2.4. This provenance
related work in SWQP would gain trusts from users. In the
future, we could borrow the approaches used by the above
e-science systems to develop our own provenance subsystem. Alternatively, we could make use of existing provenance infrastructure like [19], which supports the extraction,
maintenance and usage of provenance of answers given by
web application and services.
5
FUTURE WORK
SWQP: Semantic Web Quality Portal
and inference, and provide complete explanation of answers
given by SWQP via provenance knowledge captured.
Inference and Reasoning: There are 2 interesting reasoning can be supported by SWQP: 1. Health Effect Reasoning,
2. Flood Effect Reasoning.
Health Effect Reasoning: Drinking the water from polluted
water sources may result in serious health problems as we
discussed in our motivation use case. By modeling these
health-effects and reasoning over these effects, we will be
able to provide valuable solutions for some interesting problems. For example, in our motivation use case, if we are
able to model and infer what kind pollutions cause people
start vomiting, we will be able to more quickly identify the
polluted water source that causes the health problems.
Flood Effect Reasoning: In water quality context, if a polluted water source is flooding, it may pollute nearby water
sources, and therefore cause serious pollution problems. If
we can identify which water sources may affected by the
source of flood, we are in a better position to prevent or
alleviate this. Furthermore, if the flood source is polluted
water source, we can predict the effects of this flood on
nearby water sources by semantic reasoning.
Knowledge Provenance: To provide users with more complete explanation of the answers given by SWQP, we would
like to support building, linking and displaying proof traces
that track how the answers are derived from source data.
Our proof traces would include all the manipulations the
data go through: downloaded from the data source, converted into RDF via the converters, loaded into the triple store,
reasoned by the reasoner, presented with the presentation
tools. We also would like to support provenance granularity
with which users can choose the granularity of the provenance they prefer.
6
CONCLUSION
In this paper, we presented Semantic Water Quality Portal, a
web portal for identifying polluted water source, pollutants,
etc. In this portal, we use semantic web technologies to
provide 1. Provenance data about water source, and support
provenance based data retrieving and reasoning, 2. OWL
type reasoning and inference to identify polluted water
source, pollutants, etc., 3. Data visualization to present the
trend of the amount of the contaminants in water sources.
As we discussed in previous section, we will be working on
developing functionalities to support more interested reasoning and inference such as health effect reasoning, reasoning over flood effects, and provide understandable explanation for the analyzed results via provenance knowledge.
In the near future, we would like to improve our SWQP
system through supporting some more interesting reasoning
5
Luciano, J et al.
ACKNOWLEDGEMENTS
The authors would like to thank Evan Patton for his help on
use case development and various technical supports.
Thank Tim Lebo for his help on data conversion process.
TEAM CONTRIBUTIONS:
Ping:
 Use Tim’s converter to convert EPA and USGS
Data.
 Preprocess regulation data to CSV format
 Implement data visualization part of the project
 Write part of this final class write up, and present
the visualization part of the demo.
Jin:
 Write script to convert data to RDF format encoded
use Ontology
 Design Ontology to support automatic reasoning
and inference
 Re-implement Jena-Pellet based backend reasoner.
 Class related works: since this project is Ping’s out
of class project, I am responsible for most of the
project related write up, presentation, etc.
REFERENCES
[1] Manola, F., Miller, E., McBride, B., (2004): RDF Primer,
<http://www.w3.org/TR/rdf-syntax/>
[2] Hitzler, P., Krotzsch, M., Parsia, B., Patel-Schneider, P., Ru
dolph, S., (2009) OWL 2 Web Ontology Language Pri
mer. <http://www.w3.org/TR/owl2-primer/>
[3] Raskin, R., & Pan, M. (2005). Knowledge representation in the
semantic web for Earth and environmental terminology
(SWEET). Computers & Geosciences, 31(9), 1119-1125.
[4] Hobbs, J., Pan, F., (2006): Time Ontology in OWL,
<http://www.w3.org/TR/owl-time/>
[5] McGuinness, D., Silva, P., Ding, L., (2007): Proof Markup
Language (PML) Primer,
<http://inference-web.org/2007/primer/>
[6] Krotzsch, M., Vrandecic, D., Volkel, M., (2006) :
Semantic MediaWiki, ISWC
[7] Prud’hommeaux, E., Seaborne, A., (2008): SPARQL
QUERY LANGUAGE FOR RDF,
<http://www.w3.org/TR/rdf-sparql-query/>
[8] Pinheiro, P., Mcguinness, D. L., & Mccool, R. (2003):
Knowledge Provenance Infrastructure, in IEEE Data
Engineering Bulletin, vol. 26,
[9] Lausen,H., Ding, Y., Stollberg, M., Fensel, D., Hernandez, R.,
and Han,S. (2005): Semantic web portals:state-of-the-art
survey. Journal of Knowledge Management, vol. 9(5),
pp. 40--49
[10] Difranzo, D., Ding, L., Erickson, J. S., Li, X., Lebo, T., Mich
aelis, J., et al. (2010). TWC LOGD: A Portal for Linking
Open Government Data. In: Semantic Web Challenge,
International Semantic Web Conference 2010.
[11] Ding, Y., Sun, Y., Chen, B., Borner, K., Ding, L., Wild, D.,
6
ETST_2011_Luciano_Joanne_A1
Wu, M., DiFranzo, D., Fuenzalida, A., Li, D., Milojevic,
S., Chen, S., Sankaranarayanan, M., Toma, I., (2010):
Semantic Web Portal: A Platform for Better Browsing
and Visualizing Semantic Data. International Conference
on Active Media Technology
[12] Maedche, A., Staab, S., Stojanovic, N., & Studer, R. (2001).
SEAL - A Framework for Developing SEmantic portALs.
In Proceedings of the 18th British National Conference
on Databases
[13]Suominen, O., Hyvönen, E., Viljanen, K. and Hukka, E.,
(2009):HealthFinland-A National Semantic Publishing
Network and Portal for Health Information. Web
Semantics: Science, Services and Agents on the World
Wide Web, 7(4), pp. 287--297
[14] Chau, K., (2007) : An Ontology-based knowledge manage
ment system for flow and water quality modeling.
Advances in Engineering Software. 38(3), 172-181.
[15] Parekh V., (2005): Applying Ontologies and Semantic Web
technologies to Environmental Sceiences and Engineering. Mater Thesis, University of Maryland, Baltimore
County
[16] Zhao Jun, Goble Carole, and Stevens Robert.(2004):
Semantically linking and browsing provenance logs for
e-science. In Proc. of the 1st International Conference on
Semantics of a Networked World, Lecture Notes in
Computer Science, Paris, France.
[17] J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B.
Didier, (2003): Multi-Scale Science, Supporting
Emerging Practice with Semantically Derived Provenance, in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data.
[18]Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A
survey of data provenance in e-science. ACM SIGMOD
Record, 34(3), 31-36.
[19] Mcguinness, D. L., and Pinheiro, P. (2004). Explaining
answers from the semantic web: The inference web
approach. Journal of Web Semantics, 1, 397-413.
Download