- Tetherless World Constellation

advertisement
Towards Semantically-enabled Exploration and
Analysis of Environmental Ecosystems
Ping Wang, Linyun Fu, Evan W. Patton, Deborah L.
McGuinness
Joshua Dein, Sky Bristol
U.S. Geological Survey
City, USA
{sbristol, fjdein}@usgs.gov
Tetherless World Constellation
Rensselaer Polytechnic Institute
line 3: Troy, USA
{wangp5, ful2, pattoe, dlm}@cs.rpi.edu
Abstract—We aim to provide a broad and deep range of decision
support tools for resource managers who need to examine large
complex ecosystems and make recommendations in the face of
many tradeoffs and conflicting drivers. We take a semantic
technology approach, leveraging background ontologies and the
growing body of open linked data. In previous work, we designed
and implemented a semantically-enabled environmental
monitoring framework called SemantEco and used it to build a
water quality portal named SemantAqua. In this work, we
significantly extend SemantEco to include knowledge required to
support resource decisions concerning endangered species and
their habitats. Our previous system included foundational
ontologies to support environmental regulation violations, and
relevant human health effects. Our enhanced framework includes
foundational ontologies to support modeling of wildlife
observation and wildlife health impacts, thereby enabling deeper
and broader support for large ecosystem analysis in the face of
environmental pollution. Our results include a refactored and
expanded version of the SemantEco portal. Additionally the
updated system is now compatible with the emerging best in class
Extensible Observation Ontology (OBOE). A wider range of
relevant data has been integrated, focusing on additions
concerning wildlife health. The resulting system stores and
exposes provenance concerning where the data came from, how it
was used, and also the rationale for choosing the data. In this
paper, we describe the system, highlight its research
contributions, and describe current and envisioned usage.
Keywords-Semantic
Web;
Semantic
Environmental
Informatics; Provenance; Ecological Data Integration;
I.
INTRODUCTION
In many places around the world, wildlife and their
habitats on which they depend are deteriorating. For instance,
almost 40 percent of the United States’ freshwater fish species
are considered at risk or vulnerable to extinction according to
[1]. Aiming at preserving the environment and wildlife,
scientists and resource managers have initiated various efforts
to monitor ecological and environmental trends, investigate
causes and possible effects of pollution, and identify threats to
wildlife and their habitats [2].
Meanwhile, information technology experts
building environmental information systems
information technologies to improve access to
concerning ecological and environmental data.
have been
and using
information
In previous
work [3], [4], we proposed the Tetherless World Constellation
Semantic Ecology and Environment Portal (SemantEco) as
both an environment portal application and as an example of a
semantic infrastructure for environmental informatics
applications. In this paper, we extend the focus of SemantEco
beyond water quality and related health effects to a more
comprehensive effort including endangered species and their
related health effects. This extension provides a broader focus
to reach out to an ecosystem perspective where one focus now
is on supporting resource managers as they attempt to make
decisions about more complex ecosystems.
To realize these extension goals and make the portal more
reusable, extensible, and possibly lower barriers to adoption by
environmental and observational communities, some issues
needed to be addressed. Challenges came in a number of high
level categories including terminology, data integration,
provenance, and scalability.
Terminology: In our previous iterations of SemantAqua 1
and SemantEco, we built our SemantEco ontology family that
was driven by the use case generated requirements. This
approach worked well in that it yielded a relatively small
ontology that directly met our application needs. We
modularized the ontologies to according to domain (thus there
is a water module 2 , an air module 3 , a general contaminantthreshold layer4, etc.) The basic ontology structure has held up
to a number of extensions and one the nice properties of it is
the relative simplicity for pollution detection based on
regulations. However, it was built by lay people with respect to
observations and environmental data. Initially, we did not want
to adopt larger environmental ontologies since their breadth
was more than was needed, and some lacked depth. Now
however as we move to a setting where more breadth is useful
as larger ecosystems are considered, and as we move to a
setting where we hope are engaging with environmentalists, it
has become useful to make connections to ontologies that were
already familiar in environmental communities interested in
scientific observation data.
1
http://aquarius.tw.rpi.edu/projects/semantaqua/
http://escience.rpi.edu/ontology/2/0/water.owl#
3
http://escience.rpi.edu/ontology/2/0/air.owl#
4
http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#
2
Data Integration: Quality wildlife observation and
environmental data are becoming increasingly available on the
web. However, it is often difficult to access, particularly when
users are interested in understanding the data enough to
integrated data from multiple sources and use it in their own
applications. While some data sets and services are doing a
better job at providing some documentation, the documentation
is usually in natural language, possibly separated from and not
in sync with the data, or even incomplete. The semantics of the
data are not explicitly captured, and thus it is often difficult for
humans, let alone computers, to understand or reason over the
data.
Scalability: Our initial efforts focused on a few US states
worth of data. When we scaled up to include data that covered
the entire United States just for water quality and facilities
monitoring, we reached over 5 billion triples. As the project
now expands to endangered species and use cases begin to
address more frequent observations, the scale of the raw data
expands. Querying and reasoning over large collections of data
can be time consuming, thus we are exploring techniques that
allow the portal to scale while still doing relevant reasoning in
a timely manner.
Provenance: Our portal has always captured provenance
concerning where data was retrieved from, when the retrieval
happened, and any manipulations that were done to the data.
Further, it has used that information to provide some
provenance-aware features that allow access, for example, to
summary views that use only some resources or reason against
particular regulations. Our collaborator Sky Bristol from USGS
however asked for additional provenance – capturing the
rationale for data or data services choices as well as
manipulation choices. As pointed out by Sky, rationale that
explains the choices made during data manipulation is helpful
for portal users to obtain a deeper understanding of the data
integration. Additionally, the rationale can be invaluable in
helping data and data service providers determine the metadata
and service characteristics they should provide to see increased
usage of their services.
We use semantic technologies to provide responses to these
challenges. Our updated system is compatible with the
Extensible Observation Ontology (OBOE) [5]. This ontology is
aimed to support interoperable observation data. Our ontology
family complements OBOE by providing modules aimed at
supporting environmental monitoring and potential correlations
to observation data. Additionally, we integrated various
ecological and environmental data: wildlife observation data
from the Avian Knowledge Network (AKN 5 ), and U.S.
Geological Survey (USGS6); environmental criteria for wildlife
from the Environmental Protection Agency (EPA7); water body
data from USGS, and health effects of contaminants on wildlife
from Wildpro8. Our approach provides a formal encoding of
the semantics of the data and provides services for automatic
reasoning and visualizations over the data. Furthermore, we
compared the performance of a standard reasoner with our
5
http://www.avianknowledge.net
http://www.usgs.gov/
7
http://www.epa.gov/
8
http://wildpro.twycrosszoo.org
customized rule based reasoner over our data. Lastly, we
enhanced our provenance support by incorporating rationale as
provenance.
In this paper, section 2 and 3 elaborate how semantic web
technologies have been used to extend and improve the portal,
including extension for wildlife monitoring, connecting to
OBOE, capturing rationale as provenance, and reasoned
comparison. Section 4 reviewed related work. Section 5
discusses impacts, highlights, and future directions. Section 6
presents the conclusion.
II.
EXTENSION FOR WILDLIFE MONITORING
A. Use Case
The USGS provides integrated science and methodology to
support the Wyoming Landscape Conservation Initiative
(WLCI9): an effort to assess and enhance aquatic and terrestrial
habitats at a landscape scale in southwest Wyoming. One
vision that the USGS team has for WLCI is to produce a
decision support system for resource managers that facilitates
examination of the many tradeoffs and conflicting drivers at
work in the focus area, from energy, agricultural, and
agricultural development to fish and wildlife conservation. Our
USGS collaborators are interested in building both the decision
support system for Wyoming and the infrastructure that
supports the building of such systems for other states with
semantic science and technologies.
To this end, we designed the following use case to identify
necessary extensions to the portal for wildlife monitoring:
The resource manager chooses a geographic region of
interest by entering a zip code and the species of
concern in the species facet. The portal identifies
polluted water sources and polluting facilities, and
visualizes the results on a map using different icons.
Meanwhile, the portal displays the distribution of the
species in the region. Then, the resource manager views
the map to find out if the selected species might be
endangered by water pollution in the region. The
resource manager can click on polluting facilities or
polluted sites to investigate more about the pollution,
e.g. the health effects of the pollution on the species.
To realize the use case, we enhanced our ontology for
modeling the domain of wildlife observation, integrated
wildlife observation data according to linked data principles
[6], and developed visualizations to present the data and
provenance.
B. Ontology for Wildlife Monitoring
There are a number of existing ontologies for modeling and
publishing RDF data about species descriptions [7], [8]. After
reviewing these ontologies, we choose to reuse the Geospecies
ontology for the purpose of modeling the domain of wildlife
monitoring as it contains most of the concepts required by the
use case. For example, the Geospecies ontology defines the
6
Identify applicable sponsor/s here. (sponsors)
9
http://www.wlci.gov/
classes Observation and SpeciesConcept, and links the two
classes with object properties hasObservation and hasSpecies.
However, the Observation class from Geospecies does not
capture the observed habitat of the wildlife and the date of the
observation. Thus, in the extension10, we introduce a new class
WildlifeObservation, which is the subclass of the Observation
class from Geospecies, but enhanced with two properties:
hasHabitat and hasDate. A subset of our ontology extension is
illustrated in Fig. 1.
Water Body Data: The USGS National Hydrography
Dataset (NHD) services are used to get HUC codes given
locations on the map 14 , and the data for the water body
shapes15.
Health Data: We obtain the health effects on wildlife from
the research effort Wildpro, an electronic encyclopedia and
library for wildlife.
Each contaminant can cause different adverse effects on
different wildlife species. For example, when exposed to
excessive Zinc concentrations, mallards exhibited leg paralysis
and decreased food consumption while invertebrates exhibited
decreased growth rate and increased mortality [2].
To help researchers investigate health impacts of
contaminants on wildlife species, we refactored our ontologies
to include the class HealthEffect to model the potential health
effects of overexposure to contaminants 11 . The property
isCausedBy establishes the relationship between each effect
and its causing contaminant and the property forSpecies links
each effect with its target species.
Figure 1. Subset of wildlife ontology
C. Wildlife Data Integration
1) Source Data
Bird observation data: One major source of bird
observation data is AKN, which an international effort of
government and non-government institutions to understand the
patterns and dynamics of bird populations across the Western
Hemisphere. We obtain a subset of the eBird Reference Dataset
(3.0) [9] from AKN via its database query interface. The
datasets are based on reported observations from novice and
experienced bird observers and contains count data for bird
species, the location where observation took place, and time
observation started.
Fish observation data: The National Fisheries Data
Infrastructure brings together local and regional fisheries
information systems and provides fisheries managers and
decision-makers with one source of comprehensive data and
information of fish species12. We fetch fish observation data
from its query interface. The fish dataset includes the species
name, the hydrological unit code (HUC) 13 of the watersheds
where the fish species is observed, the date of the observation,
and the originating database.
Regulation Data: We integrate EPA's compilation of
national recommended water quality criteria [10], which is
presented as a summary table containing recommended water
quality criteria for the protection of aquatic life and human
health in surface water for approximately 150 contaminants.
10
http://escience.rpi.edu/ontology/semanteco/3/0/wildlife.owl
http://escience.rpi.edu/ontology/semanteco/3/0/wildlifehealtheffect.owl
12
http://ecosystems.usgs.gov/fisheriesdata/querybystate.aspx
13
HUC8 is an 8-digit hydrological unit code identifying a sub-basin area
of size around 700 square miles. See
http://en.wikipedia.org/wiki/Hydrological_code
11
2) Data Conversion
The general-purpose csv2rdf4lod tool provides us with the
capability of quick and easy data integration [11]. We provide
the converter with declarative parameters that map properties
of the raw data to terms defined in ontologies. For example, the
field “Common Name” in the eBird dataset is mapped to the
property geospecies16:hasCommonName. Using this mapping,
the converter is able to generate RDF triples compatible to our
ontologies from the raw tabular data provided by our data
sources We use the same regulation ontology design and
conversion workflow with our previous work [4] to map the
rules in wildlife regulations to OWL [12] classes.
3) Data Visualization
We support two types of visualizations: (1) map
visualization that displays the sources of the water pollution
together with species habitat in the context of geographic
regions and (2) time series visualization that depicts species
count over time with respect to a particular geographic region.
The map visualization gets the sources of the water
pollution from the back-end reasoner and the species habitats
by querying the triple store. We visualize clean and polluted
water sources, and clean and polluting facilities with different
markers. In the extension, we focus on waterfowl and fish, and
visualize water bodies that are their habitats. We then highlight
the water bodies with a different color. When the user clicks
one of the highlighted water bodies, the information of the
water body and the provenance of the information are shown in
the "Water Body Properties" tab of the pop up window. When
the user clicks on a polluted site, and a pop up window shows
more details about the pollution: names of contaminants,
14
http://services.nationalmap.gov/ArcGIS/rest/services/nhd/MapServer
http://nhd.usgs.gov/data.html
16
http://rdf.geospecies.org/ont/geospecies#
15
measured values, limit values, time of measurement, and health
effects on the species. Fig. 2 shows an example of our map
visualization. In the example, the portal applies EPA's water
quality criteria for aquatic life on the region with the zip code
98103 (Seattle, WA) and identifies polluting site that is close to
bird habitats.
The time series visualization retrieves species count data
within a particular geographic region by querying the triple
store and displays the data as a time series using the d3.js
library. Fig. 3 shows the count of Canada geese in the
Washington state in 2007.
1. With the SemantEco ontology, each observation record is
modeled using the class Measurement. OBOE contains the
additional class "Observation" and measurements are tied to the
corresponding observations.
2. With the SemantEco ontology, each observation record
only generates one Measurement entity as the subject, and all
fields of each observation record are directly linked to the
generated Measurement entity. According to OBOE, one
observation record can generate multiple Observation and
Measurement entities. For example, both the measured value
and the measurement date are modeled as observations which
contain measurements. And the measurement date is connected
to the measured value using the predicate "hasContext".
3. While the SemantEco ontology captures fields like
measurement value, unit and date using datatype properties,
OBOE models them using entities. For instance, the unit mg/L
is encoded as oboe17:MilligramPerLiter.
To address the differences, we incorporate an adapter
ontology from our SemantEco ontology family to OBOE and
develop an data converter for encoding the water observation
data according to the OBOE. We name the previous version of
our SemantEco ontologies as version 2 and the new version as
version 3.
Figure 2. Map Visualization
The updated data modeling is compatible to OBOE, except
for measurement value. We cannot encode the measurement
value as entities, since our regulation ontology maps the rules
in regulations to OWL classes18, and encodes allowable ranges
of regulated characteristics via numeric range restrictions on
datatype properties. Whether an observational item implies a
pollution event is reflected by whether the item is a member of
any class mapped from the regulations. It is required to encode
measurement values as numerical data to enable such
reasoning. So to capture measurement value, instead of the
object property oboe:hasValue, we use the data property
hasNumbericValue, which is defined in an adapter ontology19
from our SemantEco ontology family to OBOE. Fig 4 and 5
depict the data presentation of one example water observation
record generated according to the SemantEco version 2 and 3.
Figure 3. Time Series Visualization
III.
EXTENSION OF THE FRAMEWORK
To make the portal more reusable, extensible, and possibly
lower barriers to adoption by environmental and observational
communities, we upgrade the portal in three aspects: connect to
a more general ontology family, enhance provenance support
and investigate reasoning performance.
A. Connect to OBOE
In our previous iterations of SemantAqua and SemantEco, we
built our SemantEco ontology family that was driven by the
use case generated requirements, and the resulting ontologies
are lightweight and directly met our application needs. In
contrast, OBOE is for generic scientific observation and
measurement, and serving as a convenient basis for adding
semantic annotations to scientific data. Thus, the SemantEco
ontology and OBOE differs in several ways as follows.
Figure 5. Measurement in our pollution ontology
We model a polluted site as something that is both a
measurement site and polluted thing. The different observation
data models leads to different models of polluted things. In
version 2, a polluted thing is defined as something that has at
17
http://ecoinformatics.org/oboe/oboe.1.0/oboe.owl#
e.g., http://purl.org/twc/ontology/swqp/region/ny; others are listed at
http://purl.org/twc/ontology/swqp/region/
19
http://escience.rpi.edu/ontology/semanteco/3/0/oboe-pollution.owl#
18
Figure 4. Measurement in OBOE
least one "measurement" that violates a regulation. With
version 3, a polluted thing is modeled as something that has at
least an observation of a regulation violation. An observation of
a regulation violation is modeled as an observation that has at
least one "measurement" that violates a regulation. Fig 6 gives
the updated model of a polluted site.
It is necessary to change the regulation encoding to reflect
the updated model. We replaced some properties from our
ontology with similar properties defined in OBOE, e.g.
pol 20 :hasCharacteristic became oboe:ofCharacteristic. In
addition, we promoted some properties from data properties to
object properties. For example, pol:hasUnit originally had a
range is string and has been replaced by oboe:usesStandard
with a range of class oboe:Standard. The changes are relatively
minor, and our regulation converter is robust and supported the
changes with little effort.
Jena Semantic Web Framework [14] to reason over the data
and ontologies to answer the queries for polluted sites. Query
(1) is under SemantEco version 2, and query (2) is under
version 3. To identify polluted sites over the dataset, it takes
63263.4 ms when using version 3, while it takes 22477.2 ms
with version 2. The time is an average from 5 executions of
queries. We can see the tradeoff between interoperability and
reasoning performance. While OBOE brings greater
interoperability for the portal, it incurs longer reasoning time.
Reasoning performance is tested on a Linux Vserver
running Ubuntu 10.04 sharing its Linux 2.6.32 kernel with an
Ubuntu 10.04 host. The physical hardware was a Dell
PowerEdge R510 configured with a quad core Intel Xeon
E5620 with hyper-threading operating at 2.4 GHz and 32 GB
of 1333 MHz dual ranked memory. Tests using Pellet were
conducted using the 64-bit Java 1.6.0_26 runtime environment
(sun-java6-jre, version 6.26-1lucid1) and the virtual machine
was configured with 1GB min heap size and 2GB max heap
size.
TABLE I.
Figure 6. Updated model of a polluted site.
We investigate the reasoning implication of using OBOE
for interpreting data. The data set we use contains 3517 water
measurements taken from 588 USGS monitoring sites in Kent
County, Rhode Island (county code: 003, state code: 44). Our
reasoner uses the Pellet OWL reasoner [13] together with the
20
http://escience.rpi.edu/ontology/semanteco/2/0/pollution.owl#
EXAMPLES OF THE RATIONALE CAPTURED
Identified
thing
Type
Rationale
USGS
Data
Organization
It is an authoritative government agency for
science about the Earth, its resources, and the
environment.
NWIS
Dataset
csv2rdf4lod
Dataset
It is distributed via web services and can be
accessed periodically with automated means.
The open source tool provides an quick and
easy way to convert tabular data into wellstructured RDF. We have direct support from
the author of the tool, who is our lab mate.
Software
select distinct ?pollutedSite
where{
?violation a pol:RegulationViolation.
?violation pol:hasSite ?pollutedSite. } (1)
select distinct ?pollutedSite
where{
?obervation a oboe-pol:ObservationOfRegulationViolation.
? obervation oboe-core:hasContext ?obsContext.
?context oboe-core:ofEntity oboe-pol:SpatialLocationEntity.
?context oboe-core:hasMeasurement ?locationMea.
?locationMea oboe-core:hasValue ?pollutedSite. } (2)
B. Rationale as provenance
During the construction of an information portal, we make
various choices: from what data sources to fetch data, use what
tools to integrate data, what manipulation to conduct over the
data, etc. As pointed by Sky Bristol from USGS, rationale
which explains how we make these choices, is very important
information. Rationale helps portal users to obtain a
wholesome understanding of the portal, and facilitates the
maintaining and reuse of the portal. Users would be more likely
to have confidence in the portal if the rationale behind the
construction of the portal and the presentation of the data is
acceptable for them. If other portal builders are interested in
reusing the architecture or workflow of the portal, they can
more easily decide whether they would like to reuse one dataset
or software agent when given the rationale for why we select
the dataset or software agent.
To encode rationale as provenance, we extend the Proof
Markup Language (PML) 2 [15]. PML 2 is a modular
explanation interlingua and contains three ontologies that focus
on three types of explanation metadata: provenance,
information manipulation or justifications, and trust. We
introduce the property pmlp:hasRationale, whose domain is the
class Identified Thing and range is String. Identified-things can
be information, language and sources (including organization,
person, agent, services). With hasRationale, we can provide the
rationale for why we choose to adopt the identified things in
simple text.
By extending the scope of provenance to include rationale,
we are able to capture some important information, which
would be totally missing when the original builders of the
portal leave the project. In Table 1, we give three examples of
the rationale captured.
C. Compare the performance of different reasoners
Due to forward-chaining closure computing, standard
reasoners such as Pellet are much slower than general rule
reasoners with only the necessary rules. For example, a
standard reasoner for the RDF 21 language would include the
following rule (encoded with Jena rule syntax22):
 [rdf1: (uuu aaa yyy) -> (aaa rdf:type rdf:Property)]
21
22
http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#RDFRules
http://jena.sourceforge.net/inference/#RULEsyntax
Although this rule ensures the fulfillment of the semantics of
RDF, it is not useful in the query answering task of our
system. There are 14 such rules embedded in a standard RDFS
reasoner23 and even more for an OWL-DL reasoner like Pellet.
We avoid the invocation of these rules to boost query
answering efficiency. For example, on the data set with 3517
measurements taken from 588 USGS monitoring sites in Kent
County, Rhode Island (county code: 003, state code: 44), it
takes a specifically tailored rule reasoner with only rule (3)
below 1242 ms to answer query 4) below, while it takes the
Pellet reasoner 64620 ms to answer the same query. The
experiment is performed on a laptop with Intel Pentium P6000
CPU (1.86GHz X 2, 3MB L3 cache), 4GB DDR3 Memory
running 64-bit Windows 7 operating system.
[Chloride: (?x pol:hasValue ?v) ge(?v, 10.0) (?x
pol:hasCharacteristic pol:Chloride) (?x repr:hasUnit 'mg/l')
-> (?x rdf:type pol:ExcessiveChlorideMeasurement)]…(3)
select ?s ?x ?v where {
?x a pol:ExcessiveChlorideMeasurement.
?x pol:hasValue ?v.
?x pol:hasSite ?s.
}…(4)
IV.
RELATED WORK
In ecology and environmental community, there have been
research efforts that facilitate domain knowledge integration
via semantic approaches. These research projects focus on
different fields of ecological and environmental science. OBOE
focuses on encoding generic scientific observation and
measurement [5]. GeoSpecies is an effort for enabling species
data to be linked together as part of the Linked Data network
[7]. Chen et al. proposed a prototype system that integrates
water quality data from multiple sources [16]. As the goal of
SemantEco is to build a comprehensive ecological and
environmental information system, we need to model
knowledge spanning multiple fields. Thus, we designed a
family of ontologies for encoding water measurement, species
observation and the health effects of pollution on species and
utilize the ontologies for data integration and reasoning.
eScience can benefit from provenance for a number of
reasons. For example, provenance provides a context for data
interpretation and enables one to evaluate reliability of
experiment results and replicate scientific workflows [17].
Research projects such as myGrid [18] and CMCS [19] have
been conducted to build infrastructure that generates
provenance data and allows users view and use provenance
data. The provenance support of this work differs from that of
previous projects in that we extend the scope of provenance
support to include rationale.
V.
DISCUSSION AND FUTURE WORK
Ecology and Environmental information systems benefit
from semantic science and technology from several aspects.
Firstly, by encoding the domain knowledge required by the
23
http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#RDFSRules
information system with ontologies, we make the information
system easier to maintain and extend. In SemantEco, we
encode the environmental regulation rules and the health
effects of pollution on wildlife as OWL classes. If one
regulation rule becomes stricter, we only need to update the
OWL class to adopt the stricter threshold value. Similarly,
adding a new OWL instances is sufficient for an extension like
introducing the health effect of a new contaminant. In contrast,
if we embed the domain knowledge in the source code of the
information system, these changes would require us to modify
the source code and re-deploy the system, which are more
costly than changing the ontology files.
Semantic technologies facilitate data integration, which is
common practice in building ecological and environmental
information systems. Converting observation data according to
our wildlife ontology leads to controlled vocabulary of the
datasets. For example, we map the field "Latitude" and
"Longitude" of the eBird Reference Dataset to the property
wgs24:lat, and wgs:long.
Resource managers need analysis results from the collected
ecological and environmental data to make informed decisions.
This often involves large amounts of data and the analysis can
require much time and effort to arrive at a decision with
significant impacts. Semantic technologies can be used to
lower the cost and shorten the time required by such decisionmaking processes.
We use SPARQL [20] to perform appropriate data
aggregation, which is often used to get an overall
understanding about the datasets and it can only be performed
when it is sensible to aggregate the data objects [5]. Not only
does SPARQL enable us to specify the constraints of the data
aggregation, it also supports aggregation functions including
COUNT, SUM, AVG, MIN, and MAX. For example, we
obtain the total counts of "Canada Goose" in Washington state
in 2007 with the SPARQL query as follows. The query result
can be provided in XML or JSON, which would be readily
consumed by different visualization toolkits (e.g. D3.js) to
produce a time series plot. This way, with SPARQL and
visualization toolkits, the portal enables users to review and
interact with the growing data resource in the form of maps and
other visualizations.
GROUP BY ?month (5)
It can be challenging to retrieve, aggregate and visualize
data not encoded in semantic format. For instance, the
Washington Department of Fish and Wildlife provide species
distribution data in a spreadsheet. To retrieve the data to be
aggregated and then perform the aggregation, a resource
manager has three options: do it manually, write ad hoc
programs, or write complex excel macros. All of the three
options require considerable time and effort from the resource
manager.
The SemantEco portal has a lot of potential directions to
explore. We can use semantic technologies to enable automatic
analysis over ecological and environmental data. For example,
we model "EndangeredSpot " as a place where some animals
are reported as dead or sick, and "CriticalSpot" as a location
that is both an " EndangeredSpot " and a "PollutedSite". Then,
if we feed a semantic reasoner with the ontology and data, it
will automatically identify the critical spots.
:EndangeredSpot rdfs:subClassOf [
rdf:type
owl:intersectionOf ( :SickEventSpot :DeathEventSpot )
].
:SickEventSpot owl:equivalentClass [
rdf:type
owl:someValuesFrom :WildlifeSickEvent
].
:DeathEventSpot owl:equivalentClass [
rdf:type
owl:someValuesFrom :WildlifeDeathEvent
].
:CriticalSite rdfs:subClassOf [
PREFIX wildlife:
<http://www.semanticweb.org/ontologies/2012/2/wildlife.owl#
>.
] . (7)
geospecies:hasCommonName "Canada Goose";
wildlife:hasYearCollected "2007";
wildlife:hasMonthCollected ?month;
wildlife:hasObservationCount ?count.}
24
owl:Restriction ;
owl:onProperty :hasWildlifeEvent ;
rdf:type
WHERE {?obv wildlife:hasState "Washington";
owl:Restriction ;
owl:onProperty :hasWildlifeEvent ;
PREFIX geospecies:
<http://rdf.geospecies.org/ont/geospecies#> .
SELECT ?month SUM(?count) as ?total
owl:Class ;
owl:Class ;
owl:intersectionOf ( :PollutedSite :EndangeredSpot )
Such modeling and reasoning has a constraint: a monitoring
site has records for both environmental observation and
wildlife health events. However, as environmental qualities and
wildlife health are usually monitored by different sites, the
constraint usually does not hold. In such cases, we can model
"CriticalSite" as a "PollutedSite" having at least one
"EndangeredSpot" nearby. Then we can get the location
information of "PollutedSite" and "EndangeredSpot" with
query snippet (8) and utilize SPARQL filter to identify
"CriticalSite" as shown in (9).
To identifying "EndangeredSpot", the portal need data for
wildlife health event, which are provided web monitoring
http://www.w3.org/2003/01/geo/wgs84_pos
systems such as the Wildlife Health Event Reporter (WHER25).
We are interested in linking our portal to WHER and enable the
portal to identify "EndangeredSpot" and "CriticalSite".
?pollutedSite wgs:long ?siteLong.
[3]
P. Wang et al., “A Semantic Portal for Next Generation Monitoring
Systems,” in Proceedings of the 10th International Semantic Web
Conference, 2011, pp. 253-268.
[4]
P. Wang, “Semantically Enabling Next Generation Environmental
Informatics Portals,” RPI, 2012.
[5]
J. Madin, S. Bowers, M. Schildhauer, S. Krivov, D. Pennington, and
F. Villa, “An ontology for describing and synthesizing ecological
observation data,” Ecological Informatics, vol. 2, no. 3, pp. 279-296,
2007.
[6]
C. Bizer, T. Heath, and T. Berners-Lee, “Linked Data - The Story
So Far,” International Journal on Semantic Web and Information
Systems, vol. 5, no. 3, pp. 1-22, 2009.
[7]
“GeoSpecies Knowledge Base.” [Online]. Available:
http://lod.geospecies.org. [Accessed: 15-Jan-2009].
[8]
L. Dodds and T. Scott, “BBC Ontologies - The Wildlife Ontology,”
2010.
[9]
M. A. Munson et al., “The eBird Reference Dataset, Version 3.0,”
Ithaca, NY, 2011.
[10]
US EPA, “National Recommended Water Quality Criteria.”
[Online]. Available:
http://water.epa.gov/scitech/swguidance/standards/criteria/current/.
[Accessed: 20-Jun-2012].
[11]
T. Lebo and G. T. Williams, “Converting governmental datasets into
linked data,” in Proceedings of the 6th International Conference on
Semantic Systems, 2010, pp. 38:1-38:3.
[12]
P. Hitzler, M. Krötzsch, B. Parsia, P. F. Patel-Schneider, and S.
Rudolph, “OWL 2 Web Ontology Language Primer,” W3C
Recommendation 27 October 2009, 2009. [Online]. Available:
http://www.w3.org/TR/owl2-primer/. [Accessed: 05-Mar-2012].
[13]
E. Sirin, B. Parsia, B. Cuenca Grau, A. Kalyanpur, and Y. Katz,
“Pellet: A practical OWL-DL reasoner,” Web Semantics: Science,
Services and Agents on the World Wide Web, vol. 5, no. 2, pp. 51-53,
Jun. 2007.
[14]
J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and
K. Wilkinson, “Jena: implementing the semantic web
recommendations,” in Proceedings of the 13th International World
Wide Web Conference, 2004, pp. 74-83.
[15]
D. L. Mcguinness, L. Ding, P. P. D. Silva, and C. Chang, “PML 2:
A Modular Explanation Interlingua,” in Proceedings of the AAAI
2007 Workshop on Explanation-aware Computing, 2007, pp. 22 23.
[16]
Z. Chen, a Gangopadhyay, S. Holden, G. Karabatis, and M. Mcguire,
“Semantic integration of government data for water quality
management,” Government Information Quarterly, vol. 24, no. 4, pp.
716-735, Oct. 2007.
[17]
Y. L. Simmhan, B. Plale, and D. Gannon, “A survey of data
provenance in e-science,” ACM SIGMOD Record, vol. 34, no. 3, pp.
31-36, Sep. 2005.
[18]
J. Zhao, C. Goble, R. Stevens, and S. Bechhofer, “Semantically
Linking and Browsing Provenance Logs for E-science,” in
Proceedings of the 1st International Conference on Semantics of a
Networked World, 2004, vol. 3226, pp. 158-176.
[19]
J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier,
“Multi-scale science: supporting emerging practice with
semantically derived provenance,” in ISWC 2003 Workshop on
?pollutedSite wgs:lat ?siteLat.
?endangeredSpot wgs:long ?spotLong.
? endangeredSpot wgs:lat ?spotLat. (8)
FILTER ( ?siteLat < (?spotLat+"+delta+")
&& ? siteLat > (?spotLat-"+delta+")
&& ? siteLong < (?spotLong+"+delta+")
&& ? siteLong > (?spotLong-"+delta+")) (9)
Next, for modeling wildlife habitat, we plan to connect to
the Harmonisa project [21] which provides semantic
descriptions of land-use and land-cover categories.
Furthermore, EPA’s water quality criteria [10] provides
multiple types of thresholds and would provide and important
facet to complement our existing datasets. These criteria
include measures of acute pollution in freshwater, chronic
pollution in freshwater, acute pollution in saltwater, and
chronic pollution in saltwater. We currently incorporate
thresholds for acute pollution in freshwater for two reasons: 1)
we mainly focus on inland water bodies; 2) acute pollution can
affect both species that live near the polluting water source and
pass by the water source occasionally. To support thresholds
for chronic pollution, we need to consider some additional
factors, e.g. the time that species stay near the polluting water
source. We would require species distribution models from
animal experts for modeling the chronic pollution. Lastly, we
plan to enhance our modeling of rationale as provenance and to
collect and integrate more data on the health effects of
pollution on species.
VI.
CONCLUSION
We extended our SemantEco portal based on two driven
factors: to facilitate decision support systems for resource
managers and to make the portal more broadly reusable. Our
extension includes: support for wildlife monitoring;
connections to OBOE, integration of wildlife observation data
as linked data; enhanced provenance support through the
incorporation of rationale; and performance comparison
between a standard reasoner and customized rule base reasoner.
REFERENCES
National Fish Habitat Board, “Through a Fish’s Eye: The Status of
Fish Habitats in the United States 2010,” Washington D.C., 2010.
[1]
[2]
M. S. Schwarz, K. R. Echols, M. J. Wolcott, and K. J. Nelson,
“Environmental contaminants associated with a swine concentrated
animal feeding operation and implications for McMurtrey national
wildlife refuge,” Grand Island, Nebraska, 2004.
25
http://www.whmn.org/wher/
Semantic Web Technologies for Searching and Retrieving Scientific
Data, 2003.
[20]
E. Prud’hommeaux and A. Seaborne, “SPARQL Query Language
for RDF,” W3C Recommendation, 2008. [Online]. Available:
http://www.w3.org/TR/rdf-sparql-query/. [Accessed: 06-Mar-2012].
[21]
“HarmonISA - Harmonisation of Land-Use Data.” [Online].
Available: http://harmonisa.uni-klu.ac.at/. [Accessed: 20-Jun-2012].
Download