SWQP: Semantic Web Quality Portal Jin Guang Zheng, Ping Wang Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, New York, USA TWC-TR#14 ABSTRACT When solving problems related to water quality in environmental science, the scientists typically need collect data from various sources and perform analysis on these data. This process can be complex and time consuming. In this paper, we present Semantic Water Quality Portal (SWQP), a web portal powered by semantic web technologies for water related information discovery and analysis. The proposed SWQP collects and integrates water related data from various sources and performs automatic reasoning and analysis on the data collected, and finally presents the analyzed results in a user friendly interface. SWQP has demonstrated using the semantic web technologies we can ease the difficulty and complexity when solving water related problems. More specific, SWQP supports the following features that are enabled by semantic web technologies: 1. Provide data provenance information in a structured format using Proof Markup Language(PML) and provides provenance based reasoning, 2. Support automatic inference and reasoning using Web Ontology Language – OWL, 3. Support visualization over water related data using SPARQL and Google Visualization tools. 1 INTRODUCTION Water quality problems have been a major concern for environmental scientists as well as local citizens. People have been devoted tremendous amount of efforts to solve these problems. Identifying possible polluted water sources, pollutants in the water sources and possible polluters for the pollution are a few problems that people are interested in. To monitor and control water quality, the authorities1,2 have been collecting data about water quality, pollutants, etc. for years, and set up regulations to identify polluted water sources. With this tremendous amount of data, the process of identifying polluted water sources and pollutants, etc. can be complex and time consuming even for trained professionals, and not to mention local citizens. Furthermore, citizens or scientist may be interested in viewing the trend 1. http://www.epa.gov/ 2. http://www.usgs.gov/ of contaminants in a water source to get insight view about the water source. Motivation Example: Imagine that children in a local area start getting sick with the symptom of vomiting. The parents are suspecting there is something wrong with the drinking water. Then they contact authorities and ask them to check the water sources. Authorities then will collect data from sample water from water sources and also getting data from various authorized agencies such as Environment Protection Agency (EPA), U.S. Geological Survey (USGS), and regulation data from the state. Then authorities will perform analysis on the data collected. And finally, they will report the results to citizens and take further actions. In this use case, collecting data from various sources and performing analysis (identifying polluted water sources, pollutants, etc.) could require both domain knowledge and significant human efforts and time. In this paper we proposed Semantic Water Quality Portal (SWQP), a semantic web technologies enabled water quality portal for identifying polluted water sources, pollutants and possible sources of pollutions. In this SWQP, we have enabled following semantic web technologies based features: 1. Provenance based data selection, integration, and reasoning, 2. OWL typed automatic inference and reasoning, 3. Google visualization over RDFized water related data. 2 2.1 METHODS System Overview The system architecture of SWQP is illustrated in Figure 1. There are five major components in this system: 1. Data Conversion, 2. Ontology, 3. Provenance, 4. Jena-Pellet based reasoner, 5. Front-end interface and visualization. Data Conversion Component: There are two converters used in the system. One of the converters is a general converter, which is able to convert any data in CSV format to RDF format. Another converter is an ad-hoc converter for SWQP, which converts regulation data from PDF format, HTML format to CSV format. 1 Luciano, J et al. ETST_2011_Luciano_Joanne_A1 for this visualization component: trend visualization part and map visualization part. Details are discussed at section 2.4. 2.2 Figure 1, SWQP System Architecture, this figure illustrates how SWQP components works together. Ontology Component: In SWQP, we designed a core regulation ontology. When data are converting to RDF [1] format, we encode the data using the ontology. Therefore, we can perform reasoning on the data we collected. The ontology itself is designed and encoded use OWL2 [2]. Subset of the ontology is illustrated in figure 2. Details are discussed at section 2.2 In SWQP, we designed two types of ontology: core EPA ontology3 and regulation ontologies4. The core ontology is called EPA Ontology. This core EPA Ontology consists of 18 classes, 4 Object properties, 10 Data properties, and imports numerous existing ontologies such as sweet ontology[3], time ontology[4], etc. which models complex relationships (e.g. subclass, disjoint, etc.) between classes such as water sources, facilities, measurements, and contaminants, etc. as illustrated in figure 2. For example, a polluted water source is modeled as intersection of water source and an instance that has a measurement over a threshold. Use this modeling, we will be able to perform automatic reasoning such as “any water source has a measurement over certain threshold is a polluted water source”. Besides the core EPA ontology, we also designed regulation ontologies. The number of concepts and properties varies, since each state can have different number of water regulations. The main purpose of regulation ontology is to model the state and EPA regulation data. For example, in California, the state regulation data defines 0.01 mg/l as a threshold for Arsenic. This information is encoded in the regulation ontology. Combining the regulation ontology and core ontology we will be able to perform reasoning over water source such as “any water source contains 0.01 mg/l of Arsenic is a polluted water source.” 2.3 Figure 2, subset of SWQP ontology Provenance Component: There are two levels of provenance information we captured using our provenance component: Data Level Provenance and Application Level Provenance. Details are discussed at section 2.4. Back-end Reasoner Component: We also built a back-end reasoner using JENA and PELLET. In SWQP, the reasoner performs OWL 2 reasoning using our over the data we collected from various sources to determine polluted water sources and polluting facilities. Visualization Component: This component is responsible for mashing up and representing the data we collected from various sources in a meaningful way. There are two parts 2 Ontology and Reasoning Visualization In SWQP, we build 2 types of visualization to better present the analyzed results and the water quality data we collected. As aforementioned, the visualization contains two parts: Map visualization and Trend visualization. Map visualization: After back-end reasoner finishes the analysis on whether or not a water has been polluted or a facility has violated any regulation, map visualization will display the results on the Google Map: each facility or water source will be presented on the Google Map using different markers to identify the type of the site (polluted water source, facility, etc.). Figure 3 shows an example of map visualization. 1. http://tw2.tw.rpi.edu/zhengj3/owl/epa.owl 2. http://tw2.tw.rpi.edu/zhengj3/owl/ Luciano, J et al. SWQP: Semantic Web Quality Portal The second level is application level provenance: the main purpose of this level of provenance data is to provide explanation to the user when a water source is marked as polluted or a facility been marked as violating facility, and directs user to the quantitative data we used in our inference and reasoning. For example, if user selected a polluted water source in the map visualization, a window will be pop up and provides explanation to the user. 3 DISCUSSION Figure 3, Map Visualization Trend visualization: This part of visualization provides functionality to visualize the water quality data related to the selected water site or facility as time series. With this feature, the user can observe and analyze the trends of the water quality data over time. For example, a user may be interest to see how the amount of Arsenic in a water source changes over time. 3.1 Data The data sources of our portal span across several government agencies, such as EPA and USGS and Federal and State Regulation agencies. EPA Data: We get permit compliance and enforcement status of facilities regulated by the National Pollutant Discharge Elimination System (NPDES) under the Clean Water Act5 (CWA) from ICIS-NPDES6, which is one of the EPA information management systems. The compliance and enforcement status of facilities contains measurements of pollutants in the water discharged by the facilities, and also the threshold values for up to 5 test types for each pollutant. USGS Data: We also fetch the National Water Information System7 (NWIS) water quality data provided by USGS. The NWIS water quality data gives measurements of substance contained by in water collected at USGS data-collection stations. Figure 4, Trend Visualization 2.4 Provenance In SWQP, we generate two levels of provenance data. The first level is data level provenance: when data are converted to RDF using our data conversion component, we inject provenance information about data sources using PML [5]. These provenance data are used to support provenance based data query. For example, if a user is interested in applying EPA regulation to the data collected from USGS agency. The data level provenance we generate will be able to identify the sources of data collected and query appropriate data. 5. http://www.epa.gov/agriculture/lcwa.html 6.http://www.epa-echo.gov/echo/compliance_report_water_icp.html 7. http://waterdata.usgs.gov/nwis 8. http://water.epa.gov/drink/contaminants/ Regulation Data: The water portal also makes use of water regulations, which are lists of Contaminants and their Maximum Contaminant Level8 (MCLs). For now, we have encoded the national level drinking water regulations from EPA, and also the state drinking water regulations for California and Massachusetts. 3.2 Semantic Technologies/Claims Semantic web technologies have been proven to be able to bring many benefits to various types of applications (e.g. semantic mediawiki [6] etc.) through semantic data integration, semantic query, etc. In SWQP, we have implemented most of our features using semantic web technologies: Ontology based reasoning, data integration, provenance based query and reasoning, etc. Claim 1: Semantic Data Integration helps SWQP to integrate data from various sources, and eases the process of future data integration. 3 Luciano, J et al. As we aforementioned, SWQP integrates data from various sources, such as EPA, USGS, state regulation authorities etc. The data from these sources are typically in different formats. This heterogeneous nature of the water related data is one of the major challenges that researchers face when they need to analyze these water data: 1. It is difficult to query these data and integrate the data for particular usage, 2. Data are stored using different schema, the semantics of the terms in different schema can be very different from each other. SWQP has overcome these problems caused by heterogeneous data using semantic web technologies: 1. Data from different sources are converted into RDF format, and loaded into triple store. Then, we can use SPARQL [7] to fetch data. 2. When we convert data, we use EPA ontology as the central schema to encode the converted data, therefore we have consistent semantics for the converted data. Another benefit of semantic data integration is it is much easier to import data in the future. Imagine if we are importing more heterogeneous data, it will be difficult to use other technologies to describe and store the data, since the schema are typically fixed in other technologies difficult to alter. Whereas using semantic web technology, all we need to do is to change the ontology by adding few equivalent statement or new properties and classes. Claim 2: Automatic inference and reasoning supported by semantic web technologies helps SWQP to perform automatic analysis on water qualities etc. One of the most important tasks in water related problems is analyzing the data. However, given the amount of data that are consumed in the analysis process, this analyzing task can be very complex and time consuming. For example, to identify if a water source is polluted or not, we need to compare all measurements of all contaminants with the corresponding limits in the adopted water regulations. Furthermore, as we aforementioned in Ontology and Reasoning section, the ontology we designed allows us state rules such as “any water source has a measurement over certain threshold is a polluted water source”, “any measurement contains 0.01 mg/l of Arsenic is a threshold”. Combining these rules, ontology will produce “any water source contains 0.01 mg/l of Arsenic is a polluted water source.” Based on these inference rules, SWQP can perform automatic analysis to identify types of water sources, etc. Claim 3: Provenance information encoded in semantic web technology helps SWQP gain trust from users. The primary usage of SWQP is to identify polluted water sources and polluting facilities in the region specified by the user input. However, the answers from SWQP are not likely 4 ETST_2011_Luciano_Joanne_A1 to be trusted by users if it does not provide users with the option to examine how the answers are reached. As pointed out in [8], knowledge provenance, which includes source identification, source authoritativeness, deductive proof trace, can be used to provide understandable explanation to users. The provenance support in SWQP is multi-folded. The source meta-information is captured and encode in PML while the data collection stage. With the source metainformation, not only do we enable users to identify the source of the water quality data and water regulations, we also provide users with provenance based query. If a new New York resident who just moved from California thinks that the California water regulations are stricter and can identify water pollution better, he/she could choose to apply California water regulations on the New York water quality data. Besides source identification, we also provide proof for the answers given by SWQP. Each polluted water source or facility in the map visualization is accompanied with a link from which the trends of the water quality data are displayed. Users can easily check if the reported pollution is true or false by observing the water quality data visualized as time series. To enable users to have more complete understanding of the answers, we would like to provide deductive proof traces for the answers in the near future. 3.3 Evaluation There are no standard benchmarks we can use to evaluate our approach. So we design our own evaluation approach. We are considering evaluate SWQP to answer following questions: 1. How easy is it to deploy SWQP to analyze water quality in other states, 2. How easy is it for user to use SWQP to obtain analyzed results, 3. How easy is it to deploy SWQP to solve other environment related problems. These evaluations may require human studies, where we can invite people to use the system: performing certain tasks (identify number of polluted water source in your home town, etc.) and answering various type of questions (In scale 1 to 5 how would you rate the system w.r.t your experience, etc.). 4 RELATED WORK Three areas of work are considered to be related to SWQP: semantic web portal, water quality ontology, and provenance. There is a diverse literature on semantic web portal systems [9], for example, LOGD [10], Semantic Web Portal[11], SEAL[12], Health Finland[13], etc. As web portal, these systems provide different functionalities for users to interact Luciano, J et al. with the data, such as integrating data [10][11][12], visualizing data[10][11], searching data[12][13], etc. Our work differs from these systems in 2 aspects: 1. SWQP provides automatic inference and reasoning, 2. SWQP captures provenance information and provides provenance related functionalities. Water and environment related ontology research and development have always been communities’ interests. Considerable numbers of ontologies have been developed to describe water [14][15] and environment related data[3]. Ontologies developed by Chau [14] and Parekh [15] are for describing water data such as quality and contaminants for simulation purpose. SWEET [3] is a more general ontology that models and describes the environment we live in. The ontology developed for SWQP is aiming to model the water quality, pollution and related data for reasoning and finding pollutions. In this ontology, we also import some terms from SWEET ontology to describe and model some water related information. There also has been considerable amount of research efforts in semantic provenance, especially in the field of e-Science. myGrid [16] proposes the COHSE open hypermedia system, which generates, annotates and links provenance data to build a web of provenance documents, data, services and workflows for experiments in biology. The Multi-Scale Chemical Science [17] (CMCS) project develops a generalpurpose infrastructure for collaboration across many disciplines. It also contains a provenance subsystem for tracking, viewing and using data provenance. ---[18] presents a taxonomy of the provenance techniques used in e-science projects. Provenance is critical in e-science, because users are not likely to trust the results of scientific experiments if the provenance of the results cannot be identified. Similarly, an environmental portal also needs to support provenance to gain trust from users. Currently, SWQP provides provenance information on both data source level as well as application level. Furthermore, SWQP provides provenance based data query as discuss at section 2.4. This provenance related work in SWQP would gain trusts from users. In the future, we could borrow the approaches used by the above e-science systems to develop our own provenance subsystem. Alternatively, we could make use of existing provenance infrastructure like [19], which supports the extraction, maintenance and usage of provenance of answers given by web application and services. 5 FUTURE WORK SWQP: Semantic Web Quality Portal and inference, and provide complete explanation of answers given by SWQP via provenance knowledge captured. Inference and Reasoning: There are 2 interesting reasoning can be supported by SWQP: 1. Health Effect Reasoning, 2. Flood Effect Reasoning. Health Effect Reasoning: Drinking the water from polluted water sources may result in serious health problems as we discussed in our motivation use case. By modeling these health-effects and reasoning over these effects, we will be able to provide valuable solutions for some interesting problems. For example, in our motivation use case, if we are able to model and infer what kind pollutions cause people start vomiting, we will be able to more quickly identify the polluted water source that causes the health problems. Flood Effect Reasoning: In water quality context, if a polluted water source is flooding, it may pollute nearby water sources, and therefore cause serious pollution problems. If we can identify which water sources may affected by the source of flood, we are in a better position to prevent or alleviate this. Furthermore, if the flood source is polluted water source, we can predict the effects of this flood on nearby water sources by semantic reasoning. Knowledge Provenance: To provide users with more complete explanation of the answers given by SWQP, we would like to support building, linking and displaying proof traces that track how the answers are derived from source data. Our proof traces would include all the manipulations the data go through: downloaded from the data source, converted into RDF via the converters, loaded into the triple store, reasoned by the reasoner, presented with the presentation tools. We also would like to support provenance granularity with which users can choose the granularity of the provenance they prefer. 6 CONCLUSION In this paper, we presented Semantic Water Quality Portal, a web portal for identifying polluted water source, pollutants, etc. In this portal, we use semantic web technologies to provide 1. Provenance data about water source, and support provenance based data retrieving and reasoning, 2. OWL type reasoning and inference to identify polluted water source, pollutants, etc., 3. Data visualization to present the trend of the amount of the contaminants in water sources. As we discussed in previous section, we will be working on developing functionalities to support more interested reasoning and inference such as health effect reasoning, reasoning over flood effects, and provide understandable explanation for the analyzed results via provenance knowledge. In the near future, we would like to improve our SWQP system through supporting some more interesting reasoning 5 Luciano, J et al. ACKNOWLEDGEMENTS The authors would like to thank Evan Patton for his help on use case development and various technical supports. Thank Tim Lebo for his help on data conversion process. TEAM CONTRIBUTIONS: Ping: Use Tim’s converter to convert EPA and USGS Data. Preprocess regulation data to CSV format Implement data visualization part of the project Write part of this final class write up, and present the visualization part of the demo. Jin: Write script to convert data to RDF format encoded use Ontology Design Ontology to support automatic reasoning and inference Re-implement Jena-Pellet based backend reasoner. Class related works: since this project is Ping’s out of class project, I am responsible for most of the project related write up, presentation, etc. REFERENCES [1] Manola, F., Miller, E., McBride, B., (2004): RDF Primer, <http://www.w3.org/TR/rdf-syntax/> [2] Hitzler, P., Krotzsch, M., Parsia, B., Patel-Schneider, P., Ru dolph, S., (2009) OWL 2 Web Ontology Language Pri mer. <http://www.w3.org/TR/owl2-primer/> [3] Raskin, R., & Pan, M. (2005). Knowledge representation in the semantic web for Earth and environmental terminology (SWEET). Computers & Geosciences, 31(9), 1119-1125. [4] Hobbs, J., Pan, F., (2006): Time Ontology in OWL, <http://www.w3.org/TR/owl-time/> [5] McGuinness, D., Silva, P., Ding, L., (2007): Proof Markup Language (PML) Primer, <http://inference-web.org/2007/primer/> [6] Krotzsch, M., Vrandecic, D., Volkel, M., (2006) : Semantic MediaWiki, ISWC [7] Prud’hommeaux, E., Seaborne, A., (2008): SPARQL QUERY LANGUAGE FOR RDF, <http://www.w3.org/TR/rdf-sparql-query/> [8] Pinheiro, P., Mcguinness, D. L., & Mccool, R. (2003): Knowledge Provenance Infrastructure, in IEEE Data Engineering Bulletin, vol. 26, [9] Lausen,H., Ding, Y., Stollberg, M., Fensel, D., Hernandez, R., and Han,S. (2005): Semantic web portals:state-of-the-art survey. Journal of Knowledge Management, vol. 9(5), pp. 40--49 [10] Difranzo, D., Ding, L., Erickson, J. S., Li, X., Lebo, T., Mich aelis, J., et al. (2010). TWC LOGD: A Portal for Linking Open Government Data. In: Semantic Web Challenge, International Semantic Web Conference 2010. [11] Ding, Y., Sun, Y., Chen, B., Borner, K., Ding, L., Wild, D., 6 ETST_2011_Luciano_Joanne_A1 Wu, M., DiFranzo, D., Fuenzalida, A., Li, D., Milojevic, S., Chen, S., Sankaranarayanan, M., Toma, I., (2010): Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data. International Conference on Active Media Technology [12] Maedche, A., Staab, S., Stojanovic, N., & Studer, R. (2001). SEAL - A Framework for Developing SEmantic portALs. In Proceedings of the 18th British National Conference on Databases [13]Suominen, O., Hyvönen, E., Viljanen, K. and Hukka, E., (2009):HealthFinland-A National Semantic Publishing Network and Portal for Health Information. Web Semantics: Science, Services and Agents on the World Wide Web, 7(4), pp. 287--297 [14] Chau, K., (2007) : An Ontology-based knowledge manage ment system for flow and water quality modeling. Advances in Engineering Software. 38(3), 172-181. [15] Parekh V., (2005): Applying Ontologies and Semantic Web technologies to Environmental Sceiences and Engineering. Mater Thesis, University of Maryland, Baltimore County [16] Zhao Jun, Goble Carole, and Stevens Robert.(2004): Semantically linking and browsing provenance logs for e-science. In Proc. of the 1st International Conference on Semantics of a Networked World, Lecture Notes in Computer Science, Paris, France. [17] J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier, (2003): Multi-Scale Science, Supporting Emerging Practice with Semantically Derived Provenance, in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. [18]Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance in e-science. ACM SIGMOD Record, 34(3), 31-36. [19] Mcguinness, D. L., and Pinheiro, P. (2004). Explaining answers from the semantic web: The inference web approach. Journal of Web Semantics, 1, 397-413.