Semantic History: Towards Modeling and Publishing Changes of Online Semantic Data Jie Bao, Li Ding, and Deborah L. McGuinness Tetherless World Constellation Rensselaer Polytechnic Institute Troy, NY, 12180-3590, USA {baojie,dingl, dlm}@cs.rpi.edu Abstract. Effective revision tracking is important to maintain and use Semantic Web data for both publishers and readers. Information related to revisions in this setting often contains basic context information, semantic difference summary, and rationale summary. In this work, we present a general architecture for modeling and publishing revision history of social semantic Web data. This model has been implemented as an extension to an existing infrastructure, namely Semantic MediaWiki. We show a variety of applications that can be built using the framework, including provenance tracking, statistics, temporal reasoning and explanation. 1 Introduction The Social Web enables collaborative content publishing on the Web. When enhanced with Semantic Web elements, e.g., RDFa or embedded/synchronized RDF, the conventional text based Web pages are enriched into Social Semantic Web pages with additional semantic annotations and links to other Web data. For maintaining and using such social semantic Web pages, effective revision tracking is important for publishers as well as readers. For instance, when a page is maintained by a number of editors collaboratively, an editor often would like to get informed of the recent editing history without having to read through the entire page manually. For another example, upon revisiting a published page, a reader may be interested in knowing whether the page has been changed, how frequently the page has been updated, whether a known error in the page has been cleared, and why the change has been made. Indeed, the value of revision tracking has already been demonstrated through the wide adoption of version control systems in software development and content management systems, e.g., Subversion(SVN), Concurrent Versions System (CVS), and wikis. In tracking history, there are several aspects that we should consider: – Reusable history. History data recorded structurally in conventional version control systems only allows users to access the history data via restricted data access user interfaces. This may be improved by publishing these data on the Semantic Web, thus enabling other applications to reuse these data. For example, one may use the Google Visualization API1 to compare the editing frequency of editors of a wiki page. 1 http://code.google.com/apis/visualization/ – Linked history. Many elements of history change data may be linked to other Semantic Web data. For example, we may link the editor of a page’s revision to a person from a social network, and then answer a query such as “find revisions on my publication page made by a friend of mine”. – Fine-grained history. We often need fine-grained history information because the difference between semantic annotations of different revisions can be computed at different levels of desired granularity. For example, one may need to model history about a page or a set of pages, instead of only recording history at the RDF triple level. This would allow us to link and query the page data and history data at the same time. For example, we can query for changes to the affiliation of a person (in page data) in the last month (in history data). – User-oriented history. In addition to revision history annotations, we may add more annotations to better serve end users. Instead of using the history for just version control, we expect the history to be used by end users more frequently. Therefore, the history should be extended to incorporate end-user oriented annotations. For example, automatically generated summaries of revisions will help users grasp what semantic annotations have been updated on a page. In this work, we present a generic model for semantic history representation, focusing on the practical infrastructure for making semantic history publishable on the social semantic Web and showing the value of semantic history through working demos and deployed applications. Our contributions hence include: – A general framework for modeling revision history of online Semantic Web data, capturing both temporal changes to the semantic data and annotations to actions that led to the change. (Section 2) – We demonstrate with a Semantic Media-Wiki-based implementation how to automatically capture revision changes in social semantic Web applications. (Section 3). – We present several typical usage scenarios of our revision model, including provenance tracking, statistics and visualization, temporal reasoning and explanation, showing the strong modeling ability and practical value of our model. (Section 4) 2 From History to Semantic History Semantic history is not merely a collection of encodings of revision data. It also links revision data as a part of the Semantic Web data cloud and further exposes the currently hidden potential of history data to more end users. Figure 1 illustrates a model of semantic history that highlights its key components. An application that adopts this model is required to first create (e.g., by monitoring user actions or comparing revisions) history descriptions for Web entities in a structured way, such as generating some encoding of revision data with links to other relevant Semantic Web data. Second, it needs to publish semantic history data in a user friendly form, e.g., HTML, RSS or visualization, or in a format that facilitates easy machine processing like RDF. Fig. 1. A General Application Model of Semantic History A published Web page (with URL), potentially with some semantic annotations, may change over its lifetime. A change history description includes (among other information) two main groups of information: revision action descriptions and temporal data descriptions. Revision action description. An action description records information of the event that causes the revision, including the following information: – Basic context annotation: examples include the subject of the change, the revisions identification, author, timestamp, and user provided revision notes. These basic annotations are used to capture provenance information about the revision, e.g., who generated the identified revision using which online data set at what time. Most of the annotations can be automatically captured by a version control system except for the human input revision notes. – Revision summary: this gives a summary of resulting changes that may help users comprehension, potentially at different levels of granularity, e.g., at document level and at the instance level. – Rationale: this provides additional explanation about the action capturing the motivation. For example, it may contain user contributed annotations about the nature of the revision, and sources supporting the revision. It may also contain additional facts, either manually input or automatically generated, that may help identify the cause and impact of the revision action, e.g., syntactical or structural descriptions of data involved. Temporal data description. It describes the modification of semantic data (e.g., RDF triples) associated with the revision. It enables users to retrieve prior versions of the semantic data and compare any two prior versions. Our framework does not specify a particular way for representing temporal data, and allows an application to choose from different representations, e.g., Temporal RDF [7] or RSS feed. We also independently introduced with some practically interesting alternatives which are given in the following sections. We have implemented this generic application model of semantic history on the Semantic MediaWiki platform, which will be introduced in the next section. 3 Semantic MediaWiki Based Deployment In this section, we show how the basic model of semantic history we just introduced is applied in modeling revision history in Semantic Mediawiki. A demonstration site of its implementation is at http://tw.rpi.edu/semhis. Semantic MediaWiki (SMW) is a semantic wiki system that allows collective authoring of a shared repository of both semantic and non-semantic data. It is an extension of the popular Mediawiki (MW) platform - which powers Wikipedia - thus it also inherits the built-in change management mechanism of MW. These include2 : – Page history: for each page, every revision of the page is stored along with information about author, time and size of the revision; the difference between two revisions of a page can be computed. – Change summary: when a user submits a new edit, the user can input a short explanation in natural language to summarize the change. – Recent changes: the page “Special:RecentChanges” shows all changes that happened in a selected recent time span; the display can be filtered by page namespace, type of users (e.g., anonymous users or bots), and nature of the change (e.g. minor edits). – Action logs: the page “Special:Log” supports a limited search interface over all changes that happened in the wiki, e.g., by page title and by time period. The MW change management mechanism is limited in several ways when being used with SMW. 1. MW revision logs only record changes to a page, but in SMW we often need finegrained information about addition and deletion of semantic annotations on a page. 2. The user submitted change summary is for human consumption only and its meaning is not formally captured; thus, it lacks a built-in automated search or query revisions based on the change summary. For instance, one may want to find all revisions about a page that involve only editorial changes (e.g., typo fixing) but not factual changes. 3. Querying revisions is limited and cannot utilize knowledge (e.g., classification of pages) in the wiki. For instance, one may wish to query about changes about a logic topic made by users who are computer scientists. Another example is to ask about the list of countries on Wikipedia with missing GDP figures or whose GDP figures have not been changed in the past year. 4. Facts that are time-sensitive may be buried in revision history and thus cannot be easily used. For example, one may want to ask for the set of pages that belong to the category “Living people” on Wikipedia as of Jan. 1st 2007. 2 In this paper, we study MediaWiki 1.15.0 and Semantic MediaWiki 1.4.3. Fig. 2. Workflow of the Semantic History Extension of SMW To address these issues, we developed an extension to SMW called “Semantic History” that can automatically capture revision history of wiki pages, which itself is stored as semantic data in the wiki. In the following we give the details about the extension. Figure 2 describes the general workflow of the the Semantic History (SH) extension for SMW. It contains the following step in modeling and utilizing revision information in a SMW: 1. A user (potentially a software agent) makes some edit actions to a wiki page. Typical edit actions include creating new pages, modifying a existing page, moving a page and deleting/undeleting a page. The SH extension is transparent to the user, thus it does not require the user any special actions, nor breaks the usual workflow of wiki editing. 2. The SH extension monitors user actions and captures various types of revision changes happened on the wiki. At this step, the SH extension will – Create a revision instance that contains basic context information of the revision, e.g., revision id, author, timestamp, page location and editing summary (in plain text). This revision instance will be stored as a “hidden” wiki page. For example, an editing revision with id “1234” may be stored as a page “rev:1234”, where “rev:” is a name space for all revision pages and can be protected according to the wiki’s access control policy. – Compare triple changes before and after the editing, identify all triple additions and deletions during the process. Each triple is identified by a unique id and stored as a “hidden” wiki page along with its addition and deletion history. The output of this step thus is a set of wiki pages (i.e., revision pages and triple pages), marked up using a set of templates (details given below). We deliberately separate the template-based storage of revisions and the triple representation of revisions (which is generated in the next step) for the benefits of customization and extensibility. 3. Generate triple representation for the revision history via a set of predefined templates. For example, the template “SH Triple” will preform a reification of a triple into three triples, such as {{SH_Triple|s|p|o}} on a page “triple:123” will generate : – triple:123 subject s – triple:123 predicate p – triple:123 object o Other types of information, such as timestamp of a revision or a deletion revision associated with a triple, will be triplified in a similar fashion. 4. Add the triplified revision history back to the triple store of the SMW using SH templates. 5. Enable applications to query the SMW triple store about the semantic representation of the revision history for various purposes, e.g., statistics, visualization, temporal inference and explanation. Details of some typical use cases are discussed in the next section. Several issues in the workflow need further explanations. Using SMW for Data Storage. One design choice is how to store the revision history data. The SMW triple store is actually created on the top of a relational database. The SH extension, instead of directly storing data as additional tables in the relational database, relies on SMW as a meta layer for data storage. A set of templates are used as a meta model of revision history. They include: – Templates for revisions: “SH Rev” (basic information of a revision), “SH Minor” (if the revision is major), “SH Summary” (plain text summary of the revision) – Templates for triples: “SH Triple” (basic information of a triple) , “SH Add” (link to a revision that adds this triple), “SH Delete” (link to a revision that deletes this triple), “SH Obsolete” (flag for an obsolete triple, i.e., it has been deleted and not yet restored). Such a design brings a couple of advantages. First, this allows us to seamlessly integrate existing SMW-based tools (e.g., semantic query engine) with the SH extension since all revision related data is also in the SMW triple store. Second, the SH templates can be easily customized or extended based on a specific domain need. For instance, one may choose a vocabulary (category and property names) that is best aligned with other ontology terms on the wiki. Finally, since a template can render both semantic data and non-semantic text (e.g., layout and visual elements), using a template based approach makes it easier to create a user interface for browsing revisions and changed triples. Identifying a Triple. Each triple is identified by an id computed using a hash function from its subject, predicate and object values. Currently we use SHA-1 (a cryptographic hash function) to generate 40-character ids for triples. Parsing Semantic Editing Summary. We deliberately do not specify a syntax or a parser including how to explain the meaning of an editing summary. This would allow an application to implement their own syntax. Here we show two example syntaxes that have been proven useful on social Web applications. – The SMW annotation syntax3 : For example, one edit summary is: reason::data is outdated; source::CIA World Factbook; category:Fact Update; Update GDP numbers with the 2008 data It contains three semantic annotations in a format similar to the SMW syntax, and one non-semantic sentence explaining the change. This may be parsed into SMW scripts: [[reason::data is outdated]] [[source::CIA World Factbook]] [[category:Fact Update]] The non-semantic text will not be parsed and is stored in the original form. By semanticizing the editing summary, we will be able to perform more powerful queries over the revision data. For example, one may ask for revisions about a country that uses some information from an almanac (e.g., CIA World Factbook), we may use the following SMW query (based an ontology that contains categories Revision, Countries and Almanacs): {{#ask: [[Category:Revision]] [[about::<q>[[Category:Countries]]</q>]] [[source::<q>[[Category:Almanacs]]</q>]] }} – The Twitter-style syntax: one may use “#” to add tags to a summary. For instance, a sentence “fact update for 2008 #gdp #infobox change” maybe generate SMW annotations: [[tag::gdp]] [[tag::infobox]] (this revision is an infobox editing, and is about GDP number change). The source code of the Semantic History extension has been release at the MediawWiki site4 . 3 4 Its parser is implemented at http://tw.rpi.edu/proj/semhis.wiki/index.php/Template:SH Summary http://www.mediawiki.org/wiki/Extension:SemanticHistory 4 Usage Scenario Examples Applications may use revision history data for a variety of purposes. We demonstrate some of the SH extension’s potential with several hypothetical applications5 . Fig. 3. Provenance Tracking Provenance Tracking: One may be interested in tracking the triple changes such as who has changed it? when it was changed? Fig 3 shows an example that asks “Who has changed the first name of James Hendler?”. All triples related to this are retrieved by a semantic query. An generalization of the example is a “Semantic Recent Changes” page that lists all page-level and triple-level revisions in inverse chronological order. Statistics and Visualization: SMW provides tools for statistic queries and visualization of query results. We show two examples in Figure 4. In (a), we query about relative popularity of two types of pages, university and person, on the basis of total triple-level revision numbers to their instances. The query result is shown in a pie charter. In (b), we count the number of daily revisions for a specific week and visualize the result in a line charter. Temporal Reasoning: This example (Figure 5) shows the use of revision timestamps in inferring time-sensitive facts. To make the RPI Tetherelss World group publication list, we need to know the affiliation history of a person. Since Jie Bao became a number of the RPI in 2008, only the publications that are published after this date should be added to the list. This query can be done in two steps: – Look up triple changes with subject “Jie Bao” and predicate “affiliation”, and get the time span(from datatime value 20080226045135 to current) when the object was change to “RPI”; – Use the time span as a filter in querying Jie Bao’s publications. As shown in Figure 5, only papers published after time 20080226045135 qualify in the list. Fig. 4. Statistics and Visualization Application Examples Fig. 5. Temporal Reasoning with the Semantic History Extension Explanation: Due to the fact that a wiki page can use a template, and that template may generate new triples, a triple about page is not necessarily locally generated. For example, page “Jie Bao” generates the triple “Jie Bao Property:Member of ITA”; however, “Jie Bao” does not contain an explicit assertion for creating a Member of triple. Thus, we need to explore other possible explanations. Step 1: the triple ”Jie Bao Member of ITA” is created at revision rev:2934, on time 20090713102155, by author User:Baojie; however rev:2934 does not directly generate this triple. Step 2: rev:2934 uses template ”Template:Member of” of revision 2931. Step 3: rev:2931 may generate properties: member,member of Step 4; Thus, one possible explanation of ”Jie Bao Member of ITA” is the combination of rev:2934 to Jie Bao and rev:2931 to Template:Member of. This process can be encoded by automated queries and templates. Please see http:// tw.rpi.edu/proj/semhis.wiki/index.php/Template:Explain for details. 5 Live demos of some examples are at http://tw.rpi.edu/proj/semhis.wiki/index.php/Main Page#Examples 5 Related Work Temporal Knowledge Representation. The temporal aspect of data has been investigated in numerous communities including Database [13] and AI (e.g., Temporal Representation and Reasoning [4] and Temporal Description Logic[10]). Recently, there has been some work on adding temporal features to OWL, e.g., [8, 20, 11], to RDF[7], and to SPARQL[17, 15]. This work is useful for representing temporal data descriptions. Our work is different from this work in that we are focused on generating and publishing a Web history, and our work is not tied to one particular temporal knowledge representation formalism. It would be possible to adapt our generic model and the SMW-based implementation to represent wiki revisions with some of the aforementioned formalisms. The focus of our work is also different from the the above in that we present a generic revision change model not only addressing temporal changes in the semantic data, but also revision action descriptions, e.g., rationale information for the change. Detecting and Representing Changes. Automatically generating some meaningful changes of structured data has been investigated in database research[3]. History publishing was discussed in [16, 12] in the Wikipedia context. There are some similar work on Semantic Web data [9, 14]. However, the work [16, 12] does not support structured representation of revision information that can be consumed for general purposes, e.g., query, search and reasoning. The work [9, 14] provides means for detecting changes in ontology evolution, but does not address the publishing the change history information as semantic data. Our work addresses the issues of capturing and publishing revision data in an end user friendly way, which are missing from the aforementioned related work. Semantic Difference in RDF. Revisions on an RDF graph can be directly captured by the addition and deletion of triples introduced by the new version, and there has been a number of work from a syntactic perspective [2] or semantic perspective [6, 19] approaching this “diff” problem [1]. Diff is also investigated in synchronizing RDF graphs [5, 18]. This work can be seen as a complement of our work for detecting triplelevel changes. However, our work captures not only triple changes, but also provenance and rationale information associated with these changes. 6 Conclusion In this work, we investigate how to make the revisions of online data available in a meaningful way for common Web users using Semantic Web technologies. We described a generic application model that captures both revision action description and temporal data description for a revision. The model is implemented extending an existing platform, namely Semantic MediaWiki. We show that by encoding history information using semantic Web technologies, several interesting applications can be built, e.g., provenance tracking, temporal reasoning and explanation generation. Our future work will focus on developing additional services that may utilize semantic history information, including trust computations on semantic wikis, advanced explanation for revisions, and publishing of the revision data of Wikipedia as a part of the linked data cloud. Acknowledgments This work is partially supported by NSF #0524481, DARPA #FA8650-06-C-7605, #FA8750-07-D-0185, #55-002001, #F30602-00-2-0579, and ITA project W911NF-063-0001. References 1. T. Berners-Lee and D. Connolly. Delta: an ontology for the distribution of differences between rdf graphs. http://www.w3.org/DesignIssues/Diff (last visited on Oct 5 2009, Revision: 1.114), 2004. 2. J. J. Carroll. Signing RDF graphs. Technical Report HPL-2003-142, HP Lab, Jul 2003. 3. S. S. Chawathe and H. Garcia-Molina. Meaningful change detection in structured data. In SIGMOD Conference, pages 26–37, 1997. 4. M. Fisher, D. Gabbay, and L. Vila. Handbook of Temporal Reasoning in Artificial Intelligence (Foundations of Artificial Intelligence (Elsevier)). Elsevier Science Inc., New York, NY, USA, 2005. 5. J. N. Foster and G. Karvounarakis. Provenance and data synchronization. IEEE Data Eng. Bull., 30(4):13–21, 2007. 6. C. Gutierrez, C. Hurtado, and A. O. Mendelzon. Foundations of semantic web databases. In PODS ’04: Proceedings of the 23rd ACM symposium on principles of database systems, pages 95–106, 2004. 7. C. Gutiérrez, C. A. Hurtado, and A. A. Vaisman. Temporal rdf. In ESWC, pages 93–107, 2005. 8. J. R. Hobbs and F. Pan. An ontology of time for the semantic web. ACM Trans. Asian Lang. Inf. Process., 3(1):66–85, 2004. 9. M. C. A. Klein, A. Kiryakov, D. Ognyanov, and D. Fensel. Finding and characterizing changes in ontologies. In ER, pages 79–89, 2002. 10. C. Lutz, F. Wolter, and M. Zakharyaschev. Temporal description logics: A survey. In TIME, pages 3–14, 2008. 11. V. Milea, F. Frasincar, and U. Kaymak. Knowledge engineering in a temporal semantic web context. In ICWE, pages 65–74, 2008. 12. S. Nunes, C. Ribeiro, and G. David. Wikichanges - exposing wikipedia revision activity. In Procceedings of the 2008 International Symposium on Wikis (WikiSym) Porto, Portugal, 2008. 13. N. Pelekis, B. Theodoulidis, I. Kopanakis, and Y. Theodoridis. Literature review of spatiotemporal database models. Knowl. Eng. Rev., 19(3):235–274, 2004. 14. P. Plessers, O. D. Troyer, and S. Casteleyn. Understanding ontology evolution: A change detection approach. J. Web Sem., 5(1):39–49, 2007. 15. F. Rizzolo, Y. Velegrakis, J. Mylopoulos, and S. Bykau. Modeling concept evolution: a historical perspective. In 28th International Conference on Conceptual Modeling (ER), 2009. 16. M. Sabel. Structuring wiki revision history. In WikiSym ’07: Proceedings of the 2007 international symposium on Wikis, pages 125–130, New York, NY, USA, 2007. ACM. 17. J. Tappolet and A. Bernstein. Applied temporal rdf: Efficient temporal querying of rdf data with sparql. In ESWC, pages 308–322, 2009. 18. G. Tummarello, C. Morbidoni, R. Bachmann-Gmür, and O. Erling. Rdfsync: Efficient remote synchronization of rdf models. In ISWC/ASWC, pages 537–551, 2007. 19. M. Völkel, C. F. Enguix, S. R. Kruk, A. V. Zhdanova, R. Stevens, and Y. Sure. Semversion - versioning rdf and ontologies. Knowledge Web Deliverable 2.3.3.v1, University of Karlsruhe, June 2005. 20. C. A. Welty and R. Fikes. A reusable ontology for fluents in owl. In FOIS, pages 226–236, 2006.