AAAI Technical Report SS-12-06 Wisdom of the Crowd Pragmatic Analysis of Crowd-Based Knowledge Production Systems with iCAT Analytics: Visualizing Changes to the ICD-11 Ontology Jan Pöschko and Markus Strohmaier Knowledge Management Institute, Graz University of Technology, Inffeldgasse 21a/II, 8010 Graz, Austria Tania Tudorache and Natalya F. Noy and Mark A. Musen Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Stanford, CA 94305-5479, USA Abstract While in the past taxonomic and ontological knowledge was traditionally produced by small groups of co-located experts, today the production of such knowledge has a radically different shape and form. For example, potentially thousands of health professionals, scientists, and ontology experts will collaboratively construct, evaluate and maintain the most recent version of the International Classification of Diseases (ICD-11), a large ontology of diseases and causes of deaths managed by the World Health Organization. In this work, we present a novel web-based tool—iCAT Analytics—that allows to investigate systematically crowd-based processes in knowledge-production systems. To enable such investigation, the tool supports interactive exploration of pragmatic aspects of ontology engineering such as how a given ontology evolved and the nature of changes, discussions and interactions that took place during its production process. While iCAT Analytics was motivated by ICD-11, it could potentially be applied to any crowd-based ontology-engineering project. We give an introduction to the features of iCAT Analytics and present some insights specifically for ICD-11. 1 Figure 1: The iCAT platform allows users to edit the ICD-11 ontology with domain specific content. This view shows a hierarchy of the concepts (categories) in the ontology to the left and the detailed view of the properties for one particular concept to the right. Other views allow to browse notes and to manage the hierarchy. makings. The most important contribution of ICD is that it enables the exchange of comparable data from different regions, and it allows the comparison of different populations over long periods of time (Israel 1978). The previous revisions of ICD pursued a traditional knowledge-production process, where committees behind closed doors made the decisions on what to include in the classification. For the 11th revision, WHO pursues an open crowd-based approach, in which experts all around the world are contributing to the content using a Web platform similar—at least in some ways—to Wikipedia. In the current alpha phase of the project, around 70 international experts work on the ICD-11 ontology using the ICD-11 Collaborative Authoring Tool (Tudorache et al. 2010) (iCAT; see Figure 1). iCAT is a custom-tailored version of WebProtégé, a general collaborative ontology-authoring tool (Tudorache et al. 2011). We designed iCAT specifically for the development of ICD-11. In the beta phase starting in May 2012, potentially thousands of other contributors will join this WHO effort to produce ICD-11 by 2015. iCAT provides a collaborative Web-based platform that presents the underlying ICD-11 ontology in a user-friendly way. It shows users the disease characteristics (represented as OWL ontology properties) as simple Web forms that they need to fill in. Introduction Ontologies are widely used to represent taxonomic and other types of knowledge. With recent advances in collaborative and web technologies, large-scale ontology engineering projects increasingly seek out crowd-based strategies for supporting the process of knowledge production. Towards this end, researchers have proposed several ways in which these strategies can be applied (Simperl and Luczak-Rösch 2011). However, with increasing social and ontological complexity, such systems require effective analytical tools and methods to understand and interpret pragmatic aspects of crowd-based knowledge production, and how pragmatics (the way users interact with one another and with the system) influence the collective product (the ontology). In this demonstration paper, we will present an analytical tool—iCAT Analytics—that we developed to investigate the knowledge-production process behind the 11th revision of the International Classification of Diseases (ICD-11) managed by the World Health Organization (WHO). ICD is used worldwide to compile morbidity and mortality statistics, to monitor health-related spending and to inform policy c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 59 Table 1: Examples for the measures of conflict and author contributions on one concept with several properties. Letters represent individual edits made by the corresponding users. Vertical lines indicate session boundaries. Property Authors of changes Overrides Edit sessions Distinct authors (sorted by time) by property P1 AA|BB|C|AAA 3 4 3 P2 BBBB|C|BB 2 3 2 P3 A 0 1 1 overall measure 5 8 6 In this paper, we present a web-based tool that allows analysts to investigate the knowledge-production process for ICD-11. Specifically, the analysts can investigate the following interactive networks that the tool provides: 1. concepts, their properties, and their respective number of changes, notes, and various other measures; 2. authors and their relations through mutually edited concepts and overrides; and 3. properties attached to concepts and their relations through follow-up changes. WebProtégé and, hence, iCAT offer extensive collaboration features and functionality. In addition to editing the ontology collaboratively, users may add threaded notes attached directly to ontology classes that enable them to have contextualized discussion as part of the development process. WebProtégé tracks all the changes and notes that the users make in a structured log (Noy et al. 2006). We have been maintaining this log in iCAT since November 2009 and using this data for analytical purposes provides great insight into the ICD-11 development process so far. See our previous work (Pöschko et al. 2012) for a detailed description of the dataset and the underlying model. In order to understand the pragmatic history of such knowledge-production systems, and to gain both a quick overview and deeper insights into what areas are active, conflicted or neglected, we need effective analytical instruments. The main contribution of this demonstration paper is the introduction of a novel analytical tool that (i) has been applied to a very large collaborative ontology-engineering project and (ii) has the potential to increase our ability to make sense out of the complex dynamics and processes behind crowd-based knowledge production systems. Providing this information has several purposes: • Content editors see what concepts are “trending” and can plan their own efforts accordingly. They might be motivated by comparing their own contributions to others. • Managing editors get an overview of the whole ontology engineering process and the current state of the ontology in terms of its history. They can evaluate collaboration between authors and set future goals and milestones accordingly. • Ontology engineers see what parts of the ontology have been actively used and which parts have been neglected, giving them hints about possible improvements in the underlying model or at least in the communication of the meaning of certain properties to the editors. The general goal is to provide further insights into specific knowledge production processes, with special focus on the social context of the production. Our tool is released via open-source software licenses1 and is in active use in the development process for ICD-11.2 2 Materials and Methods In this section, we define the dataset that we use in our analysis, present the measures for identifying conflicts in the knowledge production, describe the typical user interaction in iCAT Analytics, and provide implementation details. Dataset iCAT Analytics can naturally handle data from any ontology that is edited using WebProtégé and iCAT, which is a customization of WebProtégé. But iCAT Analytics is easily extensible to visualize pragmatic aspects related to arbitrary ontology engineering projects, given that data about changes and possibly notes is available. Specifically, the tool assumes the following datasets: • the ontology characterized by concepts and relations among them. In the case of ICD-11, which is primarily a taxonomy, we focus on parent–child (“is-a”) relations; • changes and notes to the ontology identified by their respective author, by the affected concept, its properties (if any) and a timestamp. In the current stage of ICD-11, we deal with 119,382 changes and 27,181 notes by 68 different authors. The data in iCAT Analytics is updated frequently through automatic mechanisms, to reflect changes in the underlying process. Defining and Measuring Conflict A large part of our analysis focuses on areas of conflict in the ontology. Because there is no explicit notion of a revert in a non-version-controlled system such as ICD-11—as opposed to Wikipedia, for instance—we define an alternative construct of “conflict” for the purpose of our study object (ICD-11). Specifically, we say that there is a conflict if there are subsequent edits that, at least partially, undo previous edits to a concept. In this work, we focus on conflict in property values, leaving out changes to the hierarchy of the taxonomy. This approach is reasonable, as 76.8% of all changes affect property values. Another 10.9% of changes are class creation 1 2 60 http://github.com/poeschko/iCAT-Analytics http://icatanalytics.stanford.edu and users hardly ever reverse a class-creation operation. We use the following three measures to analyze the changes in property values: • Overrides is the number of times one author edits the same property as another author did previously. • Edit sessions is the number of changes grouped by consecutive changes by the same author on the same property. • Distinct authors by property counts the number of distinct authors for each property and sums over all properties. Table 1 illustrates these measures. User Interaction in iCAT Analytics The typical way of interacting with iCAT Analytics is through visualizations of weighted networks, i.e., sets of nodes (with different sizes and possibly colors) and connecting edges (with different sizes). Table 2 presents a list of the available node weights. We use either the twopi (radial) or sfdp (multi-scale forcebased “spring model”) layout of Graphviz (Ellson et al. 2003) for network layout. While the radial layout is better suited for a hierarchical taxonomy such as ICD-11, a forcebased layout could be more appropriate for other ontologies and networks. A user can browse the visual networks by scrolling around using common drag-and-drop principles, and by zooming in and out using either the mouse wheel or dedicated zoom buttons. A button allows a user to jump to the center of the graph quickly. We designed the general look-and-feel to resemble common applications such as Google Maps.3 For large networks with tens of thousands of nodes (the network of concepts, in our case), it is not useful to display all of them at the same time, especially because in this case we focus on the attributes of individual nodes (their size and color) and not on the overall layout of the network. To account for this, iCAT Analytics displays only the most “important” nodes in a given view, where we compute the importance of a node based on the weight function the user chooses. Given the coordinates of the user view’s bounding box, we select the displayed part of the network in the following way: Figure 2: Titles of nodes are shown as users move a cursor over them. Related nodes are highlighted and a short form of their respective title is shown as well. Additional Rankings and Details In addition to the network views for categories, authors, and properties, there are overview pages in iCAT Analytics that show the corresponding entities ranked by the different features. These pages allow users to find quickly the most (or least) changed concepts in the system, the most active users, etc., without having to scan through the whole network visualization (which provides an overview but not a linear ranking). Clicking on nodes leads to a page with details for the corresponding concept, author, or property (Figure 3). This page provides information about the history of a concept, answering questions such as, • when was it edited the last time? • How was work on the concept distributed over time? • How was work distributed among authors? Implementation Details 1. The bounding box is divided into 10 × 10 raster boxes. For each box, the node with the highest weight is selected. We implemented iCAT Analytics largely in Python4 using the Django Web framework.5 For network calculations, we use NetworkX (Hagberg, Schult, and Swart 2008), employing Graphviz (Ellson et al. 2003) for computing graph layouts. We export the iCAT data using the Protégé6 API and store it in a MySQL database. On the client side, we use JavaScript with AJAX (“Asynchronous JavaScript and XML”) to load dynamically and to display parts of the networks. We maintain this part in a separate open-source project.7 2. All nodes on a directed path from any selected node to the root node are selected, too. 3. All edges between any two selected nodes are selected. Step 2 allows us to provide the context for each node (concept). Without information about the parents of a concept, it would not be possible for the user to make sense of individual nodes. Showing all node titles at the same time would produce too much visual clutter. However, when the user moves the cursor over specific nodes, we show their title, highlight related nodes and show a shortened version of their respective titles (Figure 2). 3 4 http://python.org http://djangoproject.com 6 http://protege.stanford.edu 7 http://github.com/poeschko/nexp-js 5 http://maps.google.com 61 Figure 3: Category detail page showing a timeline of the number of changes and notes on the concept, a chart depicting the contributions of different authors, and a list of parents and children of the concept. Further down (not visible in this screenshot) would be a detailed list of all changes and notes. 3 Figure 4: The main view of iCAT Analytics showing the ontology with concept nodes sized according to their respective number of changes, and edges denoting parent-child relations. The color of the nodes is blue for categories that are ready for public consideration, yellow for work-in-progress and red for categories that need sufficiently more work. Network Views The user can select to display the network of categories, authors, or properties. where authors have been active, and what kind of contributions they have made. Figure 5 shows two examples of such networks. Figure 5(a) suggests that it corresponds to a kind of “ontology manager” who mostly makes high-level changes across all branches of the ontology, whereas 5(b) seems to represent a different kind of user—a “domain expert” (Falconer, Tudorache, and Noy 2011)—who focuses on one particular part of the ontology. Network of Categories The categories network shows the concepts in the ontology and their parent–child relations. The user can choose one of several node weight functions, which we use (1) to select the nodes that the tool displays (see the previous section), and (2) to size them accordingly. Table 2 presents a list of all features and the corresponding questions that analysis of these features can answer. The color of the nodes reflects their display status, which is assigned by the editors of the ontology: Network of Authors In the network of authors, nodes represent users in the ontology-engineering process. iCAT Analytics provides two different ways for linking these nodes: • red: the concept requires sufficiently more work; 1. Mutually touched categories shows edges between authors weighted according to the number of concepts that both authors edited or annotated; node sizes reflect the total number of changes by each author. This view gives an overview of the state of collaboration in a crowd-based knowledge-production system. • yellow: the concept is being edited, but it is not ready yet; • blue: all aspects of the concept have been edited and it is ready for public consideration. Nodes that have not been assigned a display status are gray. Figure 4 shows a screenshot of categories with their respective number of changes. This visualization provides a quick overview of the current production system state, displaying how the status is distributed and nested. For example, Figure 4 shows that one branch of the ontology (XII ’Diseases of the skin’) is almost entirely blue, meaning that it is close to being finished. Apart from that branch, blue nodes are rather spread out, suggesting that editors often mark a concept as ready even when its parents and children are not yet ready. In addition to the overall view of concepts and corresponding features, iCAT Analytics also allows users to focus on individual authors and to view the network of concepts that they changed. This analysis can be interesting both to the users themselves and to managers to get an overview of 2. Overrides (Figure 6) weighs edges according to the number of changes by one author that were overridden by another author; node sizes reflect the fraction of changes by an author that were overridden. This view answers both the question of who gets contradicted most often and the question of who contradicts them. Network of Properties In the network of properties, nodes correspond to the properties in the ontology and weighted edges indicate the number of follow-ups on a different property, i.e., the number of changes of a given property that were followed by a change on a given other property. This view can be further restricted 62 Table 2: Features that users can select as node weights and to sort concepts, and the questions they address. Feature Question addressed Changes and notes history Number of changes Where are highly edited areas in the ontology? Number of notes Where are highly discussed areas in the ontology? Changes + notes Where are highly active areas in the ontology? Distinct authors of changes / notes Which concepts attract many different authors? Authors Gini coefficient Which concepts are edited more “democratically”, i.e., in a more evenly distributed manner? Contrarily, where are areas that are dominated by many changes by a single author? Overrides Which concepts cause most dispute? Edit sessions Where are highly active areas (modulo consecutive changes of the same property by the same author)? Distinct authors by property Which concepts have many properties that are edited by many different authors? Network features Number of parents Which concepts have many parents? (This is particularly interesting in the case of ICD-11, as multiple parents were not possible in ICD-10 and are therefore introduced gradually.) Number of children Which concepts have many children? Depth in network Which concepts are very deep in the taxonomy? Betweenness centrality (directed) What are central concepts in the taxonomy? Betweenness centrality (undirected) ” Pagerank, Closeness centrality ” to the follow-ups that happened within three hours. Figure 7 shows a portion of the resulting network in iCAT Analytics. This network visualization aims to provide new insights for the creators of the ontology and the pragmatic usage of it. Strong connections between properties suggest that • there is a strong semantic relation between them, and • they should probably be placed close together in the user interface for the editors. 4 Examining these questions is already interesting for the limited collaboration that has happened so far in the process of ICD-11, but it will be even more useful to monitor crowd behavior and processes continuously when the system is open to a much broader public. Furthermore, iCAT Analytics can potentially be used in other knowledge production contexts that focus on ontologies as a collective product. There are several extensions to the tool that would be interesting to pursue: 1. Providing a way to compare different “snapshots” of the ontology over time could be useful to monitor recent changes. Discussion and Conclusions In this paper, we presented a novel web-based tool, iCAT Analytics, that enables users to explore pragmatic aspects of crowd-based knowledge-production systems. Our tool focuses on analyzing changes and notes that were made during the production process. The way we present this data visually allows users to get a quick overview of what happens in the ontology. Particularly, it indicates • which areas in the ontology have been actively edited and which areas have been neglected; • which concepts are edited more “democratically” than others, i.e., what are the relative contributions of different authors to the concept; • how work is distributed among authors; • which areas are disputed, i.e., have many concepts with conflicts among the editors; • what authors collaborate with each other and to what extend they contradict each other; • how properties in the ontology are used and in which order. 2. Integrating more aspects of “rewarding” authors for their contributions could encourage broader participation. 3. A deeper integration into iCAT itself (or other ontology engineering tools) would be desirable, especially in combination with 2. Acknowledgements We are grateful to our WHO collaborators for giving us the opportunity to participate in the ICD-11 project and to analyze the iCAT log data; especially, we want to thank Bedirhan Üstün for helpful discussions. This work was generously funded by a Marshall Plan Scholarship with support from Graz University of Technology. The work on iCAT and the generation of the change logs is partially supported by the NIGMS Grant 1R01GM086587. References Ellson, J.; Gansner, E.; Koutsofios, E.; North, S.; and Woodhull, G. 2003. Graphviz and Dynagraph – Static and Dynamic Graph 63 (a) Changes in all parts of the ontology classify this author as an “ontology manager”. Figure 6: Part of the override graph of authors. Names are not shown to account for the authors’ privacy. (b) Changes in one specific branch suggest this author being a “domain expert”. Figure 5: Network of changed concepts by a single author. Node sizes correspond to the number of changes by the author, edges denote parent-child relations. Figure 7: Part of the properties network. Drawing Tools. In Junger, M., and Mutzel, P., eds., Graph Drawing Software. Springer-Verlag. 127–148. Falconer, S. M.; Tudorache, T.; and Noy, N. F. 2011. An Analysis of Collaborative Patterns in Large-Scale Ontology Development Projects. In Proceedings of the Sixth International Conference on Knowledge Capture, 25–32. New York, NY: ACM. Hagberg, A. A.; Schult, D. A.; and Swart, P. J. 2008. Exploring Network Structure, Dynamics, and Function using NetworkX. In Proceedings of the Seventh Python in Science Conference, 11–15. Israel, R. A. 1978. The International Classification of Disease. Two hundred years of development. Public Health Rep. 93(2):150–152. Noy, N. F.; Chugh, A.; Liu, W.; and Musen, M. A. 2006. A Framework for Ontology Evolution in Collaborative Environments. In International Semantic Web Conference - ISWC 2006, 544–558. Springer. Pöschko, J.; Strohmaier, M.; Tudorache, T.; Noy, N. F.; and Musen, M. A. 2012. The Pragmatic History Behind our Semantic Fu- ture: Studying the Evolution of Large-Scale Ontology Engineering Projects and the Case of ICD-11. Journal of Biomedical Informatics. Under review. Simperl, E., and Luczak-Rösch, M. 2011. Collaborative Ontology Engineering: A Survey. Knowledge Engineering Review. Accepted for publication. Tudorache, T.; Falconer, S. M.; Nyulas, C. I.; Noy, N. F.; and Musen, M. A. 2010. Will Semantic Web Technologies Work for the Development of ICD-11? In Proceedings of the Ninth International Semantic Web Conference, 257–272. Berlin, Heidelberg: Springer. Tudorache, T.; Nyulas, C.; Noy, N. F.; and Musen, M. A. 2011. WebProtégé: A Distributed Ontology Editor and Knowledge Acquisition Tool for the Web. Semantic Web Journal 11-165. 64