Semantic Wikis: Fusing Two Strands of the Semantic Web Dr. Mark Greaves Vulcan Inc. markg@vulcan.com © 2008 Vulcan Inc. Talk Outline The Argument for Semantic Wikis – Two Strands of the Semantic Web – Semantic Wikis: Bridging the Gap – Lessons from the Design of SMW+ Semantic Wiki Experience with Vulcan’s Project Halo – Question Answering in Science – Wikis for Question Answering 2 Semantic MediaWiki+ Talk Outline The Argument for Semantic Wikis – Two Strands of the Semantic Web – Semantic Wikis: Bridging the Gap – Lessons from the Design of SMW+ Semantic Wiki Experience with Vulcan’s Project Halo – Question Answering in Science – Wikis for Question Answering 3 Semantic MediaWiki+ Strand 1: The Semantic Strand of the Semantic Web Semantic Web as RDBMS Integration Technology – – – – Semantic representation of schema relations Centralized workflows for ontology/data definition and management Powerful reasoning and inference Enterprise-oriented Rooted in the original software/tools of the Semantic Web – Initial triplestores and authoring systems were (mostly) stand-alone or within the confines of a controlled data set – Early DARPA use cases were oriented around data integration • EII-style applications: BBN’s Foreign Clearance Guide for AMC • More XML-oriented than Web-oriented The Primary Commercial use of Semantic Web for many years – Examples: Siderean Seamark, Oracle RDF – Still the most well-understood use cases for the semantic web – Still extremely important commercially 4 Strand 2: The Web Strand of the Semantic Web Semantic Web as a web-scale knowledge publishing technology – – – – – Uncontrolled data dynamics, imperfect and voluminous data Anyone can publish with limited/no knowledge engineer involvement A massive base of socially-curated semantic data Balance between quantity and purity (issue with owl:sameAs links) Semantic data doesn’t have to be associated with HTML web text Rooted in the original vision of the Semantic Web – Took several years to start to be realized – Difficulty conceiving of massive numbers of overlapping ontologies and class hierarchies, and uncoordinated data publishing – Hard problem is maintaining a set of informal, evolving, and partial agreements on vocabularies and ontologies An exciting and emerging data set – Examples: Yahoo!, Sindice, Linking Open Data – Fairly poorly understood use cases (especially commercially) – Web-oriented and web-scale is extremely attractive 5 What do Strand 2 Semantic Web Applications Do? Strand 1 semantic web applications have enterprise use cases – EII, E-science, Enterprise content management... – Success of use cases requires unified data models, familiar to DB thinking Strand 2 semantic web applications address a brand new use case type – “Semantic Web should allow people to have a better online experience” – Alex Iskold, CEO of AdaptiveBlue – Enhance the human activities of content creation, publishing, linking my data to other data, forming community, purchasing satisfying things, browsing, etc. – Strongly linked to Web 2.0 business models (such as they are) • Improve the effectiveness/targeting of advertising • Knowledge management tools for communities Strand 2 use cases still require Strand 1-style data consistency and vocabulary agreement Can Strand 2 Semantic Web Applications Overcome the Data Chaos of the Emerging Semantic Web? 6 Semantic Wikis are in both Strands Wikis are tools for Publication and Consensus MediaWiki (software for Wikipedia, Wikimedia, Wikibooks, etc.) – Most successful Wiki software • High performance: 10K pages/sec served, scalability demonstrated • LAMP web server architecture, GPL license – Publication: simple distributed authoring model • Wikipedia: >2.5M English articles, >250M edits, >2.5M images, #8 Alexa traffic rank in August – Consensus achieved by global editing and rollback • Fixpoint hypothesis, although consensus is not static • Gardener/admin role for contentious cases Semantic Wikis apply the wiki idea to structured (typically RDFS) information – – – – Authoring includes instances, data types, vocabularies, classes Natural language text used for explanations Automatic list generation from structured data, basic analytics, database imports See e.g., http://wiki.ontoprise.com for one powerful semantic wiki Semantic Wiki Hypotheses: (1) Significant interesting non-RDBMS Semantic Data can be collected cheaply (2) Wiki mechanisms can be used to maintain consensus on vocabularies and classes 7 Example: Semantic MediaWiki with Halo Extensions (SMW+) Semantic MediaWiki+ Knowledge Authoring Capabilities – Syntax highlighting when editing a page – Semantic toolbar in edit mode • Displays annotations present on the page that is edited • Allows changing annotation values without locating the annotation in the wiki text – Autocompletion for all instances, properties, categories and templates – Increased expressivity through n-ary relations (available with the SMW 1.0 release) 8 Example: Semantic MediaWiki with Halo Extensions (SMW+) Semantic MediaWiki+ Semantic Navigation Capabilities – GUI-based ontology browser, enables browsing of the wiki's taxonomy and lookup of instance and property information – Linklist in edit mode, enables quick access of pages that are within the context of the page being currently edited – Search input field with autocompletion, to prevent typing errors and give a fast overview of relevant content 9 Example: Semantic MediaWiki with Halo Extensions (SMW+) Semantic MediaWiki+ Knowledge Retrieval Capabilities – Combined text-based and semantic search – Basic reasoning in queries with sub-/super-category/-property reasoning and resolution of redirects (equality reasoning) – GUI-based query formulation interface Web service integration and import/export support for popular formats Rule system developed for OWL-DLP and most of OWL-R Fully open source under GPL, supported by Ontoprise 10 Cool Idea... But Does it Work? User tests were performed in Chemistry – 20 graduate students were each paid for 20 hours (over 1 month) to collaborate on semantic annotation for chemistry – ~700 Wikipedia base articles – US high-school AP exams were provided as content guidance Gardening Statistics for Test Wiki Initial Results (SMW+ 1.0) – Sparse: 1164 pages (entites), avg 5 assertions per entity • 226 Relations (1123 relation-statements) and 281 attributes (4721 attribute-statements) – Many bizarre attributes and relations – Very difficult to use with a reasoner User testing and quality results for (SMW+ 1.1) extensions – Initial SUS scoring (6 SMEs, AP science task) went from 43 to 61; final scores in the 70s – 3 sessions using the Intrinsic Motivation Inventory (interest/value/usefulness); up 14% – Aided by the consistency bot, users corrected 2072 errors (80% of those found) over 3 months 11 We have continued to build on this framework Some Lessons Learned from SMW+ (and Freebase) User Interface design matters – This is core to MediaWiki’s success – Formal usability testing with SMEs matters a lot – Zero-training matters a lot Gardening matters – Users need support for debugging – Gardeners can do large scale ontology editing – Supports “Schema Last” data engineering User-created ontologies are not always well-designed – Flatter than normal – Cheaper than normal Natural language is necessary to augment bare RDF(S) semantics – Supplemental semantics can be usefully carried in natural language 12 From Strand 2 Web to Strand 1 Semantics Well-designed semantic wikis make possible certain Strand 2 applications – They enable local consensus-building on socially-published data – They allow Strand 2 knowledge publication to go beyond search Strand 1 semantic data can certainly support Strand 2 applications – Example: use of other triplestore data in SMW+ How can you use Strand 2-collected data to support Strand 1 applications? – Corporate uses of socially-curated data (Metaweb) – Project Halo: Scientific question-answering 13 Talk Outline The Argument for Semantic Wikis – Two Strands of the Semantic Web – Semantic Wikis: Bridging the Gap – Lessons from the Design of SMW+ Semantic Wiki Experience with Vulcan’s Project Halo – Question Answering in Science – Wikis for Question Answering 14 Semantic MediaWiki+ Envisioning the Digital Aristotle for Scientific Knowledge Inspired by Dickson’s Final Encyclopedia, the HAL-9000, and the broad SF vision of computing – The “Big AI” Vision of computers that work with people The volume of scientific knowledge has outpaced our ability to manage it – This volume is too great for researchers in a given domain to keep abreast of all the developments – Research results may have cross-domain implications that are not apparent due to terminology and knowledge volume “Shallow” information retrieval and keyword indexing systems are not well suited to scientific knowledge management because they cannot reason about the subject matter – Example: “What are the reaction products if metallic copper is heated strongly with concentrated sulfuric acid?” (Answer: Cu2+, SO2(g), and H2O) 15 Response to a query should supply the answer (possibly coupled with conceptual navigation) rather than simply list 1000s of possibly relevant documents The Halo Project in One Slide Project Halo: SME-based Authoring for scientific questionanswering systems Project Halo Goal: To determine whether tools can be built to facilitate robust knowledge formulation, query and evaluation by domain experts, with ever-decreasing reliance on knowledge engineers – Can SMEs build robust question-answering systems that demonstrate excellent coverage of a given syllabus, the ability to answer novel questions, and produce readable domain appropriate justifications using reasonable computational resources? – Will SMEs be capable of posing questions and complex problems to these systems? – Do these systems address key failure, scalability and cost issues encountered in expert systems? Experimental Scope: Selected portions of the AP syllabi for chemistry, biology and physics – Example: Balance the following reactions, and indicate whether they are examples of combustion, decomposition, or combination (a) C4H10 + O2 CO2 + H2O (b) KClO3 KCl + O2 (c) CH3CH2OH + O2 CO2 + H2O (d) P4 + O2 P2O5 (e) N2O5 + H2O HNO3 16 AURA – Automated User-centered Reasoning and Acquisition System 17 Aura is a tool to help users formalize AP-level scientific knowledge Aura can then reason with that knowledge So users can ask questions and understand the answers 2006 Experimental Results for the Aura System SME Group Science grad student KBs Extensive natural lang ~$100 per syllabus page Pilot Group Halo Pilot System Percent correct Cycorp 37% 40% SRI 44% 21% Ontoprise 47% Percentage correct Number of questions SME1 SME2 Avg KE Bio 146 52% 24% 38% 51% Chem 86 42% 33% 37.5% Phy 131 16% 22% 19% Domain Knowledge Formulation Time for KF – Concept: ~20 mins for all SMEs – Equation: ~70 s (Chem) to ~120 sec (Physics) – Table: ~10 mins (Chem) – Reaction: ~3.5 mins (Chem) – Constraint: 14s Bio; 88s (Chem) SME need for help – 68 requests over 480 person hours (33%/55%/12%) = 1/day VS. Question Formulation Avg time for SME to formulate a question – 2.5 min (Bio) – 4 min (Chem) – 6 min (Physics) – Avg 6 reformulation attempts Usability – SMEs requested no significant help – Pipelined errors dominated failure analysis Professional KE KBs No natural language ~$10K per syllabus page System Responsiveness Biology: 90% answer < 10 sec Chem: 60% answer < 10 sec Physics: 45% answer < 10 sec Interpretation (Median/Max) Answer (Median/Max) Bio 3s / 601s 1s / 569s Chem 7s / 493s 7s / 485s Phy 34s / 429s 14s / 252s How Can We Increase the Efficiency of SME Authoring? 18 Symbiosis Between Aura and SMW+ Classical Knowledge Engineering – Expressive knowledge representation – Sophisticated testing and debugging Knowledge Engineering in Aura – Acquires knowledge for deductive Q/A that can be used for answering AP questions in sciences • Uses a DL style class taxonomy, and logic programming style rules with many extensions – Requires 40 hours of training for knowledge formulation Semantic Web Knowledge Engineering – Simple knowledge representation – Quantity at some expense of quality Knowledge Engineering in SMW+ – Tool for online authoring and consensusbuilding around semantic web content – Captures knowledge at the level of RDFS – Collective editing for quality control – Gardening appropriate for scientific knowledge – Almost walk up and use system Can we use the Semantic Media Wiki to capture knowledge that could be used for Q/A in AURA? – Factual knowledge (e.g., atomic number for carbon is 6, solubility constraints, etc.) – Taxonomic knowledge (e.g., eukaryotic and prokaryotic are two types of cells) 19 Knowledge creation would be faster, distributed, and cheaper Example: Wikipedia Article on Organelle 20 Source Text of Article on Organelle in SMW+ 21 Fact Box Summarizing the Annotations in SMW+ 22 Ontology Browser for Test Biology Data in SMW+ 23 Aura/SMW+ Use Case Semantic Wiki includes relevant knowledge Aura knowledge formulation engineer searches for knowledge during knowledge formulation The KFE notices useful information in SMW+ The KFE maps the knowledge into Aura – Currently uses a derivative of Ontomap – Experimenting with FOAM support – ETL workflow 24 The knowledge is translated into Aura and available for querying AURA User Searches for Information 25 Aura User Notices Useful Information in Wiki 26 Aura User Maps Wiki Knowledge into Aura KB 27 Wiki Knowledge Available in Aura for Question-Answering 28 Conclusions Two strands of semantic web applications – Strand 1: Structured, enterprise-quality semantic data • Designed for powerful analytics and easier data fusion – Strand 2: Lightweight web-scale semantic publishing • A revolution in AI if we can keep the quality up Semantic Wikis have features from both strands – Easy to see how semantic wikis can leverage Strand 1 data for Strand 2 support – Harder to see how semantic wikis can leverage Strand 2 data for Strand 1 support Vulcan’s Project Halo – Use of SMW+ to use web-collected data in a question-answering application – Addresses very hard AI problems in scaling up knowledge authoring – Full evaluation of SMW+ and Aura in early 2009 • Is mapping easier than authoring? 29 Thank You 30 Disclaimer: The preceding slides represent the views of the author only. All brands, logos and products are trademarks or registered trademarks of their respective companies.