Wikify Your Metadata! Integrating Business Semantics, Metadata Discovery, and Knowledge Management John O. Biderman Harvard Pilgrim Health Care Cameron McLean World Class Objects 16 March 2010 11:30 Outline The Problem Statement – Genesis of the “Data Dictionary” – Build vs. Buy About our Metadata environment Introduction to MediaWiki and Semantic MediaWiki Information Architecture Generating Pages from Structured Source into the Wiki User Content Future Directions 2 © 2010 Harvard Pilgrim Health Care About Harvard Pilgrim Health Care Not-for-profit Health Plan serving approximately one million members in Massachusetts, New Hampshire, and Maine Ranked the #1 commercial Health Plan in America for five consecutive years by U.S. News & World Report and the National Committee for Quality Assurance (NCQA) Ranked #1 in Member Satisfaction in the Northeast region by J.D. Power and Associates Rated in top 10 places to work in both the Boston Business Journal and The Boston Globe. 3 © 2010 Harvard Pilgrim Health Care Data Environment Migrating from legacy monolithic core system to a componentized architecture integrated via SOA In preparation for this, an Enterprise Data Warehouse (EDW) is in production Analytics-intensive environment – Financial: Actuarial pricing, cost and utilization trend analysis, provider efficiency, etc. – Clinical: HEDIS, disease detection, quality of care, population studies, etc. – Sales & Marketing: product performance, consumer behavior, broker productivity, etc. 4 © 2010 Harvard Pilgrim Health Care Problem Statement Business data analysts’ sense of Impending Doom: The planned shutdown of the legacy Data Warehouse was coming – everyone had to move to the EDW A Semantic Challenge: Semantics of the legacy warehouse were in the native terms of the old monolithic system EDW semantics are based on the Enterprise Logical Data Model, in business terms independent of any source application 5 © 2010 Harvard Pilgrim Health Care Problem Statement “You can have the best data warehouse and the best BI tool in the world, but if we don’t have good descriptions of the data nobody will be able to use them.” - a user from the business Available, quality metadata seen as key to EDW adoption Executives set it as a priority 6 © 2010 Harvard Pilgrim Health Care Problem Statement HPHC has documented metadata for years – as an IT function – Mostly ODS- and warehouse-focused – Hard to get business or project team involvement Metadata stored in a database and presented through the Enterprise Metadata Repository (a commercial tool) Business users found the data definitions to be technical, sometimes unhelpful, without context Presentation tool had poor navigation and search capability Thus was born the “Data Dictionary” project – Executive sponsorship from COO and CFO – Business users involved in defining requirements 7 © 2010 Harvard Pilgrim Health Care DD Key Driving Requirements “Structured” and “Collaborative” components – Structured contains formal, approved, seldom-changing data definitions and notes – Collaborative is an area where business users can contribute knowledge, insights, and best practices about the data • Contributions may transcend individual data elements or subject areas Governance – Data Stewardship committees approve Structured content and may recommend migration of Collaborative content • DD project dovetailed with rapidly evolving Data Governance program Search – Content should be searchable across Structured and Collaborative areas – Search on key words and business concepts as well as literals 8 © 2010 Harvard Pilgrim Health Care Business Context Diagram Oversight Data Stewardship Boards Authorization Oversight Executive Oversight Social controls Business definitions Watched topics Structure collaborative content Metadata Database Subject Matter Experts Experience, Comments, suggestions Review collaborative content Derivation logic Other formal content Business definitions Technical definitions Lineage Structured Content · · · · · · Collaborative Content Views, data elements, reports Technical & business definitions Data type, size, format Lineage Derivation logic Etc. Data Dictionary · · · · · Best practices Questions about use FAQs Recommendations for changes to formal content Etc. Experience, Comments, suggestions Read-Only Data Dictionary Consumers 9 Experience, Comments, Suggestions Watched Topics © 2010 Harvard Pilgrim Health Care DD Solution Assumptions The Data Dictionary represents a business-friendly front end on metadata along with collaboration extensions – Addresses current issues of metadata usability and business ownership Metadata will be stored and displayed through some metadata management tool The Data Dictionary will leverage the metadata tool to: – – – – Facilitate EDW adoption and implementation of a new BI tool Share metadata more broadly Elicit business contributions Make metadata searchable and usable 10 © 2010 Harvard Pilgrim Health Care Existing MetaData Management Process ERwin Data Models ELDM Workbook PDM Workbooks ELDM Mapping Workbooks Data Definitions Workbooks (Entity, Attribute) ETL Specifications Report/Extract Metadata Custom Data Load App CMC Corporate Metadata Center (Oracle DB) Home-grown metamodel, used for ad hoc queries, reports, impact analysis Import Utility Excel Workbooks Metadata generation and collection tools, version-controlled through Harvest, designed to be loadable to metamodel via RTU ASP Templates EMR Browser Presentation Enterprise Metadata Repository Complex data model, used only as presentation layer 11 © 2010 Harvard Pilgrim Health Care Technology Option: Replace Repository with another tool Survey the marketplace – see what commercial tools could replace part or all of the storage and presentation layers ERwin Data Models possibly this ELDM Workbook PDM Workbooks ELDM Mapping Workbooks Data Definitions Workbooks (Entity, Attribute) ETL Specifications Report/Extract Metadata Custom Data Load App CMC Corporate Metadata Center (Oracle DB) definitely this Import Utility Excel Workbooks EMR ASP Templates Browser Presentation Enterprise Metadata Repository 12 © 2010 Harvard Pilgrim Health Care Metadata Tools Review Engaged consultant with extensive market research for two-day workshop Researched ±15 products, received demos of 5 Conclusions: – Lots of good tools out there – Have come a long way in last several years, e.g.: • Visualization of data lineage • SOA readiness, integration of service registry/ WSDL metadata – Early stages of adding collaborative components – Still aimed mostly at technical users with relatively complex User Interfaces Plus, we were under the gun to deliver rapidly 13 © 2010 Harvard Pilgrim Health Care Identified Solution Metadata management and metadata presentation are two different problems that do not necessarily have one solution Contemplated writing our own presentation layer, but… Driving requirements – search, collaboration, ease of use – lent themselves to a Wiki MediaWiki was already in-house and is Open Source with a plethora of plug-ins and extensions… … and has programming interfaces that enable pushing of content into the tool outside the UI, plus ability to protect pages from editing (satisfies “structured” requirement) 14 © 2010 Harvard Pilgrim Health Care General Solution for Data Dictionary Metadata generation and collection tools remain the same CMC Corporate Metadata Center (Oracle DB) Supports ad hoc queries, reports, impact analysis Source for Web publication Interface Programs to MediaWiki MySQL Database MediaWiki Database MediaWiki + Semantic Extensions Browser Presentation via Browser Formal Metadata Capture Processes Custom Data Load App User-Contributed (Collaborative) Content 15 © 2010 Harvard Pilgrim Health Care 16 © 2010 Harvard Pilgrim Health Care MediaWiki: It’s not just Wikipedia! 17 © 2010 Harvard Pilgrim Health Care Simplified Markup Notation For example, this wikitext. . . == Getting Started == To the left is the navigation box. Data Dictionary has several paths to help you find what you need: * Navigate by [[EDW_View_Layers|view]] -- Find your departmental view and navigate by subject area. * Navigate by [[Analytic_Topics|analytical topic]] -- These are generic analytical opportunities that represent best practices in health care informatics. Data Dictionary shows you the data columns that pertain to each topic. The Data Dictionary can be augmented with other business taxonomies in future releases. * Look through complex [[EDW Derivations|derivations]] -- These are HPHC-standard ways for calculating measures that previously analysts had to program into their applications. * Use the [[EDW-DWH_Lexicon|Lexicon]] to help you migrate off DWH -The Lexicon is a cross reference of the EDW Semantic View Layer to the legacy Data Warehouse and data marts -- CIRS, CCDB, and AURAmart. 18 © 2010 Harvard Pilgrim Health Care Simplified Markup Notation . . .results in this page display 19 © 2010 Harvard Pilgrim Health Care Simplified Markup Notation Headings with == == Getting Started == Bullets with * To the left is the navigation box. Data Dictionary has several paths to help you find what you need: * Navigate by [[EDW_View_Layers|view]] -- Find your departmental view and navigate by subject area. * Navigate by [[Analytic_Topics|analytical topic]] -- These are generic analytical opportunities that represent best practices in health care informatics. Data Dictionary shows you the data columns that pertain to each topic. The Data Dictionary can be augmented with other business taxonomies in future releases. Hyperlinks Look through complex with*HPHC-standard [[ ]] ways for [[EDW Derivations|derivations]] -- These are calculating measures that previously analysts had to program into their applications. * Use the [[EDW-DWH_Lexicon|Lexicon]] to help you migrate off DWH -The Lexicon is a cross reference of the EDW Semantic View Layer to the legacy DataWiki Warehouse and data martsdisplayed -- CIRS, CCDB, link and page name AURAmart. 20 © 2010 Harvard Pilgrim Health Care Simplified Markup Notation MediWiki translates it all into HTML… References this URL: https://. . ./ddw/index.php/EDW_Derivations 21 © 2010 Harvard Pilgrim Health Care Native MediaWiki Assign pages to Namespaces – Namespaces are specified by the system administrator and become part of the page name, e.g. “Edw:Member Liability Amount” Assign pages to Categories – Categories are user-defined on a page and can be specified on the fly Both participate in Search Version history – Every wiki change is logged by user and date – Compare changes – Rollback 22 © 2010 Harvard Pilgrim Health Care Wiki Templates Declared in double braces {{ }} Simplify page layout standardization For example, Wikipedia references to disambiguation pages are through the “About” template. For the article titled “Wiki,” the template is invoked like this: {{About|the type of website}} Results in this: This article is about the type of website. For other uses, see Wiki (disambiguation). 23 © 2010 Harvard Pilgrim Health Care 24 © 2010 Harvard Pilgrim Health Care Beyond Hyperlinking transclusion (trănz-kloo-zhən) Dynamic inclusion of part or all of the text of one hypertext document into another. See: Nelson, Ted 1982, Literary Machines (Mindful Press) 25 © 2010 Harvard Pilgrim Health Care Semantic MediaWiki Extends the markup notation Supports assigning semantic Properties on a page, e.g.: – “Member Of” properties • Belongs to a Derivation • Belongs to a Subject Area – “Has A” properties • Has a description – Synonyms/Antonyms • Rx = Pharmacy = Drug Semantic searches can find pages that are members of a property. Tagged parts of a page can be transcluded onto other pages. 26 © 2010 Harvard Pilgrim Health Care Semantic Properties Example of semantic tags: Property Name Value [[Category:View]] [[EDW Subject Area::CLAIM | ]] [[View Layer::Finance_Atomic_View | ]] [[View::CLAIM | ]] [[objectDesc::A notification or request for payment for health care services or products rendered to an HPHC member.]] A section of text can be wrapped in a semantic tag 27 © 2010 Harvard Pilgrim Health Care Semantic Queries Within this namespace {{#ask: [[Finance_atomic_view:+]] [[EDW Subject Area::CLAIM]] | ?objectDesc = | }} For a page whose EDW “Subject Area” property = “CLAIM” Return the text for the property “objectDesc” Results in: A notification or request for payment for health care services or products rendered to an HPHC member. 28 © 2010 Harvard Pilgrim Health Care A page with lots of content… 29 © 2010 Harvard Pilgrim Health Care …but almost no Wikitext Entire Wikitext for that page: __TOC__ Invokes a template inside of which is the semantic query to locate all pages that get listed in the table. ==EDW View Layer Details== '''View Layer:''' Reference_View '''Description:''' Reference View Layer ==Views== {{ObjectList |[[Reference_View:+]] [[Category:View | Name]]}} {{HyperLinkSeeAlso | {{FULLPAGENAME}}}} [[Category:View Layer]] [[View Layer::Reference_View | ]] [[objectDesc::Reference View Layer | ]] 30 © 2010 Harvard Pilgrim Health Care The Power of SMW #ask and ye shall receive! – and its cousin #show Normalized content – “Article of Record” content can be displayed by reference rather than in local Wikitext Dynamic pages – Self-maintaining Faceted querying – Properties group pages into like categories 31 © 2010 Harvard Pilgrim Health Care or: It’s All in the Vocabulary 32 © 2010 Harvard Pilgrim Health Care Leveraging Available Taxonomy The Enterprise Logical Data Model is the über taxonomy – Native hierarchy • Subject Area • Facet • Entity • Attribute EDW Architecture’s natural hierarchy – View “Layer” – View – Column Business taxonomy – Seeded with “Analytic Topics” 33 © 2010 Harvard Pilgrim Health Care Information Model Map hierarchy and navigation – What content gets generated – Protected pages (“Structured content”) – What gets found by semantic query Set page naming convention Determine properties for each node: – Namespace – Category – Property tags 34 © 2010 Harvard Pilgrim Health Care Wiki Structure Each EDW “View Layer” content loaded into its own Wiki namespace – Supports filtered searching for users who generally have access to one layer only EDW namespaces are protected – Page generator has its own wiki ID with admin rights – Supports the “Structured Content” requirement “Business Annotations” section on the protected pages is editable by any user – Puts user comments more front and center than “Discussion” pages 35 © 2010 Harvard Pilgrim Health Care Physical-to-Logical Mappings All business data elements in the EDW are mapped to their counterparts in the ELDM ELDM mappings are a semantic property of a physical column, e.g.: [[ELDM Attribute Name::provider contract identifier]] In the SOA world, the ELDM is increasingly important as an application-neutral expression of enterprise data requirements Data Dictionary is the first time the physical-to-logical relationships have been systematically exposed to the business users 36 © 2010 Harvard Pilgrim Health Care Code Lookup External Web app Invoked from hyperlink on pages for code columns that have a reference table Queries the associated reference table in the data warehouse in real time 37 © 2010 Harvard Pilgrim Health Care 38 © 2010 Harvard Pilgrim Health Care Components Materialized views of Corporate Metadata Center database to: – Provide a frozen snapshot of the metadata (no shifting sands) – Flatten hierarchy for simpler querying – Allow for generation of changed pages only – comparing earlier snapshot with current Java program to read CMC data – Retrieves content – Recurses through data lineage to find source end point for a given target Velocity template language to format into Wikitext Selenium robot and MediaWiki APIs 39 © 2010 Harvard Pilgrim Health Care Page Generation Process Materialized Views Java SQL Queries Hash Tables Corporate Metadata Center Selenium Bot through Wiki UI [development] Velocity Templates WikiText MediaWiki API calls (Java, http) [production] 40 MediaWiki MySQL Database © 2010 Harvard Pilgrim Health Care 41 © 2010 Harvard Pilgrim Health Care Business Glossary Business vernacular terms cross referenced to ELDM analogues – creates an association between “folksonomy” and the structured vocabulary 42 © 2010 Harvard Pilgrim Health Care Links from User Contributions to Protected Pages Proxy pages define semantic properties that associate user-contributed pages with protected pages [[links from::<page name>]] [[links to::<page name or url>]] All generated pages reference a template that queries for these properties, dynamically adds a See Also section and transcludes the hyperlink 43 © 2010 Harvard Pilgrim Health Care Participation Challenges Organization is relatively primitive about Knowledge Management and Web 2.0 – But… Many people are intrigued by the capabilities of Semantic MediaWiki. Interest is piqued. Need to get to some critical mass to demonstrate value and elicit more user contributions 44 © 2010 Harvard Pilgrim Health Care 45 © 2010 Harvard Pilgrim Health Care How has it been received? Well! – “The Data Dictionary provides clear and consistent EDW business definitions, search capabilities and a platform for user collaboration. The best part? The Data Dictionary is based on technology many users have already experienced with Wikipedia.” – Actuary – “I'm quite excited about the ease of collaboration provided by the Data Dictionary. When analysts learn something interesting from their analysis they can now post their findings in the Data Dictionary for others to see. The Data Dictionary will dramatically decrease learning time for data analysts transitioning to the EDW.” – Financial Analyst 300+ unique users 18,000+ generated pages Data stewardship committees taking up responsibility for quality of data definitions Implemented at 1/2 to 1/3 the cost of a commercial package with closer fit to user requirements 46 © 2010 Harvard Pilgrim Health Care What’s coming? Wiki forms for user contributions – Guide user on assigning properties – Simplify creation of proxies for transcluded hyperlinks Less Wiki markup – More templates! – Allows bulk revision of page layouts End-user UI on Semantic queries (Halo extension?) More taxonomy – “Analytic Topics” were good as a demonstration but not terribly useful – Develop more business vocabulary and strategy for relating this to the metadata to support… Conceptual Search Extended scope: – SOA documentation – Integration into enterprise Knowledge Management 47 © 2010 Harvard Pilgrim Health Care Conclusions A “roll-your-own” approach to metadata presentation is a workable strategy – Requires a structured metadata store Open Source tools are powerful and flexible – More functionality than you can afford to build yourself – Ever-evolving • Requires a version management strategy A collaboration platform engages the business in knowledge sharing and transfers ownership of business metadata from IT Semantic tools can build the connections between structured data taxonomy and the business vocabulary 48 © 2010 Harvard Pilgrim Health Care The Vision Business Intelligence Semantics and Metadata Data Management Knowledge Management 49 The Data Dictionary in Semantic MediaWiki sets a framework for this vision © 2010 Harvard Pilgrim Health Care john_biderman@harvardpilgrim.org cmclean@wcobjects.com 50 © 2010 Harvard Pilgrim Health Care