TUE_1130_Biderman_John_McLean_Cameron_2

advertisement
Wikify Your Metadata!
Integrating Business Semantics, Metadata
Discovery, and Knowledge Management
John O. Biderman
Harvard Pilgrim Health Care
Cameron McLean
World Class Objects
16 March 2010 11:30
Outline
 The Problem Statement
– Genesis of the “Data Dictionary”
– Build vs. Buy
 About our Metadata environment
 Introduction to MediaWiki and Semantic MediaWiki
 Information Architecture
 Generating Pages from Structured Source into the Wiki
 User Content
 Future Directions
2
© 2010 Harvard Pilgrim Health Care
About Harvard Pilgrim Health Care
 Not-for-profit Health Plan serving approximately
one million members in Massachusetts, New
Hampshire, and Maine
 Ranked the #1 commercial Health Plan in
America for five consecutive years by U.S.
News & World Report and the National
Committee for Quality Assurance (NCQA)
 Ranked #1 in Member Satisfaction in the
Northeast region by J.D. Power and Associates
 Rated in top 10 places to work in both the
Boston Business Journal and The Boston
Globe.
3
© 2010 Harvard Pilgrim Health Care
Data Environment
 Migrating from legacy monolithic core system to a
componentized architecture integrated via SOA
 In preparation for this, an Enterprise Data Warehouse
(EDW) is in production
 Analytics-intensive environment
– Financial: Actuarial pricing, cost and utilization trend analysis,
provider efficiency, etc.
– Clinical: HEDIS, disease detection, quality of care, population
studies, etc.
– Sales & Marketing: product performance, consumer behavior,
broker productivity, etc.
4
© 2010 Harvard Pilgrim Health Care
Problem Statement
Business data analysts’ sense of Impending Doom:
 The planned shutdown of the legacy Data Warehouse
was coming – everyone had to move to the EDW
A Semantic Challenge:
 Semantics of the legacy warehouse were in the native
terms of the old monolithic system
 EDW semantics are based on the Enterprise Logical
Data Model, in business terms independent of any
source application
5
© 2010 Harvard Pilgrim Health Care
Problem Statement
“You can have the best data warehouse
and the best BI tool in the world, but if we
don’t have good descriptions of the data
nobody will be able to use them.”
- a user from the business
 Available, quality metadata seen as key to EDW
adoption
 Executives set it as a priority
6
© 2010 Harvard Pilgrim Health Care
Problem Statement
 HPHC has documented metadata for years – as an IT
function
– Mostly ODS- and warehouse-focused
– Hard to get business or project team involvement
 Metadata stored in a database and presented through the
Enterprise Metadata Repository (a commercial tool)
 Business users found the data definitions to be technical,
sometimes unhelpful, without context
 Presentation tool had poor navigation and search capability
 Thus was born the “Data Dictionary” project
– Executive sponsorship from COO and CFO
– Business users involved in defining requirements
7
© 2010 Harvard Pilgrim Health Care
DD Key Driving Requirements
 “Structured” and “Collaborative” components
– Structured contains formal, approved, seldom-changing data definitions
and notes
– Collaborative is an area where business users can contribute
knowledge, insights, and best practices about the data
• Contributions may transcend individual data elements or subject
areas
 Governance
– Data Stewardship committees approve Structured content and may
recommend migration of Collaborative content
• DD project dovetailed with rapidly evolving Data Governance
program
 Search
– Content should be searchable across Structured and Collaborative
areas
– Search on key words and business concepts as well as literals
8
© 2010 Harvard Pilgrim Health Care
Business Context Diagram
Oversight
Data Stewardship Boards
Authorization
Oversight
Executive Oversight
Social controls
Business definitions
Watched topics
Structure collaborative content
Metadata
Database
Subject Matter Experts
Experience,
Comments,
suggestions
Review collaborative
content
Derivation logic
Other formal content
Business definitions
Technical definitions
Lineage
Structured Content
·
·
·
·
·
·
Collaborative Content
Views, data elements, reports
Technical & business definitions
Data type, size, format
Lineage
Derivation logic
Etc.
Data
Dictionary
·
·
·
·
·
Best practices
Questions about use
FAQs
Recommendations for changes to
formal content
Etc.
Experience,
Comments,
suggestions
Read-Only
Data Dictionary Consumers
9
Experience, Comments, Suggestions
Watched Topics
© 2010 Harvard Pilgrim Health Care
DD Solution Assumptions
 The Data Dictionary represents a business-friendly
front end on metadata along with collaboration
extensions
– Addresses current issues of metadata usability and business
ownership
 Metadata will be stored and displayed through some
metadata management tool
 The Data Dictionary will leverage the metadata tool to:
–
–
–
–
Facilitate EDW adoption and implementation of a new BI tool
Share metadata more broadly
Elicit business contributions
Make metadata searchable and usable
10
© 2010 Harvard Pilgrim Health Care
Existing MetaData Management
Process
ERwin
Data Models
ELDM Workbook
PDM Workbooks
ELDM Mapping
Workbooks
Data Definitions
Workbooks
(Entity, Attribute)
ETL
Specifications
Report/Extract
Metadata
Custom Data
Load App
CMC
Corporate Metadata
Center (Oracle DB)
Home-grown metamodel, used for ad
hoc queries, reports, impact analysis
Import Utility
Excel Workbooks
Metadata generation and collection
tools, version-controlled through
Harvest, designed to be loadable to
metamodel via RTU
ASP Templates
EMR
Browser
Presentation
Enterprise Metadata
Repository
Complex data model, used
only as presentation layer
11
© 2010 Harvard Pilgrim Health Care
Technology Option: Replace
Repository with another tool
 Survey the marketplace – see what commercial tools could
replace part or all of the storage and presentation layers
ERwin
Data Models
possibly this
ELDM Workbook
PDM Workbooks
ELDM Mapping
Workbooks
Data Definitions
Workbooks
(Entity, Attribute)
ETL
Specifications
Report/Extract
Metadata
Custom Data
Load App
CMC
Corporate Metadata Center
(Oracle DB)
definitely this
Import Utility
Excel Workbooks
EMR
ASP Templates
Browser
Presentation
Enterprise Metadata
Repository
12
© 2010 Harvard Pilgrim Health Care
Metadata Tools Review
 Engaged consultant with extensive market research for
two-day workshop
 Researched ±15 products, received demos of 5
 Conclusions:
– Lots of good tools out there
– Have come a long way in last several years, e.g.:
• Visualization of data lineage
• SOA readiness, integration of service registry/ WSDL metadata
– Early stages of adding collaborative components
– Still aimed mostly at technical users with relatively complex
User Interfaces
 Plus, we were under the gun to deliver rapidly
13
© 2010 Harvard Pilgrim Health Care
Identified Solution
 Metadata management and metadata presentation are
two different problems that do not necessarily have one
solution
 Contemplated writing our own presentation layer, but…
 Driving requirements – search, collaboration, ease of
use – lent themselves to a Wiki
 MediaWiki was already in-house and is Open Source
with a plethora of plug-ins and extensions…
 … and has programming interfaces that enable pushing
of content into the tool outside the UI, plus ability to
protect pages from editing (satisfies “structured”
requirement)
14
© 2010 Harvard Pilgrim Health Care
General Solution for Data Dictionary
Metadata generation
and collection tools
remain the same
CMC
Corporate Metadata
Center (Oracle DB)
Supports ad hoc queries,
reports, impact analysis
Source for Web publication
Interface
Programs to
MediaWiki
MySQL
Database
MediaWiki
Database
MediaWiki
+ Semantic
Extensions
Browser
Presentation
via Browser
Formal Metadata Capture
Processes
Custom Data
Load App
User-Contributed
(Collaborative) Content
15
© 2010 Harvard Pilgrim Health Care
16
© 2010 Harvard Pilgrim Health Care
MediaWiki: It’s not just Wikipedia!
17
© 2010 Harvard Pilgrim Health Care
Simplified Markup Notation
For example, this wikitext. . .
== Getting Started ==
To the left is the navigation box. Data Dictionary has several paths
to help you find what you need:
* Navigate by [[EDW_View_Layers|view]] -- Find your departmental
view and navigate by subject area.
* Navigate by [[Analytic_Topics|analytical topic]] -- These are
generic analytical opportunities that represent best practices in
health care informatics. Data Dictionary shows you the data columns
that pertain to each topic. The Data Dictionary can be augmented
with other business taxonomies in future releases.
* Look through complex [[EDW Derivations|derivations]] -- These are
HPHC-standard ways for calculating measures that previously analysts
had to program into their applications.
* Use the [[EDW-DWH_Lexicon|Lexicon]] to help you migrate off DWH -The Lexicon is a cross reference of the EDW Semantic View Layer to
the legacy Data Warehouse and data marts -- CIRS, CCDB, and
AURAmart.
18
© 2010 Harvard Pilgrim Health Care
Simplified Markup Notation
. . .results in this page display
19
© 2010 Harvard Pilgrim Health Care
Simplified Markup Notation
Headings with ==
== Getting Started ==
Bullets with *
To the left is the navigation box. Data Dictionary has several paths
to help you find what you need:
* Navigate by [[EDW_View_Layers|view]] -- Find your departmental
view and navigate by subject area.
* Navigate by [[Analytic_Topics|analytical topic]] -- These are
generic analytical opportunities that represent best practices in
health care informatics. Data Dictionary shows you the data columns
that pertain to each topic. The Data Dictionary can be augmented
with other business taxonomies in future releases.
Hyperlinks
Look through complex
with*HPHC-standard
[[
]]
ways for
[[EDW Derivations|derivations]] -- These are
calculating measures that previously analysts
had to program into their applications.
* Use the [[EDW-DWH_Lexicon|Lexicon]] to help you migrate off DWH -The Lexicon is a cross reference of the EDW Semantic View Layer to
the legacy DataWiki
Warehouse
and
data martsdisplayed
-- CIRS, CCDB,
link and
page
name
AURAmart.
20
© 2010 Harvard Pilgrim Health Care
Simplified Markup Notation
MediWiki translates it all into HTML…
References this URL:
https://. . ./ddw/index.php/EDW_Derivations
21
© 2010 Harvard Pilgrim Health Care
Native MediaWiki
 Assign pages to Namespaces
– Namespaces are specified by the system administrator and
become part of the page name, e.g.
“Edw:Member Liability Amount”
 Assign pages to Categories
– Categories are user-defined on a page and can be specified on
the fly
 Both participate in Search
 Version history – Every wiki change is logged by user
and date
– Compare changes
– Rollback
22
© 2010 Harvard Pilgrim Health Care
Wiki Templates
 Declared in double braces {{ }}
 Simplify page layout standardization
 For example, Wikipedia references to disambiguation
pages are through the “About” template. For the article
titled “Wiki,” the template is invoked like this:
{{About|the type of website}}
Results in this:
This article is about the type of website. For other uses, see
Wiki (disambiguation).
23
© 2010 Harvard Pilgrim Health Care
24
© 2010 Harvard Pilgrim Health Care
Beyond Hyperlinking
transclusion (trănz-kloo-zhən)
Dynamic inclusion of part or all of the text
of one hypertext document into another.
See: Nelson, Ted 1982, Literary Machines (Mindful Press)
25
© 2010 Harvard Pilgrim Health Care
Semantic MediaWiki
 Extends the markup notation
 Supports assigning semantic Properties on a page, e.g.:
– “Member Of” properties
• Belongs to a Derivation
• Belongs to a Subject Area
– “Has A” properties
• Has a description
– Synonyms/Antonyms
• Rx = Pharmacy = Drug
 Semantic searches can find pages that are members of
a property.
 Tagged parts of a page can be transcluded onto other
pages.
26
© 2010 Harvard Pilgrim Health Care
Semantic Properties
 Example of semantic tags:
Property
Name
Value
[[Category:View]]
[[EDW Subject Area::CLAIM | ]]
[[View Layer::Finance_Atomic_View | ]]
[[View::CLAIM | ]]
[[objectDesc::A notification or request for payment for health care
services or products rendered to an HPHC member.]]
A section of text can
be wrapped in a
semantic tag
27
© 2010 Harvard Pilgrim Health Care
Semantic Queries
Within this
namespace
{{#ask:
[[Finance_atomic_view:+]]
[[EDW Subject Area::CLAIM]] |
?objectDesc = | }}
For a page whose
EDW “Subject Area”
property = “CLAIM”
Return the text
for the property
“objectDesc”
Results in:
A notification or request for payment for health care
services or products rendered to an HPHC member.
28
© 2010 Harvard Pilgrim Health Care
A page with lots of content…
29
© 2010 Harvard Pilgrim Health Care
…but almost no Wikitext
Entire Wikitext for that page:
__TOC__
Invokes a template inside
of which is the semantic
query to locate all pages
that get listed in the
table.
==EDW View Layer Details==
'''View Layer:''' Reference_View
'''Description:''' Reference View Layer
==Views==
{{ObjectList |[[Reference_View:+]] [[Category:View | Name]]}}
{{HyperLinkSeeAlso | {{FULLPAGENAME}}}}
[[Category:View Layer]]
[[View Layer::Reference_View | ]]
[[objectDesc::Reference View Layer | ]]
30
© 2010 Harvard Pilgrim Health Care
The Power of SMW
 #ask and ye shall receive!
– and its cousin #show
 Normalized content
– “Article of Record” content can be displayed by reference rather
than in local Wikitext
 Dynamic pages
– Self-maintaining
 Faceted querying
– Properties group pages into like categories
31
© 2010 Harvard Pilgrim Health Care
or: It’s All in the Vocabulary
32
© 2010 Harvard Pilgrim Health Care
Leveraging Available Taxonomy
 The Enterprise Logical Data Model is the über taxonomy
– Native hierarchy
• Subject Area
• Facet
• Entity
• Attribute
 EDW Architecture’s natural hierarchy
– View “Layer”
– View
– Column
 Business taxonomy
– Seeded with “Analytic Topics”
33
© 2010 Harvard Pilgrim Health Care
Information Model
 Map hierarchy and navigation
– What content gets generated – Protected pages (“Structured
content”)
– What gets found by semantic query
 Set page naming convention
 Determine properties for each node:
– Namespace
– Category
– Property tags
34
© 2010 Harvard Pilgrim Health Care
Wiki Structure
 Each EDW “View Layer” content loaded into its own
Wiki namespace
– Supports filtered searching for users who generally have access
to one layer only
 EDW namespaces are protected
– Page generator has its own wiki ID with admin rights
– Supports the “Structured Content” requirement
 “Business Annotations” section on the protected pages
is editable by any user
– Puts user comments more front and center than “Discussion”
pages
35
© 2010 Harvard Pilgrim Health Care
Physical-to-Logical Mappings
 All business data elements in the EDW are mapped to
their counterparts in the ELDM
 ELDM mappings are a semantic property of a physical
column, e.g.:
[[ELDM Attribute Name::provider contract identifier]]
 In the SOA world, the ELDM is increasingly important as
an application-neutral expression of enterprise data
requirements
 Data Dictionary is the first time the physical-to-logical
relationships have been systematically exposed to the
business users
36
© 2010 Harvard Pilgrim Health Care
Code Lookup
 External Web app
 Invoked from hyperlink on pages for code columns that
have a reference table
 Queries the
associated reference
table in the data
warehouse in real
time
37
© 2010 Harvard Pilgrim Health Care
38
© 2010 Harvard Pilgrim Health Care
Components
 Materialized views of Corporate Metadata Center
database to:
– Provide a frozen snapshot of the metadata (no shifting sands)
– Flatten hierarchy for simpler querying
– Allow for generation of changed pages only – comparing earlier
snapshot with current
 Java program to read CMC data
– Retrieves content
– Recurses through data lineage to find source end point for a
given target
 Velocity template language to format into Wikitext
 Selenium robot and MediaWiki APIs
39
© 2010 Harvard Pilgrim Health Care
Page Generation Process
Materialized
Views
Java SQL
Queries
Hash
Tables
Corporate
Metadata Center
Selenium
Bot through
Wiki UI
[development]
Velocity
Templates
WikiText
MediaWiki API
calls
(Java, http)
[production]
40
MediaWiki
MySQL Database
© 2010 Harvard Pilgrim Health Care
41
© 2010 Harvard Pilgrim Health Care
Business Glossary
 Business vernacular terms cross referenced to ELDM
analogues – creates an association between
“folksonomy” and the structured vocabulary
42
© 2010 Harvard Pilgrim Health Care
Links from User Contributions to
Protected Pages
 Proxy pages define semantic properties that associate
user-contributed pages with protected pages
[[links from::<page name>]]
[[links to::<page name or url>]]
 All generated pages reference a template that queries
for these properties, dynamically adds a See Also
section and transcludes the hyperlink
43
© 2010 Harvard Pilgrim Health Care
Participation Challenges
 Organization is relatively primitive about Knowledge
Management and Web 2.0
– But… Many people are intrigued by the capabilities of
Semantic MediaWiki. Interest is piqued.
 Need to get to some critical mass to demonstrate value
and elicit more user contributions
44
© 2010 Harvard Pilgrim Health Care
45
© 2010 Harvard Pilgrim Health Care
How has it been received?
 Well!
– “The Data Dictionary provides clear and consistent EDW business definitions,
search capabilities and a platform for user collaboration. The best part? The
Data Dictionary is based on technology many users have already
experienced with Wikipedia.” – Actuary
– “I'm quite excited about the ease of collaboration provided by the Data
Dictionary. When analysts learn something interesting from their analysis they
can now post their findings in the Data Dictionary for others to see. The Data
Dictionary will dramatically decrease learning time for data analysts
transitioning to the EDW.” – Financial Analyst
 300+ unique users
 18,000+ generated pages
 Data stewardship committees taking up responsibility for quality of
data definitions
 Implemented at 1/2 to 1/3 the cost of a commercial package with
closer fit to user requirements
46
© 2010 Harvard Pilgrim Health Care
What’s coming?
 Wiki forms for user contributions
– Guide user on assigning properties
– Simplify creation of proxies for transcluded hyperlinks
 Less Wiki markup – More templates!
– Allows bulk revision of page layouts
 End-user UI on Semantic queries (Halo extension?)
 More taxonomy
– “Analytic Topics” were good as a demonstration but not terribly useful
– Develop more business vocabulary and strategy for relating this to the
metadata to support…
 Conceptual Search
 Extended scope:
– SOA documentation
– Integration into enterprise Knowledge Management
47
© 2010 Harvard Pilgrim Health Care
Conclusions
 A “roll-your-own” approach to metadata presentation is
a workable strategy
– Requires a structured metadata store
 Open Source tools are powerful and flexible
– More functionality than you can afford to build yourself
– Ever-evolving
• Requires a version management strategy
 A collaboration platform engages the business in
knowledge sharing and transfers ownership of business
metadata from IT
 Semantic tools can build the connections between
structured data taxonomy and the business vocabulary
48
© 2010 Harvard Pilgrim Health Care
The Vision
Business
Intelligence
Semantics and
Metadata
Data
Management
Knowledge
Management
49
The Data Dictionary in
Semantic MediaWiki
sets a framework for
this vision
© 2010 Harvard Pilgrim Health Care
john_biderman@harvardpilgrim.org
cmclean@wcobjects.com
50
© 2010 Harvard Pilgrim Health Care
Download