Data Quality Assessment and Measurement Laura Sebastian-Coleman, Ph.D., IQCP Optum Data Management EDW April 2014 – AM5 April 28, 8:30 – 11:45 Agenda Welcome and Thank You for attending! Agenda • Introductory materials – Abstract – Information about Optum and about me • Presentation sections will follow the outline in the abstract (details in a moment…) – Challenges of measuring data quality – DQ Assessment in context » Initial Assessment Deep Dive – Defining DQ Requirements – Using measurement for improvement • Discussion / Questions Ground Rules • I will try to stick to the agenda. • But the purpose of being here is to learn from each other, so questions are welcome at any point. • I will balance between the two. Propriety and Confidential. Do not distribute. 2 Abstract: Data Quality Assessment and Measurement Experts agree that to improve data quality, you must be able to measure data quality. But determining what and how to measure is often challenging. The purpose of this tutorial is to provide participants with a comprehensive and adaptable approach to data quality assessment. • The challenges of measuring data quality and how to address them. • DQ assessment in context: Understand the goals and measurement activities and deliverables associated with initial assessment, in-line measurement and control, and periodic reassessment of data. Review a template for capturing results of data analysis from these processes. • Initial Assessment: Review an approach to initial assessment that allows capture of important observations about the condition of data. • Defining DQ requirements: Learn how to define measurable characteristics of data and establish requirements for data quality. Review a template designed to solicit and document clear expectations related to specific dimensions of quality. • Using measurement for improvement: Share examples of measurements that contribute to the ongoing improvement of data quality. Propriety and Confidential. Do not distribute. 3 About Optum • Optum is a leading information and technology-enabled health services business dedicated to helping make the health system work better for everyone. • With more than 35,000 people worldwide, Optum delivers intelligent, integrated solutions that modernize the health system and help to improve overall population health. • Optum solutions and services are used at nearly every point in the health care system, from provider selection to diagnosis and treatment, and from network management, administration and payments to the innovation of better medications, therapies and procedures. • Optum clients and partners include those who promote wellness, treat patients, pay for care, conduct research and develop, manage and deliver medications. • With them, Optum is helping to improve the delivery, quality and cost effectiveness of health care. Propriety and Confidential. Do not distribute. 4 About me • • 10+ years experience in data quality in the health care industry • Have worked in banking, manufacturing, distribution, commercial insurance, and academia. These experiences have influenced my understanding of data, quality, and measurement. • Published Measuring Data Quality for Ongoing Improvement (2013). • Influences on my thinking about data: • The challenge of how to measure data quality. Addressing this challenge, I have focused on the concept of measurement itself. Any problem of measurement is a microcosm of the general challenge of data definition and collection. • The demands of data warehousing; specifically integrating data from different sources, processing it so that it is prepared for consumption, helping make it understandable My thinking about data governance has been influenced by my position within an IT organization. – DAMA says governance is a business function. But I think IT needs to step up as well. – IT takes care of data. Technical and non-technical people would be better off it we all recognized IT as data stewards and if IT acted responsibly to steward data. – The quality of data (esp. in large data assets) depends on data management practices, which are IT’s responsibility. (It depends on other things, too, but data management is critical.) – Complex systems require monitoring and control to detect unexpected changes. Propriety and Confidential. Do not distribute. 5 Challenges of Measuring the Quality of Data Overview: Challenges of Measuring Data Quality • Lack of consensus about the meaning of key concepts. Specifically, – Data – Data Quality – Measurement/Assessment » The only way to address a lack of consensus about meaning is to propose definitions and work toward consensus. In the next few slides, we will go into in depth about the meaning of these terms. • To start: Sometimes the term data quality is used to refer both to the condition of the data and to the activities necessary to support the production of high quality data. I separate these into » The quality of the data / the condition of data » Data quality activities: those required to produce and sustain high quality data • Lack of clear goals and deliverables for the data assessment process » These we will discuss in detail in DQ Assessment in Context. • Lack of a methodology for defining “requirements”, “expectations” and other criteria for the quality of data. These criteria are necessary for measurement. » This challenge we will discuss in detail in Defining Data Quality Requirements. Propriety and Confidential. Do not distribute. 7 Assumptions about Data and Data Quality • In today’s world, data is both valuable and complex. • The processes and systems that produce data are also complex. • Many organizations struggle to get value out of their data because – They do not understand their data very well. – They do not trust the systems that produce it. – They think the quality of their data is poor – though they can rarely quantify data quality. • Poor data quality is not solely a technology problem – but we often – Blame technology for the condition of data and – And jump to the conclusion that tools can solve DQ problems. They don’t . • Technology is required to manage data and to automate DQ measurement – without automation, comprehensive measurement is not possible. There’s too much data. • Data is something people create. – It does not just exist out in the world to be collected or gathered. – To understand data requires understanding how data is created. • Poor data quality results from a combination of factors related to processes, communications, and systems within and between organizations. • Propriety and Confidential. Do not distribute. 8 Assumptions, continued…. • Given the importance of data in most organizations, ALL employees have a stewardship role, just as all employees have an obligation not to waste other resources. • Given how embedded data production is in non-technical processes, ALL employees contribute to the condition of data. – Raising awareness of how they contribute will help improve the quality of data. • Sustaining high quality data requires data management, not just technology management – Data management, like all forms of management, includes knowing what resources you have and using those resources to reach goals and meet objectives – Technology should be a servant, not a master; a means, not an end; a tail, not a dog. • Producing high quality data requires a combination of technical and business skills, (including management skills), knowledge, and vision. – No one can do it alone • Better data does not happen by magic. It takes work. • People make data. People can make better data. • Why don’t they? Propriety and Confidential. Do not distribute. 9 What we want data to be Reasonable Reliable Rational Ready to use A bit technical, but basically comprehensible Propriety and Confidential. Do not distribute. 10 How data sometimes seems Powerful. Packed with knowledge. But threatening. And ambiguous. And for those reasons, Interesting… And, of course, Somewhat magical Still…It is difficult to tell whose side data is on; whether it is good or evil. Propriety and Confidential. Do not distribute. 11 What data seems to be turning into Big Data is BIG – Monstrous, even. And also powerful & threatening. Moving faster than we can control. Neither rational nor ready to use. And yet … a potential weapon. If only it would behave . Propriety and Confidential. Do not distribute. 12 Definition: Data • • • • • Data’s Latin root is dare, past participle of to give. Data means “something given.” In math and engineering, the terms data and givens are used interchangeably. The New Oxford American Dictionary (NOAD) defines data as “facts and statistics collected together for reference or analysis.” ASQ defines data as “A set of collected facts” and identifies two kinds of numerical data: “measured or variable data … and counted or attribute data.” ISO defines data as “re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing” (ISO 11179). Observations about the concept of data – Data tries to tell the truth about the world (“facts”) – Data is formal – it has a shape – Data’s function is representational – Data is often about quantities, measurements, and other numeric representations “facts” – Things are done with data: reference, analysis, interpretation, processing Propriety and Confidential. Do not distribute. 13 Data • Data: Abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage. • Each piece is important – abstract representations – Not “reality” itself. – of selected characteristics of real-world objects, events, and concepts – Not every characteristic. – expressed through explicitly definable conventions related to their meaning, collection, and storage – Defined in ways that encodes meaning. Choices about how to encode are influenced by the ways that data will be created, used, stored, and accessed. – and understood through these conventions – Interpreted through decoding. • These concepts are clearly at work in systems of measurement and in the scientific concept of data – something that you plan for (designing an experiment) and test for both veracity (are the measurements correct) and purpose (are the measurements telling me what I need to know). Propriety and Confidential. Do not distribute. 14 Definition: Data Quality Data Quality / Quality of Data: • The level of quality of data represents the degree to which data meets the expectations of data consumers, based on their intended use of the data. • Data also serves a semiotic function – it serves as a sign of something other than itself). So data quality is also directly related to the perception of how well data effects (brings about) this representation. Observations: • High-quality data meets expectations for use and for representational effectiveness to a greater degree than low-quality data. • Assessing the quality of data requires understanding those expectations and determining the degree to which the data meets them. Assessment requires understanding – The concepts the data represents – The processes that created data – The systems through which the data is created – The known and potential uses of the data Propriety and Confidential. Do not distribute. 15 Data Quality Activities • The data quality practitioner’s primary function is to help an organization improve and sustain the quality of its data so that it gets optimal value from its data. • Activities that improve and sustain data quality include: – Defining / documenting quality requirements for data – Measuring data to determine the degree to which data meets these requirements – Identifying and remediating root causes of data quality issues – Monitoring the quality of data in order to help sustain quality – Partnering with business process owners and technology owners to improve the production, storage, and use of an organization’s data – Advocating for and modeling a culture committed to quality • Assessment of the condition of data and ongoing measurement of that condition are central to the purpose of a data quality program. Propriety and Confidential. Do not distribute. 16 Measurement is always about comparing two things…. Propriety and Confidential. Do not distribute. 17 Definition: Measurement • Measurement: The process of measurement is the act of ascertaining the size, amount, or degree of something. – Measuring always involves comparison. Measurements are the results of comparison. – Measurement most often includes a means to quantify the comparison. • Observation: Measurement is both simple and complex. – Simple because we do it all the time and our brains are hard-wired to understand unknown parts of our world in terms of things we know. – Complex because, for those things we have not measured before, we often do not have a basis for comparison, the tools to execute the comparison, or the knowledge to evaluate the results. » If you don’t believe me, imagine trying to understand “temperature” in a world without thermometers. – Measuring the quality of data is perceived as complex or difficult, because we often do not know what we can or should compare data against. Propriety and Confidential. Do not distribute. 18 Assessment goes further than measurement Assessment is not just about comparison…it’s about drawing conclusions. Drawing conclusions depends on understanding implications and how to act on them. Propriety and Confidential. Do not distribute. 19 Definition: Assessment • Assessment is the process of evaluating or estimating the nature, ability, or quality of a thing. • Data quality assessment is the process of evaluating data to identify errors and understand their implications (Maydanchik, 2007). Observations about assessment • Like measurement, assessment requires comparison. • Further, assessment implies drawing a conclusion about—evaluating—the object of the assessment, whereas measurement does not always imply so. • But as with data quality measurement, with data assessment, we do not always know what we are comparing data against. For example, how do we know what is wrong? What = an “error”? Propriety and Confidential. Do not distribute. 20 Measurement/Assessment Measurement is knowing that the temperature outside is 30 degrees F below zero. Assessment is knowing that it’s cold outside. R You can act on the implications of an assessment: Get a coat! Or, better yet, stay inside. Propriety and Confidential. Do not distribute. 21 Benefits of Measurement • Objective, repeatable way of characterizing the condition of the thing being measured. • For measurement to work, people must understand the meaning of the measurement. • A beginning point for change / improvement of the thing that needs improvement. • A means of confirming improvement has taken place. Propriety and Confidential. Do not distribute. 22 Data Quality Assessment in Context Overview: DQ Assessment in Context • Goals: – Understand the goals and measurement activities and deliverables associated with » Initial assessment » In-line measurement and control » Periodic reassessment of data – Review a template for capturing results of data analysis from these processes. Order of information • Challenges of data quality assessment • Overview of the DQAF: Data Quality Assessment Framework • What the DQAF is • The Data Quality dimensions it includes • Relation of DQAF measurement types to data quality dimensions and to specific measurements • Objects of measurement and the data quality lifecycle • Context diagrams and deliverables • Template review Propriety and Confidential. Do not distribute. 24 Data Quality Assessment • Ideally, data quality assessment enables you to describe the condition of data in relation to particular expectations, requirements, or purposes in order to draw a conclusion about whether it is suitable for those expectations, requirements, or purposes. – A big challenge: Few organizations articulate expectations related to the expected condition or quality of data. So at the beginning of an assessment process, these expectations may not be known or fully understood. The assessment process includes uncovering and defining expectations. • We envision the process as linear…. • But in most cases, it is iterative and sometimes requires multiple iterations…. Propriety and Confidential. Do not distribute. 25 Data Quality Assessment • Data assessment includes evaluation of how effectively data represent the objects, events, and concepts it is designed to represent. – If you cannot understand how the data works, it will appear to be of poor quality. • Data Assessment is usually conducted in relation to a set of dimensions of quality that can be used to guide the process, esp. in the absence of clear expectations: – How complete the data is – How well it conforms to defined rules for validity, integrity, and consistency – How it adheres to defined expectations for presentation • Deliverables from an assessment include observations, implications, and recommendations. – Observations: What you see – Implications: What it means – Recommendations: What to do about it Propriety and Confidential. Do not distribute. 26 DQAF– Data Quality Assessment Framework • A descriptive taxonomy of measurement types designed to help people measure the quality of their data and use measurement results to manage data. – Conceptual and technology-independent (i.e., it is not a tool) – Generic – it can be applied to any data • Initially defined in 2009 by a multi-disciplinary team from Optum and UHC seeking to establish an effective approach for ongoing measurement of data quality. Basis for Measuring Data Quality for Ongoing Improvement. – Focuses on objective characteristics of data within five quality dimensions: – Completeness – Timeliness – Validity – Consistency – Integrity – Defines measurement types that – Measure characteristics important to most uses of data (i.e., related to the basic meaning of the data) – Represent a reasonable level of IT stewardship of data. That is, types that enable data management. Propriety and Confidential. Do not distribute. 27 Using the DQAF • The intention of the DQAF was to provide a comprehensive description of ways to measure. I will describe it this way. • But it does not have to be applied comprehensively. • It can be applied to one attribute or rule. • The goal is to implement an optimal set of specific measurements in a specific system (i.e., Implementing all the types should never be the goal of any system). • Implementing an optimal set of specific measurements requires: – Understanding the criticality and risk of data within a system. – Associating critical data with measurement types. – Building the types that will best serve the system by » Providing data consumers a level of assurance to that data is sound based on defined expectations » Providing data management teams information that confirms that data moves through the system in expected condition Propriety and Confidential. Do not distribute. 28 Using the DQAF • The different kinds of assessment are related to each other. – Initial assessment drives the process by separating data that meets expectations from data that does not and helping identify at risk and critical data for ongoing measurement. – Monitoring and periodic measurement identify data that may cease to meet expectations and data for which there are improvement opportunities. • The concept of data quality dimensions provides the initial organizing principle behind the DQAF: Data Quality Dimension: A data quality dimension is a general, measurable category for a distinctive characteristic (quality) possessed by data. Data quality dimensions function in the way that length, width, and height function to express the size of a physical object. They allow understanding of quality in relation to a scale and in relation to other data measured against the same scale. Data quality dimensions can be used to define expectations (the standards against which to measure) for the quality of a desired dataset, as well as to measure the condition of an existing dataset. Dimensions provide an understanding of why we measure. For example, to understand the level of completeness, validity, and integrity of data. Propriety and Confidential. Do not distribute. 29 DQAF Terminology • Measurement Type: – Within the DQAF, a measurement type is a subcategory of a dimension of data quality that allows for a repeatable pattern of measurement to be executed against any data that fits the criteria required by the type, regardless of specific data content. – The measurement results of a particular measurement type can be stored in the same data structure regardless of the data content. – Measurement types describe how measurement are taken, including what data to collect, what comparisons to make, and how to identify anomalies. For example, all measurements of validity can be executed in the same way. Regardless of specific content, validity measurements include collection of data and comparison of values to a specified domain. • Specific Metric: – A specific metric describes particular data that is measured and the way in which it is measured. – Specific metrics describe what is measured. For example, a metric to measure the validity of procedure codes on a medical claim table. Or one to measure the validity of ZIP codes on a customer address table. Propriety and Confidential. Do not distribute. 30 Dimensions, Measurement Types, Specific Metrics Propriety and Confidential. Do not distribute. 31 Example Measurement Types Propriety and Confidential. Do not distribute. 32 DQAF Dimension Definitions Completeness: Completeness is a dimension of data quality. As used in the DQAF, completeness implies having all the necessary or appropriate parts; being entire, finished, total. A data set is compete to the degree that it contains required attributes and a sufficient number of records, and to the degree attributes are populated in accord with data consumer expectations. For data to be complete, at least three conditions must be met: the data set must be defined so that it includes all the attributes desired (width); the data set must contain the desired amount of data (depth); and the attributes must be populated to the extent desired (density). Each of these secondary dimensions of completeness would be measured differently. Timeliness: Timeliness is a dimension of data quality related to the availability and currency of data. As used in the DQAF, timeliness is associated data delivery, availability, and processing. Timeliness is the degree to which data conforms to a schedule for being updated and made available. For data to be timely, it must be delivered according to schedule. Validity: Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy, and correctness. Validity is degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain. Consistency: A dimension of data quality. As used in the DQAF, consistency can be thought of as the absence of variety or change. Consistency is the degree to which data conform to an equivalent set of data, usually a set produced under similar conditions or a set produced by the same process over time. Integrity: Integrity is a dimension of data quality. As used in the DQAF, integrity refers to the state of being whole and undivided or the condition of being unified. Integrity is degree to which data conform to data relationship rules (as defined by the data model) that are intended to ensure the complete, consistent, and valid presentation of data representing the same concepts. Integrity represents the internal consistency of a data set. Propriety and Confidential. Do not distribute. 33 DQAF Terminology • Assessment Category: – In the DQAF, an assessment category is a way of grouping measurement types based on where in the data life cycle the assessment is likely to be taken. – Assessment categories pertain to both the frequency of the measurement (periodic or in-line) and the type of assessment involved (control, measurement, assessment). – They include: initial assessment, process control, in-line measurement, periodic measurement, and periodic assessment. • Measurement (or Assessment) Activities: – Measurement activities describe the goals and related actions related associated with work carried out within an assessment category. Measurement activities differ depending on when, within the data lifecycle, they are carried out and against which the objects of measurement. – Measurement activities correspond closely with DQAF measurement types. • Object of Measurement: – In the DQAF, objects of measurement are groupings of measurement types based on whether types focus on process or content, or on a particular part of a process (e.g., receipt of data) or kind of content (e.g., the data model). – Content-related objects of measurement include: The data model, content based on row counts, content of amount fields, date content, aggregated date content, summarized content, crosstable content (row counts, aggregated dates, amount fields, chronology), overall database content. – Process-related objects of measurement include: Receipt of data, Condition of data upon receipt, adherence to schedule, data processing Propriety and Confidential. Do not distribute. 34 Goal Initial One-Time Assessment Measurement Activities Gain Understanding of the Data & Data Environment Assessment Category Context of Data Quality Assessment Manage Data within and between Data Stores with Controls and Ongoing Measurement Automated Process Controls In-line Measurement Manage Data within a Data Store Periodic Measurement Understand business processes represented by the data Review and understand processing rules Ensure correct receipt of data Take preprocessing data set size measurements Measure data content completeness Take postprocessing data set size measurements Measure crosstable integrity of data Assess completeness, consistency, and integrity of the data model Assess the condition of data (profile data) Inspect initial condition of data Take preprocessing timing measurements Measure data validity Take postprocessing timing measurements Assess overall sufficiency of database content Assess sufficiency of metadata and reference data Assess data criticality; define measurement requirements Measure data set content Measure data consistency Assess effectiveness of measurements and controls Support Processes and Skills Propriety and Confidential. Do not distribute. 35 Functions in Assessment: Collect, Calculate, Compare, Conclude Use DQAF dimensions to help with this process and measurement types to help with this process Propriety and Confidential. Do not distribute. 36 Results of Data Assessment The following three slides associate deliverables from each of the measurement activities. Through these deliverables…. • Metadata is produced, including: – Expectations related to the quality of data, based on dimensions of quality – Objective description of the condition of data compared to those expectations – Documentation of the relation of data’s condition to processes and systems – rules, risks, relationships • Data and process improvement opportunities can be identified and quantified, so that decisions can be made about which ones to address. Propriety and Confidential. Do not distribute. 37 Initial Assessment Propriety and Confidential. Do not distribute. 38 In-Line Measurement & Control Propriety and Confidential. Do not distribute. 39 Periodic Measurement Propriety and Confidential. Do not distribute. 40 Initial Assessment: Capturing Observations and Conclusions about the Condition of Data Deep Dive on Initial Assessment: Data Analysis Results Template • One of the challenges in data quality measurement is a lack of clear goals and deliverables for the data assessment process. I hope the preceding materials can help you clarify your goals for any measurement activities within an assessment. • Data Analysis Template should help you formulate your deliverable. • Components – Analysis Protocol Checklist – Observation Sheet – Supporting components – purpose and usage, content overview, definitions of terms, etc. – Summarized analysis questions Show template now… Propriety and Confidential. Do not distribute. 42 Analysis Protocol Checklist • A tool to enable analysts to execute data profiling in a consistent way. • Describes the actions that should be taken during any data analysis sequence. • Includes prompts and questions that help guide analysts in discovering potential risks within source data. • Although the list includes a set of discrete actions that can be described individually, many of these can be executed simultaneously; for example when reviewing the cardinality of a small set of valid values, analysts can and should be assessing the reasonability of the distribution of values. • The checklist ensure that nothing is missed when data is profiled. Propriety and Confidential. Do not distribute. 43 Examples from Analysis Protocol Checklist Protocol number 1 2 3 4 5 Type Action (Tool Use) Analysis object Review the overall set of files to be included in Review the overall set of files to profiling. Make initial comparisons between be included in profiling. Complete Source ERD, UDW model source metadata. Data Set -- Review the Profiling metadata tab. Complete the Profiling metadata tab. Identify overall content Identify any limitations or risks. any limitations or risks. Sort Data Values by Minimum Column -Value - low to high. (Column Review columns where Min & Max Values are Cardinality Analysis Analysis) Null Column -Sort Cardinality - Low to High Cardinality Analysis (Column Analysis) Sort Cardinality - Low to High (Column Analysis). Use the View Details button, for 2nd level of Column -drill-down; select the Frequency Cardinality Analysis Distribution Tab. Sort Cardinality - Low to High (Column Analysis). Use the View Details button, for 2nd level of Column -- Value drill-down; select the Frequency Distribution Analysis Distribution Tab. Analyze all other columns where Cardinality = 1. Analyze all columns where Cardinality = 2. Review Value Distribution for columns where Cardinality = 2 or 3 Propriety and Confidential. Do not distribute. 44 Examples from Analysis Protocol Checklist Protocol number Type Purpose (Why we do this) 11 If a column contains date information, there are inherent date format, as well as physical and logical ranges of valid values. Is the Data Type consistent with Date? Which date fields are nullable? Does this make sense? Is there a dependency or other relatedness between multiple date columns? Between the date column and other "non-date" columns? Do the Column -- Data Min and Max dates make sense? Are the column names consistent with a Date? Why or Class Analysis Why not? When necessary, reassign the Selected Data Class. Review all other Dates characteristics for each column. 12 Is the Data Type consistent with a Quantity? Are the values consistent with Quantity? Are the column names consistent with Quantity? Does the source distinguish between fields related to dollar amounts and fields related to other kinds of quantity date? Does the data Column -- Data format make sense for this column? (ex: dollars have precision of 2 decimals; count of Class Analysis members are integers) When necessary, reassign the Selected Data Class. Review all other Quantities characteristics for each column. 13 14 The Code class represents instances where there is a finite set of valid values, as in those from a code table. Does the source also supply the code tables? (See Structure for referential integrity between reference and core data.) Is the cardinality consistent with what is known about the specific Code set? Is the column name consistent with a code? What are the Column -- Data values? Are there invalid values present? Are there expected values which are missing? Class Analysis When necessary, reassign the Selected Data Class. Review all other characteristics for each Code column. Column -- Data What is the Cardinality of the column? Based on what you observe about the column, how Class Analysis should it be classified? Why? When necessary, reassign the Selected Data Class. Review all Unknown other characteristics for each column. Propriety and Confidential. Do not distribute. 45 Examples from Analysis Protocol Checklist Protocol number Type 43 44 45 46 49 50 Purpose (Why we do this) Identify patterns in the population of records based on status codes, record type codes, dates, or other critical fields that may be used to differentiate records of different kinds. Determine how many record types appear to be present in the data and whether records can be classified based on the type of transaction or type of data present in Structure -- Record Types the record. Determine how records representing variations of the same information can be understood in relation to each other. For example, one record may be an original claim; another may be an Structure -- Change over time adjustment to the same claim. Based on record types and change over time, determine if we can characterize the different states of data in the data set and the events that might trigger a new record or an update to an existing Structure -- State Data record. Identify any content or structural differences between older to more recent records. Older records may have been produced under Structure -- Age of Records different business processes. Structure -- Source Naming Review findings from overall column analysis to identify any general Conventions General naming conventions Structure -- Naming Review findings from overall column analysis to identify any Convention consistency inconsistencies or peculiarities in source naming conventions Propriety and Confidential. Do not distribute. 46 Observation List • Designed to capture discreet, specific observations for knowledge sharing purposes. • Observations can be made at the column, table, file, or source level. • Observations will be used to inform other people about the condition of data and will be repurposed as metadata. Observations should be formulated with these ends in mind. • Each observation is recorded and associated with a relevancy category, so that its importance is understood and can be confirmed. Propriety and Confidential. Do not distribute. 47 Example Observations Checklist Column Name Step Observati Observational on Type Category Observation Naming PROCEDURE_Conventions MODIFIER_2 - Consistency Finding PROCEDURE_Nullability, Informatio MODIFIER_2 General nal DIAGNOSIS_I Nullability, DENTIFIER General Finding Naming Conventions Limited use field Relevance Appears to be same data as SRV_CDE_MOD, PROCEDURE_MODIFIER_3 and proc_mod_4_cd; why different naming convention? Confirm. Null=98.3% Null=24.6%; Values: 1-9, A,B,C. Are these expected values? Any missing values? Reasonable for this data set? The Nulls seem to be related to PAY_RULE_INDIC=null. Is Diagnosis Related Fields Business Rule needed for all procedures? PAY_RULE_INDIC PAY_RULE_IN Value Dist, Informatio DIC Valid Values nal Data Content Null=24.6%; Values: P=52.4%,C=21.3%,U=1.8%. Are these expected values? Any missing values? Reasonable for this data set? Unexpected percentage NULL Propriety and Confidential. Do not distribute. 48 Example Observations Column Name Checklist Observational Step Observation Type Category Value Dist, FIRST_DAT Range of E_OF_SRVC Values Finding Source Edit Checks RENDERIN G_PROVIDE R_NPI, rndr_prov_npRelated i_new_id Columns Risk Business Rule DETAIL_RE Field CORD_NUM Length Informational Structure, Keys Observation Relevance First DOS, no reasonability checks in source. Values from 1915, 1920's, 1970's, 2020, later in Distribution of 2013. Future Dates values is allowed questionable Related Fields Two fields both RENDERING_P appear to contain ROVIDER_NPI rendering provider and NPI. Which should rndr_prov_npi_ne be used? w_id Significant difference between actual Defined=2;actual=3. and defined field Sizing opportunity. lengths Propriety and Confidential. Do not distribute. 49 Relevance Categories Relevance: High cardinality where low is expected Component for compound Key Low cardinality where high is expected Inconsistent data type Inconsistent format Foreign Key Identifying relationships Logic for Natural Keys Multiple formats in one field Non identifying relationships Difference between inferred and defined field lengths Source Natural Key Multiple default values Unexpected percentage of records defaulted Default value differs from target Default value differs from Source documentation Granularity differs from Source documentation Granularity differs from target Inconsistent granularity Source Primary Key 100% NULL Not OK -- field is not populated but should be 100% NULL but OK -- field is not populated but is not expected to be Related Columns within a table Related Columns across tables Invalid values present in column Non-standard values for a standardized code Value set differs from source documentation Value set differs from target Data populated as expected Distribution of values appears unreasonable Distribution of values is questionable High frequency values are unexpected Low frequency values are unexpected Unexpected percentage NULL Inconsistent Naming Conventions Naming Convention differs from target Propriety and Confidential. Do not distribute. 50 Summarized Analysis • Tracks high level findings, at the attribute or rule level. • Contains Yes / No questions so that analysts can reach conclusions and roll up findings • Questions are based on the dimensions of quality in the DQAF and they can be associated with measurement types in the DQAF. Propriety and Confidential. Do not distribute. 51 Summarized Analysis – Sample questions Field Name Does cardinality of values makes sense? (Y, N, N/A) Are default values present? (Y/N) Is level of default population reasonable? (Y,N, N/A, CND) Field Definition Cardinality refers to the number of items in a set. For data measurement results, the cardinality means the number of values returned from the core data. A high cardinality is expected on fields like procedure code which have a large set of valid values; whereas fields like gender codes have only a few values and therefore will return only a few rows. This field captures high-level reasonability related to cardinality. Unreasonable conditions include: high cardinality where low is expected, low cardinality where high is expected, inconsistent cardinality. Note that cardinality is not always the number of ROWS returned. If a measurement is broken down by a source system code, then it will return a set of rows for each source and the reasonableness of cardinality must be understood at the source level. Valid values = Y/N/CND [Cannot Determine]. The requirements process should capture whether population of a field is optional or mandatory. This question asks whether default values are actually present in the data. In most cases, defaults are valid values. However, depending on other expectations, a high level of defaults may be a problem. If there is more than one functional default value, capture that fact in the observation sheet. For Match Rate consistency measurements, the presence of default values indicates that records have not matched. This field captures high-level reasonability related to the level of defaulted records. Unreasonable levels would occur when business rules indicate a field should be populated under specific conditions and it is defaulted instead. For Match Rate Consistency measurements, this field should be used to capture whether the level of non-matches is reasonable. The general intention of match processes is to return 100% of records. If defaults are present, we should document the factors that have caused them. In some cases, these may be reasonable. For example, if a field defaults for particular sources because source data is not yet available, but the field is populate for sources for which data is available, then the overall level is reasonable. If there are known issues associated with the data, then the answer to this question is N and issues should be noted in the observations. Propriety and Confidential. Do not distribute. 52 Summarized Analysis – Sample questions Field Name Field Definition Are invalid values present? (Y, N, N/A) Does the level of invalid values represent a problem? (Y, N, CND) Values = Y/N. The data model should record whether NULL is allowed or not. If it is allowed, then the answer is Y. We need to record this information in order to ensure we report accurately on levels of validity. Does distribution of values make sense, based on what the data represents? (Y, N, CND) For all instances where the answer is Y to the question "Are invalid values present?" determine whether the level of invalid values presents a risk. This question can be answered based on either the distinct number of invalid values, the percentage of rows containing invalid values, or both. A special case: When the only invalid value is NULL and NULL is populated at a reasonable level, the answer here should be No. This field captured high level reasonability based on knowledge of what the field represents. Some assessment can be based on common sense (in a large set of claim data, a high level of procedure codes for office visits makes sense, whereas a large set for heart transplants would be questionable). Other assessment can be made based on similar data in other systems, or on defined rules in standard documentation. Questions should be directed at business process SMEs. Are there identifiable patterns based on Valid Values = Y, N, CND [Cannot determine]. Baseline assessment of many consistency measures will dates data was focus on determining the degree to which there is a legitimate expectation that data will be delivered or consistent. In order to draw a conclusion, data needs to be looked at over time. Assessment of validity measures includes not only the level of validity, but also changes over time. Analysts should processed (e.g., also identify any known events (e.g., project releases, the introduction of new sources, etc.) that have trends, spikes, etc.)? an impact on the consistency of data. To see changes over time requires graphing of data. Do not (Y, N, CND) answer this question until data has been reviewed graphically. Propriety and Confidential. Do not distribute. 53 Summarized Results – Samples Field Name Field Definition Are there patterns based on age of records, status, type codes, or any other fields related to the content of the records, This field prompts analysts to look across the record set at any fields that might etc. ? (Y, N, CND) provide a deeper understanding of the reasonability of the data. This field is intended as a catch-all for any problems that may have been identified or questions that may have been raised in the course of analysis that do not fit into one of the previous categories. When such issues and questions Are there any other issues are identified, we should not only address them for themselves, we should also related to data content? review them for possible improvements to the data assessment process and this (Y, N) template. This field should capture analysis of reasonability at the highest level. For validity Overall Reasonability: and consistency measurements, it should be populated based on analysis using Does result content make the Data Content Protocol. The pink fields that follow summarize a series of sense based on what we observations based on that protocol. They should be populated before know of the data? (Y, N) completing the high level field. Is this Data meeting Like the reasonability field, the quality expectation field captures a high level Quality expectations? (Y, conclusion. In most cases, it should be filled out after the other data content N, CND) analysis has been completed. Propriety and Confidential. Do not distribute. 54 Results of Assessment • I had previously asserted that few organizations articulate expectations related to the expected condition or quality of data. • The assessment process includes uncovering and defining expectations. • The assessment allows you to establish facts about the data and to answer the questions about its condition: – Is the data reasonable? » Is it complete? » Is it valid? » Is there integrity between related tables – Is data in the condition it needs to be for use? – If note, what is not acceptable about the data? • The results should be in a sharable form. “Data-tized” observations. • From these findings, data consumers can define measurable requirements for the expected condition of data. Propriety and Confidential. Do not distribute. 55 Defining Data Quality Requirements Overview: Defining Data Quality Requirements Defining DQ requirements: Learn how to define measurable characteristics of data and establish requirements for data quality. Review a template designed to solicit and document clear expectations related to specific dimensions of quality. Order of Ideas • Definitions of terms – Requirement, Data Quality Requirement, expectations, risks • Input for the requirements process • Asking questions • Capturing output Propriety and Confidential. Do not distribute. 57 Data Quality Requirements • A requirement defines a thing or action that is necessary to fulfill a purpose. • Data quality requirements describe characteristics of data necessary for it to be of high quality. • Data quality content requirements define the expected condition of data in terms of quality characteristics, such as completeness, validity, consistency, and integrity. • Data quality measurement requirements define how a particular characteristic should be measured. • The DQ requirements process should identify data quality expectations and risks in order to make recommendations for how to measure the quality of data. – Expectations are based on business processes and rules and what specific data is designed to represent. – Risks can be associated with business processes that produce data, source systems that supply data to a downstream data asset, or technical processes related to data movement and storage within the downstream data asset (e.g., transformation rules within ETL). – Expectations and risks can be expressed in relation to dimensions of quality (i.e., the data is considered complete / valid / consistent, if….) Propriety and Confidential. Do not distribute. 58 Assessment / Requirements Relationship • We envision the requirements-to-assessment process as linear. • But in most cases, it is iterative and sometimes requires multiple iterations. • Because the assessment process includes uncovering and defining expectations. Likewise, the requirements process often entails assessment. We think requirements start here Or you figure them out here But Requirements may start here Propriety and Confidential. Do not distribute. 59 Input to the Requirements Process • Data content requirements • Business process flows • Business rules • Entity and attribute definitions • Source data model • Target data model • Source-to-target mapping specification or functional specification, • Transformation rules • Profiling / data assessment results • And any other documentation that provides information about business expectations for the data Show requirements template now…. Propriety and Confidential. Do not distribute. 60 Which data to focus on Measurements inform you about the condition of the data within a system. But most organizations do not have the capacity to measure everything. Nor is it beneficial to do so. Measurement requirements should focus on critical and at risk data. More on this in a few slides….. Propriety and Confidential. Do not distribute. 61 Risks in Business or Technical Processes Business Process complexity (high, medium, low) High risk = candidate measurement Population rule Complexity (high, medium, low) This field should assess the level of complexity within the business process that creates the data. The purpose of this evaluation is to identify risks in associated in data production. This field should record a high level assessment of the complexity of the population of a field. Complexity provides a way to assess technical risk associated with data population. Complexly populated fields are candidates for consistency measurements. Valid values include: High, Medium, Low. Fields that require only formatting changes or are direct move are considered low complexity. Fields that require minor transformations are medium complexity. Fields that are derived from multiple inputs are high complexity. Population rules complexity will be populated only when input to the DQ Measurement Requirements template includes the mapping spreadsheet / functional specification. Column Category (direct This field is used to characterize the kind of data in the column, using move field, Amount field, categories that are helpful in determining whether to take a indicator, match process, measurement and to decide on which type of measurement to be applied. Values are based largely on the guidelines for standard other derivation, codified measurement processes. Valid values include: Match Process, Codified data in a core table, ref Data, Ref Table field, UDW system generated field. The Column Category table field, systemwill be populated only when input to the DQ Measurement Requirements generated field) template includes the mapping spreadsheet / functional specification. Propriety and Confidential. Do not distribute. 62 Risks in Business or Technical Processes Are there any Known Risks within business process that produces this data? (Y, N, CBD) Describe known business process risks Are there any Known Risks or issues associated with this data in the source systems which supply it? (Y, N, CBD) Describe known source system risks This field should capture whether there are any known limitations of the business process that produces the data. If there is no information about known risks, then populate with CBD (cannot be determined). This field should describe known limitations of the business process that produces the data. If this information is documented elsewhere, include the link or reference to that information. If additional space is needed, create another tab. This field should capture whether there are any known limitations of the data within the systems that supply it. Risks can be associated with direct or originating systems. If there is no information about known risks, then populate with CBD (cannot be determined). This field should describe known risks of the systems that supply the data. If this information is documented elsewhere, include the link or reference to that information. If additional space is needed, create another tab. High risk anywhere here = candidate measurement Propriety and Confidential. Do not distribute. 63 Expectations for Completeness of Column Population High Criticality = Candidate measurement Attribute Criticality (high, medium, low) This field should record a high level assessment of an attribute's criticality for business purposes. The Analyst should populate based on data knowledge and common sense. The draft assessment should be reviewed by the business. High critical fields are candidates for consistency measurements. Valid values include: High, Medium, Low. High indicates that the data is very critical. Attributes may be critical in and of themselves or they may be critical because they serve as input into derivation processes. This field should record whether a field is expected always to be populated (mandatory) or whether population is not always expected (optional). Ideally, information about optionality should be captured in the model and obtained via the MID. Valid values include: Mandatory, Optional, TBD. Population expectation: (Mandatory vs. Optional) If optional, identify the For fields where the population is not mandatory, this field will record the conditions under which the conditions under which the data will not be populated (or under which is will field is populated be populated, whichever is simpler to express). Defaults allowed? (Y/N) If the data quality requirements process is executed as part of development, this field indicates whether defaults are allowed. Such information should be captured in the model. If the process is executed against existing data, if actual data is available for inspection, the field should record whether defaults are present. In this field, capture the specific value that is expected to be populated when the field is defaulted. Standard Default value? Under what conditions are For fields where defaults are allowed, this field will record the conditions under defaults allowed? which the data will or might be defaulted. Propriety and Confidential. Do not distribute. 64 Expectations for Validity of Column Population This field captures all defined criteria for validity. For example, it may Clear Criteria name a range of valid values, a source of valid values, or a rule that is for validity = clear criteria for associated with determining validity. The criterion for a DIAGNOSIS Code measurement field might be: ICD Diagnosis codes valid at the time of the claim; or See DIAGNOSIS_CODE Table. It is not the intention of this field to capture Criteria for Validity? valid values or to duplicate information in code tables. This field captures any business rules that are associated with the field. For example, if a health condition is related to a workers compensation Business Rules Associated claim, then the workers comp indicator must be 'Y' and the workers comp with the population of the claim number must be populated. Often rules state the relationship field between fields, so ensure that you note the information for both fields. This field should include any additional data quality expectations shared by business SMEs or other data consumers. For example, whether fields Other Expectations based are related to each other, whether there are differences in population on business processes based on different types of records, etc. This field records what action the business wants the data store to take if Action if data does not data does not meet the expectations for population, validity, or defined meet expectations (Keep/ business rules. Valid values are to keep the data and allow the record to Reject) be inserted despite the defect or to reject the record. Propriety and Confidential. Do not distribute. 65 Compare documented requirements to Results of Data Analysis Percentage of records not populated [NULL or other If profiling information is available, record the percentage of defaults for Default value] the column. Test of population conditions for optional fields (under what For fields where the population is not mandatory, this field will test conditions is the field documented conditions under which the data will not be populated (or populated/not populated?) under which is will be populated, whichever is simpler to express). This field records whether or not default values are present. Valid values include: Y - Defaults are present, N - Defaults are not Present, Multi -More than one functional default is present, and CND -- Cannot determine whether defaults are present. Defaults present? (Y/N) Unexpected defaults? (Y/N) In this field, capture whether there are any unexpected characteristics (default different from related to rows where the field is defaulted. For example, are any values documentation, more than other than the standard default are being used to default the field one functional default, (functional defaults)? Is the standard value not being used at all? Is there etc.) more than one value being populated when the field is defaulted? Default percentage reasonable? (Y/N) This field should be populated with the analyst's assessment of whether the level of defaulted data is reasonable based on an understanding of what the data represents. If the response is No, then the reasons for drawing this conclusion should be recorded in the "Observations on existing population" field. Propriety and Confidential. Do not distribute. 66 Compare documented requirements to Results of Data Analysis Criteria for validity met? (Y/N) Observations on existing population Known issues from data analysis / profiling Risk Assessment This field should record whether actual values in the data meet the documented expectations for validity. For example, if the criteria for a Procedure Code field stipulate that the field should contain only industry standard codes, but it also has homegrown procedure codes, then it has not met the criteria for validity. This field should capture any observations on existing data. For example, why the default level is unreasonable, what the functional defaults are, whether the distribution of values in the column is reasonable, etc. If there is a need to record multiple observations, consider creating a observation tab, based on the Baseline Assessment template, or if there is a distinct category of observation, then add a column to this template to capture it. This field should indicate whether previous analysis identified any known issues with the data. The field does not need to detail those issues, but should reference other documents This field captures a high-level assessment of the risk associated with the attribute, based on knowledge of the source or gained through analysis. High, Medium, Low. Propriety and Confidential. Do not distribute. 67 Measurement ROI Once you have identified risks and assessed criticality, you can associate any data element or rule with one of these four quadrants. High risk, high criticality data is the data that it is worth monitoring. Propriety and Confidential. Do not distribute. 68 Measurement Decision Tree – Criticality Initial Assessment Critical data Improvement projects Data does not meet expectation Less critical data Assessed data sets Critical data Ongoing monitoring Less critical data Periodic reassessment Data meets expectation Propriety and Confidential. Do not distribute. 69 Measurement Decision Tree – Criticality / Risk Initial Assessment High Risk Risk Mitigation Critical Data Low Risk Assessed data sets High Risk Ongoing monitoring Low Risk Periodic reassessment Less-Critical Date Propriety and Confidential. Do not distribute. 70 Results of Requirements Process • Similar to those of the assessment process • A set of assertions about the expected condition of the data, focused on • Completeness • Consistency • Validity • Integrity • These can be defined as measurements: • For customer address, ZIP code must always be valid for addresses in the US. • Measure/monitor the level of invalid ZIP codes. • They can be shared with data consumers to ensure there is knowledge of the actual condition of the data and consensus about the expected condition. • This information is valuable as metadata. • Patterns of requirements can be associated with DQAF measurement types so that common processes can be set up to take sets of similar measurements. Show examples from filled in template… Propriety and Confidential. Do not distribute. 71 Measurement from requirements through support Propriety and Confidential. Do not distribute. 72 Using Measurement for Improvement 73 Using Measurements for Improvement Goal: Share examples of measurements that contribute to the ongoing improvement of data quality. • Member Coverage to Medical Claim Match Process: Complex derivation across data domains. • Earliest service date derivation: Complex derivation within a data domain • Member Reconciliation process: Comparison of Source and Target data. • All three show the benefit of initial assessment and ongoing measurement / monitoring. Propriety and Confidential. Do not distribute. 74 Member Coverage ID population on Medical Claim Data • Member Coverage ID is populated through a look up to member data. • Business Rule: Each medical claim should be associated with one and only one member with medical coverage at the time of first service. • The field was populated at only 88% (12% defaulted) and the population rate was declining. Low point was 83% population. • The root cause appeared to be overlapping timelines in the Member Coverage Data. This problem was addressed and the rate improved, but then leveled out at ~88%. • An additional root cause was identified: The population of Earliest Service Date on the Medical Claim records. Propriety and Confidential. Do not distribute. 75 Measurement Results Before and After Logic Change Keep this graph in mind. We will come back to it. Propriety and Confidential. Do not distribute. 76 Earliest Service Date on Medical Claim Data • The attribute was being populated in an existing data warehouse table. • Assessment showed 13% of records had defaulted dates (1/1/1900) • Background: Claims are stored at two levels: – Header or Event level contains data related to ALL services rendered through the claims (member and subscriber information, provider information, information on how the claim was submitted, status of the claim, date range for the set of services were provided) – Service or Line level includes details related to each service (procedure codes, service dates, type of service line, places of service, etc.) • There can be one or many service records for each header record. • Populating Earliest Service Date on the Header record requires a look up to the service records. • Review of derivation rules showed that the logic was being applied only to a subset of service lines, based on the service line type code. • Solution included extending this logic to all service lines. Propriety and Confidential. Do not distribute. 77 Measurement Results Before and After Logic Change Propriety and Confidential. Do not distribute. 78 Measurement Results Before and After the Second Logic Change This is the graph I asked you to keep in mind % MBR COV ID Populated As of March 2014 Individual Value 100 1 95 1 1 UCL=93.60 90 _ X=88.7 85 LCL=83.80 1 3 3 3 3 3 3 3 3 3 3 4 4 4 -1 -1 - 1 n- 1 - 1 g- 1 p- 1 -1 -1 -1 -1 b- 1 -1 l r r t r y v c n a a a Ju Ju Oc Ja Ap M Au Se No De Fe M M Month.Year Propriety and Confidential. Do not distribute. 79 Source / Target Reconciliation Measurement Flow chart shows a generic comparison. In the case of the measurement we were taking there was expected to be exact correspondence between Source and Target records. Propriety and Confidential. Do not distribute. 80 Overstated in Target – Records active in Target System / Inactive in Source Overstated OVR_UNDR_IND is Set to '2' 1600 1400 Count Over 1200 1000 When there is an overstatement, the problem should selfcorrect. (See 7/2011 and 11/2011) In 2013, it did not self correct. 800 600 400 200 0 1 1 2 2 2 3 3 3 4 01 01 01 01 01 01 01 01 01 2 2 2 2 2 2 2 2 2 / 4/ 2/ 3/ 4/ 1/ 6/ 4/ 1/ 18 /1 /1 /1 /1 /1 /1 /1 /1 7/ 1 2 6 0 2 6 0 2 1 1 1 LOAD_DT Root Cause: Target system did not receive de-activation records. Same records keep getting measured. Propriety and Confidential. Do not distribute. 81 Understated in Target – Target System is Missing Records Understated OVR_UNDR_IND is Set to '1' If this number spikes, the target system is missing records. 1000 Count Under 800 600 400 200 0 1 1 1 2 2 2 2 3 3 3 3 01 01 01 01 01 01 01 01 01 01 01 2 2 2 2 2 2 2 2 2 2 2 / / / / / / / / / 7/ 7/ /4 /4 /4 /5 /4 /4 /4 /4 11 6/ 9/ 2 3 6 9 2 3 6 9 / 1 1 12 LOAD_DT_1 Root Cause: An out of date exclusion rule was preventing records from being loaded. Propriety and Confidential. Do not distribute. 82 Parting Thoughts • Know your data. • If you don’t already know your data, get to know it through assessment and measurement. Propriety and Confidential. Do not distribute. 83 THANK YOU! Questions? Contact information Laura Sebastian-Coleman, Ph.D., IQCP Laura.Sebastian-Coleman@optum.com 860 221 0422