Course Name: Business Intelligence Year: 2009 Data Quality and Information Compliance 14th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 9 Bina Nusantara University 3 The Business Case According to the 2001 PricewaterhouseCoopers Global Data Management Survey, fully 75% of the respondents (senior-level executives) had experienced significant problems as a result of defective data. There problem included: • Extra costs to prepare reconciliations • Delays or scrapping of new systems • Failure to collect receivables • Inability to deliver orders • Lost sales According to the Data Warehousing Institute’s report Data Quality and The Bottom Line, poor quality customer data cost U.S. businesses $611 billion a year (and this is referring just to customer data-consider how much all other kinds of poor quality data can cost.). Bina Nusantara University 4 More Than Just Names and Addresses What is particularly interesting is that some of these assessments of the commercial costs of poor data quality are based on relatively simple metrics related to incorrect names and addresses. Data quality is defined in terms of how each data consumer desires to use the data, and to this end we must discuss some dimensions across which data quality can be measured. • Data Quality Dimensions There are potentially many dimensions of data quality dealing with data models, data domains, and data presentation, the ones that usually attract the most attention are dimensions that deal with data values, namely: Accuracy, Completeness, Consistency, and Timeliness. • Accuracy Accuracy refers to the degree to which data values agree with an identified source of correct information. • Completeness Completeness refers to the expectation that data instances contain all the information they are supposed to. Bina Nusantara University 5 More Than Just Names and Addresses (cont…) • Consistency Consistency refers to data values in one data set being consistent with values in another data set. • Currency/Timeliness Currency refers to the degree to which information is current with the world that is models. Currency can measure how up to date information is and whether it is correct despite possible time-related changes. Timeliness refers to the time expectation for accessibility of information. Timeless can be measured as the time between when information is expected and when it is readily available for usage. • Data Quality Expectations The level of data quality is determined by the data consumers in terms of meeting or beating their own defined expectations. Bina Nusantara University 6 Types of Errors In this section we look at some common sources of errors. • Attribute Granularity The granularity of an attribute refers to how much information is embedded within that attribute. • Finger Flubs and Transcription Errors Many data errors can creep in right at the introduction into the processing stream. At a data entry or transcription stage, individuals or systems are likely to introduce variations in spellings, abbreviations, phonetic similarities, transcribed letters inside words, misspelled words, miskeyed letters, etc. • Floating Data Imprecision in metadata modeling may result in lack of clarity as to whether certain kinds of data go into specific fields. • Implicit and Explicit Nullness The question of nullness is interesting , because the absence of a value may provide more fodder for making inferences than the presence of an incorrect value. Examples of explicit nulls are shown in Figure 14-1. Bina Nusantara University 7 Types of Errors (cont…) Figure 14-1 • Semistructured Formats Data records in regular databases are formatted in a structured way; information in free text is said to be unstructured. Somewhere in the middle is what we call semistructured data, which is essentially free text with some conceptual connectivity. • Strict Format Conformance We naturally tend to view information in patterns, and frequently data modelers will reverse this order and impose a pattern or set of patterns on a set of attributes that may actually prevent those fields from being properly filled in. Bina Nusantara University 8 Types of Errors (cont…) • Transformation Errors Errors may be introduced during the extraction and transformation process. If the transformation rules are not completely correct or if there is a flaw in the transformation application, errors will be created where none originally existed. • Overloaded Attributes Either as a reflection of poor data modeling or as a result of changing business concerns, there is information that is stored in one data field that actually contains more than one data value. Bina Nusantara University 9 Data Cleansing A large part of the cleansing process involves identification and elimination of duplicate records. The data cleansing process consists of phases that try to understand what is in the data, transform that into a standard form, identify incorrect data, and then attempt to correct it. • Parsing The first step in data cleansing is parsing, which is the process of identifying meaningful tokens within a data instance and then analyzing token steams for recognizable pattern. The parsing process segregates each word and then attempts to determine the relationship between the word and previously defined token sets and to form sequences of tokens. • Standardization Standardization is the process of transforming data into a form specified as a standard. That standard may be one defined by some governing body or may just refer to the segregation of subcomponents within a data field that has been successfully parsed. Bina Nusantara University 10 Data Cleansing (cont…) • Abbreviation Expansion An abbreviation is a compact representation of some alternate recognized entity, and finding and standardizing abbreviations is another rule-oriented aspect of data cleansing. Abbreviations must be parsed and recognized, and then a set of transformational business rules can be used to change abbreviations into their expanded form. • Correction Once components of a string have been identified and standardized, the next stage of the process attempts to correct those data values that are not recognized and to augment correctable records with the corrected information. The correction process is based on maintaining a set of incorrect values and their corrected forms. • Updating Missing Fields One aspect of data cleansing is being able to fill fields that are missing information. It is important to understand, though, that without a well-documented and agreed-to set of rules to determine how to fill in a missing field, it can be counterproductive and dangerous to fill in missing value. Bina Nusantara University 11 Business Rule-Based Information Compliance Information compliance is a concept that incorporates the definition of business rules for measuring the level of conformance of sets of data with client expectations. • A Data Quality Rule Framework Our framework for articulating quality expectations looks at how that data is used and how we can express rules from this holistic approach. This can be decomposed into the definition of metadata-like reference data sets and assertions that relate to values, records, columns, and tables within a collection of data sets. • Domains Any value set that has some structural rules and explicit semantic rules governing validity. Either way, these expectations restrict the values that an attributes takes. Whether these rules are syntactic or semantic, we can define an explicit set of restrictions on a set of values within a type and call that a domain. • Mappings A mapping is a relation between domain A and domain B, defined as a set of pairs of values {a,b} such that a is a member of domain A and b is a member of domain B. Bina Nusantara University 12 Business Rule-Based Information Compliance (cont…) • Null Conformance If null are allowed, the rule specifies that if a data attribute’s value is null, then it must use one of a set of defined null representations. • Value Restrictions A value restriction describes some business knowledge about a range of values, a value restriction rule constraint value to be within the defined range. • Domain and Mapping Membership Domain membership asserts that an attribute’s value is always taken from a previously defined data domain. A mapping membership rule asserts that the relation between two attributes or fields is restricted based on a named mapping. • Completeness and Exemption A completeness rule specifies that when a condition is true, a record is incomplete unless all attributes on a provided list are not null. An exemption rule says that if a condition is true, then those attributes in a named list should not have values. Bina Nusantara University 13 Business Rule-Based Information Compliance (cont…) • Consistency Consistency refers to maintaining a relationship between two (or more) attributes based on the content of the attributes. A consistency rule indicates that if a particular condition holds true, then a following a consequent must also be true. • Continuous Data Quality Monitoring and Improvement We iteratively improve data quality by identifying sources of poor data quality, asserting a set of rules about our expectations for the data, and implementing a measurement application using those rules. Bina Nusantara University 14 Management Issue Data quality is probably the most critical aspect of the business intelligence process, and it is a problem that needs to be addressed. But be assured that some investment must be made (preferably at the start of the project) in ensuring high levels of data quality. • Pay Now or Pay (More) Later The most critical management issue associated with data quality is the lack of true understanding of its importance. No BI program can be successful if the data that’s feeds that program is faulty. • Personalization of Quality Where the exposition of data quality problems is viewed by the employee as a personal attack, to be avoided at all costs. • Data Ownership and Responsibilities A formal structure for data ownership and a responsibility chain should be defined before the data integration component of implementation. Bina Nusantara University 15 Management Issue (cont…) • Correction Versus Augmentation Another important distinction is between the repetitive application of the same corrections to the same data each time that data set is extracted from its sources and brought into the BI environment, and a true data quality improvement process. Bina Nusantara University 16