Course Name: Business Intelligence Year: 2009 Data Profiling 13th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 8 Bina Nusantara University 3 The Business Case No business intelligence program can be built without information, and that information may be coming from many different sources and providers, each of which may have little or no stake in the success of the outcome of BI program. When faced with an integration process incorporating disparate data sets of dubious quality, data profiling is the first step toward adding value to that data. Data profiling automates the initial processes of what we might call inferred metadata resolution: discovering what the data items really look like and providing a characterization of that data for next steps of integration. Bina Nusantara University 4 Data Profiling Activities Data profiling is a hierarchical process that attempts to build an assessment of the metadata associated with a collection of data sets. The bottom level of the hierarchy characterizes the values associated with individual attributes. At the next level, the assessment looks at relationships between multiple columns within a single table. At the highest level, the profile describes relationships that exist between data attributes across different tables. One other item to keep in mind while profiling data is that the most significant value is derived from discovering business knowledge that has been embedded in the data itself. Bina Nusantara University 5 Data Model Inference When presented with a set of data tables of questionable origin, a data consumer may want to verify or discover the data model that is embedded within that set. This is a hierarchical process that first focuses on exposing information about the individual columns within a single table, and then resolves relationships between different tables to generate a proposed data model for a data set. • Simple Type Inference When a data set is first introduced into a data environment, even if the set is accompanied by a corresponding data definition, the analyst may choose to verify the corresponding types through the profiling process. This is done through simple type inference, which is a process of resolving the view of each column’s data type to its most closely matched system data type. Data type inference is an iterative analysis of a value set to refine the perceived data type. Simple type inference centers only on assigning system data types (such as integer, decimal, date) to columns. Bina Nusantara University 6 Data Model Inference (cont…) • Table Model Inference and Relational Model Inference There are two approaches to resolving a relational model from a collection of data tables. The first is a brute-force approach that uses the results of overlap analysis to determine whether any pair of columns exhibits a key relationship. The second approach is more of a semantic approach that evaluates column names to see if there is any implied relation. Bina Nusantara University 7 Attributes Analysis Attribute analysis is a process of looking at all of the values populating a particular column as a way to characterize that set of values. Attribute analysis is the first step in profiling because it yields a significant amount of metadata relating to data set. The result of any of these analyses provides greater insight into the business logic that is applied to each column. Typically this evaluation revolves around the following aspects of a data set. • Range Analysis Relating a value set to a simple type already restricts the set of values that a column can take; most data types still allow for an infinite number of possible choices. Range analysis can be used in an intelligence application to explore minimum values of interest, perhaps related to customer activity and monthly sales or to prices customers are being charged for the same products at different retail location. • Sparseness The degree of sparseness may indicate some business meaning regarding the importance of that attribute. Depending on the value set, it probably means one of two things. Bina Nusantara University 8 Attributes Analysis (cont…) • Format Evaluation It is useful to look for the existence of patterns that might characterize the values assigned to a column. We can use the discovered definitions as a validation rule, which we would then add to a metadata database of domain pattern. Simple examples of rule-based data domains include telephone numbers, zip codes, and social security number. • Cardinality and Uniqueness The cardinality of a value set is the number of distinct values that exists within a column. Cardinality is interesting because it relates to different aspects of the correctness of the value set and because how it relates exposes business knowledge. Cardinality can be used to find columns whose values are unique, from which candidate keys can be inferred. • Frequency Distribution The frequency distribution of values yields the number of times each of the distinct values appears in a value set. Frequency distribution is also useful when looking for variations from the norm that might indicate something suspicious. Bina Nusantara University 9 Attributes Analysis (cont…) • Value Absence The are actually two problems associated with the absence of values that can be explored through data profiling. The first involves looking for values that are not there, and the second is to look for nonvalues that are there. • Abstract Type Recognition An abstract type is a more semantically descriptive qualification of a type definition that conveys business meaning. Typically abstract data types are represented by some kind of semantic definition, including: Constructive Assertion Value enumeration Pattern Conformance The goal of abstract type analysis is to propose an abstract data type for a specific column based on a suggestive statistical conformance to one defined abstract type. • Overloading It is possible that as an attribute’s data value are checked against know domains the profiling process will see significant matches against more than one domain. This Bina Nusantara University 10 Attributes Analysis (cont…) might indicate that two attributes’ worth of information is in one column, where the same column is being used to represent more than one actual attribute (Fig. 13-1). Alternatively, the use of more than one domain by a single attribute might indicate that more complex business rules are in effect, such as the existence of a split attribute, which is characterized by the use of different domains based on other data quality or business rules. Overloading can appear in other ways as well, such as the compaction of multiple pieces of data into a single character string. Bina Nusantara University Figure 13-1 11 Relationship Analysis Relationship analysis is frequently referred to in the data profiling space as cross-column analysis focuses on establishing relationships between sets of data. The goal of these processing stages is to identify relationships between value sets and known reference data, to identify dependencies between column (either in the same table or across different tables), and to indentify key relationships between columns across multiple tables. • Domain Analysis Domain analysis covers two tasks: identifying data domains and identifying references to data domains. The brute-force method for identifying enumerated domains is to look at all possible value sets. • Functional Dependency A functional dependency establishes a relationship between two sets of attributes. If the relationship is casual (i.e., the dependent attribute’s value is filled in as a function of the defining attributes), that is an interesting piece of business knowledge that can be added to growing knowledge base. If the relationship is not a casual, then that piece of knowledge can be used to infer information about normalization of the data. I Bina Nusantara University 12 Relationship Analysis (cont…) If a pair of data attribute values is consistently bound together, the those two columns can be extracted from the targeted table and the instance pairs inserted uniquely into a new table and assigned a reference identifier. • Key Relationships A table key is a set of attributes that can be used to uniquely indentify any individual record within the table. When one table’s key is used as a reference to another table, that key is called a foreign key. Modern relational databases enforce a constraint known as referential integrity, which states that if an attribute’s value is used in table A as a foreign key to table B, then that key value must exist in one record in table B. There are two aspect to profiling key relationship: indentifying that a key relationship exist, and indentifying what are called orphans in a violated referential integrity situation. Bina Nusantara University 13 Management Issues The most significant management issues involve the relatively steep costs of good data profiling tools and the performance of these tools. It is worth-while to explore the questionable performance of these tools. First, some of the algorithms used in data profiling are actually quite computationally intensive, and it is not unusual for some of the analysis to require both large amount of computational resources (memory, disk space) and time to successfully complete. Second, because the computations are summaries of frequency analysis and counts, the results presented tend to be almost endless, with long lists of values, each of which may have appeared only once in a column. Bina Nusantara University 14 End of Slide Bina Nusantara University 15