Document 15063049

advertisement
Course Name: Business Intelligence
Year: 2009
Data Profiling
13th Meeting
Source of this Material
(2).
Loshin, David (2003). Business Intelligence:
The Savvy Manager’s Guide. Chapter 8
Bina Nusantara University
3
The Business Case
No business intelligence program can be built without information, and that
information may be coming from many different sources and providers, each of
which may have little or no stake in the success of the outcome of BI program.
When faced with an integration process incorporating disparate data sets of
dubious quality, data profiling is the first step toward adding value to that data.
Data profiling automates the initial processes of what we might call inferred
metadata resolution: discovering what the data items really look like and
providing a characterization of that data for next steps of integration.
Bina Nusantara University
4
Data Profiling Activities
Data profiling is a hierarchical process that attempts to build an assessment of
the metadata associated with a collection of data sets.
The bottom level of the hierarchy characterizes the values associated with
individual attributes. At the next level, the assessment looks at relationships
between multiple columns within a single table. At the highest level, the profile
describes relationships that exist between data attributes across different
tables.
One other item to keep in mind while profiling data is that the most significant
value is derived from discovering business knowledge that has been embedded
in the data itself.
Bina Nusantara University
5
Data Model Inference
When presented with a set of data tables of questionable origin, a data
consumer may want to verify or discover the data model that is embedded
within that set. This is a hierarchical process that first focuses on exposing
information about the individual columns within a single table, and then
resolves relationships between different tables to generate a proposed data
model for a data set.
• Simple Type Inference
When a data set is first introduced into a data environment, even if the set is
accompanied by a corresponding data definition, the analyst may choose to verify the
corresponding types through the profiling process. This is done through simple type
inference, which is a process of resolving the view of each column’s data type to its
most closely matched system data type. Data type inference is an iterative analysis of
a value set to refine the perceived data type. Simple type inference centers only on
assigning system data types (such as integer, decimal, date) to columns.
Bina Nusantara University
6
Data Model Inference (cont…)
•
Table Model Inference and Relational Model Inference
There are two approaches to resolving a relational model from a collection of data
tables. The first is a brute-force approach that uses the results of overlap analysis to
determine whether any pair of columns exhibits a key relationship. The second
approach is more of a semantic approach that evaluates column names to see if
there is any implied relation.
Bina Nusantara University
7
Attributes Analysis
Attribute analysis is a process of looking at all of the values populating a
particular column as a way to characterize that set of values. Attribute analysis
is the first step in profiling because it yields a significant amount of metadata
relating to data set. The result of any of these analyses provides greater
insight into the business logic that is applied to each column.
Typically this evaluation revolves around the following aspects of a data set.
• Range Analysis
Relating a value set to a simple type already restricts the set of values that a column
can take; most data types still allow for an infinite number of possible choices. Range
analysis can be used in an intelligence application to explore minimum values of
interest, perhaps related to customer activity and monthly sales or to prices
customers are being charged for the same products at different retail location.
•
Sparseness
The degree of sparseness may indicate some business meaning regarding the
importance of that attribute. Depending on the value set, it probably means one of
two things.
Bina Nusantara University
8
Attributes Analysis (cont…)
•
Format Evaluation
It is useful to look for the existence of patterns that might characterize the values
assigned to a column. We can use the discovered definitions as a validation rule,
which we would then add to a metadata database of domain pattern. Simple
examples of rule-based data domains include telephone numbers, zip codes, and
social security number.
•
Cardinality and Uniqueness
The cardinality of a value set is the number of distinct values that exists within a
column. Cardinality is interesting because it relates to different aspects of the
correctness of the value set and because how it relates exposes business
knowledge. Cardinality can be used to find columns whose values are unique, from
which candidate keys can be inferred.
•
Frequency Distribution
The frequency distribution of values yields the number of times each of the distinct
values appears in a value set. Frequency distribution is also useful when looking for
variations from the norm that might indicate something suspicious.
Bina Nusantara University
9
Attributes Analysis (cont…)
•
Value Absence
The are actually two problems associated with the absence of values that can be
explored through data profiling. The first involves looking for values that are not there,
and the second is to look for nonvalues that are there.
•
Abstract Type Recognition
An abstract type is a more semantically descriptive qualification of a type definition
that conveys business meaning. Typically abstract data types are represented by
some kind of semantic definition, including:
 Constructive Assertion
 Value enumeration
 Pattern Conformance
The goal of abstract type analysis is to propose an abstract data type for a specific
column based on a suggestive statistical conformance to one defined abstract type.
•
Overloading
It is possible that as an attribute’s data value are checked against know domains the
profiling process will see significant matches against more than one domain. This
Bina Nusantara University
10
Attributes Analysis (cont…)
might indicate that two attributes’ worth
of information is in one column, where
the same column is being used to
represent more than one actual attribute
(Fig. 13-1).
Alternatively, the use of more than one
domain by a single attribute might
indicate that more complex business
rules are in effect, such as the existence
of a split attribute, which is characterized
by the use of different domains based
on other data quality or business rules.
Overloading can appear in other ways
as well, such as the compaction of
multiple pieces of data into a single
character string.
Bina Nusantara University
Figure 13-1
11
Relationship Analysis
Relationship analysis is frequently referred to in the data profiling space as
cross-column analysis focuses on establishing relationships between sets of
data. The goal of these processing stages is to identify relationships between
value sets and known reference data, to identify dependencies between
column (either in the same table or across different tables), and to indentify key
relationships between columns across multiple tables.
• Domain Analysis
Domain analysis covers two tasks: identifying data domains and identifying
references to data domains. The brute-force method for identifying enumerated
domains is to look at all possible value sets.
•
Functional Dependency
A functional dependency establishes a relationship between two sets of attributes. If
the relationship is casual (i.e., the dependent attribute’s value is filled in as a function
of the defining attributes), that is an interesting piece of business knowledge that can
be added to growing knowledge base. If the relationship is not a casual, then that
piece of knowledge can be used to infer information about normalization of the data. I
Bina Nusantara University
12
Relationship Analysis (cont…)
If a pair of data attribute values is consistently bound together, the those two
columns can be extracted from the targeted table and the instance pairs inserted
uniquely into a new table and assigned a reference identifier.
•
Key Relationships
A table key is a set of attributes that can be used to uniquely indentify any individual
record within the table. When one table’s key is used as a reference to another table,
that key is called a foreign key. Modern relational databases enforce a constraint
known as referential integrity, which states that if an attribute’s value is used in table
A as a foreign key to table B, then that key value must exist in one record in table B.
There are two aspect to profiling key relationship: indentifying that a key relationship
exist, and indentifying what are called orphans in a violated referential integrity
situation.
Bina Nusantara University
13
Management Issues
The most significant management issues involve the relatively steep costs of
good data profiling tools and the performance of these tools. It is worth-while to
explore the questionable performance of these tools. First, some of the
algorithms used in data profiling are actually quite computationally intensive,
and it is not unusual for some of the analysis to require both large amount of
computational resources (memory, disk space) and time to successfully
complete. Second, because the computations are summaries of frequency
analysis and counts, the results presented tend to be almost endless, with long
lists of values, each of which may have appeared only once in a column.
Bina Nusantara University
14
End of Slide
Bina Nusantara University
15
Download