Document 15063055

advertisement
Course Name: Business Intelligence
Year: 2009
Information Integration
15th Meeting
Source of this Material
(2).
Loshin, David (2003). Business Intelligence:
The Savvy Manager’s Guide. Chapter 10
Bina Nusantara University
3
The Business Case
The business intelligence process revolves around the ability to collect,
aggregate, and most importantly, leverage the integration of different data sets
together; the ability to collect that data and place it in a data warehouse
provides the means by which that leverage can be obtained.
The only way to get data into a data warehouse is through an information
integration process. The only way to consolidate information for data
consumption is through an information integration process.
Bina Nusantara University
4
ETL: Extract, Transform, Load
A basic premise of constructing a data warehouse is that data sets from
multiple sources are collected and the added to a data repository from which
analytical applications can source their input data.
This extract/transform/load process is the sequence of applications that extract
data sets from the various sources, bring them to a data staging area, apply a
sequence of processes to prepare the data for migration into data warehouse,
and actually load them. Here is the general theme of an ETL process.
• Get the data from source location
• Map the data from its original form into data model that is suitable for
manipulation at the staging area.
• Validate and clean the data
• Apply any transformations to the data that are required before the data sets
are loaded into repository.
• Map the data form its staging are model to its loading model.
• Move the data set to repository
• Nusantara
Load
the data into the Warehouse
Bina
University
5
ETL: Extract, Transform, Load (cont…)
•
Staging Architecture
The first part of the ETL process is to assemble the infrastructure needed for
aggregating the raw data sets and for the application of the transformation and the
subsequence preparation of the data to be forwarded to the data warehouse This is
typically a combination of a hardware platform and appropriate management software
that we refer to as the staging area. The architecture of staging process can be seen
in Figure 15-1
Figure 15-1
Bina Nusantara University
6
ETL: Extract, Transform, Load (cont…)
•
Extraction
A lot of extracted data is formed into flat load files that can be either easily
manipulated in process at the staging area or forwarded directly to the warehouse.
How data should be extracted may depend on the scale of the project, the number
(as disparity) of data sources, and how far into the implementation the developers
are. Extraction can be as simple as a collection of simple SQL queries or as complex
as to require ad hoc, specially designed programs written in a proprietary
programming language.
•
Transformation
What is discovered during the profiling is put to use as part of the ETL process to
help in the mapping of source data to a form suitable for the target repository,
including the following tasks.
 Data type conversion
 Data cleansing
 Integration
 Referential integrity Checking
 Derivations
Bina Nusantara University
7
ETL: Extract, Transform, Load (cont…)




•
Denormalization and renormalization
Aggregation
Audit information
Null conversion
Loading
The loading component of ETL is centered on moving the transformed data into the
data warehouse. The critical issues include the following.
 Target dependencies
 Refresh volume and frequency
•
Scalability
There are two flavors of operations that are addressed during the ETL process. One
involves processing that is limited to all data instances within a single data set, and
the other involves the resolution of issues involving more that one data set. The more
data sets that are being integrated, the greater the amount of work that needs to be
done for the integration to complete.
Bina Nusantara University
8
Enterprise Application Integration and Web Services
Similar to the way that ETL processes extract and transform information from
multiple data sources to a target data warehouse, there are processes for
integrating and transforming information between active process and
application to essentially make them all work together. This process, called
enterprise application integration (EAI), which includes a function similar to
interacting applications that is provided by ETL.
• Enterprise Application Integration
Enterprise application integration (EAI) is meant to convey the perception of multiple
applications working together as if they were all a single application. The basic goal is
for a business process to be able to be cast as the interaction of set of available
applications and for all applications to be able to properly communicate with each
other. Enterprise application integration is not truly a product or a tool, but rather is
framework of ideas comprising different level of integration, including:
 Business Process Management
 Communications Middleware
 Data Standardization and Transformation
Bina Nusantara University
9
Enterprise Application Integration and Web Services
(cont…)
 Application of Business Rules
•
Web Services
Web services are business functions available over the internet that are constructed
according to strict specifications. Conformance to a strict standard enables different,
disparate clients to interact. By transforming data into an Extensible Markup
Language (XML) format base on predefined schema and providing object access
directives that describe how objects are communicated with, web services provide a
higher-level of abstraction than what is assumed by general EAI.
Bina Nusantara University
10
Record Linkage and Consolidation
Consolidation is a catchall term for those processes that make use of collected
metadata and knowledge to eliminate duplicate entities and merge data from
multiple sources, among other data enhancement operations. That process is
powered by the ability to identify some kind of relationship between any
arbitrary pair of data instances. The key to record linkage is the concept of
similarity. This is a measure of how close two data instances are to each other,
and can be a hard measure or a more approximate measure, in which case the
similarity is judged based on scores above or below a threshold.
• Scoring Precision and Application Context
One of the most significant insight into similarity and difference measurements is the
issue of application context and its impact on both measurement precision and the
similarity criteria. Depending on the kind of application that makes use of approximate
searching and matching, the thresholds will most likely change.
•
Elimination of Duplicates
The elimination of duplicates is a process of finding multiple representations of the
same entity within the data set and eliminating all but one of those representations
from the set. The elimination of duplicates is essentially a process of clustering similar
Bina Nusantara University
11
Record Linkage and Consolidation (cont…)
records together and then reviewing the corresponding similarity scores with respect
to a pair of thresholds.
•
Merge/Purge
Merge/purge is similar to the elimination of duplicates, except that whereas duplicate
elimination is associated with removing doubles from a single data set, merge/purge
involves the aggregation of multiple data sets followed by eliminating duplicates.
•
Householding
Householding is a process of reducing a number of records into a single set
associated with a single household. A household could be defined as a single
residence, and the householding process is used to determine which individuals live
within the same residence.
•
Improving Information Currency
There are other applications that make use of a consolidation phase during data
cleansing. One application is the analysis of currency and correctness. In the
consolidation phase, when multiple records associated with a single entity are
combined, the information in all the records can be used to infer the best overall set
of data attributes.
Bina Nusantara University
12
Management Issues
•
Data Ownership
How are you to direct your team to maintain a high level of data quality within the
warehouse? There are three ways to address this: correct the data in the warehouse,
try to effect some changes to the source data, and leave the errors in the data.
•
Activity Scheduling
How to schedule the activities associated with the scheduling process. The answer
depend on the available resources, the relative quality of supplied data, and the kind
of data sets that are to be propagated to the repository.
•
Reliability of Automated Linkage
Although our desire is for automated processes to properly link data instances as part
of the integration process, there is always some doubt that the software is actually
doing what we want it to do.
Bina Nusantara University
13
End of Slide
Bina Nusantara University
14
Download