Course Name: Business Intelligence Year: 2009 Information Integration 15th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 10 Bina Nusantara University 3 The Business Case The business intelligence process revolves around the ability to collect, aggregate, and most importantly, leverage the integration of different data sets together; the ability to collect that data and place it in a data warehouse provides the means by which that leverage can be obtained. The only way to get data into a data warehouse is through an information integration process. The only way to consolidate information for data consumption is through an information integration process. Bina Nusantara University 4 ETL: Extract, Transform, Load A basic premise of constructing a data warehouse is that data sets from multiple sources are collected and the added to a data repository from which analytical applications can source their input data. This extract/transform/load process is the sequence of applications that extract data sets from the various sources, bring them to a data staging area, apply a sequence of processes to prepare the data for migration into data warehouse, and actually load them. Here is the general theme of an ETL process. • Get the data from source location • Map the data from its original form into data model that is suitable for manipulation at the staging area. • Validate and clean the data • Apply any transformations to the data that are required before the data sets are loaded into repository. • Map the data form its staging are model to its loading model. • Move the data set to repository • Nusantara Load the data into the Warehouse Bina University 5 ETL: Extract, Transform, Load (cont…) • Staging Architecture The first part of the ETL process is to assemble the infrastructure needed for aggregating the raw data sets and for the application of the transformation and the subsequence preparation of the data to be forwarded to the data warehouse This is typically a combination of a hardware platform and appropriate management software that we refer to as the staging area. The architecture of staging process can be seen in Figure 15-1 Figure 15-1 Bina Nusantara University 6 ETL: Extract, Transform, Load (cont…) • Extraction A lot of extracted data is formed into flat load files that can be either easily manipulated in process at the staging area or forwarded directly to the warehouse. How data should be extracted may depend on the scale of the project, the number (as disparity) of data sources, and how far into the implementation the developers are. Extraction can be as simple as a collection of simple SQL queries or as complex as to require ad hoc, specially designed programs written in a proprietary programming language. • Transformation What is discovered during the profiling is put to use as part of the ETL process to help in the mapping of source data to a form suitable for the target repository, including the following tasks. Data type conversion Data cleansing Integration Referential integrity Checking Derivations Bina Nusantara University 7 ETL: Extract, Transform, Load (cont…) • Denormalization and renormalization Aggregation Audit information Null conversion Loading The loading component of ETL is centered on moving the transformed data into the data warehouse. The critical issues include the following. Target dependencies Refresh volume and frequency • Scalability There are two flavors of operations that are addressed during the ETL process. One involves processing that is limited to all data instances within a single data set, and the other involves the resolution of issues involving more that one data set. The more data sets that are being integrated, the greater the amount of work that needs to be done for the integration to complete. Bina Nusantara University 8 Enterprise Application Integration and Web Services Similar to the way that ETL processes extract and transform information from multiple data sources to a target data warehouse, there are processes for integrating and transforming information between active process and application to essentially make them all work together. This process, called enterprise application integration (EAI), which includes a function similar to interacting applications that is provided by ETL. • Enterprise Application Integration Enterprise application integration (EAI) is meant to convey the perception of multiple applications working together as if they were all a single application. The basic goal is for a business process to be able to be cast as the interaction of set of available applications and for all applications to be able to properly communicate with each other. Enterprise application integration is not truly a product or a tool, but rather is framework of ideas comprising different level of integration, including: Business Process Management Communications Middleware Data Standardization and Transformation Bina Nusantara University 9 Enterprise Application Integration and Web Services (cont…) Application of Business Rules • Web Services Web services are business functions available over the internet that are constructed according to strict specifications. Conformance to a strict standard enables different, disparate clients to interact. By transforming data into an Extensible Markup Language (XML) format base on predefined schema and providing object access directives that describe how objects are communicated with, web services provide a higher-level of abstraction than what is assumed by general EAI. Bina Nusantara University 10 Record Linkage and Consolidation Consolidation is a catchall term for those processes that make use of collected metadata and knowledge to eliminate duplicate entities and merge data from multiple sources, among other data enhancement operations. That process is powered by the ability to identify some kind of relationship between any arbitrary pair of data instances. The key to record linkage is the concept of similarity. This is a measure of how close two data instances are to each other, and can be a hard measure or a more approximate measure, in which case the similarity is judged based on scores above or below a threshold. • Scoring Precision and Application Context One of the most significant insight into similarity and difference measurements is the issue of application context and its impact on both measurement precision and the similarity criteria. Depending on the kind of application that makes use of approximate searching and matching, the thresholds will most likely change. • Elimination of Duplicates The elimination of duplicates is a process of finding multiple representations of the same entity within the data set and eliminating all but one of those representations from the set. The elimination of duplicates is essentially a process of clustering similar Bina Nusantara University 11 Record Linkage and Consolidation (cont…) records together and then reviewing the corresponding similarity scores with respect to a pair of thresholds. • Merge/Purge Merge/purge is similar to the elimination of duplicates, except that whereas duplicate elimination is associated with removing doubles from a single data set, merge/purge involves the aggregation of multiple data sets followed by eliminating duplicates. • Householding Householding is a process of reducing a number of records into a single set associated with a single household. A household could be defined as a single residence, and the householding process is used to determine which individuals live within the same residence. • Improving Information Currency There are other applications that make use of a consolidation phase during data cleansing. One application is the analysis of currency and correctness. In the consolidation phase, when multiple records associated with a single entity are combined, the information in all the records can be used to infer the best overall set of data attributes. Bina Nusantara University 12 Management Issues • Data Ownership How are you to direct your team to maintain a high level of data quality within the warehouse? There are three ways to address this: correct the data in the warehouse, try to effect some changes to the source data, and leave the errors in the data. • Activity Scheduling How to schedule the activities associated with the scheduling process. The answer depend on the available resources, the relative quality of supplied data, and the kind of data sets that are to be propagated to the repository. • Reliability of Automated Linkage Although our desire is for automated processes to properly link data instances as part of the integration process, there is always some doubt that the software is actually doing what we want it to do. Bina Nusantara University 13 End of Slide Bina Nusantara University 14