Business Intelligence Dr. Mahdi Esmaeili 1 2 Step 8: Database Design Operational Databases BI Target Databases •Geared toward eliminating redundancy, coordinating updates, and repeating the same types of operations many times a day, every day (for example, airline reservations, deposits and withdrawals from bank accounts, hotel room reservations). •Geared toward supporting a wide range of queries and reports. Queries and reports may vary from one business analyst to another or from one department to another. All of the queries and reports may not run on the same day and may not run every day (for example, quarterly trend analysis reports on regional sales, monthly order fulfillment report). •Most of the transactional systems require subsecond •Although response time is important, subseconds cannot response time. be expected. Typical response times are seconds, minutes, or hours. •Highly normalized to support consistent updates and •Highly denormalized to provide quick retrieval of a wide maintenance of referential integrity. range and a large amount of data. Data that belongs together from an analytical reporting perspective is usually stored together. •Store very little derived data. Data is usually derived •Store large amounts of derived data. This saves time for dynamically when needed. the queries and reports. •Do not store historical data. Historical records are archived. •Store large amounts of historical data, often at some level of summarization, but just as often at a detailed level. •Lightly summarized, mostly for reporting purposes. •Many levels of precalculated, summarized data, from lightly summarized to highly summarized. 3 Logical Database Design • The Star Schema • The Snowflake Schema • The Constellation Schema Physical Database Design • • • • • • • • Implementation Options Physical Dataset Placement Partitioning Clustering Indexing Reorganizations Backup and Recovery Parallel Query Execution 4 5 Deliverable Resulting • Physical data model • Physical design of the BI target databases • Data definition language • Data control language • Physical BI target databases • Database maintenance procedures 6 Roles Involved in This Step • Application lead developer • Data administrator • Database administrator • ETL lead developer 7 Step 9: Extract/Transform/Load Design 8 Integrated BI Implementation Strategy 9 Sets of ETL Programs 1 Initial Load Initial population of BI target databases with current operational data 2 Historical Load Initial population of BI target databases with archived historical data 3 Incremental Load Ongoing population of BI target databases with current operational data 10 ETL Processes 11 Designing the Transformation Programs Using the 80/20 rule, 80 percent of ETL work occurs in the "T" (transform) portion when extensive data integration and data cleansing are required, while extracting and loading represent only 20 percent of the ETL process 12 Source Data Problems • Inconsistent primary keys • • • • • Inconsistent data values Different data formats Inaccurate data values Synonyms and homonyms Embedded process logic 13 Common Use of ETL Tools in the ETL Process 14 Evaluating ETL Tools 1. Perform a cost-benefit analysis to compare licensing (buying) an ETL product with building the ETL process in-house 2. Compile a list of ETL products and vendors that are likely to meet your requirements 3. Compare the ETL products and vendors to your weighted data transformation requirements 4. Evaluate each ETL product objectively and prepare a scorecard that compares the product features and their effectiveness 5. Check the vendors' client references by talking with people at organizations that already use the tools you are considering 6. Narrow the list of ETL products and vendors to a short list of two or three candidates 7. Arrange for product demos since "seeing is believing." 8. Test the vendor products even though it takes away time from your project schedule 15 16 Deliverable Resulting • Source-to-target mapping document • ETL process flow diagram • ETL program design document • Staging area 17 Roles Involved in This Step • Data quality analyst • Database administrator • ETL lead developer • Subject matter expert 18 Step 10: Meta Data Repository Design 19 Centralized Meta Data Repository 20 Custom-Built Centralized Meta Data Repository Advantages Disadvantages •A customized database design incorporates all meta data requirements. •Full-time staff is needed to maintain the meta data repository database and the meta data reports. •The front end for access and the interfaces to •The front end for access and the interfaces to tools (ETL, OLAP, and so on) are custom designed tools must be programmed and maintained, both to meet all requirements. of which are time-consuming processes. •Reports as well as help functions are designed exactly as desired. •The meta data repository would have to be enhanced periodically (sometimes redesigned) because it cannot be built with all functionality from the start. •Technicians have full control over the design and •Content may become out of synch with the functionality of the meta data repository. proprietary dictionaries of the tools and the DBMS 21 Licensed Centralized Meta Data Repository Advantages Disadvantages •Time is saved by not having to design and •The "plain vanilla" version of the licensed build a meta data repository database, product will probably not satisfy all meta interfaces, front end, and reports. data requirements. Therefore, a full-time administrator is needed to maintain and enhance the licensed product. •Most licensed meta data repository products come with interfaces, and most come with a full set of APIs. •There will be a learning curve to become familiar with the product's architecture, interfaces, and APIs. •If the meta data repository product is certified for the tools where the meta data resides, it will provide the tool interfaces. •The more sophisticated the meta data repository product is, the more expensive it is, and the more skills the technicians need to maintain it. 22 Decentralized Meta Data Repository 23 Decentralized Meta Data Repository Advantages Disadvantages •Various owners can maintain and manage their own sets of meta data separately. •Controlling redundancy across multiple meta data repositories and keeping the meta data consistent is difficult. •Meta data repository databases are smaller and easier to use because each database contains only those meta data components that are of interest to a specific group of business people. •It will take longer to maintain and manage multiple databases on multiple platforms. There could also be synchronization problems with new DBMS releases. •Each meta data repository can have its own meta model, that is, its own customized design. •Communication among the custodians of the various meta data repositories will have to increase. Plus, it will require maintaining a meta-meta model, which is an integrated (merged) overall architecture of multiple meta models. •Reports can be customized for each individual meta •Relating meta data across various databases may data repository. be difficult. For example, business meta data is not automatically linked to technical meta data if they reside on different databases. •A gateway makes the name and location of the meta data repository transparent to the person accessing it. •The architecture of this solution is more complicated and the learning curve to use multiple databases with potentially different designs may be high. 24 Distributed XML-Enabled Meta Data Solution 25 Distributed XML-Enabled Meta Data Solution Advantages Disadvantages •XML tags enable access of meta data across any type of data •The initial tagging of all meta data with XML storage through standardized categorization and tagging of meta tags is a manual and laborious process. Plus, data components. XML tagging cannot be used for all meta data. •Meta data never has to be duplicated or moved from its original •XML tags add to the storage requirements source (except for reporting purposes). for the dictionary databases that store meta data (DBMS and tool dictionaries). •A gateway makes the location of the meta data transparent to the person accessing it. •A meta-meta model has to be created as a map of all the various types of meta data storage, each of which is designed according to its own unique meta model. •Standard Web search engines should be able to locate any meta •DBMS and tool vendors must follow data anywhere. industry standards[*] for meta data XML tags in order to enable seamless meta data access across all products. Multiple standards need to be supported. •Meta data and business data can be coupled and transmitted simultaneously. •Not all DBMSs and tools are XML-enabled. This is a bleeding-edge and unproven technology. 26 Designing a Meta Data Repository • Entity-Relationship Design • Object-Oriented Design 27 Entity-Relationship Design Advantages Disadvantages •E-R designs are easy to read and easy to •Changes and enhancements may require a understand. database redesign, as well as unload-ing and reloading the meta data repository. •Because of the intuitive and explicit nature of the design, queries can be written with relatively simple SQL statements. •The physical meta model is fairly large, with many objects and many relation-ships, which makes the architecture somewhat complex. •E-R designs are easy to implement as relational database structures. •Meta data objects and their relationships must be very well defined and under-stood for the physical meta model to be accurate 28 Object-Oriented Design Advantages Disadvantages •OO designs are extremely flexible; they will not need any database redesigns when changes are necessary. •Since the object named Object contains all instances (rows) of meta data, this table will become very large. This may affect access performance. •OO designs are simplistic and therefore easy to maintain and to enhance. •Queries are much more difficult to write and will require many recursive JOINs. Advanced SQL knowledge is required. •OO designs are easy to implement as object-oriented database structures. •OO designs require a high learning curve. The very abstract physical data model is difficult to comprehend, and the extensive rules take time to understand 29 Licensing (Buying) a Meta Data Repository (Expansion capabilities) • • • • • • • • • Adding meta data objects Adding relationships Changing inappropriate relationships Adding meta-meta data attributes to the meta data objects Changing the size and length of meta-meta data components Customizing vendor-provided reports Creating and storing code for additional reports Importing meta data from other tools Exporting meta data to other tools 30 Example of Product Evaluation Criteria with Weights Criterion # Product Evaluation Criteria Weight 1 Product closely matches our logical meta model 10 2 Product can satisfy our important meta data requirements 6 3 Product can satisfy our optional meta data requirements 1 4 Product can be expanded 8 5 Product has interfaces 9 6 Product has a Web front end 4 7 Product has APIs 9 31 Example of Product Ratings Criterion # Product 1 2 3 4 5 6 7 Autumn Dictionary 3 7 6 0 9 9 10 Helixor 9 0 2 8 6 0 5 Leeches Repository 6 6 1 6 4 0 7 Springrep 8 2 0 10 10 2 10 Tamroller MDR 7 5 5 0 6 2 7 32 Example of Weighted Product Ratings Product 1 2 3 4 5 Springrep 80 12 0 80 90 Autumn Dictionary 30 42 6 0 Helixor 90 0 2 Leeches Repository 60 36 Tamroller MDR 70 30 6 7 Total Score 8 90 360 81 36 90 285 64 54 0 45 255 1 48 36 0 63 244 5 0 54 8 63 230 33 34 Deliverable Resulting • Physical meta model • Data definition language for the meta data repository • Data control language for the meta data repository • Meta data repository programming specifications 35 Roles Involved in This Step • BI infrastructure architect • Data administrator • Meta data administrator 36