Business Intelligence Unit 1 Important Concepts INTRODUCTION • Organizations need business intelligence • Business intelligence (BI) – knowledge about your customers, competitors, business partners, competitive environment, and internal operations to make effective, important, and strategic business decisions 3-3 INTRODUCTION • IT tools help process information to create business intelligence according to: – OLTP – OLAP 3-4 INTRODUCTION • Online transaction processing (OLTP) – the gathering of input information, processing that information, and updating existing information to reflect the gathered and processed information – Databases support OLTP – Operational database – databases that support OLTP 3-5 INTRODUCTION • Online analytical processing (OLAP) – the manipulation of information to support decision making – Databases can support some OLAP – Data warehouses only support OLAP, not OLTP – Data warehouses are special forms of databases that support decision making 3-6 INTRODUCTION 3-7 What Is a Data Warehouse? • Data warehouse – logical collection of information – gathered from operational databases – used to create business intelligence that supports business analysis activities and decision-making tasks “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant 3-8 What Is a Data Warehouse? 3-9 What Is a Data Warehouse? Multidimensional Rows and columns Also layers Many times called hypercubes 3-10 Data Warehouses a record of an enterprise's past transactional and operational information designed to favor efficient data analysis and reporting data warehousing is not meant for current "live" data Data Warehouses large amounts of data – sometimes subdivided into smaller logical units (dependent data marts) What Are Data-Mining Tools? Data-mining tools – software tools that you use to query information in a data warehouse Query-and-reporting tools Intelligence agents Multidimensional analysis tools Statistical tools 3-13 Data Warehouses Components of a data warehouse: Sources -> Data Source Interaction Data Transformation Data Warehouse (Data Storage) Reporting (Data Presentation) Metadata Data Warehouses ADVANTAGES complete control over the four main areas of data management systems: Clean data Query processing: multiple options Indexes: multiple types Security: data and access Data Warehouses DISADVANTAGES Adding new data sources takes time and associated high cost Data owners lose control over their data, raising ownership, security and privacy issues Long initial implementation time and associated high cost Difficult to accommodate changes in data types and ranges, data source schema, indexes and queries OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse OLTP Database vs. Data Warehouse relational databases - groups data using common attributes found in the data set objectives are different OLTP database Designed for real time business operations Data Warehouse Designed for analysis of business measures by categories and attributes OLTP database Data Warehouse Mostly updates Mostly reads Many small transactions Queries are long and complex Gb - Tb of data Mb - Gb of data OLTP database Data Warehouse Current snapshot History Raw data Summarized, reconciled data Hundreds of users (e.g., decision-makers, analysts) Thousands of users (e.g., clerical users) SUMMARY four questions for you 1 Designed for analysis of business measures by categories and attributes 2 Designed for real time business operations Data Warehouse Designed for analysis of business measures by categories and attributes OLTP database Designed for real time business operations 1 Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table. 2 Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table. OLTP database Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table. Data Warehouse Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table. 1 Optimized for validation of incoming data during transactions; uses validation data tables. 2 Loaded with consistent, valid data; requires no real time validation. OLTP database Data Warehouse Optimized for validation of incoming data during transactions; uses validation data tables. Loaded with consistent, valid data; requires no real time validation. 1 Supports few concurrent users relative to OLTP. 2 Supports thousands of concurrent users. Data Warehouse Supports few concurrent users relative to OLTP. OLTP database Supports thousands of concurrent users. Data, Information & Knowledge Data is just symbols Information is data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions Knowledge is application of data and information; answers "how" questions Data Data is raw. It simply exists and has no significance beyond its existence (in and of itself). It can exist in any form, usable or not. It does not have meaning of itself. In computer parlance, a spreadsheet generally starts out by holding data. Information Information is data that has been given meaning by way of relational connection. This "meaning" can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it. Knowledge Knowledge is the appropriate collection of information, such that it's intent is to be useful. Summaries of information in a database for example. Or modeling and simulation tools exercise some type of stored knowledge. Examples - Supermarket OLTP Event is 3 cans of soup and 1 box of crackers bought; update database to reflect that event OLAP Last winter in all stores in northeast, how many customers bought soup and crackers together? Data Mining Are there any interesting combinations of foods that customers frequently bought 1-36 together? Copyright © 2005 Pearson AddisonWesley. All rights reserved. 11 Database designing rules Rule 1: What is the nature of the application (OLTP or OLAP)? Rule 2: Break your data in to logical pieces, make life simpler Rule 3: Do not get overdosed with rule 2 Rule 4: Treat duplicate non-uniform data as your biggest enemy Rule 5: Watch for data separated by separators Database designing rules Rule 6: Watch for partial dependencies Rule 7: Choose derived columns preciously Rule 8: Do not be hard on avoiding redundancy, if performance is the key Rule 9: Multidimensional data is a different beast altogether Rule 10: Centralize name value table design Rule 11: For unlimited hierarchical data self-reference PK and FK Normal form examples 1 NF : First Name, Middle name , Surnamedifferent columns 2 NF : Syllabus column of 5th standard should depend on both primary keys roll no. & standard 3 NF : Average column depends on marks & no. of subjects Normalization rules are important guidelines but taking them as a mark on stone is calling for trouble. Rule 1: What is the nature of the application (OLTP or OLAP)? Transactional: End user is more interested in CRUD, i.e., creating, reading, updating, and deleting records. The official name for such a kind of database is OLTP. Analytical: End user is more interested in analysis, reporting, forecasting, etc. - less number of inserts and updates. - main intention here is to fetch and analyze data as fast as possible. - The official name for such a kind of database is OLAP. Rule 1: What is the nature of the application (OLTP or OLAP)? In other words if you think inserts, updates, and deletes are more prominent then go for a normalized table design, else create a flat denormalized database structure. Rule 2: Break your data into logical pieces, make life simpler The first rule from 1st normal form. If your queries are using too many string parsing functions like substring, charindex, etc apply this rule E.g . Query- student names having “Koirala” and not “Harisingh”, very complex query The better approach would be to break this field into further logical pieces to write clean and optimal queries. Rule 3: Do not get overdosed with rule 2 Decomposing, is it needed? The decomposition should be logical. It’s rare that you will operate on ISD codes of phone numbers separately (until your application demands it). So it would be a wise decision to just leave it as it can lead to more complications. Rule 4: Treat duplicate non-uniform data as your biggest enemy Focus and refactor duplicate data, it creates confusion. For instance, in the below diagram, you can see “5th Standard” and “Fifth standard” means the same. Rule 4: Treat duplicate non-uniform data as your biggest enemy One of the solutions -move the data into a different master table altogether and refer them via foreign keys. E.g. new master table called “Standards” and linked the same using a simple foreign key. Rule 5: Watch for data separated by separators The 2nd rule of 1st normal form says avoid repeating groups. Too much data stuffed in syllabus column. These fields are termed as “Repeating groups”. To manipulate this data, the query would be complex and the performance of the queries degrades. Rule 5: Watch for data separated by separators Columns which have data stuffed with separators need special attention and a better approach would be to move those fields to a different table and link them with keys for better management. Rule 6: Watch for partial dependencies Watch for fields which depend partially on primary keys. E.g Primary key is created on roll number and standard. The syllabus is associated with the standard in which the student is studying and not directly with the student. Move the syllabus field and attach it to the Standards table. This rule is the 2nd normal form: “All keys should depend on the full primary key and not partially”. Rule 7: Choose derived columns preciously Rule 7: Choose derived columns preciously OLTP applications: getting rid of derived columns would be a good OLAP :a lot of summations, calculations, these kinds of fields are necessary to gain performance. The 3rd normal form: “No column should depend on other non-primary key columns”. See the situation and then decide if you want to implement the 3rd normal form. Rule 8: Do not be hard on avoiding redundancy, if performance is the key Need for performance: think about denormalization. Normalization: make joins with many tables Denormalization: the joins reduce and increase performance. Rule 8: Do not be hard on avoiding redundancy, if performance is the key Rule 9: Multidimensional data is a different beast altogether OLAP projects mostly deal with multidimensional data. E.g. get sales per country, customer, and date, where sales figures have three intersections of dimension data. Rule 10: Centralize name value table design Name and value tables :has key and some data associated with the key. E.g. currency table and a country table. Have only a key and value. For such kinds of tables, creating a central table and differentiating the data by using a type field makes more sense. Rule 11: For unlimited hierarchical data selfreference PK and FK Unlimited parent child hierarchy. E.g. A multi-level marketing scenario where a sales person can have multiple sales people below them. For such scenarios, using a self-referencing primary key and foreign key will help to achieve the same Business Models Depends on business requirements E.g. E-commerce business model Data Marts Data warehouses can support all of an organization’s information Data marts have subsets of an organizationwide data warehouse Data mart – subset of a data warehouse in which only a focused portion of the data warehouse information is kept 3-66 Assignment 1 Differentiate between OLTP and OLAP. Explain the design aspects of OLTP & OLAP What is BI & what are its components? References OLTP Vs OLAP ppts Notes by Shivprasad koirala