02-1-Datawarehousing Components Topics being covered: REVIEW – GENERAL -- OLTP vs OLAP -- WHAT IS A DATA WAREHOUSE DATA WAREHOUSING TERMINOLOGY -- Some terminology to be aware of WHY DATA WAREHOUSING IS NEEDED -- WHY is it separated? -- Some of the reasons for the separation are historical -- Combining of more than one system -- Company Information cannot be adequately -- Performance degradation -- Operational data is volatile -- Data stored longer for analysis -- Data warehouse closer to business structure MORE OF WHAT IS DATA WAREHOUSING -- Basic Functions of a Data Warehouse DATA WAREHOUSE ARCHITECTURE BENEFITS OF DW Document1 by rt -- 15 March 2016 1 of 12 REVIEW GENERAL Business needs to know what is going on. Every business has data in operational systems Management asks questions about business, not about data - Are we selling more this week than usual? - Why did something sell well in one store but didn't in the other store? Users will modify the question when they hear the answer - We discovered we are selling more this week - Now which product sold more than usual – was there a reason - did the moth have 5 weekends this year and only 4 last year? - Having discovered which product - Now tell me if that product makes us any money or not – gross margin - Why did that product do well in that store and not in a similar store? The more answers that management gets to see the more additional questions that are generated in order to analyze what is truly happening in the business or to a customer or in a region. Each question, which was not anticipated, requires the professional (you again) to take the time to write programs to retrieve the data. This takes valuable time. This is the vicious circle mentioned before in the notes A COMMON PROBLEM Answers often are late in coming, Someone, a professional, has to build a report that is you -and to a lesser extent the Answers come usually from one database Document1 by rt -- 15 March 2016 2 of 12 REVIEW OLAP vs OLTP OLTP (Online Transaction Processing) OLAP (Online Analytical Processing) Processes real-time transactions Decision oriented Report types Activity tracking, customer services, lists (product , employees, customers, equipment) Detail then summary (Bottom up reporting) Report types Patterns, trends, anomalies Static report Slice and dice to manipulate the report Thousands or millions of transactions per day Fewer “transactions” -- meaning requests per day Small number of rows accessed per transaction Large number of rows accessed per request – often all rows Data updated frequently Volatile Read only Update periodically – “loading production data” Less volatile Dimensional Modeling - Optimized for querying large volumes of data Relational Modeling - Optimized for transaction processing - Normalized Summary then detail (Top down reporting) OLAP – how it tends to view data -Takes a more global view of data -Data is more summarized - OLAP is not focused on detailed information - Not the individual order but the total sales by product over a time period – This is a more summarized look. It allows us to look for patterns Document1 by rt -- 15 March 2016 3 of 12 REVIEW - WHAT IS A DATA WAREHOUSE Here are several ways to look at the same thing. Data Warehousing is A methodology for delivering informational answers to analytical business questions via a data warehouse A simple answer could be: A data warehouse is managed data situated after and outside the operational systems. A data Warehouse is a tool that stores numerical data to measure or analyze business usually over time periods. A data warehouse is a store of data that comes from data that is generated in the normal day-to-day or operational side of the business. The operational data comes from transactions. Sales, bank deposits, product moving from raw materials to finished goods, registering students, allocating classrooms and student marks recorded are some examples. Document1 by rt -- 15 March 2016 4 of 12 MORE ABOUT A DATA WAREHOUSE It is a database designed to answer analytical questions, making use of summarization and dimensionality Uses historical data It is fed with (relational) data, - The data has been - Extracted from existing source(s), - Cleaned, - Transformed and - Integrated End users using OLAP or Business Intelligence tools to access it. It does not generate new data – it simply turns existing data into relevant information It is implemented using standard RDBMS SPECIAL NOTE: Originally the course had a lot of PowerPoint’s, that were a pain to update. They were converted to word documents. What you see is a variety of print sizes and boxes. These notes were not meant originally as student notes, but as personal notes to remind me about the subject. When you read these you will feel that the same thing gets said more than once on the same page. These notes are being converted to have greater value to students, but will not be written in book form, but will continue to be written with boxes and coloured emphasis Example: the next line reminded me to draw on the board a … Rough Diagram of overall system Document1 by rt -- 15 March 2016 5 of 12 DATA WAREHOUSING TERMINOLOGY This is some terminology used in the industry that you need to know (It appears on tests also) Data Mart A subset of a data warehouse that includes information from one department is a data mart. Data marts can be combined into an enterprise DW Data Mining A process used for decision support and discovering information from deep within a database Decision Support System (DSS) A set of programs based on statistical models or trend analysis that assists in management decisions. On-Line Analytical Processing (OLAP) Database software tools that makes use of an intermediary database to store summary information and pre-defined calculations. On-Line Transactional Processing (OLTP) It is a system that will keep track of an establishment’s daily transactions and updates the warehouse at periodic intervals. Data Cleansing Data cleansing ensures that the data extracted from the operational database contains valid information. You can evaluate the data on a logical or on a technical level. Document1 by rt -- 15 March 2016 6 of 12 Why Data Warehousing is needed NOTE: Very important The primary concept of data warehousing is that the data stored for business analysis is most effectively accessed if it is separated from the data in the operational system. WHY is it separated? - Some of the reasons for the separation are historical. Throughout the history of systems development, the primary emphasis had been given to the operational systems and the data they process. It is not practical to keep data in the operational systems indefinitely; and only as an afterthought was a structure designed for archiving the data that the operational system has processed. - In the past, historical data was stored on tape or archived as it became inactive - Analytical reporting was done by running the data from these archives Sometimes mirror data sources were used to reduce impact on resources needed to run the daily operational systems. - Combining of more than one system Data warehousing systems are most successful when data can be combined from more than one operational system. Data warehousing can combine multiple source applications such as sales, marketing, finance and production. Data warehousing data is built around time dimensions. Time is a filtering criterion for most of the data warehousing activity. Data warehousing combine or group on different time periods. - Queries weekly, monthly, quarterly, yearly - Comparisons year-to-year activity is common – comparing different time data - Company Information cannot be adequately analyzed in the form in which it is stored. Operational relational databases are optimized for data entry since operational data is generated continuously Data is spread over multiple tables to reduce redundancy Transaction data is too detailed – contains data not needed for analysis Requires a lot of joins to get information – joins take a lot of processing resources Data is not in proper format for analysis All of the above and more. Document1 by rt -- 15 March 2016 7 of 12 WHY is it separated? (Continued) - Performance degradation Transaction processing and query processing can greatly slow down the system. The most important reason for separating data for business analysis from the operational data has always been the potential performance degradation on the operational system. High performance and quick response time is critical for operational systems. A 5-second delay may seem small but take into account the effect it has on other operations or multiply that by the number of orders processed. Think of how you feel about the Internet and slow response time. Slow processing of orders can result in lost sales. Slow processing of monthly statements can result in slower receipts for payments adversely effecting cash flows. Operational activities are more predictable. - Operational activities are often measured and the cost of long response time on telecommunication costs, extra staff or extra operational hours are factors in the cost of operations. DW queries are not as easily predefined. - Activity against a data warehouse is hard to predict. - Operational data is volatile Inventory data in an operational system changes with every transaction. Carrying all the changes to the data warehouse is impossible - Data needed for analysis is stored longer Data is saved in a DW for longer periods than in transaction systems. A product on an assembly line has many processes applied to it and many transactions applied. Data from operations is transferred to the data warehouse when most of the activity is completed. Data in a data warehouse represents a snap shot in time. - It is not changed - It is historical in nature Document1 by rt -- 15 March 2016 8 of 12 - Data warehouse model aligns with the business structure Data was archived after it becomes inactive. Primary reason for archiving data is performance. Data warehouses are designed to be the archives for the operational data and are saved for much longer periods. Storing data for more than 5 years is common. A data warehouse design/model looks closer to the business structure rather than a model that follows an application model. A business model rather than an ERD model - Data warehouse model looks closer to the business structure Notes to read A data warehouse looks more like the way the business is structured. The tables/entities in the data warehouse appear closer to how the actual business entities appear. Management doesn't think in terms of INVOICE and INVOICE LINE. The invoice line comes about by normalizing an invoice. INVOICE [IID, CID, DATE] INVOICE LINE [OID, PID, QTY-SOLD, Selling Price] Management thinks in terms of customers, products, orders, and distributors. Parts of the organization may deal with customers within a small set of the activities that a customer has with the organization. Different parts of an organization may have a very narrow view of a business entity such as a customer. In the college, the registration department is concerned with the dealings it has with a student along the lines of obtaining the data from an application, sending out an invoice, collecting the money, and allowing access to the program of choice. The educational department that the student chose has a limited set of dealings with the student/customer and it isn't about whether tuition has been paid. The faculty is concerned with data that leads to transcripts. The data warehouse needs to view the student as a whole. The "parts" of the college would provide information to be extracted into a whole view of the student. In building a "whole" view of the student the analysis needs to gather all of the attributes from all of the parts that interact with that student. This might be considered an "enterprise" view. Document1 by rt -- 15 March 2016 9 of 12 MORE OF WHAT IS DATA WAREHOUSING – (continued) Basic Functions involved with Data Warehousing Software to do the following Data Extraction of the information from the operational database and load it into the data warehouse Data Cleaning to remove inconsistencies from the data coming from various sources Example: inconsistency in storing province of Ontario as – ON – Ont – Ont. Data Loading Data Storage Warehouse Management for managing the metadata (Stored Database Definitions) and repository Data Mining that looks for hidden patterns in the information Data Visualize or Business Intelligence Software used to present the data in the warehouse to the users Document1 by rt -- 15 March 2016 10 of 12 DATA WAREHOUSE ARCHITECTURE (do not memorize this) (Source unknown) High Level View HOST DATA SOURCES Extract & Transform Clean & Scrub E. T. L. Extract, Transform, Load Schedule & Move ENTERPRISES DATA STORE CLIENT ACCESS Enquiry, Data Mining and Presentation Tools Store Data Access Data META DATA MANAGEMENT Document1 by rt -- 15 March 2016 11 of 12 SUMMARY -- BENEFITS OF DW DW makes it easy to see what your business is doing, what is does and does not do well, where the opportunities are, where the problems are Without DW, a business is blinded by too much detailed data and it is difficult to access for answers Save $$ by reducing time & labour to produce information and reports End users can get new insights with little or no IT intervention Simplicity of the Query (see below) Write the query and ask what is this looking for Standard query SELECT FROM WHERE AND AND GROUP BY ORDER BY P.BRAND, SUM (F.DOLLARS), SUM (F.UNITS) SALESFACT F, PRODUCT P, TIME T F.PRODUCTKEY =P.PRODUCTKEY F.TIMEKEY =T.TIMEKEY T.QUARTER =‘1 Q 2007’ P.BRAND P.BRAND Note the simplicity of the query. That means we could develop some code to automate the queries and reduce the need for expensive IT people. Document1 by rt -- 15 March 2016 12 of 12