Unit No.01 Data Warehouse Fundamentals What is a Data Warehouse? A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. It includes historical data derived from transaction data from single and multiple sources. Bill Inmon's definition of a data warehouse is that it is a “subject-oriented, non-volatile, integrated, time-variant collection of data in support of management's decisions.” A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that may help in decision makings. For example, a college might want to see quick different results, like how the placement of CS students has improved over the last 10 years, in terms of salaries, counts, etc. Data warehouse is an information system that contains historical and commutative data from single or multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version of truth for any company for decision making and forecasting. Characteristics of Data warehouse Subject-Oriented Integrated Time-variant Non-volatile Subject-Oriented A data warehouse is subject oriented as it offers information regarding a theme instead of companies’ on-going operations. These subjects can be sales, marketing, distributions, etc. a data warehouse never focuses on the on-going operations. Instead, it put emphasis on modelling and analysis of data for decision making. It also provides a simple and concise view around the specific subject by excluding data which not helpful to support the decision process. Integrated In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from the dissimilar database. The data also needs to be stored in the Data warehouse in common and universally acceptable manner. A data warehouse is developed by integrating data from varied sources like a mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding. This integration helps in effective analysis of data. Consistency in naming conventions, attributemeasures, encoding structure etc. has to be ensured. Time-Variant The time horizon for data warehouse is quite extensive compared with operational systems. The data collected in a data warehouse is recognized with a particular period and offers information from the historical point of view. It contains an element of time, explicitly or implicitly. One such place where Data warehouse data display time variance is in in the structure of the record key. Every primary key contained with the DW should have either implicitly or explicitly an element of time. Like the day, week month, etc. Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated or changed. Non-volatile Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it. Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what & when happened. It does not require transaction process, recovery and concurrency control mechanisms. Activities like delete, update, and insert which are performed in an operational application environment are omitted in Data warehouse environment. Only two types of data operations performed in the Data Warehousing are Need for Data Warehouse An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing data of TB size, the storage shifted to Data Warehouse. Besides this, a transactional database doesn’t offer itself to analytics. To effectively perform analytics, an organization keeps a central Data Warehouse to closely study its business by organizing, understanding, and using its historic data for taking strategic decisions and analysing trends. Data Warehouse Architecture A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise. Each data warehouse is different, but all are characterized by standard vital components. Production applications such as payroll accounts payable product purchasing and inventory control are designed for online transaction processing (OLTP). Such applications gather detailed data from day to day operations. Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity recently dubbed online analytical processing (OLAP). These include applications such as forecasting, profiling, summary reporting, and trend analysis Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a warehouse database is updated from operational systems periodically, usually during offhours. As OLTP data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies and new fields and keys added to reflect the needs to the user for sorting, combining, and summarizing data. Data warehouses and their architectures very depending upon the elements of an organization's situation. Three common architectures are: o Data Warehouse Architecture: Basic o Data Warehouse Architecture: With Staging Area o Data Warehouse Architecture: With Staging Area and Data Marts Data Warehouse Architecture: Basic Operational System An operational system is a method used in data warehousing to refer to a system that is used to process the day-to-day transactions of an organization. Flat Files A Flat file system is a system of files in which transactional data is stored, and every file in the system must have a different name. Meta Data A set of data that defines and gives information about other data. Meta Data used in Data Warehouse for a variety of purpose, including: Meta Data summarizes necessary information about data, which can make finding and work with particular instances of data more accessible. For example, author, data build, and data changed, and file size are examples of very basic document metadata. Metadata is used to direct a query to the most appropriate data source. Lightly and highly summarized data The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager. The goals of the summarized information are to speed up query performance. The summarized record is updated continuously as new information is loaded into the warehouse. End-User access Tools The principal purpose of a data warehouse is to provide information to the business managers for strategic decision-making. These customers interact with the warehouse using end-client access tools. The examples of some of the end-user access tools can be: o Reporting and Query Tools o Application Development Tools o Executive Information Systems Tools o Online Analytical Processing Tools o Data Mining Tools Data warehouse Schemas A schema is defined as a logical description of database where fact and dimension tables are joined in a logical manner. Data Warehouse is maintained in the form of Star, Snow flakes, and Fact Constellation schema. Star Schema A Star schema contains a fact table and multiple dimension tables. Each dimension is represented with only one-dimension table and they are not normalized. The Dimension table contains a set of attributes. Characteristics In a Star schema, there is only one fact table and multiple dimension tables. In a Star schema, each dimension is represented by one-dimension table. Dimension tables are not normalized in a Star schema. Each Dimension table is joined to a key in a fact table. The following illustration shows the sales data of a company with respect to the four dimensions, namely Time, Item, Branch, and Location. There is a fact table at the center. It contains the keys to each of four dimensions. The fact table also contains the attributes, namely dollars sold and units sold. Note − Each dimension has only one-dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state, country}. This constraint may cause data redundancy. For example − "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country. Snowflakes Schema Some dimension tables in the Snowflake schema are normalized. The normalization splits up the data into additional tables as shown in the following illustration. Unlike in the Star schema, the dimension’s table in a snowflake schema are normalized. For example − The item dimension table in a star schema is normalized and split into two dimension tables, namely item and supplier table. Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key. The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type. Note − Due to the normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space. Fact Constellation Schema (Galaxy Schema) A fact constellation has multiple fact tables. It is also known as a Galaxy Schema. The following illustration shows two fact tables, namely Sales and Shipping − The sales fact table is the same as that in the Star Schema. The shipping fact table has five dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The shipping fact table also contains two measures, namely dollars sold and units sold. It is also possible to share dimension tables between fact tables. For example − Time, item, and location dimension tables are shared between the sales and shipping fact table Online Analytical Processing (OLAP) OLAP stands for Online Analytical Processing. OLAP systems have the capability to analyse database information of multiple systems at the current time. The primary goal of OLAP Service is data analysis and not data processing Online Analytical Processing (OLAP) consists of a type of software tool that is used for data analysis for business decisions. OLAP Examples Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are described below. Spotify analysed songs by users to come up with a personalized homepage of their songs and playlist. Netflix movie recommendation system. Online Transaction Processing (OLTP) OLTP (online transactional processing) enables the rapid, accurate data processing behind ATMs and online banking, cash registers and ecommerce, and scores of other services we interact with each day. OLTP, or online transactional processing, enables the real-time execution of large numbers of database transactions by large numbers of people, typically over the internet. Online transaction processing provides transaction-oriented applications in a 3-tier architecture. OLTP administers the day-to-day transactions of an organization. OLTP Examples An example considered for OLTP System is ATM Center a person who authenticates first will receive the amount first and the condition is that the amount to be withdrawn must be present in the ATM. The uses of the OLTP System are described below. ATM center is an OLTP application. OLTP handles the ACID ( Atomicity, Consistency, Isolation, and Durability) properties during data transactions via the application. It’s also used for Online banking, Online airline ticket booking, sending a text message, add a book to the shopping cart. Benefits of OLTP Services OLTP services allow users to read, write and delete data operations quickly. OLTP services help in increasing users and transactions which helps in real-time access to data. OLTP services help to provide better security by applying multiple security features. OLTP services help in making better decision making by providing accurate data or current data. OLTP Services provide Data Integrity, Consistency, and High Availability to the data. Drawbacks of OLTP Services OLTP has limited analysis capability as they are not capable of intending complex analysis or reporting. OLTP has high maintenance costs because of frequent maintenance, backups, and recovery. OLTP Services get hampered in the case whenever there is a hardware failure which leads to the failure of online transactions. OLTP Services many times experience issues such as duplicate or inconsistent data. Difference between OLAP and OLTP Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing) Definition It is well-known as an online database query management system. It is well-known as an online database modifying system. Data source Consists of historical data from various Databases. Consists of only operational current data. Method used It makes use of a data warehouse. It makes use of a standard database management system (DBMS). Application It is subject-oriented. Used for Data Mining, Analytics, Decisions making, etc. It is application-oriented. Used for business tasks. Normalized In an OLAP database, tables are not normalized. In an OLTP database, tables are normalized (3NF). Usage of data The data is used in planning, problem-solving, and decisionmaking. The data is used to perform day-to-day fundamental operations. Task It provides a multi-dimensional view of different business tasks. It reveals a snapshot of present business tasks. Purpose It serves the purpose to extract information for analysis and decision-making. It serves the purpose to Insert, Update, and Delete information from the database. Volume of data A large amount of data is stored typically in TB, PB The size of the data is relatively small as the historical data is archived in MB, and GB. Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing) Queries Relatively slow as the amount of data involved is large. Queries may take hours. Very Fast as the queries operate on 5% of the data. Update The OLAP database is not often updated. As a result, data integrity is unaffected. The data integrity constraint must be maintained in an OLTP database. Backup and Recovery It only needs backup from time to time as compared to OLTP. The backup and recovery process is maintained rigorously Processing time The processing of complex queries can take a lengthy time. It is comparatively fast in processing because of simple and straightforward queries. Types of users This data is generally managed by CEO, MD, and GM. This data is managed by clerks and managers. Operations Only read and rarely write operations. Both read and write operations. Updates With lengthy, scheduled batch operations, data is refreshed on a regular basis. The user initiates data updates, which are brief and quick. Nature of audience The process is focused on the customer. The process is focused on the market. Database Design Design with a focus on the subject. Design that is focused on the application. Category Productivity OLAP (Online Analytical Processing) Improves the efficiency of business analysts. OLTP (Online Transaction Processing) Enhances the user’s productivity. Online Analytical Processing Server (OLAP) Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP. Types of OLAP Servers We have four types of OLAP servers − Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) Specialized SQL Servers Relational OLAP ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses relational or extended-relational DBMS. ROLAP includes the following − Implementation of aggregation navigation logic. Optimization for each DBMS back end. Additional tools and services. Multidimensional OLAP MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets. Hybrid OLAP Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store. Specialized SQL Servers Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. OLAP Operations Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data. Here is the list of OLAP operations − Roll-up Drill-down Slice and dice Pivot (rotate) Roll-up Roll-up performs aggregation on a data cube in any of the following ways − By climbing up a concept hierarchy for a dimension By dimension reduction The following diagram illustrates how roll-up works. Roll-up is performed by climbing up a concept hierarchy for the dimension location. Initially the concept hierarchy was "street < city < province < country". On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country. The data is grouped into cities rather than countries. When roll-up is performed, one or more dimensions from the data cube are removed. Drill-down Drill-down is the reverse operation of roll-up. It is performed by either of the following ways − By stepping down a concept hierarchy for a dimension By introducing a new dimension. The following diagram illustrates how drill-down works − Drill-down is performed by stepping down a concept hierarchy for the dimension time. Initially the concept hierarchy was "day < month < quarter < year." On drilling down, the time dimension is descended from the level of quarter to the level of month. When drill-down is performed, one or more dimensions from the data cube are added. It navigates the data from less detailed data to highly detailed data. Slice The slice operation selects one particular dimension from a given cube and provides a new subcube. Consider the following diagram that shows how slice works. Here Slice is performed for the dimension "time" using the criterion time = "Q1". It will form a new sub-cube by selecting one or more dimensions. Dice Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation. The dice operation on the cube based on the following selection criteria involves three dimensions. (location = "Toronto" or "Vancouver") (time = "Q1" or "Q2") (item =" Mobile" or "Modem") Pivot The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation. ETL (Extract, Transform, and Load) Process What is ETL Process? The mechanism of extracting information from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading. The ETL process requires active inputs from various stakeholders, including developers, analysts, testers, top executives and is technically challenging. To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented. How ETL Works? ETL consists of three separate phases: Extraction o Extraction is the operation of extracting information from a source system for further use in a data warehouse environment. This is the first stage of the ETL process. o Extraction process is often one of the most time-consuming tasks in the ETL. o The source systems might be complicated and poorly documented, and thus determining which data needs to be extracted can be difficult. o The data has to be extracted several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date Cleansing The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data quality. The primary data cleansing features found in ETL tools are rectification and homogenization. They use specific dictionaries to rectify typing mistakes and to recognize synonyms, as well as rule-based cleansing to enforce domain-specific rules and defines appropriate associations between values. The following examples show the essential of data cleaning: If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date list of contact addresses, email addresses and telephone numbers must be available. If a client or supplier calls, the staff responding should be quickly able to find the person in the enterprise database, but this need that the caller's name or his/her company name is listed in the database. Transformation Transformation is the core of the reconciliation phase. It converts records from its operational source format into a particular data warehouse format. If we implement a three-layer architecture, this phase outputs our reconciled data layer. The following points must be rectified in this phase: o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly show that this is a Limited Partnership company. o Different formats can be used for individual data. For example, data can be saved as a string or as three integer Following are the main transformation processes aimed at populating the reconciled data layer: o Conversion and normalization that operate on both storage formats and units of measure to make data uniform. o Matching that associates equivalent fields in different sources. o Selection that reduces the number of source fields and records. Cleansing and Transformation processes are often closely linked in ETL tools. Loading The Load is the process of writing the data into the target database. During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. Loading can be carried in two ways 1. Refresh: Data Warehouse data is completely rewritten. This means that older file is replaced. Refresh is usually used in combination with static extraction to populate a data warehouse initially. 2. Update: Only those changes applied to source information are added to the Data Warehouse. An update is typically carried out without deleting or modifying pre-existing data. This method is used in combination with incremental extraction to update data warehouses regularly.