Building Data Warehousing CIT858: Data Mining & Data Warehousing Bajuna Salehe The Institute of Finance Management: Computing and IT Dept. Data Transformation • Data extracted from transactional databases must go through several kinds of data transformation on its way to a data warehouse: – Data from different transactional databases being merged to form the data warehouse tables – Data will often be aggregated as it is being extracted from the transactional databases and prepared for the data warehouse – Units of measure used for attributes in different transactional databases must be reconciled as they are being merged into common data warehouse tables Data Transformation – Coding schemes used for attributes in different transactional databases must be reconciled as they are being merged into common data warehouse tables – Sometimes values from different attributes in transactional databases are combined into a single attribute in the data warehouse (e.g., employee name) Data Loading • After all of the extracting, cleaning, and transforming, the data is ready to be loaded into the data warehouse • Data will be loaded into a “loading” or working area in the database – Some of the previous steps may have been done in the database – Data may have to go through a number of stages dividing up the data and merging with other data – When the above has been done the Star Schemas are populated with the new, time specific data Data Loading (cont…) • A schedule for regularly updating the data warehouse must be put in place – Frequency of updates is important – Time taken to get to this point is important Data Warehouse Queries • Types of queries that a data warehouse is expected to answer ranges from the relatively simple to the highly complex and is dependent on the type of end-user access tools used • End-user access tools include: – – – – Reporting, query, and application development tools Executive information systems (EIS) OLAP tools Data mining tools Steps in Building DW • Users specify information needs • Analysts and users create a logical and physical design • Sources of data is scrubbed, extracted and transformed • Data is transferred and loaded into the warehouse periodically • Users are given the access to warehouse • The warehouse is maintained in terms of changing requirements Data Warehouse Schema/Dimensional Modeling • The entity-relationship data model (ERD) is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. • Such a data model is appropriate for on-line transaction processing (OLTP) Data Warehouse Schema/Dimensional Modeling • A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis (OLAP). • The most popular data model for a data warehouse is a multidimensional model. Dimensional Modeling • Dimensional modeling/Multidimensional Modeling is a new method of designing database for analytical purpose. – In dimensional model design; simplicity and understandability of data is crucial to help business users (e.g CEOs and other top leaders) to analyse the business performance and make right decision. Dimensional Modeling • Data warehouses and OLAP tools are based on a dimensional data model. • This model views data in the form of a data cube. • A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. Dimensional Modeling • For example, AllElectronics may create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions time, item, branch, and location. • These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations Dimensional Modeling • A multidimensional data model is typically organized around a central theme, like sales, for instance. • This theme is represented by a fact table. Facts are numerical measures which look like quantities by which we want to analyze relationships between dimensions. Dimensional Modeling • Examples of facts for a sales data warehouse include shillings sold (sales amount in shillings), units sold (number of units sold), and amount budgeted. • The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables. Dimensional Modeling • Data cube can be of 2 – D or 3 – D geometric structures. • For example in the company AllElectronics sales data for items sold per quarter in the city of Arusha. The data is in 2 – D representation. • The sales for Arusha are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold). Dimensional Modeling • The fact or measure displayed is shillings sold (in thousands). • You can view the sales data with the third dimension. For instance, you can view data according to time and item as well as location for the cities Dar, Tanga, Mwanza, Kagera and Lindi. Dimensional Modeling • These 3 – D data are represented as a series of 2-D tables or in the form of a 3-D data cube. Data Warehouse Schema/Dimensional Modeling • Dimensional model can exist in the following forms – a star schema – a snowflake schema – a fact constellation schema. • The major focus will be on the star schema which is commonly used in the design of many data warehouse. Star Schema • This is the most common modeling paradigm for designing data warehouse. • In this model a data warehouse consists of: – a large central table (fact table) containing the bulk of the data, with no redundancy – a set of smaller attendant tables (dimension tables), one for each dimension. • The diagram below show an example of star schema Star Schema • Star schema of a data warehouse for sales. Star Schema • A star schema for AllElectronics sales is shown in Figure in the above slide. Sales are considered along four dimensions namely, time, item, branch, and location. • The schema contains a central fact table for sales that contains keys to each of the four dimensions, along with two measures: dollars sold and units sold. • To minimize the size of the fact table, dimension identifiers (such as time key and item key) are system-generated identifiers. Star Schema • Notice that in the star schema, each dimension is represented by only one table, and each table contains a set of attributes. • For example, the location dimension table contains the attribute set {location key, street, city, province or state, country} Note on Dimensional Modeling • In doing dimensional modeling for designing DW there are generally four steps:– Select the business process to model. – Select the grain for business process – Choose the dimension that apply to each fact table. – Identify the numeric facts/measurements that will populate each fact table row. Typical Data Warehouse Queries • Examples include: – What was total IFM revenue in 3rd quarter of 2006? – What was total revenue for property sales for each type of property in Tanzania in 2006? – What are the three most popular areas in each city for the renting of property in 2003 and how does this compare with the figures for the previous two years? – What would be effect on property sales in the different regions of Europe if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over €250,000? – What is monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures? Benefits Of Data Warehousing • Gives the data you want, in a suitable format • Removes inconsistency of reporting • Gives one consistent picture of the data. i.e. It provides single manageable structure for decision support data. • Potential high returns on investment Benefits Of Data Warehousing • Enable users to run complex queries on data that traverses a number of business areas. • Competitive advantage • Increased productivity of corporate decision-makers