Chapter 2: Data Warehousing Architecture Q. What is architecture? • Architecture is the combination of the science and the art of designing and constructing physical structures. In an information system, the architecture helps better communication and planning, increase the flexibility and improve learning and facilitate learning. The architecture of DW and the blueprint that will drive its construction are critical to the success or the failure of the program and its projects. Q. What is the value of architecture? In an information system, the architecture adds value to the system in the same way the blueprint for a construction project. • Communication: The architecture plan is a good communication tool at many levels. It can function as a communication tool within the team and with other information system teams. This will also providing them with a sense where they fir in the process and what they need to accomplish. • Planning: The architecture brings all the details of the project in one place and shows how they fit in and provides a cross check for the project plan. The architecture also uncovers the technical requirements and dependencies that do not come out as part of planning process. • Flexibility and Maintenance: Creating architecture is really about anticipating as many of the issues as possible and building a system that can handle those issues as a matter of course, rather than they become problems. This makes the data warehouse more flexible and easier to maintain. • Learning: The architecture plays an important role as documentation for the system. It can help new members of the tem to get up to speed more quickly on the components, contents and connections. Building a data warehouse is has a set of standardized procedures and it cannot depend on personal beliefs and myths. • Productivity and Reuse: Since we can understand the warehouse processes and database contents more quickly, it becomes easier for a developer easier for a developer to reuse existing processes than to build from scratch. Building a new data source is easier if you can use the generic load utilities and work from existing examples. Q. Explain the global warehouse architecture: A global data warehouse is considered one that will support all, or a large part, of the corporation with a high degree of data access and usage across departments or linesof-business. That is, it is designed and constructed based on the needs of the enterprise as a whole. It could be considered to be a common repository for decision support data that is available across the entire organization, or a large subset thereof. The term global is used here to reflect the scope of data access and usage, not the physical structure. The global data warehouse can be physically centralized or physically distributed throughout the organization. A physically centralized global warehouse is to be used by the entire organization that resides in a single location. A distributed global warehouse is also to be used by the entire organization, but it distributes the data across multiple physical locations within the organization and is managed by the IS department. Figureshows the two ways that a global warehouse can be implemented. In the top part of the figure, you see that the data warehouse is distributed across three physical locations. In the bottom part of the figure, the data warehouse resides in a single, centralized location. Independent data mart: Independent data mart architecture implies stand-alone data marts that are controlled by a particular workgroup, department, or line of business and are built solely to meet their needs. There may, in fact, not even be any connectivity with data marts in other workgroups, departments, or lines of business. Data for these data marts may be generated internally. The top part of Figure depicts the independent data mart structure. Although the figure depicts the data coming from operational or external data sources, it could also come from a global data warehouse if one exists. Interconnected data mart: Interconnected data mart architecture is basically a distributed implementation. Although separate data marts are implemented in a particular workgroup, department, or line of business, they can be integrated, or interconnected, to provide a more enterprise-wide or corporate-wide view of the data. In fact, at the highest level of integration, they can become the global data warehouse. Therefore, end users in one department can access and use the data on a data mart in another department. This architecture is depicted in the bottom of Figure. Although the figure depicts the data coming from operational or external data sources, it could also come from a global data warehouse if one exists. Comprehensive data warehousing architecture: Data Sources ETL Software S T A G I N G Transaction Data Prod IBM Mkt IMS HR VSAM Ascential Oracle Fin Extract Acctg Sybase Other Internal Data ERP SAP Informix SAS HarteHanks Users ANALYSTS Cognos Teradata IBM Load Informatica D A T A External Data Demographic A R E A O P E R A T I O N A L Data Analysis Tools and Applications SQL Sagent Web Data Clickstream Data Stores S T O R E Data Warehouse Data Marts SAS MANAGERS Finance Essbase Marketing Queries,Reporting, DSS/EIS, Data Mining EXECUTIVES Micro Strategy Meta Data Sales Microsoft Siebel Business Objects OPERATIONAL PERSONNEL Web Browser Clean/Scrub Transform Firstlogic CUSTOMERS/ SUPPLIERS This provides an example of a comprehensive data warehousing architecture. It illustrates the large number of possible source systems, ETL processes, sample products, the data warehouse, dependent data marts, meta data, data access tools and applications, and various kinds of users. Inmon calls a comprehensive architecture like this one a Corporate Information Factory. Typical Components of a Data Warehouse Architecture • • Operational data sources Operational datastore(ODS) • Load Manager • Warehouse Manager • Query Manager • Detailed Data • Lightly & Highly Summarized Data • Archive & Back up Data • Meta Data • End-User Access Tools Operational data • Without source system, there would be no data • The data sources for the data warehouse are supplied as follow: – Operational data held in network databases – Departmental data held in file systems – Private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers Operational datastore(ODS) • Is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse – ODS objectives: to integrate information from day-to-day systems and allow operational lookup to relieve day-to-day systems of reporting and current-data analysis demands – ODS can be helpful step towards building a data warehouse because ODS can supply data that has been already extracted from the source systems and cleaned Load Manager • Called the frontend component • Performs all the operations associated with the extraction and loading of data into the warehouse • These operations include simple transformations of the data to prepare the data for entry into the warehouse • The data is extracted from the operational systems directly or from the operational data store (more common) and then to the data warehouse • Size and complexity will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom-built programs. Warehouse Manager • Performs all the operations associated with the management of the data in the warehouse as follows: – Analysis of data to ensure consistency – Transformation and merging of source data from temporary storage into the data warehouse tables – Creation of indexes and views – Backing-up and archiving data • Constructed using vendor data management tools and custom-built programs. • Generates query profiles to determine which indexes and aggregations are appropriate Query Manager • Called backend component • Performs all the operations associated with the management of user queries – Directing queries to the appropriate tables and scheduling the execution of queries • Constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs • Query manager complexity depends on the end-user access tools and database Detailed Data • Stores all the detailed data in the database schema • On a regular basis, detailed data is added to the warehouse to supplement the aggregated data Lightly and Hightly Summarized Data • Stores all the pre-defined lightly and highly aggregated data generated by the warehouse manager • Transient as it will be subject to change on an on-going basis in order to respond to changing query profiles • The purpose of summary information is to speed up the performance of queries costly • On the other hand, it removes the requirement to continually perform summary operations (such as sort or group by) in answering user queries • The summarized data is updated continuously as new data is loaded into the warehouse Archive/Backup Data • Stores detailed and summarized data for the purposes of archiving and backup • May be necessary to backup online summary data if this data is kept beyond the retention period for detailed data • The data is transferred to storage archives such as magnetic tape or optical disk Meta Data • This area of the warehouse stores all the metadata definitions used by all the processes in the warehouse • Meta-Data is used for a variety of purposes: – Extraction and loading processes • – Warehouse management process • – Used to automate the production of summary tables Query management process • • Metadata is used to map data sources to a common view of information within the warehouse Used to direct a query to the most appropriate data source End-user access tools use metadata to understand how to build a query End-user Access Tools • Users interact with the warehouse using end-user access tools • Can be categorized into five main groups – Data reporting and query tools – (Query by Example –MS Access DBMS) – Application development tools (application used to access major DBS –Oracle, sybase..) – Executive information system (EIS) tools (For sales, marketing and finance) – Online analytical processing (OLAP) tools (Allow users to analyze the data using complex and multidimentional views-from multiple databases) – Data mining tools (allow the discovery of new patterns and trend by mining a large amount of data using statistical, mathematical tools) Data Warehousing: Data flows Inflow • The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse • Cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity • Transformation include adding date/time stamp fields, summarizing detailed data, deriving new fields to store calculated data • Extract the relevant data from multiple, heterogeneous, and external sources (commercial tools are used) • Then mapped and loaded into the warehouse Upflow • The process associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data • Summarizing the data works by choosing, projecting, joining, and grouping relational data into views that are more convenient and useful to the end users. Summarizing data goes beyond simple relational operations to involves sophistacated statistical analysis including identifying trends, clustering, and sampling the data • Packeging the data involves converting the detailed or summarized information into more useful formats, such as spreadsheets, test documents, charts, other graphical presentations, private databases, and animation. • Distribute the data in appropiate groups to increase its availability and accessibility Downflow • The processes associated with archiving and backing-up of data in the warehouse • Archiving the effectiveness and performace maintanance is achieved by transferring the older data of limited value to storage archivers such as magnetic tapes, optical disk or digital storage devices • If the databases in a warehouse are very big, partitioning is a useful design option which enables the fragmentation of a table storing enournous number of records into smaller tables. Thus, preserving data warehouse performance • The downflow of data includes the processes to ensure that the current state of the data warehouse can be rebuilt following data loss, or software/hardware failures. Archived data should be stored in a way that allows the re-establishement of the data in the warehouse when required Outflow • Involves the process associated with making the data availabe to the end-users • This involves two activities such as data accessing and delivering • Data accessing is concerned with satisfying the end users’s requests for the data they need. The main problem here is the creation of an environment so that the users can effectively use the query tools to access the most appropiate data source. • Delivering activity makes possible the information delivery to the user’s systems/workstations. This activity is referred to as a type of ’’publish-andsubscribe” process. Data warehouse publishes several ’business objects’ that are revised periodically by monitoring usage patterns. Users subcriber to the set of business objects that best meets their needs. Implementation Choices • Top Down Implementation • Bottom Up Implementation • A Combined Approach Top Down Implementation A top down implementation requires more planning and design work to be completed at the beginning of the project. This brings with it the need to involve people from each of the workgroups, departments, or lines of business that will be participating in the data warehouse implementation. Decisions concerning data sources to be used, security, data structure, data quality, data standards, and an overall data model will typically need to be completed before actual implementation begins. The top down implementation can also imply more of a need for an enterprisewide or corporatewide data warehouse with a higher degree of cross workgroup, department, or line of business access to the data. This approach is depicted in the figure. As shown, with this approach, it is more typical to structure a global data warehouse. If data marts are included in the configuration, they are typically built afterward. And, they are more typically populated from the global data warehouse rather than directly from the operational or external data sources. A top down implementation can result in more consistent data definitions and the enforcement of business rules across the organization, from the beginning. However, the cost of the initial planning and design can be significant. It is a time-consuming process and can delay actual implementation, benefits, and return-on-investment. For example, it is difficult and time consuming to determine, and get agreement on, the data definitions and business rules among all the different workgroups, departments, and lines of business participating. The top down implementation approach can work well when there is a good centralized IS organization that is responsible for all hardware and other computer resources. Top down implementation will also be difficult to implement in organizations where the workgroup, department, or line of business has its own IS resources. They are typically unwilling to wait for a more global infrastructure to be put in place. Bottom Up Implementation A bottom up implementation involves the planning and designing of data marts without waiting for a more global infrastructure to be put in place. This approach is more widely accepted today than the top down approach because immediate results from the data marts can be realized and used as justification for expanding to a more global implementation. In contrast to the top down approach, data marts can be built before, or in parallel with, a global data warehouse. And as the figure shows, data marts can be populated either from a global data warehouse or directly from the operational or external data sources. The bottom up implementation approach has become the choice of many organizations, especially business management, because of the faster payback. It enables faster results because data marts have a less complex design than a global data warehouse. In addition, the initial implementation is usually less expensive in terms of hardware and other resources than deploying the global data warehouse. With careful planning, monitoring, and design guidelines, the date redundancy among the data marts can be minimized. Multiple data marts may bring with them an increased load on operational systems because more data extract operations are required A Combined Approach As we have seen, there are both positive and negative considerations when implementing with the top down and the bottom up approach. In many cases the best approach may be a combination of the two. As a first step simply identify the lines of business that will be participating. A high level view of the business processes and data areas of interest to them will provide the elements for a plan for implementation of the data marts. As data marts are implemented, develop a plan for how to handle the data elements that are needed by multiple data marts. This could be the start of a more global data warehouse structure or simply a common data store accessible by all the data marts. It some cases it may be appropriate to duplicate the data across multiple data marts. This is a trade-off decision between storage space, ease of access, and the impact of data redundancy along with the requirement to keep the data in the multiple data marts at the same level of consistency. There are many issues to be resolved in any data warehousing implementation. Using the combined approach can enable resolution of these issues as they are encountered, and in the smaller scope of a data mart rather than a global data warehouse. Careful monitoring of the implementation processes and management of the issues could result in gaining the best benefits of both implementation techniques.