Unit 5 A – Data warehouse What is a data warehouse? A simple answer to this question is that a data warehouse is managed data situated after and outside operational systems. This answer is based on the concept that fundamentally distinguishes an organization's operation data (data that used to run the organization) and information data (data used to manage the organization.) For example, a bank's operation data are the most up-to-date financial data for each customer, which are stored in different banking hosts. A given customer's saving account data will be stored in a saving account database, while his/her stock trading data will be stored in a stock exchange database. Operation data can be dispersed amongst different places in an organization, and they are usually overwritten by new data as transactions go on. On the other hand, the same bank may have a data warehouse, i.e. a single data repository that stores all the financial data (all the transaction records) of all its customers over a period of time (say several years). This huge amount of accumulated archival data will serve as the bank's information data on which business intelligence analyses can be done. From an application point of view, a data warehouse can be defined as any centralized data repository that can be queried for business benefit. Warehousing makes it possible to: extract archived operational data; overcome inconsistencies between different legacy data formats; integrate data throughout an enterprise, regardless of location, format or communication requirements; and incorporate additional or expert information. Characteristics of data warehouses As defined by Bill Inmon (1992)(an information system guru and a well-known data warehousing expert), one can easily recognize the following characteristics of a data warehouse: Subject-oriented -- data in the warehouse are organized by subject instead of application, e.g. an insurance company would organize their data by customer, premium and claim, instead of by different products (auto, life, etc.). A data warehouse contains only the information necessary for decision support processing. Integrated -- encoding of data is often inconsistent, e.g. gender might be coded as 'm' and 'f' or 0 and 1, but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention. Time-variant -- a data warehouse contains a place for storing data that are 5 to 10 years old, or older. These data are used for comparisons, trends and forecasting, and are not updated. Non-volatile -- data are not updated or changed in any way once they enter a data warehouse. They are only loaded and accessed. These unique characteristics usually pose many challenges to the warehouse designer or architect when she considers how to implement a data warehouse for an organization. These challenges include the considerations of the following criteria for a data warehouse: Load Performance -- as data warehouses require incremental loading of new data on a periodic basis, they must not artificially constrain the volume of data. Data Quality Management -- a data warehouse must ensure local consistency, global consistency and referential integrity despite 'dirty' sources and massive database size Query Performance -- a data warehouse must not be slowed or inhibited by the performance of its RDBMS. Terabyte Scalability -- data warehouse sizes are growing at astonishing rates so their RDBMSs must not have any architectural limitations. They must support modular and parallel management. Mass User Scalability -- access to warehouse data must not be limited to an elite few. A data warehouse has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance. Networked Data Warehouses -- data warehouses rarely exist in isolation -- users must be able to look at and work with multiple warehouses from a single client workstation. Warehouse Administration -- since a data warehouse is large-scale and time-cyclic, it must offer administrative ease and flexibility. Integration of dimensional analysis by RDBMS -- dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools. Advanced Query Functionality -- end users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarized data Data warehousing architecture Data warehouses, as we have discussed, are different from corporate databases in the sense that they serve an organization's information needs instead of its operational needs. Depending on various factors such as the availability of resources and business needs, there are a number of different ways to structure data warehouses. In this unit, we are not going to look into the technical details of data warehouse architecture -- i.e. the hardware, software and physical layout. Instead, we will look into the architecture of data warehouses from the organization perspective as well as a business-needs perspective. Organization architecture and implementation choices There are a number of architecture choices for data warehouses as seen from the organization perspective. These architecture choices will be based on managerial factors such as: The organization's existing physical and management structure -- Is the organization highly centralized in one location or highly dispersed with many branch offices? Is the management structure highly centralized or is there relatively independent departmental decision making? Technical capability and resources availability -- Does the IS department control all the technical expertise and resources or are there departmental computing capabilities and resources? Commitment and style of management -- Is senior management highly committed to data warehouse development or is it only driven by needs from frontline users? Depending on these various organizational factors, there are three broad types of data warehouse architecture: Global architecture -- this architecture is built upon a fully integrated data warehouse across all the departments and lines of business of an enterprise. The architecture is global in the sense of data access instead of the physical organization of the data warehouse. A globalized data warehouse can be either physically centralized or distributed. However, a global architecture can only be effectively managed and maintained by a centralized IS department. Independent data mart architecture -- the term 'data mart' refers to a data warehouse on a smaller scale, usually small, stand-alone data warehouses developed by workgroups, departments or lines of business within an organization. Implementation and maintenance of these data marts are usually drawn from a department's own resources and expertise. Interconnected data mart architecture -- several independent data marts within an organization can be integrated or interconnected to provide sharing of access to data marts across departments or workgroups. In addition to these architectural choices, there are also several different approaches for the implementation of data warehouses that also depend on various organizational factors. These choices are: Top-down approach -- this implementation approach usually leads to a global architecture, and usually features more senior management involvement and more resources from the organization. More consistent data definition and business rules should be the result. However, the cost for implementation will be high and more time will be required. Bottom-up approach -- this approach is more flexible in the sense that it is usually driven by business needs in individual departments and lines of business. This approach does not necessarily end up in data mart architectures. The usual scenario is that a global data warehouse be built incrementally with initial data mart implementations expanding and being joined up. Combined approach -- this approach is in a sense the best approach since, with good project management control, this approach can be a 'balancing act' that combines the advantages of both the top-down and bottom-up approaches. Data architecture for data warehouses In the previous subsection, we talked about the architecture of data warehouses from the organization's perspective. You learned how a data warehouse could be organized in terms of the ownership, management and control of data and usage. This architectural design is also directly related to how a data warehouse is implemented. In this subsection, we will look at the architecture of a data warehouse from another perspective; that is, how the data contained in a data warehouse should be organized to meet different business needs. To begin our consideration of architecture from this business-needs perspective, we first take a look at the different types of data that can be stored in a data warehouse. There are three basic types of data, namely: real-time data derived data reconciled data As you've seen, there are different methodologies for data architecture design in data warehouses. The Enterprise Data Model (EDM) is one of the methodologies that enables all the data elements to be defined consistently throughout the whole organization. Usually an EMD exercise is divided into phases. Each phase will include different amounts of information. A full-scale EDM exercise can be very resource intensive and time-consuming and may not be practical for some organizations. Hence, a trimmed-down version of EDM can be adopted in which some of the core components that are required for data warehouse modelling can be extracted and grouped. The following reading presents the details of EMD and a simplified version of it. The benefits and drawbacks of EMD are also presented. In the physical design phase of data modelling, a very important aspect of the design is the 'granularity' of the data. This concept is concerned with the degree of summarization of the data elements. The level of data granularity will directly affect the ability to answer queries in future data analysis and data mining. The following reading gives you the details of how to proceed with a choice of data granularity. Data partitioning is very important in data architecture design. It affects the efficiency and flexibility of accessing data, and also affects the maintenance aspects of the data warehouse such as the scalability, portability and ease of sharing and archiving data. There are two different perspectives in data partitioning: physical partitioning and logical partitioning. Physical partitioning refers to how data are structured and grouped according to the physical design of a data warehouse. Logical partitioning refers to how data are structured and grouped based on the characteristics of the data, such as customer, product, time period, accounts, etc. Logical data partitioning will facilitate access of data according to business needs. However, physical data partitioning can overlap with logical data partitioning.