Understanding Data Warehouse Management A Primer John Deitz For Viasoft, Inc. Status: 02September1998 Preface In this primer, I describe a business management practice called “data warehousing” and many of the management issues that contribute to making data warehousing successful. This is a fairly non-technical coverage of the subject, which should be useful to Sales, Services, Development and Practices personnel who work with Viasoft’s solutions. In one respect, the advent and evolution of data warehousing marks a return to basic values for running a successful business. IT organizations have long been caught up in the technology and application systems needed to automate the business. Although these applications are called “information systems”, it is closer to the truth that they are merely “business transaction” systems -- generating little real “information” that could be useful for analyzing the business. Savvy business managers, who rise above the whir and confusion of the IT technology engine, recognize the need to accurately define the key concepts and processes at play in their business, and to begin measuring business operations against them. It takes little effort to see that the reports produced, and raw data processed, while running the business offer virtually no insights into: Customer segmentation, profiling and retention, Market basket or cross-product linkage analysis, Churn and trend Analysis, Fraud and abuse analysis, or Forecasting and risk analysis. These are not the “raw” elements of data found in highly normalized transaction system database schema; rather, they are aggregations and synthesis of data into “information” … and ideally into useful knowledge. Data warehousing is the process of creating useful information that can be used in measuring and managing the business. Understanding Data Warehousing – A Primer 2 Table of Contents 1. What’s a Data Warehouse or Data Mart? ................................................. 6 1.1 How are Data Warehouses Produced? .......................................................................... 6 1.2 What is the Purpose of the Data Warehouse? .............................................................. 7 Measurement ............................................................................................................................... 7 Discovery ...................................................................................................................................... 8 Trend Analysis ............................................................................................................................. 8 2. How do Data Warehouse Databases Differ from Production Ones? .. 9 2.1 Decision Support versus Transactional Systems ......................................................... 9 2.2 Data Quality and Understanding .................................................................................. 9 2.3 General Data Usage Patterns ...................................................................................... 10 Time Series Data.........................................................................................................................11 Summarized or Aggregated Data..............................................................................................11 2.4 Separate DSS and OLTP Systems .............................................................................. 11 2.5 DSS-specific Usage Patterns ....................................................................................... 12 2.6 Operational Data Stores (ODS) ................................................................................... 13 3. Data Warehouse Architectures................................................................ 14 3.1 Marts and Warehouses ................................................................................................ 14 What do you mean by “Architecture” ? ...................................................................................15 3.2 Physical Warehouse Architecture ............................................................................... 15 Centralized ..................................................................................................................................15 Independent Data Mart .............................................................................................................16 Dependent Data Marts, with Distribution Data Warehouse ..................................................17 Operational Data Stores - with Marts and Data Warehouse ..................................................17 Virtual Data Warehouse ............................................................................................................18 “Hub-and-Spoke” Architecture ................................................................................................19 3.3 Warehouse Framework Architecture .......................................................................... 20 Production or External Database Layer ..................................................................................21 Data Access Layer ......................................................................................................................21 Transformation Layer ...............................................................................................................21 Data Staging Layer.....................................................................................................................21 Data Warehouse Layer ..............................................................................................................21 Information Access Layer ..........................................................................................................21 Application Messaging Layer ....................................................................................................21 Meta Data Directory Layer .......................................................................................................22 Process Management Layer ......................................................................................................22 4. What is Data Mart/Warehouse Management? ....................................... 23 4.1 Managing Data Administration: Defining “Information” ......................................... 23 Terms and Semantics Directory (Glossary) .............................................................................23 Value Domains and Abbreviations ...........................................................................................24 Naming Conventions ..................................................................................................................24 Data Type and Format Standards ............................................................................................24 Automated Data Quality Measurement ...................................................................................25 Information Owners ...................................................................................................................25 Managing IT-related Meta Data ...............................................................................................25 Understanding Data Warehousing – A Primer 3 Table of Contents (continued) 4.2 Managing Business Segment Analysis & Mart Design ............................................. 26 Exploration .................................................................................................................................26 Schema Design ............................................................................................................................26 Locating Data Sources ...............................................................................................................27 Prototyping .................................................................................................................................27 4.3 Managing Database Administration ........................................................................... 28 Managing Production Data Store Schema ...............................................................................28 Managing Data Warehouse, Mart and ODS Schema ..............................................................28 Applying Data Types, Naming Conventions and Format Descriptors ..................................28 4.4 Managing Data Valid Value Domains ........................................................................ 29 4.5 Managing Data Extraction .......................................................................................... 29 Identifying and Qualifying Candidate Stores to Draw From .................................................29 Managing Queries and Data Extraction ...................................................................................29 4.6 Managing Data Transformation .................................................................................. 30 Data Conjunction .......................................................................................................................30 Data Aggregation........................................................................................................................30 Data Cleansing ............................................................................................................................31 Data Dimensioning .....................................................................................................................31 4.7 Managing Data Traceability ........................................................................................ 31 4.8 Managing Schedules and Dependencies ..................................................................... 32 4.9 Managing Data Quality and Usage ............................................................................. 33 Use-based Data Quality Audits .................................................................................................34 Data Quality Re-Design .............................................................................................................34 Data Quality Training ................................................................................................................34 Data Quality Continuous Improvement ...................................................................................34 Data Quality S.W.A.T. Tactics ..................................................................................................34 Data Cleansing Management ....................................................................................................35 4.10 Managing Data Warehouse Architectures .................................................................. 35 4.11 Managing Operations, Facilities and End User Tools ............................................... 36 4.12 Managing Basic Database Operations ........................................................................ 36 4.13 Managing Rules, Policies and Notifications ............................................................... 36 4.14 Managing Meta Data Definition, Population and Currency ..................................... 37 4.15 Managing History, Accountability and Audits ........................................................... 38 4.16 Managing Internal Access, External Access and Security ........................................ 38 4.17 Managing Systems Integration Aspects ..................................................................... 39 4.18 Managing Usage, Growth, Costs, Charge-backs and ROI ......................................... 39 Usage............................................................................................................................................39 Growth ........................................................................................................................................40 Cost Management .......................................................................................................................40 Return on Investment ................................................................................................................40 4.19 Managing Change – the Evolving Warehouse ............................................................ 40 Understanding Data Warehousing – A Primer 4 Table of Contents (continued) 5. Future of Data Warehousing .................................................................... 42 Objectives ....................................................................................................................................42 Business Factors .........................................................................................................................42 Technology Factors ....................................................................................................................43 Knowledge Engineering .............................................................................................................43 More Effective Infrastructure & Automation .........................................................................43 More Effective Communication ................................................................................................44 6. Summary ...................................................................................................... 45 Conclusion ...................................................................................................................................45 Understanding Data Warehousing – A Primer 5 1. What’s a Data Warehouse or Data Mart? A “data warehouse” is a database, or group of databases, where business people can access business-related data; a data warehouse may aggregate business information from several organizations, and/or may serve as a central repository for standards, definitions, value domains, business models, and so on. Bill Inmon, recognized as the father of the data warehouse concept, defines a data warehouse as “a subject-oriented, integrated, time variant, non-volatile collection of data in support of management’s decision-making process”. Richard Hackathorn, another data warehouse pioneer, says the “goal of data warehouse is to provide a single image of business reality”. Both of these definitions have merit, as this primer will illustrate. A “data mart” is a (usually smaller) specialized data warehouse that covers one subject area, such as finance or marketing, and may serve only one distinct group of users. Since a data mart focuses on meeting specific user business needs, it can capture the commitment and excitement of users. Also, the limited scope makes it easier for IT to grasp the business requirements and work with the users effectively. Most successful data warehousing efforts begin through the successes of one or more marts. A data mart or warehouse is a collection of “clean data”. This data usually ultimately originates from the production transaction systems that automate the business. But this data is not the same “raw” data processed by transaction systems, and more specifically it is not in the same transaction-optimal form; rather it is very selective data, organized into specific “dimensions”, and optimized for access by decision support systems (DDS), sometimes also called executive information systems (EIS). These are the simple definitions for data warehouse and mart. The remainder of this paper will illustrate that the process of data warehousing is far from simple. Recent history provides many examples of unsuccessful warehouse implementations costing millions of dollars. A number of things - warehouse/mart architecture, data understanding, data access, data movement, merging of data from heterogeneous sources, data cleansing and transformation, mart database design, and business user access/query tools - all have to come together successfully for a data warehouse to succeed. In short, data warehousing cannot be done successfully without significant management measures in place. For the remainder of this paper, I will use the term “data warehouse” to refer equally to marts or warehouses – except in discussions of architecture where I draw distinctions between the two. 1.1 How are Data Warehouses Produced? Data warehouses are usually populated with data originating from the production business systems - in particular, from the databases underlying those systems. That means existing data must be extracted from its source, cleansed, transformed (modified, aggregated and dimensioned), staged, distributed to remote sites (perhaps) and finally loaded into the data stores comprising the data warehouse or marts. The following figure illustrates this flow: Understanding Data Warehousing – A Primer 6 Figure 1a: Steps in Building a Data Warehouse Actually, this figure grossly simplifies the effort that goes onto designing, creating and managing a data warehouse. Depending on the nature and quality of existing production systems, and the needs of business users querying the data warehouse, each step may require significant research, planning and coordination. 1.2 What is the Purpose of the Data Warehouse? Basically, the purpose of a data warehouse is to put accurate information about the business into the hands of competent analysts or decision-makers. This information must present very crisp and unambiguous facts about business subjects (such as customer, product, brand, contract, representative, region, and so on) and provide useful perspectives on the current operations of the business. Here I look at just a couple roles of the warehouse. Measurement The analysts use the data warehouse to measure the performance of the business; these measurements are often computed from millions of detailed records of data, or sometimes from aggregations (summaries) of the data. When business performance is weak or unexpected, the analysts “drill down” to deeper levels of detail to find out why; this can mean looking at the data (i.e. querying) in a new and different way – perhaps on a one-time basis. Here’s an example: a well-designed mart might allow an analyst to measure the effectiveness of the company’s investment in inventory. Called GMROI, or Gross Margin Return on Inventory, this measurement is computed as follows: (Total Quantity Shipped) * (Value at Latest Selling Price – Value at Cost) GMROI = (Daily Average Quantity on Hand) * (Value at Latest Selling Price) Obviously, this measurement could not be made without the right, accurate detailed data organized in a very particular pattern, and the ready ability to query the data in a number of Understanding Data Warehousing – A Primer 7 interesting ways. Most successful data warehouses are based on a “star schema” database design that minimizes the amount of data to be stored, while optimizing the potential of query performance. (I will discuss this further later on, because not all star schema designs have the same performance potential.) Discovery Analysts also use data warehouse data for a relatively new form of analysis called “data mining”. Data mining involves deep analysis of potentially huge volumes of data to discover patterns and trends that no one would think to look for. Certain business events or trends can occur due to the convergence of highly unrelated facts; data mining makes it possible to see new facts, and how they are correlated. Data mining is “new” because only recently have the database and query technologies matured enough to support it with acceptable performance. Trend Analysis The data in a data warehouse is usually organized into “dimensions”. For example, a data mart designed to store information about stores selling products might be organized into the following dimensions: Store, Product, and Time. Time is a very common and important dimension, because it allows the measurements taken today to be compared with the same measurements taken last week, last month or last year. This is what Bill Inmon referred to as “time variant” in the quote I used earlier to define data warehouses. When a number of time-phased measurements are considered in succession, we can do trend analysis. One significance of trend analysis is to better understand the past. Trend analysis is very effective in highlighting a declining market or shift in sales from one brand to another. Perhaps a more powerful use of trend analysis is to project the future; this is done by extending a trend into the future through the application of various statistical models. It should be easy to see how the notions of Measurement, Discovery and Trend Analysis contribute (in very meaningful ways) to assessing how effectively a business is operating, and where management actions should be focused. Understanding Data Warehousing – A Primer 8 2. How do Data Warehouse Databases Differ from Production Ones? What’s the difference between a data warehouse database, and all the other production databases that the business has? Why not just use the production databases as the basis for data warehouse queries? These are good questions - and important ones – as we explore the success criteria for data warehousing initiatives. In this section, I’ll discuss the differences and rationales. 2.1 Decision Support versus Transactional Systems Decision Support systems and Executive Information systems have different usage and performance characteristics from production systems. Production systems such as order entry, general ledger or materials control generally access and update the record of a single business object or event: one order, one account, or one lot. Transactions are generally pre-defined, and require the database to provide very fast access to, and locking for, one record at a time. In fact, the requirement that production databases be “transaction-optimal” heavily influences the schema design of these databases. These database schemas are often highly normalized to maximize the opportunity for concurrent access to database data; that is, they are designed so that the smallest amount of data will be locked (i.e. unavailable to others) at any one time. In this way, transactions can have pinpoint accuracy and be highly efficient. In contrast, databases supporting DSS must be able to retrieve large sets of aggregate or historical data within a reasonable response time. The data must be well-defined, clean, and organized into specific dimensions that support business analysis. 2.2 Data Quality and Understanding The data records processed by production systems are usually concatenations of the master records of the key databases with contextual information. IT systems that have evolved over the years have been tuned to cater to the data anomalies found in this data; that is, they “correct” anomalies “on the fly”. Such anomalies include missing or inaccurate codes, discrepancies between order header and detail records, and garbage found in fields due to electronic transmission errors. The programmers of these systems are usually so far removed from the data entry points to the system, that it is easier (and more convenient) to adjust values during processing than to correct the source of the data. It is common for data in production stores to get “tainted” - becoming application specific. For example, a primary datastore of customer information may be pre-filtered to contain only “active accounts”, while to the casual observer (outside IT) it may appear to encompass all accounts. Also, programmers and early database designers have traditionally been lazy about naming standards, assuming that only a small technical audience would see the names they contrived for data elements. Hence, an “outsider” cannot trust the meaning of terms such as customer, order, product, and so on – such terms were often used loosely in the past. These subtleties and miscellaneous filters make production stores the wrong, or at least incomplete, sources of information for most decision-making purposes. Understanding Data Warehousing – A Primer 9 2.3 General Data Usage Patterns As noted above, transaction systems are tuned for item-at-a-time processing, and more importantly, update processing. These systems are called on-line transaction processing (OLTP) systems. The intent of such systems to update, and to minimize the scope of database locks, places significant constraints on how the data is laid out (schema design) and how it is accessed. By contrast, decision support systems operate against schemas that facilitate querying [perhaps very granular] information in a myriad of ways; this is a read-only mode of access. The tools used for these systems are called on-line analysis processing (OLAP) tools. A particular pattern of schema design, called “star schemas”, has proven very powerful in decision support systems. Star schemas can be used to design dimensional data stores. The term “star” was coined because the schema configuration consists of a core fact table which has relationships (foreign keys, in relational OLAP systems) to a number of dimension tables. Time Dimension Sales Fact Product Dimension time_key product_key product_key description time_key store_key brand day_of_week dollars_sold category month units_sold quarter dollars_cost year Store Dimension holiday_flag store_key store_name address floor_plan_type Figure 2a: Example of a Star Schema Another schema pattern, snowflake schema, is sometimes used for data warehouses. The snowflake approach is an extension of the star schema idea; in contrast to star schemas, which are commonly de-normalized, snowflake schemas are often highly normalized. This creates the “snowflake” schema pattern as illustrated below: Dim Dim Facts Dim Dim Figure 2b: Pattern of a Snowflake Schema Several pioneers of data warehousing have heavily discouraged the usage of snowflake schemas for data warehouses of appreciable size. Too many joins are necessary to access the data, which leads to unacceptable performance. Also, the schema is more difficult for users to learn and use. Still, there will be sites where you run across the term, so I mention it here for completeness. Understanding Data Warehousing – A Primer 10 It should be clear to IS professionals that these schemas differ considerably from production database schema in key respects: Time Series Data As mentioned earlier, the Time dimension allows analysts to view how business behaviors change over time; it is very common to find a Time dimension in OLAP star schemas. This dimension is essential for measuring whether the business will reach its business goals, and to compare today’s state of some segment of the business to a past state (last month, last year, etc.) of the identical segment. Summarized or Aggregated Data In our example above, note that the central fact table contains elements (dollars_sold, units_sold) that are aggregations of more detailed information (perhaps gathered from production data stores). A more subtle observation is that these “facts”, by themselves, have no meaning. The meaning of each fact is entirely dependent on the dimensions attached to it, which provide a valid “context” for the fact. In our example above, the dollars_sold fact is more accurately: dollars_sold of a Product at a Time in a Store. Lastly, it is worth pointing out that, from a usage point of view, transactions are often rigidly defined and mechanically executed. By contrast, DSS/EIS users need the flexibility to form ad hoc queries which “cut” the facts into any combination of the available dimensions. In his report, “Data Warehouses: An Architectural Perspective”, Jack McElreath nicely summarizes the difference of data usage: Production Data Short-lived, rapidly changing Requires record-level access Warehouse Data Long-lived, static Data is aggregated into sets ** ** Which is why warehouse data is friendly to relational DBs. Repetitive standard transactions Updated in real time Event-driven – process generates data Ad hoc queries; some periodic reporting Updated periodically with mass loads Data-driven – data governs process Figure 2c: Production and Warehouse Data are Very Different In summary, DSS/EIS users need data that accurately describes the organization (as they visualize it), is accessible and internally consistent, and is organized for access by analytical tools. DSS users are analysts or managers that think about the big picture long term. Typical queries might list total sales for each of the last five years, items that have been out of stock more than 15% of the time, or customers with the most orders in 1997. 2.4 Separate DSS and OLTP Systems In very rare cases in which the databases designed for OLTP are also accessed by DSS systems. In these situations, the production database serves as a virtual data warehouse. To be successful, this requires a fairly technical warehouse user, very clean data in the production store, and excess bandwidth in the processing window of that store. In addition, the analyst must concede that the data is not stable (it is changing constantly), is not time phased, and performance will likely be degraded by normal business processing. Given these limitations, it is easy to see why this strategy is not employed in serious, large-scale data warehousing solutions. However, there are a few OLAP tools designed to enable virtual data warehouses. Understanding Data Warehousing – A Primer 11 In even fewer cases, there is a need for an production system to use DSS type functions. This is called on-line complex processing (OLCP) - yes,there’s an acronym for everything!. These cases are rare indeed. A defining characteristic of successful data warehousing is a clean separation of production and decision support functionality. By separating these two very different processing patterns, the data warehouse architecture enables both production and decision support systems to focus on what they do best, and thereby provide better performance and functionality for each. 2.5 DSS-specific Usage Patterns Within the decision support system community, users in various roles place different demands on the warehouse. A few of the recognized roles, and their demands of the DSS system, are discussed below: Information Farmer: The information farmer is a consumer of well-understood data through pre-defined queries. This user knows what to look for, and accesses small amounts of data with predictable patterns. Information Tourist: The information tourist is a consumer with unpredictable access to data. The tourist is a regular user of the warehouse/mart, so the load placed on system is known; but s/he accesses larger amounts of data, and potentially different data each time. Custom queries are used. Information Explorer: The information explorer is involved with information mining. This user has very erratic access patterns, and can access very large amounts of very detailed data. This can place very unreasonable performance loads on a warehouse or mart. Mart Prototyper: The mart prototyper is involved with experimenting with warehouse and mart schemas. Like the information explorer, his/her usage patterns are erratic. Also, several schemas may be implemented and discarded during the development process, causing administrative loads on the DBMS. It is easy to see that the explorer and prototyper users place unreasonable demands on the typical data warehouse environment; a random relational query, that is not supported by well-tuned indices and access plans, can overload the DBMS and bring the decision support system to its knees. For this reason, many companies isolate these functions onto entirely different database platforms, where they do not threaten the performance of the mainstream warehouse. One new technology specifically addresses the explorer and prototyper audience. The technology is called Nucleus, and it is packaged into two new products, Nucleus Exploration Warehouse/ Mart and Nucleus Prototype Warehouse/Mart, marketed by Sand Technology, Inc. Nucleus is a hybrid database technology that uses a “tokenization” scheme to store and automatically index all data that is loaded; the virtue of tokenization is that each unique data value is stored only once, thereby providing data compression. Nucleus boasts highly-tuned algorithms to access data, automated server fault recovery, and greatly reduced database administration requirements. Nucleus looks and acts like a relational database, accessed though a standard ODBC read/write interface. But under the covers, the storage technology has the effect of compressing data (up to 60 percent), while supercharging the performance of data access. Bill Inmon, of Pinecone Systems, and other data warehouse experts are already voicing support for the technology remarkable, since the products have been on the market less than a year. Understanding Data Warehousing – A Primer 12 2.6 Operational Data Stores (ODS) In the last few years, the data warehousing industry has begun to mature, and with this maturity has come new ways to view and integrate decision support technology. The concept of Operational Data Stores (ODSs) has evolved out of this industry experience, and has recently been promoted heavily by Bill Inmon. An ODS is not a warehouse, nor is it a mart. Rather it is another store used, in conjunction with warehouses and marts, within some warehousing architectures. An ODS is a regularly-refreshed container of detailed data used for very short term, tactical decision making, and is often used as one of the feeds to a central data warehouse hub. I discuss ODSs further in the next section. Understanding Data Warehousing – A Primer 13 3. Data Warehouse Architectures Quite recently, some of the venerable pioneers and gurus of data warehousing history have taken the proverbial “step back” to assess what works and what doesn’t. Billions of dollars have been lost in unsuccessful data warehousing initiatives; and there are certainly a lot of ways to create poor ones. A key focus of these leaders today is: data warehouse architecture. Shaku Atre, acknowledged expert in the data warehousing, client-server and database fields, speaks about architecture in her latest 1998 report: “Because data warehousing is playing a more critical role, organizations need to ensure that their data warehousing capability is able to meet requirements that change rapidly. You need an approach that delivers quick results now, but provides a flexible and extendible framework for the future; this means you need to build the right architecture from the beginning.” “Without the right architecture, there can be no effective long term strategy.” In Data Warehousing Fundamentals: What You Need to Know to Succeed, Bob Lambert states: “A data warehouse is a very complex system that integrates many diverse components: personal computers, DSS/EIS software, communications networks, servers, mainframes, and different database management system packages, as well as many different people and organizational units with different objectives. The overall requirement for data warehouse is to provide a useful source of consistent information describing the organization and its environment. Even though this requirement can be simply stated, it is a moving target, buffeted by accelerating change in business conditions and information technology. Successful data warehouse projects require architectural design.” In this section, we look at data warehouse architectures through the eyes of the industries leading experts. To do this, we need to again separate the ideas of data mart and data warehouse. 3.1 Marts and Warehouses Based on the nature of star schemas, it is easy to deduce that an average business could have several data marts, each focused on a specific area of analysis. One analyst reported in 1997 that most companies have already developed three or more data marts. Many data marts are operated more-or-less autonomously within departments until they prove successful. But after a few successful marts are in full swing, it makes sense to pool resources and refine basic processes used in populating warehouses. (I’ll explore these processes in a later section.) In businesses with a strong central IT organization, there is often a desire to create an central data warehousing “infrastructure” that mart-builders can take advantage of. In fact, some central organizations have had the clout to prevent marts from being created until a set of central services is established, thus impeding some business objectives. There is a fair amount of controversy between these two camps. In part, it comes down to whether marts are independent or dependent (on a central warehouse). Sometimes this is just a “turf war”; but there are many other factors to consider having to do with time and resources. Understanding Data Warehousing – A Primer 14 Suppose a body of information is essential to 5 marts, and this body of information is aggregated from 4 different source and cleaned in preparation for use. If each mart is independent, then each mart: performs the extractions from the 4 sources, merges the data, performs its own cleaning, and populates the mart database; that’s a total of 5 x 4 = 20 extractions, 4 merge functions and 4 cleanings. Besides the duplication of effort, what are the chances that these operations are conducted in precisely the same way, yielding precisely the same data? Suppose instead that the same data were processed once (centrally) and then distributed to the 4 marts. This would result in 5 extractions, 1 merge function, 1 cleaning, and 4 distributions. An important side benefit is that the data would be consistent, and have precisely the same meaning, in each of the marts. Therefore, it would be safe to combine information from the different marts into a higher-level report. Aside from redundant processing, data replication is another important issue to consider. Data that is aggregated centrally (for processing) and then distributed to marts naturally resides in multiple locations at a time. This can consume significant DASD resources that must be planned for and managed. What do you mean by “Architecture” ? In the following sections, we’ll look at two aspects of architecture relating to data warehouses. First we explore the physical architecture – basically the decision of whether the warehouse (with marts) is centralized or distributed. Second, we explore the framework architecture – that is, the relationships between data, operations on data, and movement of data. 3.2 Physical Warehouse Architecture There are several architectures that can data warehouses can be based on. Each have pros and cons. One key to data warehousing success is to first understand the possible options, then understand the unique business needs of the company, and finally select architectures that meet those needs. Centralized In this approach, a company gathers production data from across the organizations of the company into a central store. This data covers many different subject areas, and/or lines of business. The advantages of the centralized approach include the degree of control, accuracy and reliability it provides, along with economies of scale. But one of the main problems with the centralized approach is that it comes up directly against and organization or political issues that are causing difficulties for the company. Building such a data warehouse will commonly require great cooperation between IT and users, between business units and central management, and among various business units. This is a big reason, without even considering technology challenges, why centralized, monolithic data warehouses tend to be expensive, complex and take a long time to build. In other words, in order to serve an enterprise, the data warehouse must reflect its complexity. Usually, a centralized warehouse will draw from a wide variety of data sources. It will be a big job for IT to transform and combine the data into forms better suited for analysis, especially when large volumes of non-relational mainframe data is used. This data will also need to be “cleaned” – an effort-intensive process of removing inconsistency and redundancy that needs to involve users. In a centralized data warehouse, it can be difficult Understanding Data Warehousing – A Primer 15 to motivate users to participate in cleaning data if they don’t see an immediate payoff for themselves. (Note: I’ll discuss later how data usage directly contributes to data quality). The pros and cons of the central warehouse are: Pros Cons Engineered by IT as enterprise system to ensure quality of design. Able to share components and thus justify industrial strength tools for data cleansing and transformation. Generally use high performance platforms able to scale and handle growth. Highly manageable. Big and complex to maintain. Take a long time to plan and implement, because projects try to accomplish too much. Expensive (often three to five million dollars). Come up against organizational and political walls. Independent Data Mart The data mart is typically a specialized data warehouse that covers one subject area, such as Finance or Marketing. Since a data mart is focused on meeting specific business user needs, it is generally easier to get users more involved in the data quality, cleansing and transformation activities. Because the audience is usually limited to a group of end users performing the same function, and data mart avoids much of the interdepartmental conflicts that occur between business units with different business processes, different world views and different priorities. In this architecture, a company builds a series of data marts that each model a particular subject area. By narrowing the scope, a data mart avoids much of the complexity of centralized warehousing – like trying to model the entire company. The initial data volume and number of data sources may also be far less than the centralized architecture, which may mean the marts can run on smaller, more cost-effective machines. Figure 3a: Centralized vs. Mart Architecture An independent data mart is a stand-alone system that does not connect to other data marts. The advantage of an independent mart is that its politics and technical complexity Understanding Data Warehousing – A Primer 16 are localized, making it the quickest and least expensive to deploy. The primary risk with an independent mart arises when you build a series of them. When a mart reflects the different business assumptions of its department, it can become an island of information within the organization. A secondary risk is that the marts will each require diverse data feeds from legacy sources. Also, each marts stores its own transformations of the corporate data (along with summaries and indexes) driving overall data volumes up 3 to 7 times. The pros and cons of independent data marts are: Pros Cons Narrower and more contained scope. Hidden costs, which multiple as you build a series of marts. Potential duplication of effort. Can be poor in quality and inconsistent. Can be built relatively quickly. Lower initial costs for hardware, software and staff. Tend not to scale up very well. Dependent Data Marts, with Distribution Data Warehouse Another architecture, the dependent data mart, shares most characteristics of the independent marts – that is, it still concentrates on one subject area. The primary difference is that a dependent mart “depends” on a central data warehouse that stages and transforms data for many marts. This architecture does a good job of balancing department needs (for flexibility and focus) with enterprise needs (consistency, control and manageability). The big problem with dependent marts comes when management decides they will not actualize any marts until the central warehousing facilities are in place – which could take years. Operational Data Stores - with Marts and Data Warehouse Yet another architecture incorporates the use of Operational Data Stores (ODSs) with dependent marts and a distribution warehouse. ODSs are usually deployed for tactical reasons. An ODS is basically a subject-oriented, time-variant and volatile (frequently changing or updated) image of production system data. An ODS may be implemented over a star schema, but this schema may be quite different from downstream warehouse schemas; the ODS schema and content tends to be geared towards very short-term tactical analysis rather than trend analysis. The data in an ODS may, or may not, be cleansed. An ODS may be is implemented for one or more of a variety of purposes: Better organized and more accessible view of production data. Production data is often stored in indexed files or old DBMS types, such as IMS. Users accessing these stores require specialized skills, and are usually programmers. When production data is surfaced in an ODS based on relational technology, it becomes more available to growing numbers of analysts through highly friendly ODBC-based tools. Production system extension. Once production data is available in a more accessible form, production system analysts often want to leverage it. This can lead to extensions to production systems that operate on ODS data. These applications may generate “new data” that never existed in the production systems; when this occurs it can be useful to propagate the new data “back” into product databases. Understanding Data Warehousing – A Primer 17 When ODSs are used in this manner, it is easy to see how “the line gets blurred” between production systems and the ODS, leading to some very significant management challenges. Production system migration. An ODS can be used as part of a strategy to migrate production systems from antiquated DBMSs to new (usually relational) ones. Using a populated ODS that “mirrors” the content of production data, new applications can be developed and tested, while the old applications continue to serve their existing functions. Eventually, the old applications and DBMS can be phased out. Short-window tactical analysis. The mainstream purpose for most ODSs is to provide “current moment” tactical analysis of the business. This differs from typical data mart analysis, which often covers longer-term trends, and depends on stable, immutable data. Staging point on the way to distribution warehouse. Very often, data from an ODS gets loaded into the data warehouse (subject to established data cleansing and transformation rules). This data propagation must be managed carefully, and on a very time-sensitive basis, to ensure that the data contains a proper subset (periodic slice) of information; in other words, such propagation must be planned around the fact that the ODS data is frequently refreshed and updated. The following figure illustrates the options available when an ODS is part of the data warehousing architecture: Figure 3b: Environment with Operational Data Store The operational data store concept can be worked into most data warehouse architectures. The decision for-or-against an ODS is usually based on the unique needs of the particular data warehousing environment – especially the quality of, and accessibility to, production system data. Virtual Data Warehouse A virtual data warehouse isn’t really a physical data warehouse at all, but rather a unified way to access production data that resides on diverse systems. Thus, although it appears to the warehouse users that they are working with a dedicated system, the data does not really get moved into one – it remains in the operations stores. This approach may meet the needs less dynamic organizations that do not really need a full fledged warehouse, or any organization not quite ready for real data warehousing. Perhaps the most serious limitation of the virtual warehouse approach is that the data is not Understanding Data Warehousing – A Primer 18 cleansed, transformed or reformatted in ways that support better analysis, thus seriously hampering its effectiveness. Virtual warehouses may find use when: a real data warehouse can’t be cost-justified, for proof of concepts, when an interim solution is needed while a warehouse is built, or when only infrequent access is needed stores. “Hub-and-Spoke” Architecture The “hub-and-spoke” architecture attempts to combine the speed and simplicity of independent data marts with the efficiency, manageability and economies of scale of the centralized data warehouse. Shaku Atre terms this architecture the “managed data mart” approach. The managed data mart approach is very similar to the distribution warehouse and dependent marts architecture. The company deploys marts across the enterprise and manages them through a central facility where data is staged and transformed. This central can be built incrementally over time. The hub-and-spoke architecture is a state of mind about how data warehousing should be managed. The “hub” delivers common services and performs common tasks – most notably data preparation and management. The “spokes” include data sources (inputs), central staging facilities, and the target warehouses, marts and user query applications (destinations). Figure 3c: Managed Mart Approach This is a hybrid approach that combines data marts with a central data warehousing component. The “hub” is not a full-fledged data warehouse. For example, it isn’t designed to provide user data access; rather the role of the hub is a central “clearing house” and management center for the enterprises data warehousing capability. The hub receives data from various sources; then cleans, transforms and integrates it – according to the needs of the data warehouses, data marts, and user applications that it services. The hub can function in two ways. It can serve as a transient data hub that does not maintain any long term data; in this mode, it holds data only long enough to clean it, transform it and staging it until marts can accept it. Alternately, it can be a distribution data Understanding Data Warehousing – A Primer 19 warehouse that maintains cleansed data for the data marts. It this mode, it can grow in size and function over time. Either way, because services are centralized, the hub can help a company enforce consistent standards for the design of marts, and can dramatically simply the creation of new marts from existing data pools. Of the architectures discussed, the hub-and-spoke approach allows scalability of every part of the data warehousing capability: data sources, central component, and data marts. For large corporate undertakings, experts seem to agree that it is the best approach. But it can also be noted that the other architectures may have their place during the evolution of a data warehousing solution. 3.3 Warehouse Framework Architecture A data warehouse “framework” is a means to define and understand the locations of data, access to data, movement of data and operations on data. At a high level, the data warehouse environment can be segmented into several interconnected “layers” of functionality: Production or External Database Layer Information Access Layer Data Access Layer Transformation (aggregation, cleansing, transformation and dimensioning) Layer Data Directory Meta Data Layer Process Management Layer Application Messaging Layer Data Warehouse Layer Data Staging Layer These layers are illustrated below: Application Messaging External Databases Information Access Data Access ODS, Data Warehouse and Marts Data Staging Data Transform Data Access Operational Databases Meta Data Directory (Repository) Functions Process & Activity Management Repository Figure 3d: Data Warehouse Framework I’ll discuss each of these layers briefly: Understanding Data Warehousing – A Primer 20 Production or External Database Layer Production systems process data to meet daily business operation needs. Historically these databases have been created to provide efficient processing for a relatively small number of well-defined transactions; therefore, they are difficult to access for general information query purposes. As noted earlier, there are also significant problems with data quality in these stores. Increasingly, large organizations are acquiring additional data from outside databases or electronic transmissions. This information can include demographic, econometric, competitive and purchasing trends. Data Access Layer The figure above portrays two different Data Access layers: one over the data warehouse, and another over the production databases. Predominately, SQL is the access language used by Information Access tools to retrieve data warehouse information through this layer. In contrast, the data access methods used over production and external data sources may be quite different – and possibly quite archaic. The Data Access layer not only spans different DBMSs and file systems running on the same hardware, it also spans manufacturers and network protocols as well. Transformation Layer The Transformation Layer is responsible for the aggregation, cleansing, transformation and dimensioning of data gathered from production and external sources. This is where the bulk of the data preparation activities take place. This layer may use services of the Staging layer to privately store intermediate data results during the transformation processes. Data Staging Layer An important component of the framework is the data Staging layer. This layer handles the necessary staging, copying and replication of data between the production systems, preparation tools and data warehouse stores; in distributed warehouse architectures (such as distribution marts or hub-and-spoke), this layer can also stage data movement between marts. Data Warehouse Layer The core (physical) data warehouse layer is where the cleansed, organized data used as the basis for DSS and OLAP is stored. This data is commonly stored in relational databases, although multi-dimensional and OLAP databases are also used. Information Access Layer The end users of the data warehouse deal directly with the Information access layer. In particular, it represents the tools that end users employ to access information, and the office tools used to graph, report and analyze information retrieved from the data warehouse. A number of OLAP and business query tools are available in this layer. Application Messaging Layer The Application Messaging layer is pervasive across the production system and data warehousing environments. It is responsible for transporting data around the enterprise’s computing network. The Messaging layer may include “middleware” – technology used to transfer data across different platforms and “equalize” data formats between tools/ Understanding Data Warehousing – A Primer 21 applications. Messaging can also be used to collect transactions or messages and later deliver them to a specific location at a particular time. Meta Data Directory Layer It is hard to imagine a large-scale data warehousing solution that does not take significant advantage of a repository to manage meta data about the warehousing and production system environments. Meta data is the “data about data” within the enterprise. Meta data about data structures, data models, production databases, value domains, transformation rules, authorities, schedules (and many other subjects) is necessary to effectively manage even a small data warehouse. Some solutions from data warehouse tool vendors include a meta data repository, but the common weakness of vendor-included repositories is that the scope of their meta data collection is limited to the scope of the tool. This means that large data warehousing environment may end up with more than one “partial solution” meta data repository. The Meta Data Directory layer provides access to the meta data repository (perhaps more than one). Ideally, a common meta data repository can be implemented to aggregate data from (or coordinate access to) the repositories that occur in the environment. Process Management Layer The Process Management layer is involved in sequencing and scheduling the various tasks involved in building and maintaining the data warehouse, and in building and maintaining the meta data repository. Together, these layers provide the infrastructure for data warehousing activities and management. Understanding Data Warehousing – A Primer 22 4. What is Data Mart/Warehouse Management? In this section, we take a closer look at data warehouse management. What gets managed, and why is it important? Many roles and responsibilities are involved with managing successful data warehouse implementations. Depending on the architecture selected, a significant number of people may be involved; in small environments, a few people may serve multiple roles. 4.1 Managing Data Administration: Defining “Information” Recall that the objective of data warehousing is to create an accurate and consistent view of the business that can be analyzed, and from which decisions can be made. This means that the business must be well understood, and the goal(s) of the data warehouse should be crystal clear. But in large organizations, and even within departments, the definitions of the information to be analyzed may be poorly understood. Many enterprise resource planning (ERP) systems begin by defining the core business “objects” involved in any business, and then proceed to define systems around these themes. A good data data warehousing solution does the same. This is about getting back to basics: clear and unambiguous themes, single definitions, specified ranges of acceptable values, and so on. Managing the definition of information sits at the heart of data quality management. Companies are often reluctant to manage information quality, because it is a very cumbersome, largely manual process unless it is supported by active repository services. What does it mean to manage the “definition” of information? Let’s discuss some of the key initiatives. Terms and Semantics Directory (Glossary) For most companies, it is essential that a corporate “glossary” or directory of terms be compiled. This directory stores information about the terms that describe the business, especially core business themes such as Customer, Vendor, Order, Line Item, Vendor, Invoice, Sales Region, Service Region, Mail Drop, Zip Code, Country and so on. As companies grow, these terms can shift in meaning; for example, a term like “customer” may come to refer to an internal, as well as external, entity. Good business communication depends on everyone using well-understood terms that convey consistent semantics. The term directory stores a semantic description of each business term, and may also provide a cross reference to related information, including: sub-classifications of the term (i.e. terms which refine the theme), business models the theme appears in, business processes the theme is involved in, common value domains (range of valid values) for the term, data types or structures or elements that implement the theme, business rules which involve the theme, and so on. The directory also stores terms that are company internal. Some of these terms include acronyms for key management information systems, coined “business lingo” phrases, various codes used in business systems, and so on. Ideally, the terms glossary is part of a comprehensive business information directory (BID). Understanding Data Warehousing – A Primer 23 Value Domains and Abbreviations Value domains specify the range of valid values for business terms and technical elements. Value ranges for business terms may be general, while the value domains for technical elements may be specific to the technical element. Why are value domains important? Suppose you are querying a production database, and filtering on the value of a column. You may (or may not) have visibility to the data type – for example, that it is a YES/NO binary flag. Assuming you have this information, what predicates should you use in your query? Probably only the production system programmers know that values to expect in this column: NULL, a HIGH-VALUES code, a LOW-VALUES code, Y, y, N, n, 1, 0, space, T, t, F, f, and perhaps other characters. This explains the problem, and why an analyst constructing a query needs to know that a YES/NO flag has valid values Y or N (or perhaps Y or unknown) … period. Value domains play a critical role in data warehousing practices, because they provide the basis for assessing the quality of data from production systems, and “cleansing” it, before it is made available in data warehouses. To support the cleansing operation, it is also useful to identify the set of common erroneous values, and associate each value with the appropriate correct value (or a single standard “unknown” value); this information can be generated into “data conversion maps” employed by the data cleansing tools. Value domains also play a critical role in the development of on-line processing validation routines. Poorly crafted or incomplete validation at data entry is a key cause of inaccurate data in production databases in the first place. Identifying a definitive set of corporate abbreviations for each primary business term is also highly useful. These abbreviations may be organized by size, i.e. a 5-character abbreviation, a 4-character one, a 3-character one, and so on. (Some organizations only allow ONE abbreviation to be defined.) The valid sets of abbreviations are used by quality standards initiatives, such as the standard naming conventions used by programmers and data structure/schema designers. Naming Conventions Naming conventions apply the rigor of business terms, semantics and abbreviations to the technical IT world. Naming conventions can be defined for various (usually technical) elements such as file names, module names, data element/structure names; ideally they are also defined for less technical things, such as business processes. Ideally, consistent naming conventions should be outlined for virtually all items that comprise “information”. Although naming conventions are usually attached to specific technical elements, it should be clear that they also have important linkages to business terms, abbreviations and data types. Good conventions are built around an ordered sequence that includes some combination of type (or term), name, qualification, and context; these are sometimes called object label standards. In such standards, abbreviations are used for types/terms and contexts, and entity abbreviations (see next paragraph) are used for name and qualification. These conventions have been employed by application generation tool for years. Some organizations establish a standard for data element definitions that include a standard “abbreviated name”. These abbreviated names are highly useful in the construction of naming conventions that have length constraints (such as COBOL items). Data Type and Format Standards Data type standards define the common data forms (and formats) that your organization understands as “information”. Some examples include: Text, Number, Date, Time, Time Understanding Data Warehousing – A Primer 24 Stamp, Identifier (ID), Code, URL, and so on. The standards also define the content guidelines for each respective type. Certainly most “year 2000” analysis projects can illustrate how many forms that DATEs and TIME STAMPs appear in. However, the data quality issues relating to other data types remain largely unexplored - until now, in the face of data warehousing. For example, a large manufacturing organization have many several different formats for the thing called “part number” – perhaps some alpha and some numeric. It is highly useful to relate standard data types with the various enterprise assets that implement them, including business themes, codes, data structures/schema, and so on. Automated Data Quality Measurement Data type, form, value and naming standards are high priorities for IT groups – but are very difficult to manage unless adherence to the standards can be easily measured, and there are ways for the standards to be actively deployed in IT systems; such problems can lead to waning effort and attention. However, today’s e-commerce and data warehousing initiatives are increasing corporate awareness of the impact that data quality has on the ability to understand (query and analyze) and modify the behaviors (shift tactics) of the business. These capabilities are highly correlated with competitive advantage. Information Owners It is common for the responsibility for data quality to go unassigned in sizeable organizations. IT managers refuse to accept responsibility for the content of data that “flows through“ their systems, business managers disavow knowledge of the data values “stored in” IT systems, database administrators can’t control what applications put in data stores, and users perpetuate and suffer from poor information content. Who should be responsible? Corporations serious about business engineering, and needing to leverage data quality in data marts, are assigning explicit ownership for specific segments of information; a data administration group typically takes on this role. The responsibility for quality is most effectively assigned at the business unit or department level, although it can also be assigned more centrally. There are some key benefits to assigning this responsibility: A central point of contact is established that immediately understands the quality issue, and can prioritize the business value of multiple issues. A role with authority for the business unit gets first-hand, intimate knowledge of the information available (and unavailable) to the unit, and a keen understanding of which information is used (and which is not). A stronger link is forged between the goals of the business unit, and the IT data, systems and marts that support them. The “data quality analysis” and “data cleansing” operations of the data warehouse are key opportunities to establish these standards, and to put them to active work in critical decision systems. Managing IT-related Meta Data There is a significant amount of other IT-related meta data to be managed besides the database schemas. These include, but may not be limited to: Meta data from traditional application analysis tools (such as ESW or EPM) Meta data from vendor-specific repositories Meta data from business process, conceptual design, or database design (CASE) modeling tools Understanding Data Warehousing – A Primer 25 Meta data from application environment and management tools (such as Tivoli) Meta data from middleware or messaging tools (such as MQSeries) Meta data from message brokers or component managers (ORBs, etc.) … and varied other sources. Managing this varied body of meta data, in a meaningful way (by inter-relating it), is perhaps the greatest challenge for IT (if not business management) in the coming years. It also represents one of the greatest opportunities for companies in the repository and information directory industry. The challenge is in organizing and presenting the key information while avoiding the common problems of myopia (seeing too little, too small a scope) and macro-phobia (seeing a picture too big to be meaningful). The vehicle for presenting this information is sometimes called a “business information directory”, a BID, or some similar name. Certain industry professionals, including John Zachmann, have spent careers in pursuit of a highly usable information directory. Other organizations, such as Enterprise Engines, Inc. have pursued an alternate strategy of integrating business application generation (Javabased) with a strategic business design framework. Both of these approaches serve to shorten the cycle between a business decision, and a corresponding shift in the policies enforced by business systems. This is the stuff that makes enterprises truly agile. Without meta data correlation spanning business process definition, such agility is a pipe dream for large corporations. A BID solution, or similar integration between business planning and application system engineering, brings agility into the realm of feasibility. But a BID, by itself, is not the full solution. Enterprises that are steeped in traditional values and archaic management structures may experience great difficulty “re-engineering themselves” to take advantage of this new wave of information. The data administration function should play a role in defining the data transformation maps and rules used in data warehousing; these management subjects are discussed in later sections. The data administration role should also be held accountable for maintaining the meta data linkages between production (or data mart) schemas and the associated business themes; this subject, too, is discussed later. 4.2 Managing Business Segment Analysis & Mart Design Good data mart design begins with analyzing the business segment that the mart will service. I group the management of these practices together because they are heavily linked during data warehouse development. Exploration Data exploration involves rummaging through, and creatively assessing, the ODSs or production data available to build warehouses and marts from. This tasks requires a keen sense of the business objectives and issues, and an open mind about how [seemingly obscure] data might be used to derive facts and insights about the business. Exploration is similar to data mining; the explorer may run a series of queries and look for patterns, or analyze the distinct values found in columns of interest. But exploration doesn’t usually involve special technology; rather it is done through schema analysis and basic database queries. Schema Design In his book, the Data Warehouse Toolkit, Ralph Kimball describes a number of data mart star (dimensional) schemas geared toward analyzing different aspects of the business. These schemas Understanding Data Warehousing – A Primer 26 can be used as analysis models, and serve as a good starting point for thinking outside the traditional IT “transaction-based, fully-normalized” box. Below are just a few, ranging from the simple to more complex: Grocery Store (Retail analysis) Warehouse (Inventory analysis) Shipments (Discounts, Ship Modes and other analyses) Value Chain ( demographics) Others include Financial Services, Subscriptions, Insurance, and non-fact tracking methods. Building a dimensional data warehouse is a process of matching the needs of the business user community to the realities of available data. That sounds remarkably like what IT business systems are supposed to do in the first place. This disconnect - between what business users need to run the business, and what IT systems typically supply – is the driver for the entire data warehousing opportunity. In practice, most data warehousing implementations become evolutionary, long-term projects. The “facts” (synthesized out of production data and deposited in marts) offer new views of the business. This leads to further analysis, confirmations and information mining, which in turn ultimately lead to refinements in core IT business systems. Why won’t data warehouses simply replace IT systems? The answer to this question remains: the inherent differences between “transaction processing” versus “analysis and decision support” keep these two worlds from converging. At least nine decision points affect the database design for a dimensional data warehouse: The business processes, and hence the identity of the core fact tables The grain (data granularity) of each fact table The dimensions of each fact table The facts, including pre-calculated facts The dimension attributes, with complete descriptions and proper terminology How to track slowly changing dimensions The aggregations, heterogeneous dimensions, mini-dimensions, query modes and other physical storage decisions The historical duration of the database, and The urgency (frequency) with which the data is extracted and loaded into the data warehouse. These factors must be addressed essentially in the order given. As fact tables and dimensions are determined, the mart designer can begin to work with Database Administration to develop the mart’s database schema. Locating Data Sources The search for relevant data sources is another responsibility that the mart designer may undertake. I discuss this topic later in Managing Data Acquisition. Prototyping A significant role of mart design is prototyping the mart and testing its behavior. Queries must be developed to use the new schema; these access paths will drive requirements for how the mart tables will be indexed. Indexing is one of the key performance tuning mechanisms available to database designers, and one that results in the most DASD resource consumption. Finally, queries must be tested to determine that they are using optimum data access paths, and using indices as expected. Understanding Data Warehousing – A Primer 27 4.3 Managing Database Administration Several kinds of database administration (DBA) are necessary to support large-scale data warehousing projects. In this context, I refer mainly to the schema management of production systems, the schema management of warehouses and marts, and cooperative administration of meta data models (which reflect enterprise and technical themes) in the corporate or tactical repositories. Database administration should get involved with the valid data value domains for technical data structures, since the understanding of possible value ranges and scope can contribute to data normalization decisions. And the database administration group should be held accountable for actuating naming conventions and format standards in the schemas and data structures produced for IT. Managing Production Data Store Schema The Database Administration function should have accountability for the schema design, implementation and tuning of production data stores. I use the term “accountability”, because in some organizations this function may simply oversee these operations - due to staffing constraints of the DBA function. This group must promote and police the adherence to standards for naming conventions especially for new schemas and shared record layouts, but also for existing schemas … if management and resources allow for this re-engineering. Managing Data Warehouse, Mart and ODS Schema The Data Administration function should have accountability for the schema design, implementation and tuning of warehouse and mart data stores as well. In organizations new to data warehousing, the fact and dimension theme designs should ideally be done by (or in conjunction with) the mart designer; the mart designer will have keen sensibilities about the analysis that the mart must support, and care should be taken that these insights are not contaminated early in pursuit of blind data optimizations. As the database administration function matures in the management and design of mart information, it can play a key role in re-using the schema design patterns that have been deployed in that past, and any persistent “clean data” pools in the warehouse. It is even more critical that the DBA group adheres to standards for the mart and data warehouse schemas. A less technical class of user (a business analyst) will be using these table and column names in countless queries. Queries against different marts may later be correlated into aggregate reports, so semantics must be consistent. To accelerate usage of mart schemas, the DBA can prototype a number of basic queries that illustrate valid usage patterns to prospective end users. Applying Data Types, Naming Conventions and Format Descriptors The DBA group, in cooperation with the data operation group, shares responsibility for managing data types. This function includes applying the data standards of the organization (including valid content format and format), and perhaps more important, developing an auditing and continuous improvement plan for data structures used in implemented systems. This can be a political issue: the time the organization spends improving its data quality and understanding may be viewed (by some) as better spent on activities that more directly contribute to strategic initiatives. In other words: do we decide to prepare ourselves for growth, or skip over that detail in order to grow sooner? Understanding Data Warehousing – A Primer 28 Also see related management topics, Managing Data Valid Domains and Managing Data Traceability , below. 4.4 Managing Data Valid Value Domains An administrative group must be assigned to pay special attention to the valid value domains for technical elements (codes, IDs and so on). These domains, and supplementary cross-references of “bad values” to appropriate domain values, will greatly aid the data cleansing functions of the data warehouse. This responsibility may be assigned by business unit; if the central data warehouse “distribution hub” is adopted, it could be coordinated centrally. The most essential aspect is that the accountability for data quality remains close to the end users of the data. See the related Managing Data Quality topic below. 4.5 Managing Data Extraction Data extraction is the process of pulling the data from production data stores that will be ultimately be used by one or more marts. Very often, this is a complex process that involves using 3rd-party extraction tools, which are usually applicable to commercial database sources. It may also involve home-grown solutions to extract data from complex data structures, or odd-form data stores peculiar to a given IT environment. I break this section into two segments: qualification and extraction. These functions typically occur at different times during the data warehouse development project, and may be performed by different organizations. Identifying and Qualifying Candidate Stores to Draw From Ahead of the task of extraction, is the task of identifying and qualifying the “candidate” production data stores that data could be drawn from. The enormity of this task is factored by the size, number and distribution of the production system data stores. In exceptionally large global organizations, these stores could be managed in multiple geographic locations. This task may not performed by the data extraction team; if the mart designer is fairly technical, he/she may conduct this research. However, there are real benefits to having the data extraction team involved with “data sourcing”. First, this gives the extraction team early information about the formation of a mart - most importantly, some insight into the problem to be solved; this knowledge can be important for selecting the right sources. Secondly, the extraction team will tend to be a fairly technical group with intimate knowledge of problems or incompleteness in certain data pools; this should greatly help in the selection process. In a complex environment, data mining might be employed to locate suitable data sources. Data mining relies on specialized tools to search production stores for certain clusters of information. Managing Queries and Data Extraction A key challenge in data extraction is to ensure that the data content extracted matches the expected scope of prospective mart users; this mean applying a number of filters or selection criteria. Close communication is important on these subjects; this is another reason why the extraction team should have a close relationship with the mart design team. Understanding Data Warehousing – A Primer 29 The extraction team manages the queries and tools, and/or develops programs to accomplish the extractions. For these tasks, intimate knowledge of the data structures, or ready access to accurate meta data about the structures, is crucial. For most data warehousing situations, a given data extraction will be performed on a very precise schedule, with very particular dependencies on completion of specific operation phases. For example, extractions of data from the inventory database might be dependent on completion of data entry for today’s warehouse shipments. Honoring these dependencies is key to providing accurate and semantically complete information to later processing steps. See the related topic Scheduling and Dependencies below. Data extraction for data warehouse purposes is necessarily “downstream” from the data administration of the production system data stores. When data structures in the production data stores change, the relevant DW extraction process must immediately be altered to stay in sync; otherwise, the data warehouse population mechanisms will break down. 4.6 Managing Data Transformation Data Transformation is perhaps the “muddiest” topic associated with data warehousing. I discuss it separately, so that I can be clear about its definitions. But in fact, data transformations may occur in concert with other operations, such as Data Extraction. In this section, I cover several transformations common to data warehousing. Before proceeding, it is worth mentioning that a number of other utilities come into play that don’t do actual transformations. The most obvious ones are simple sorting and merging utilities; others include database unloads and proprietary 4GL extractions. While not complex, these utilities consume time, and must be represented in the overall dependency scheme that populates a warehouse. The management of any data transformation activity, such as those discussed below, requires knowledge of data element and record key offsets within the data stores being processed. Also required are the sequences of, and dependencies between, transformation activities, the schedules of when they are performed, and the associations between the originating, intermediate and final data stores involved during any transformation thread. Management needs specific to particular transformations are discussed in relevant sections below. Data Conjunction Data conjunction is the process of gathering data from different sources, and potentially different formats, and transforming it into a common record profile. For example, customer information may be gathered from a several business subsystems or regional databases; each of these sources may impose different constraints or value formats on the data. In order to make this data available, en total, to subsequent data warehouse processing, it must be transformed into a consistent format. The management of conjunction requires knowledge of: each data schema that will be merged, the target schema, mappings of the source schemas to the target one, incidental transformations associated with the mapping, the job streams or programs that perform the operation, and the scheduling of the operation. Data Aggregation Data aggregation is the process of making meaningful summarizations of data before the data is made available to subsequent data warehouse processing. Typically, these summarizations are Understanding Data Warehousing – A Primer 30 made along the “dimensions” that will ultimately be used in data marts. See the related topic Data Dimensioning below. For example, in Figure 2a I discussed a fact table, called Sales Fact, that served as the foundation of a star schema. Columns in this table, such as dollars_sold and units_sold represent aggregations of information along the dimensions of time, product and store. It is possible for data to go through several levels of aggregation, especially if it is used in multiple marts with different focuses. For example, detailed sales data might be aggregated to feed a regional sales analysis mart; then later re-aggregated to feed an enterprise-global inventory movement analysis mart. The management of aggregation requires knowledge about the schema of the detailed data, schema of the target data (usually a fact table and dimension tables), mappings of the source schemas to the target one, the job streams or programs that perform the operation, and the scheduling of the operation. Data Cleansing Data cleansing is the process of making data conform to established formats, lengths and value domains so that its can be effectively ordered, merged, aggregated and so on for use in data marts. Data cleansing is necessary because, typically, production data is subject to a number of quality problems. These problems are introduced through poor data entry validation, electronic data transmission errors, and business processing anomalies. Most aging business systems have “hard coded” business rules (informally maintained over the years) which reject, tolerate or fix data quality anomalies on the fly (during processing). But in order to make this data usable for decision support systems, or any relational query mechanism, the values in the data must be made to conform to consistent standards. The management of data cleansing requires additional knowledge of the valid value domains for technical data elements. The linkage between a data element and a value domain may be through a business element or technical element definition, if such information is gathered in a repository. See also Managing Data Valid Value Domains above, and Managing and Data Quality below. Data Dimensioning Data dimensioning is the process of deriving a fact table, and corresponding dimension tables, from cleansed data; it is a specialized kind of aggregation (described above). Dimensioned data is the cornerstone of most data warehouse analysis. The management of data dimensioning relies on knowledge and semantics of the dimensions, and of their relationship to a particular “fact”. This meta data should originate from the business model, and be correlated with the “mart themes” which support that model. Dimensions should also be associated with specific star schemas that implement them. All these relationships can be maintained in a repository. 4.7 Managing Data Traceability Providing data traceability is perhaps the most challenging aspect of data warehousing. Users of a particular warehouse or mart require knowledge of the production stores and ODSs that the data was drawn from, and of the transformations made to it. Other trace information may also be required, such as the schedule on which the warehouse is augmented, and the cut-off times for the source data stores and processing. Understanding Data Warehousing – A Primer 31 There are two sides to this information: the “operations plan” for the warehouse, and the history of activities involved in warehouse operations (i.e. how the plan was executed). In this section, I discuss the operations plan, and defer the subject of tracking history to the section Managing Operational History below. Virtually every data movement in the preparation and operation of the warehouse, in a sense, defines the warehouse. This is why traceability is so important. Information must be associated with specific sources, and be confirmed to follow well-defined semantics, in order to be useful in decision support systems. Otherwise, the information may carry unacceptable risks. In this section, I identify many of the facts and activities that must be traceable: Extraction or Query. Any program or query definition used to extract data from a data source according to well-defined criteria. Movement. Any movement of data into/from a process, onto/off archival media (tape), or transmitted through electronic channel. Transformation. Any transformation of data into a similar or different configuration. Also the systematic cleansing of data. Dependencies. Any dependency of one process on another process, of any process on data, and of any data on process. Also, the correlation between business concepts and the domains and marts in the warehouse. Schedules. The identified times (and duration if appropriate) when source data availability must be ready, when an operations activity shall take place, when warehouse data is made available and when warehouse access will be predictably suspended. Planned dates/times of changes to current schema, activities, roles, ownership, dependencies, and so on (engineering change control). Content Ownership. The person, role or organization responsible for the definition of the content, and the policies that govern its quality. Operation Ownership. The person, role or organization responsible for a particular movement or transformation conducted by warehouse operations. History. A transactional history of activities managing the warehouse, including consummated engineering change control milestones, exceptions, corrections, reorganizations or disaster recovery. Exceptions. Any deviation from the operations schedule (beyond policy tolerance) or failure (beyond tolerance) of any planned activity or process. Corrections, Recovery and/or Re-loads . Any adjustments to warehouse or mart data (which is typically immutable). Such adjustments may be required to recovery from a processing exception or database disaster. Rules and Policies. The rules that define warehouse operations, especially those implemented though the operations schedule. The policies that describe the acceptable tolerances for activity completions (such as return codes) and the actions to be taken upon any exception. Tools Used. The tools used in any operation, activity or aspect of the warehouse. These range from schema design tools, meta data gathering and management tools, utilities, and any extraction transformation or movement tools. Also includes the tool that implements the scheduler. These traceable elements are described in greater detail in the adjoining sections of this paper. 4.8 Managing Schedules and Dependencies All operation activities, engineering change control, data readiness deadlines and data availability windows are best managed through a central (or regional) scheduling system. Ideally, the scheduling system should reflect the operations and plans of the warehouse, expressed in terms of the meta data defining the objects and activities; likewise, historical information (that may be gathered from the scheduling system and execution environment) should refer to the same meta Understanding Data Warehousing – A Primer 32 data definitions. For this reason, common “job schedulers” may be inadequate for the purposes of operating the warehouse. The schedule will reflect a number of critical dependencies that must be maintained between warehouse activities, in order to maintain the integrity of the warehouse and associated marts. These dependencies must be expressed and accessible so that a “recovery schedule” can be constructed on-the-fly (manually or automatically) when a processing exception disrupts subsequent portions of the schedule. An intelligent schedule management system may allow tailoring according to rules. Examples of rules include: election of an alternate processing flow if a specific exception occurs, or automatic resumption of a pending process when a required data store becomes available. A good scheduling system is essential for managing the day-to-day “threads” of activities involved in data extraction, transformation, movement or transmission, warehouse loading, and warehouseto-mart data distributions. 4.9 Managing Data Quality and Usage Without data quality, even the most rigorously managed warehouse has little inherent value. Data quality is influenced by every extraction, conjunction, transformation, movement, and any other data warehouse related activity. It is influenced by the right data being ready at the right time; and it is influenced by the valid value domains used in cleansing the data. Considering all the things that can go wrong, most data warehousing sites agree that data quality must be managed as a specific risk. In Data Quality and Systems Theory, Ken Orr discusses managing data quality through a feedback-control system (FCS). “From the FCS standpoint, data quality is easy to define: Data quality is the measure of agreement between the data views presented by an information system and that same data in the real world.” Orr discusses a number of data quality rules deduced from the FCS research: Data that is not used cannot be correct for very long. Data quality in an information system is a function of its use, not its collection. Data quality will, ultimately, be no better than its most stringent use. Data quality problems tend to become worse with the age of a system. The less likely some data attribute (element) is to change, the more traumatic it will be when it finally does change. Laws of data quality apply equally to data, and to meta data (data about data). Clearly, if an organization is not using data, then over time, real world changes will be ignored and the quality of that data will decline. Orr relates this behavior to the scientific phenomenon of atrophy, i.e. if you don’t use a part of the body, it atrophies. Something similar happens to unused data – if no one uses it, then there is no basis for ensuring its quality. What does this have to do with data quality? Product system schemas are chock full of elements that were included “in case someone might want the information later. This data, if populated with any kind of consistency, has inherently poor quality because no users rely on it (perhaps since the inception of the system). It is an common mistake for warehouse designers to assume all elements exiting in production system schemas are fair game for DSS; in truth, only a handful of production system programmers know anything about the quality of many elements. Not only does data quality suffer as a system ages, so does the quality of its meta data. This begins as the people responsible for the entering the data learn which fields are not used; they then Understanding Data Warehousing – A Primer 33 either make little effort to enter correct data, or they begin using the data elements for other purposes. The consequence is that the data and the meta data cease to agree with the real world. Very often, corporate accountability for data quality is assigned within the business units that consume it. Successful efforts follow a “use-based” approach of audits, re-design, training and continuous measurement. Use-based Data Quality Audits Auditing can be done through statistical sampling of the data pool. Elements can be ranked according to use-based criteria including: How interested are users in the data? What is the data model, data design and meta data? Who uses the data today, how is it used, and how often is it used? How do data values compare to real world perceptions? How current is the data? Quality audits should be conducted on a regular periodic basis; besides addressing gross anomalies, auditors should also identify trends from a sequence of audits. Data Quality Re-Design In order to improve data quality, Orr says it is mandatory to improve the linkage between data throughout the system. The first step is a careful examination of which data serves a critical role, and how that data is used. Data usage is typically manifest in two areas: the basic business processes, and in decision support. The goal of use-based redesign is to eliminate the flow of extraneous data through the system, and identify inventive ways of ensuring that mainstream data is used more strenuously. Only so much resource is available to the quality assurance effort. The bottom line is this: If certain data cannot be maintained correctly, then it is questionable whether that data provides any value to the enterprise; perhaps it should be eliminated. Data Quality Training Both users and managers must come to understand the fundamentals of data quality before quality can improve. These parties must understand the steps taken in the organization to refine data quality, and understand the risks to data quality when data is not used or used infrequently. It is unreasonable to expect that users and managers will intuit these facts on their own. General training ensures that a common perspective of data quality is promoted across organizations. Data Quality Continuous Improvement Management controls must be established to ensure that Data Quality policies and procedures are followed; a key component of these controls is measurement. Measurement and quality programs go hand-in-hand. As Data Quality Re-Design occurs, the Data Quality Audits must be re-done for the redesigned systems. This begins the cycle of improvement for the new system. Data that is truly vital must be physically sampled and audited, and ideally these audits should verified by authorities external to the local organization responsible for quality. Data Quality S.W.A.T. Tactics The departmental data quality authority must be mobilized to address quality problems quickly, especially when these problems are polluting the warehouse data pools. Accountability must be Understanding Data Warehousing – A Primer 34 assigned to review and trace data anomalies that appear – errant data in the data warehouse pools, exceptions that shake out of the data cleansing and transformation processes, and data redundancy that should get resolved during data conjunction. In summary, the problems with data quality that must be actively managed are: Dirty data – inaccurate, invalid or missing data Redundant data – comes from redundant record, with or without common keys, often with conflicting values Inconsistent data – resulting from inconsistent transformation process or rules. An organization that is not dedicated to the active management of data quality, is likewise not dedicated to a fruitful data warehouse environment. Data Cleansing Management In the section Managing Data Transformation above, I discussed the management of data cleansing processes. It is important to draw a distinction between managing the data cleansing activities, and managing the goals of those activities. The cleansing activity itself is largely mechanical; but the definition of the cleansing process (including the application of standards, value domains and conjunction matrices) is fairly creative work that often requires significant subject area expertise. The latter is better managed outside the operations group, and with departmental (users of the data) accountability. 4.10 Managing Data Warehouse Architectures Data warehousing environments of any size must be designed with a particular architecture in mind. A common problem occurs when successful mart implementations are re-scaled, or combined in some fashion, to become centralized data warehouses. The success of the data warehouse project is significantly compromised when the warehouse architecture is simple left to “happen”. Architecture must be a conscious effort. The independence or dependence of specific marts must be well known. The economies of scale possible through hub-and-spoke architectures must be calculated. The short-term and long-term usage patterns of marts and ODSs must be assessed and managed to. In many respects, data warehouse environments are “living, breathing animals”. They consume massive amounts of physical and people resources. Some portions dynamically grow while other portions decline. Requirements are continually shifting based on DSS feed-back loops. DSS systems are often the most expensive initiatives in the enterprise, and therefore the most heavily scrutinized. Some of the architecture aspects to be managed are: Logical network of warehouse, marts, ODSs and staging domains to identify dependencies and key information flows Functional and logistical analysis of where and when data extraction, transformation, staging and distribution occurs Physical network of the platforms, facilities and middleware in the environment used to implement the warehouse plan Resource load planning and budgeting for initial and subsequent growth phases of the warehouse project. Coordination of MIS, IT and Data Administration resources Selection of appropriate tools, or design of home-grown solutions, to enable the operations and management of the warehouse Strategic planning and oversight of the growth plan Understanding Data Warehousing – A Primer 35 It is often necessary to dedicate specific resources to the development and growth of the data warehouse architecture; otherwise, resources can get torn between the priorities of the warehouse environment, and other operational priorities in the enterprise. 4.11 Managing Operations, Facilities and End User Tools After planning and implementation is underway, running the warehouse on a day-to-day basis is largely a job of scheduling and logistics. Highly integrated activities cannot be simply “farmed out” to satellite organizations without a strong infrastructure in place. Even when a strong infrastructure exists, the rate of change during early warehousing efforts can be significant. The logistics manager of the warehousing project needs a clear view of the hardware, software, tools, utilities, middleware, physical resources, intranet and internet capabilities, messaging services and other facilities that are employed in running the warehouse. Secondarily, s/he needs current information about the loading, or the available capacity, of these resources. These requirements are a tall order. Facilities are augmented all the time, and capacities can fluctuate dynamically. Be definition, managing these operations includes estimating, planning and monitoring transaction and query workloads. Capacity planning is crucial for ensuring that growth is managed, and that intermittent crises don’t bring the warehouse environment to its knees. The tools employed by the end users of the warehouse (primarily query and OLAP tools), and the catalog of existing queries, are perhaps the most visible elements of the warehouse on a daily basis; this is where “the rubber hits the road” and the value potential of the warehousing effort is realized. Users continually and subjectively assess the quality and performance of warehouse. Since satisfying users is the objective of the data warehouse project, it is important that the user community has some visibility to the logistics involved in operating the warehouse. This visibility includes the planned refreshment cycle, operational milestones (stages of processing), availability of certain data to the central warehouse, availability of new data to marts (distribution schedule and duration), and the “release“ of data for official use following verification. This visibility allows the user community to better understand the operational capabilities and constraints of the “warehouse engine”, implicitly sets expectations about quality and availability, and engenders a team dynamic between the users and deliverers of the warehouse. 4.12 Managing Basic Database Operations Data warehouses, ODSs and marts are largely implemented using commercial database technology. This technology requires particular patterns of administration including: back ups, replication, reorganization, data migrations, column additions, index adjustments, recovery, maintenance and so on. In the data warehouse environment, the administrative overhead of these activities can be significantly higher than that of the production system environment, because of the dynamics of the warehouse and logistical complexity. Additional resources should be allocated, if not dedicated, to handle this work load; the time and resource overhead of these activities must be also factored into logistics plans. 4.13 Managing Rules, Policies and Notifications A data warehousing environment is managed through a number of rules and policies that affect day-to-day decision making and ultimately contribute to information quality. For example, what Understanding Data Warehousing – A Primer 36 happens if one of the sources for a data conjunction is not available due to some catastrophe? Should be the data conjunction still be performed, and the data warehouse loaded anyway? Do users need to be notified? Situations such as these need to be considered, and policies established to handle them, well in advance of their occurrence. Policies must reflect the critical nature of the data, and the effects on the business users when current data is not available. Such policies may vary by data mart, since criticality of data and sophistication of users likewise vary. The important thing is that they be established before a crisis occurs. Communication becomes paramount when the operations of the warehouse are spread across a number of organizations. Peer organizations must be notified when problems disrupt the operations schedule. Data Quality administrators must be notified when quality thresholds are exceeded. Users must be notified when there is a change in delivery schedule or databases are unavailable due to failure or maintenance. Meta data about the data warehouse and mart environments, and the people/roles responsible for critical events, makes effective communication possible. Ideally, the complete “supply chain” of data delivery should be available to a particular participant in this environment. For example, a mart user should be able to determine what queries are available, where the mart data comes from, who is responsible along the way, and the cycles of data availability and refreshment. An operation task manager should be able to determine which predecessor operations he is impacted by, which successor operations he affects, who the quality authority is for an operation, and which marts are ultimately affected. A production systems manager should be able to map production data to business themes, and trace the threads of business data from the production system to the data warehouse and marts. Each of these roles needs to know who to notify when something goes wrong (and who to praise when things go right). Rules and policies take special effort to administer because they can be so difficult to automate. 4.14 Managing Meta Data Definition, Population and Currency Throughout this paper I’ve identified the importance of meta data in managing the data warehouse environment. Meta data identifies the objects to be managed, the processes, the responsibilities, and the higher critical thinking about business themes and processes, data quality, and mart objectives. Meta data must be managed with much the same (or greater) rigor as business data, because it ultimately influences how business data is utilized and understood. The previous section, Managing Rules, Policies and Notifications, discussed the critical role that meta data about supply chains, roles and responsibilities can play in keeping the warehousing community well-informed. An active meta data repository can be powerful vehicle for active and passive communication across management, operations and end users. When key roles in warehouse environment keep their meta data current, anyone else can assess the state of the environment for themselves: this is passive communication. Conversely, when a crisis occurs, meta data can be used to ensure that impacted roles are actively notified. The value of the data warehouse is built on credibility. Such credibility is possible only when there is a common understanding of business themes and processes, and of the data that business decisions are made from. Many of the same decisions made regarding data mart data must likewise be made for meta data. When an enterprise-wide repository is populated, it essentially becomes the “meta data mart”. So the following decisions must be made about the data: Who is the user of the data, and what is that user trying to accomplish? Understanding Data Warehousing – A Primer 37 How should the data be organized to make it more accessible to the target users? How shall change control be managed? As meta data evolves, how important is versioning of historical configurations? How often, and through what methods, will meta data be refreshed or augmented? How should the threads between business themes and initiatives, production systems, database schema, standards, content quality and responsibilities be modeled in the meta data? How will these threads (relationships) get established? Which are most crucial? What interfaces are necessary to make the relevant meta data available to the global audience? How will sensitive meta data be protected? Who will manage the integrity of the meta data? The Data Administration group? … and so on. Effective meta data management affects the entire operation. The information systems (IS) group, which often gets saddled with most of the data warehouse operation responsibilities, is most directly impacted. It is a common perception that IS is responsible for the integrity of the data and how its used, so IS has a vested interest in making sure everyone uses it the best way they can; otherwise, repercussions will come back IS. 4.15 Managing History, Accountability and Audits Data warehouse projects are some of the most mission critical, expensive and visible initiatives in the corporation. A significant number of operations go into the day-to-day care and feeding of a large warehousing environment; there are many opportunities for the objectives of the warehouse to be compromised. As DSS activities become more crucial to the business, it is necessary to establish certain “assurances”. Corporate auditors may perform regular operation audits (over and above those performed within the immediately responsible organizations). Such audits may require that logs be captured of critical processing, and that a record of local audits, change control and oversight be maintained. Ideally, logs should capture measurement criteria. For example, daily reports from the data cleaning operations might indicate the volume of spurious values in crucial value domains. These reports can be logged; any exceptions over prescribed thresholds can be escalated for immediate attention. If the steps taken to resolve these problems are also logged, then the auditors have a check and balance, and a measure of performance (time to correct the problem). Logs such as these must be retained for a prescribed period, perhaps though specific archival procedures that facilitate the audits. Because of the critical importance of meta data, the management practices involved with maintaining meta data should also be audited. 4.16 Managing Internal Access, External Access and Security By definition, the data business in critical decision support systems is “sensitive”. The definitions of the warehouse and marts paint a picture of how the business is strategically managed. If this information falls in the wrong hands, competitive advantage can be compromised. Conversely, if the information is not placed into the right hands, it quickly diminishes in strategic value. More and more progressive companies are exploring the value of making certain mart data available to the supply chain (also called the business ecosystem) they are involved with. For example, a regional distributor may be granted access to the parts delivery marts of the manufacturers it buys parts from, and may in turn grant to its customers access to its own regional warehouse stock and delivery queue marts. These collaborations save organizations money Understanding Data Warehousing – A Primer 38 (reduced inventory) and allow them to provide greater customer service (visibility to the event horizon). The management of data accessibility can be complex in “open” environments. Internal information consumers must be administrated on a need-to-know basis, while mart prototypers and data miners may need special (more comprehensive) access. Data access for outside entities must be administrated across firewalls, and through web or virtual private network (VPN) mechanisms. It may be necessary to encrypt especially sensitive data that travels beyond protected boundaries. Decisions about data access are necessarily at business unit levels, but must be administrated at technical IT levels. Administrating how internal and external users access DSS information is essential to the success of the warehouse. This includes monitoring usage, which we explore in a later section. 4.17 Managing Systems Integration Aspects A complex data warehousing environment can require significant systems integration. Each interface between applications, databases, ERP systems, EDI sources and OLAP tools represents a potential risk for compromising the value of information, or impeding performance. Even the localized architecture between the production systems, operational data stores, warehouse and marts can yield many system integration issues. The team responsible for the data warehouse architecture (as discussed in section Managing Data Warehouse Architectures) can handle many of the systems integration issues. However, this team may lack the skills necessary to interface with ERP systems, or to integrate significant new technologies. Additional resources with specialized skills may need to be assigned to the warehouse project on a one-time, periodic or full-time basis. Managing the interfaces with ERP packages, OLAP tools and other special utilities may necessitate additional kinds of meta data. Where possible, meta data extraction tools should be purchased or developed to keep the meta data in the corporate or tactical repositories in step with meta data stored independently by packages, tools or management subsystems. 4.18 Managing Usage, Growth, Costs, Charge-backs and ROI Data warehouse projects are some of the most expensive and most visible initiatives in the enterprise, so it comes as no surprise that they come under heavy scrutiny. Planners and mart designers need to consider usage and growth. Project managers, data administrators and database administrators need to accrue various project costs (relating to design and operations) to the warehousing effort. Accountants need to distribute costs across the benefiting organizations, because no one department should have to carry the budget responsibility for the overall warehouse. And executive management wants to ensure that return on investment is realized. This section discusses a few of the issues involved. Usage Departments want to know who is using the marts, how often, and through which queries. In earlier section Managing Data Quality, we discussed that the data which goes unused represents the greatest quality risk. Understanding Data Warehousing – A Primer 39 The data query environment can be monitored with various devices (such as SQL exits) to measure usage; where such monitors are unavailable, periodic polls can be taken. Unused queries can be removed from general availability, and unused data can be eliminated from relevant schemas at an appropriate change control juncture. Growth Data warehouse projects are known to experience dynamic growth in consumption of resources. Unless huge volumes of DASD are idle and available, advanced planning is a must. At least some of the growth planning can be done from a combination of planned “mart starts” and historical experience of warehouse and mart growth; of course the latter is not available for the first warehousing project, so industry metrics or peer experience can be substituted. Cost Management Cost management is an accounting function, but actual cost data must be gathered from IT operations, contributing data administration and database administration groups, and end users. Various tools may be capitalized and depreciated, while others may be expensed against one or more startup projects. Centralized data warehouse costs can be aggregated; then a “slice” of the expenses can be charged back to each benefiting business organization. This distribution of expenses can be done simply (with an equal distribution), but most using organizations argue for distribution according to use. This makes the usage monitoring efforts all the more important. Return on Investment Executive management is most interested in which new business initiatives were made possible by the warehouse and marts, and that the business investment for warehousing results in real value. This can be hard to quantify after the fact, so early planning of how ROI will be computed is very important. For example, metrics about some business processes or line of business behaviors must be gathered before the warehouse projects begin, so that they can be measures against similar post-implementation metrics. 4.19 Managing Change – the Evolving Warehouse As business users start to work with integrated business data and discover new value in the information, they often change the requirements. Keeping pace with these demands means developers must make enhancements quickly and cost-effectively, and facilities planners must ensure that adequate CPU and DASD resources are available. Change management is especially important. There are a number of activities, across departments or organizations, that must be coordinated. When schema or file layouts change, the programs and utilities that use those resources must be changed in coordination; indeed, changes to business elements can and should ripple changes all the way through warehouse processing and ultimately influence or alter end user queries. These large-scale changes must be staged for implementation in the future – at a point when all affected parties can be ready. Ripple effects of uncoordinated implementations can be catastrophic and take weeks to undo. Strong planning and management communication is essential for keeping politics at bay, and keeping the warehouse focused on key business initiatives. The warehouse environment has the potential to become the “hub” of business decision making; left unchecked, it can easily spawn a new and highly unmanageable breed of applications running against ODS and warehouse data. Excited business areas can exert political forces that warp the original objectives of the warehouse, Understanding Data Warehousing – A Primer 40 and lead the warehouse project beyond its constraints of processing window, data quality and general feasibility. The data warehouse project requires a strong management presence that can maintain and evolve the warehouse vision, while keeping pace with target business needs. The holder of the vision must continually re-cast the role of the data warehouse as it evolves in concert with production systems and business initiatives. Understanding Data Warehousing – A Primer 41 5. Future of Data Warehousing We can expect data warehousing to continue evolving in the next few years. Perhaps the best way to deduce how data warehousing will evolve is to revisit why it originated in the first place, then evaluate the factors that continue to influence it. Objectives Recall that data warehousing was initially invented to work around a serious problem: the operational data buried in production systems was not available, nor clean enough, nor organized properly, for use by decision support systems. Data warehousing solves this problem with a number of solutions that (not surprisingly) retrieve, clean, organize and stage data – and finally make that data available for business queries. In some ways, current data warehousing is a brute force methodology that works around the problems that mainstream technology can’t solve today. These problems include: A single structure of data is insufficient to support varied audiences with separate interests A single body of data cannot support concurrent access from all potential information consumers, and their respective usage patterns, with any semblance of performance. Operational data is forever evolving, while data used for analysis must remain relatively stable. Organizations require a closer bond between business initiatives and the IT systems that activate them - in order to be more responsive and competitive. At the heart of data warehousing is “reaching deeper truths”. Many corporations are finally asking the question: “What do we really know about the most important thing in our business – the customer?”. Knowledge that a customer exists is not enough; nor, perhaps, are simple demographics enough. The deeper truths are discovered by understanding the behavior of the customer - and this means observing patterns: buying trends, shifts in preferences, and so on. It has been said that 90% of the proprietary data we’re protecting today has no value; yet we spend an enormous amount to protect it. Data warehousing is about wringing value from data. Business Factors Changes to the nature and positioning of data warehousing will be driven by business factors. Issues such as competitive advantage, gaining market share, focus on strategic objectives, and streamlining business operations will continue to be the main reasons for data warehousing. But the nature of business is changing due to new opportunities like e-commerce and the advent of more powerful enterprise requirements planning (ERP) systems. We can expect increasing interest in the integration of data warehouses or marts into “supply chain” scenarios, since mart data is the most relevant, cleanest, and most semantically pure data that the enterprise can offer to the outside world. Over time, I anticipate growing pressure to actually replace production systems with ODS or data warehouse “like” systems that serve both the production and DSS worlds. There are real technology barriers that make this impossible today; but as these barriers dissolve, people are going to gravitate to the data that is the most accessible and easy to understand. And this body of people will increasingly come with a business perspective rather than a technical one, i.e. with a predisposition to ODS or mart-like data. Deeper integration with ERP systems will also drive a shift to better data organization. ERP systems visualize the business is a well-defined, clean and organized fashion. This “information Understanding Data Warehousing – A Primer 42 map” corresponds well to ODS or mart data structures, but maps poorly to most production system data structures. So integration with ERP systems will more likely happen through the ODSs or marts. All this said, the cost of migrating from an established data warehousing architecture to “something new”, will be an inhibitor for many organizations – unless the migration can be tied to a specific business initiative with measurable returns that can cover the costs. Technology Factors Advancements in technology, and the decreasing costs of CPU and DASD hardware, will continue to drive changes to the data warehouse infrastructure. Data warehousing is fundamentally comprised of many low level technology tasks (extraction, cleaning, moving, distribution, etc.). We can expect technology to improve in remarkable ways; the Nucleus technology mentioned in an earlier section is a good example. In addition to doing the current processes of data warehousing better, we can expect technology to allow data warehousing to be done differently, and perhaps more succinctly. For example, data technology is being developed that allows common access, through standards like SQL via ODBC, to databases of different kinds and structures. It is not that far fetched to anticipate technologies that will allow us to combine and view archaic legacy data stores through a “data mart lens”. Web systems are perhaps on the fringes of a new frontier of consumer behavior monitoring. Remote web sites are able to collect increasingly significant information from the clients that access them. And the new automated “information gathering” services and specialty search engines are a natural proving ground for gathering hard data about consumer behaviors. We can expect this “behavior demographics” information to be sold back into vertical industries, and incorporated into data mining and warehousing initiatives. We can also expect new information delivery mechanisms, such as intranet- or Internet-based publishing or search facilities, to be more commonly utilized on the back end of data warehouses. Knowledge Engineering Data warehousing should enjoy increasing synergies with “sister initiatives” such as knowledge engineering and data mining. There is a growing overlap between these fields, and for good reason – they are all involved in deducing knowledge from data. As discussed in this paper, central knowledge engineering is a critical success factor in allowing global organizations to act globally, rather than just locally. Acting globally only happens when people can communicate clearly on important subjects, and cooperate (rather than bump into each other) on enterprise-wide projects. More Effective Infrastructure & Automation Data warehousing is a complex environment. It requires cooperation and coordination among many departments; it utilizes many different tools, and requires significant ongoing management and administration. It is also one of the most dynamically changing environments in organizations today. The data warehousing scenarios are simply too complex to be administrated manually or through trivial scheduling tools. A significant portion of the infrastructure, and the administration, must be automated. Since the quality of warehouse data is inherently tied to the consistency of the processes that create it, there is a growing opportunity for process flow automation. Understanding Data Warehousing – A Primer 43 The goal of automation is to reduce human interactions (human error) and leverage the consistency and predictability of the computer. Ultimately, the goal is for humans to make fewer “mechanical” decisions, and rather respond to events and deal with exceptions. For data warehousing, this implies that a very strong repository of meta data is available about all the business information, activities and deliverables occurring in the warehouse environment – complete with detailed dependencies between warehouse processes and assigned responsibilities. Few repository technologies available today can meet this challenge; but without such technology, the risk to data warehouse projects will remain very high – mitigated only by . More Effective Communication The number of parties that must communicate effectively in the design, planning, day-to-day operations, distribution and mart management activities of the data warehouse can be staggering. This highlights a significant challenge that will undoubtedly be addressed in the future communication. Effective communication requires a sound information foundation. Certainly, meta data repositories play a key role in laying this foundation, and enabling collaboration. Various work group tools (such as Lotus Notes) also provide some communication infrastructure, but may not provide necessary continuity between the communication, and the context of the communication. Being able to communicate easier with more people is not the problem. Rather, it is communicating more discretely, and more contextually, with just the right people. In order for this to occur, a more automated process infrastructure must be in place. People must be trained to leverage it, and to communicate through it. And the process infrastructure must be capable of monitoring events, raising exception conditions for review, and notifying responsible people or subsystems. We can expect such an infrastructure to evolve as organizations strive to “close the gap” between the business objective, and the implementation of that objective as realized in IT and DSS applications. Understanding Data Warehousing – A Primer 44 6. Summary Data warehousing is a significant undertaking for most organizations. Data warehousing projects are notoriously expensive, and history has shown that the overly ambitious projects are prone to failure. The risks are high, but so are the rewards when data warehouses and marts are conceived and employed strategically. As with e-commerce, many companies will reach the point where they can no longer afford not to do some strategic decision support; the competition alone will make it imperative. This primer has discussed a number of management issues inherent in the data warehouse environment. Chapter 4 identified nearly twenty management aspects that require attention; even if several of these can be consolidated, it is clear to see the breadth of processes, roles, control and communication that must come together to make data warehousing succeed. We are using the breakdown of management issues, outlined in Chapter 4 , for the development of use cases and solution points for data warehouse management. A number of roles (actors, in “use case” parlance) are involved in the data warehouse problem space. The spectrum of different needs, ranging from the very technical to the very non-technical, is perhaps the most striking aspect. Just some of the identified roles include: End User (Common, Explorer, Protyper) Application developer Database administrator Data Administrator Systems programmer Network administrator Managers (Business unit, DW Operations and IT) Data warehouse project administrator. It is clear that a single solution cannot hope to satisfy all these roles; but it is likewise clear that a common information infrastructure must support them all. There is an obvious need for meta data solutions to support the data warehouse environment. In fact, many of the existing warehouse management tools come with narrowly focused meta data repositories. This is good and bad news: it is good news that meta data is being leveraged (data warehousing would not be feasible without it); but it is bad news that there a many scattered repositories with no coordination between them. The ongoing challenge for meta data delivery is to provide excellent integration of meta data sources, intuitive organization of meta data subjects, high availability to the meta data store, and focused views for different roles. Conclusion It is worth restating the initial premise of data warehousing: Data warehousing is the process of creating useful information that can be used in measuring and managing the business. Whether data warehousing survives in its current form, or evolves to a more optimal form, will depend on emerging technology and the ingenuity of leaders in this market; regardless, it is clear that the business requirements for strategic decision support will not go away. It is also clear that the complex issues of data warehouse management will not go away either. Understanding Data Warehousing – A Primer 45