Snapshots from Hell: Practical Issues in Data Warehousing Tyson Lee, LEAP Consulting, Palo Alto, CA Introduction Data warehouse projects have a reputation for being complex, costly, and almost certain to fail. According to a recent report from Information Week, the rate of data warehouse project failures, while down from analysts’ estimates of 60% to 90% in the 1990s, still hovers around 40%. Lack of planning and poor design also contributed to these inflexible, "stovepipe" projects that quickly became legacy baggage in the face of an ever-changing business environment. Instead of reaping the benefits of data warehousing, many practitioners fear the nightmare of delayed timelines and overrun budgets. Some even describe their data warehousing experience as a living hell. In this paper, the author intends to leverage practical experiences to illustrate industry best practices in good data warehouse design, construction and on-going maintenance. Topics include: • Fundamental Concepts of Data Warehouse • Zen of Data Warehousing • Myths and Facts of Data Warehousing • Validation Issues ® • Examples Using SAS Database vs. Data Warehouse On July 21, 2003, the key word search on Google for words ‘database’ and ‘data warehouse’ yielded 48.8 million and 612 thousand hits respectively, while the most widely searched key word ‘sex’ yielded 184 million hits. By this measure, while database is becoming a quite prevalent term, data warehouse remains much less known to the general public. By definition, both database and data warehouse are a collection of data. They differ in their purposes and structures. satisfy certain rules (normalization rules for instance) to guarantee data integrity and update efficiency. A database is transaction oriented and dynamic. As a result, a database is not friendly to end-users for queries and decision support analyses. A data warehouse, on the other hand, is “a subjectoriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process,” (Building the Data Warehouse, Bill Inmon). In a repository data warehouse, data structure is defined as such that it ensures full integration of data and application in a new environment. Data are pre-processed and relatively stable; tables are merged and de-normalized to facilitate applications known as on-line analytical processing (OLAP). Unlike in a RDB, data redundancy is highly encouraged in a data warehouse environment. Objective Data Structure Database DW Transactional Analytical Normalized Denormalized Organization Separated Integrated Update Frequency Dynamic Static Data Acquisition Decision Support DBA DWA Interface System Administrator Over the years, data warehouses have been implemented in a number of industries including market research, financial risk management, investment modeling and business intelligence. In the pharmaceutical and biotechnology industry, repository data warehouses are commonly used in e-submissions to provide users with integrated, analytical information rather than raw data. Zen of Data Warehousing Yin Yang A relational database (RDB) is typically used for data acquisition and storage, applications known as on-line transaction processing (OLTP). Therefore a RDB has to In Chinese philosophy, Zen refers to the coexistence and harmony of two opposite components. Due to the complexity of data warehousing, it is invaluable to leverage Zen philosophy as guiding principles in data warehouse design, implementation, and project management. destined to become a victim of change. A successful practitioner not only anticipates change but also embraces changes. They would build the data warehouse with changes in mind from the beginning. When designing the data warehouse, one should always keep the schema flexible to accommodate future requirements. Change control procedures should also be implemented for regulatory requirements and speedy implementation. Small is Big Despite the fact that data redundancy is encouraged in a data warehouse for the ease of analytical processing, elegant design and a small footprint is still very important. In other words, redundancy should not be created casually. In many cases, views should be used rather than permanent tables. The advantages of views are dynamic linking and small footprint. The downside is the performance that is sacrificed at runtime. Myths of Data Warehousing Slow is Fast Myth #1: We can do well without data warehousing. The book In Search of Excellence, by Thomas Peters and Robert Waterman, Jr., reveals an astounding fact: According to a university study, most managers do not regularly invest large blocks of time for planning. Worse, the time they use is fragmented into tiny segments of activity. The average time allotted to any particular task was only nine minutes. Nine minutes! You cannot even enjoy a good candy bar in nine minutes. A lot of companies out there judge the consequences of data warehousing as a greater risk than the risk of not doing it. This is particularly true in today’s stalled economy. However, history has taught us many lessons on doing nothing. Montgomery Ward did nothing or very little to ensure its survival. Xerox did nothing as the future of traditional copying became grimmer and grimmer, and Xerox’s own inventions – the mouse, the graphical user interface – were grabbed by Apple. The airlines did nothing, even as they saw Southwest achieving the supposedly unthinkable. Sears did nothing as two guys who got fired from their last job - Bernie Marcus and Arthur Blank – built Home Depot into a chain of more than 1,400 stores. Procter and Gamble and Kraft General Foods did nothing as Starbucks redefined coffee. I can go on and on. But the key point is doing nothing bears a greater risk when your competitors are gaining better understanding of the business through data warehousing and business intelligence. As the poet Charles Wright has said, “What we refuse defines us.” Data warehousing should not be hastened. If we use the ‘3D’(Define, Design, Deliver) approach for data warehousing, one is expected to spend a huge amount of time upfront in the initial phase of definition and planning. The rule of thumb is this: if you cut corners by 1 hour in definition and planning phase, you will be expected to pay the price of 10 hours and 100 hours respectively in the subsequent phases. So the payback ratio is 1:10:100. A slow, systematic start in data warehousing can set the project on the right track and will pay off in the long run. The key is doing things right the first time. Myth #2: Data warehouse is just a fancy name for analysis files. Less is More It is inconceivable that any sophisticated data warehouse can address all business scenarios and needs. The ‘80/20’ rule tells us that you can spend 20% effort to build a data warehouse that will meet 80% of the needs. However, if you try to perfect it by addressing the rest of the 20% requirements, you will end up spending 80% effort. This is where a lot of projects failed. They have underestimated the effort/payback ratio. So the key is to have a few excellent ‘quick hits’ in your data warehousing effort to target top priorities – mission-critical business issues with the greatest ROI. Don’t try to warehouse all your data. Can you image an automaker with the mentality that “a car is just a platform running on wheels,” can build top quality automobiles over time? Just like you would not call an EXCEL spreadsheet a database, a loosely tied set of analysis files would not necessarily qualify as a data warehouse. Building a data warehouse involves careful planning, deliberate designing, and systematic implementation. One has to add in bells and whistles such as data schema (Conceptual, logical and physical data models), data standard, derivation rules, and metadata management. Calling analysis files a data warehouse undermines the complexity of data warehousing. Practitioners with this mentality are doomed to fail in the face of changing needs from diverse users. Future is Now “Time is the best teacher. Unfortunately, it killed all its pupils.” It is naïve to assume that data warehousing won’t bring about technological, process or organizational changes. Data warehousing, in many ways, is an exercise of chasing a moving target. You’ve probably heard the old adage that there are three types of people in the world: those who make things happen, those who watch things happen, and those who ask, “what happened?” In data warehousing, either you’re an agent of change , or you’re Myth #3: Standardization slows down the data-warehouse project. Albeit time-consuming, standardization is an essential part of data warehousing. Projects without taking serious considerations of data standard end up paying a higher toll. Instead of reinventing the wheel, try to use published, well-adopted data standard for reference. 2 Examples are CDISC in the biopharmaceutical industry and HL7 in the healthcare industry. However, PROC COMPARE doesn’t provide much information on added/deleted observations. In this case, SQL can be used to generate a customized validation report. A more sophisticated approach is to develop a set of audit trail utilities using a variety of tools. Myth #4: A DBA can naturally be a good DWA (Data Warehouse Administrator). Data "Where" House: Knowing Where to Fix The DWA role requires a different skill set than the one possessed by most DBA’s. Inadequate user involvement is ultimately responsible for most data warehousing failures. A good DWA wears many hats and needs to be a subject matter expert (SME), an efficient project manager, a savvy leader and organizer. He or she needs to be able to work with many cross-functional teams to promote the data standard and collaboration so that the final deliverables can bring information to end users’ fingertips. A more critical issue is maintaining the data warehouse for the long run. Specifically, when requests come to fix some data problems, it is critical to know which tables to fix. Since redundant data have been populated in a data warehouse, identifying where to fix is tricky and an effective approach could dramatically improve your effectiveness. Validation Issues Case Study: In a clinical trial data warehouse, it is decided to disqualify one investigator for the final ISE (Integrated Summary of Efficacy) analysis. Also, a few patients were transferred from one site to another in the long-term follow-up phase. Global data updates are needed to address these changes. It is important to verify the results before applying any permanent data changes to tables in a data warehouse. In particular, in the biopharmaceutical industry, due to the FDA regulation 21 CFR Part 11, it is important to document the changes. In SAS, PROC COMPARE can be a powerful tool to perform comparison of two data sets. In this case, a data dictionary becomes handy. Data dictionary, also known as metadata, contains data about the data. In SAS, dictionary tables are generated at run time and are read-only. The following SAS dictionary tables are available by default: By default, PROC COMPARE procedure generates the following five reports: 1. 2. 3. 4. 5. data set summary variables summary observation summary values comparison summary value comparison results for all variables judged unequal. MEMBERS: contains members of each library and general information about members. TABLES: detailed information on members of each library. COLUMNS: information on variables and their attributes. CATALOGS: catalog entries in libraries. VIEWS: lists views in libraries. INDEXES: information on indexes defined for data files. These reports are helpful but lengthy. Options in the PROC COMPARE procedure can be specified to produce customized reports. The following macro prints out the differences for the data set before and after the change. ****************************************; * Macro to compare a data set before and *; * after data update *; ****************************************; For the benefit of readers, let' s simplify the case and assume that investigator id, investigator site and patient id are common fields for all tables in a clinical research data warehouse. Otherwise, we can still use the same technique, just that we have to take one step further by using the COLUMNS table to identify which tables to fix. %MACRO COMPARE(DATASET, KEYS); PROC COMPARE BASE=WAREHSE.&DATASET COMPARE=TEMP.&DATASET NOPRINT OUT=RESULT OUTNOEQUAL OUTBASE OUTCOMPARE OUTDIF; Program: **********************************************; * Global changes to data warehouse: *; * - exclude Dr. Jones (id=875) *; * - change investigator site for patients *; **********************************************; ID &KEYS; RUN; PROC PRINT DATA=RESULT; BY &KEYS; ID &KEYS; TITLE “Discrepancies in &DATASET before and after”; RUN; *** Get table names in the library ***; PROC SQL NOPRINT; CREATE TABLE TO_CHG AS SELECT MEMNAME FROM DICTIONARY.TABLES %MEND COMPARE; 3 (doing the right thing) in designing building and maintaining a data warehouse. WHERE LIBNAME = ' WAREHSE' ; *** The data update macro ***; First of all, a data warehouse may not have the sophisticated audit trail and rollback capabilities that a traditional rDBMS has. Data backup prior to each change is critical. And data validation is essential, even for a simple change. %MACRO CHG_SITE; DATA TEMP.&DATASET; SET WAREHSE.&DATASET (WHERE=(INV NE ' 875' )); IF PT IN (' 2488' ,' 2487' ,' 6213' ,' 2486' ) AND INVSITE=' 7371'AND PHASE=2 THEN INVSITE=' 0661' ; Appropriate documentation of data warehousing needs to be maintained. This is extremely important in a regulated environment such as a clinical research system. The documentation should include the original requests, test plan, expected and actual results, implementation procedure, etc. A SAS audit trail facility can also be developed. RUN; %MEND; *** Macro to apply global changes ***; A critical issue about maintaining a data warehouse is how to identify which tables to fix. A SAS data dictionary can be very useful. Knowing where to fix would fundamentally improve the effectiveness of data warehouse maintenance. %MACRO CHG_ALL; %DO I = 1 %TO &SQLOBS; DATA _NULL_; I = &I; SET TO_CHG POINT=I; CALL SYMPUT(' DATASET' , COMPRESS(MEMNAME)); STOP; RUN; Due to the data redundancy in a data warehouse, efficiency is often a concern for building and maintaining a data warehouse. A small efficiency gain in a macro program can be multiplied when applying repeatedly and globally. Consider the tradeoff of computer resources and human efforts. After all, efficiency is measured by a combination of your CPU, I/O and. human costs %CHG_SITE %COMPARE(&DATASET, INV PT) %END; %MEND; Use the data driven approach rather than writing a new program each time. Use the macro facility to generate SAS program code, instead of writing them explicitly. ® SAS/ASSIST can also be utilized as a CASE tool for code generation. All these techniques are geared towards minimizing chances for human errors in building and updating a data warehouse. Build your data warehouse with flexible data schema to accommodate future business requirement changes. Let your data warehouse work as hard as you do! *** Call the macro for global changes ***; %CHG_ALL RUN; This program will automatically update tables in the warehouse one after another, and print the differences before and after change. A couple of tips: • The SQL procedure sets up a macro variable SQLOBS, which contains the number of rows processed. • COMPRESS in SAS OPTIONS needs to be set to NO, otherwise the POINT= option won’t work. Acknowledgements The author is thankful to Lindley Frahm and Oscar Cheung for reviewing the manuscript and providing invaluable feedback. More Considerations We live in a fast-paced, dynamic information age. Data warehousing turns quantitative raw data into more user friendly, qualitative information for better decisionmaking. However, data warehousing is a painstaking, laborious process. Due to its complexity, if mismanaged, this seemly value-added effort can quickly turn into nightmares. In addition, due to the characteristics of a data warehouse, special considerations are needed to build and maintain a data warehouse environment. We must consider both efficiency (doing it right) and effectiveness Author Information Tyson Lee provides technical and management consulting services. He has over 12 years of industry experience working for market leaders and has served as Associate Director of Biometrics. He spearheaded the data warehousing initiative at Genentech and has won the prestigious Technological Achievement Award at Eli Lilly and Company. 4 Tyson is a recognized contributor and speaker at WUSS, SUGI and PharmSUG. He has won the best paper award at WUSS and has been inducted in International Who’s Who in Information Technology. Tyson holds an MBA from UC-Berkeley’s Haas School of Business and an MA in Econometrics & Computer Science. Contact: Tyson Lee LEAP Consulting Palo Alto, CA 94303 ttlee88@yahoo.com 5