ISSUES AFFECTING THE DATA WARHOUSE EFFICIENCY Jaee Ranavde B.E., Mumbai University, India, 2004 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in BUSINESS ADMINISTRATION (Management Information Systems) at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2010 ISSUES AFFECTING THE DATAWAREHOUSE EFICIENCY A Project by Jaee Ranavde Approved by: __________________________________, Committee Chair Dr. Monica Lam, Ph.D. ____________________________ Date ii Student: Jaee Ranavde I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Dean ________________ Date Sanjay Varshney, Ph.D., CFA College of Business Administration iii Abstract of ISSUES AFFECTING THE DATA WAREHOUSE EFFICIENCY by Jaee Ranavde Data warehousing is achieving huge importance in all businesses around the world. Data warehousing is commonly used by companies to analyze trends over time. In other words, companies use data warehousing to view day-to-day operations, but its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews business models, forecasts, and other reports and projections can be made. In the world where data warehousing is gaining so much importance, it becomes necessary to study the factors that hinder its efficiency. These problems have to be studied in detail to isolate the reason of their existence. The aim of this project is to understand these issues affecting the data warehousing efficiency. Understanding these issues will help us to design better data warehouses in the future. _______________________, Committee Chair Dr. Monica Lam, Ph.D _______________________ Date iv TABLE OF CONTENTS Page List of Figures ................................................................................................................................ vii Chapter 1. BACKGROUND AND OVERVIEW ......................................................................................... 1 2. THE GOAL OF THE PROJECT ................................................................................................. 3 2.1 Objective 1 ....................................................................................................................... 3 2.2 Objective 2 ....................................................................................................................... 3 2.3 Objective 3 ....................................................................................................................... 4 2.4 Objective 4 ....................................................................................................................... 4 3. LITERARY REVIEW ................................................................................................................. 5 3.1 Data Warehouse Architecture Goals ................................................................................ 6 3.2 Data Warehouse Users ..................................................................................................... 7 3.3 How Users Query the Data Warehouse ........................................................................... 9 3.4 Principles of Data Warehouse Design and Implementation ............................................ 9 3.4.1 Organizational Consensus ...................................................................................... 10 3.4.2 Data Integrity ......................................................................................................... 10 3.4.3 Implementation Efficiency ..................................................................................... 11 3.4.4 User Friendliness ................................................................................................... 11 3.4.5 Operational Efficiency ........................................................................................... 12 3.4.6 Scalability .............................................................................................................. 12 3.4.7 Compliance with IT Standards ............................................................................... 12 v 3.5 Data Warehouse Issues .................................................................................................. 13 3.5.1 Loading and Cleansing Data .................................................................................. 13 3.5.2 Cost and Budgeting Issues ..................................................................................... 15 3.5.3 Data Warehousing Security Issues ......................................................................... 16 3.5.4 Maintenance Issues for Data Warehousing Systems ............................................. 19 4. SIGNIFICANCE OF THE PROJECT ....................................................................................... 22 5. RESEARCH METHODOLOGY............................................................................................... 23 6. CASE STUDIES ........................................................................................................................ 24 6.1 John Deere – Interview with Project Manager............................................................... 24 6.2 Kyobo Life Insurance – Case Study One ....................................................................... 31 6.3 Scandinavian Airlines – Case Study Two ...................................................................... 37 6.4 Philips Consumer Electronics – Case Study Three ........................................................ 45 7. SURVEY ................................................................................................................................... 49 7.1 Sample Population ......................................................................................................... 49 7.2 Sample Frame ................................................................................................................ 49 7.3 Survey Response ............................................................................................................ 50 7.4 Data Collection via Online Survey ................................................................................ 50 8. CASE STUDIES AND SURVEY OBSERVATION ................................................................ 51 9. CONCLUSION .......................................................................................................................... 63 Appendix. Questionnaire .............................................................................................................. 68 Bibliography .................................................................................................................................. 70 vi LIST OF FIGURES Page Figure 1: Number of Respondents from Different Data Warehouse Sizes ....................... 55 Figure 2: Respondents Role in the Data Warehouse ........................................................ 56 Figure 3: Importance of Each Feature to Respondents in Different Roles ....................... 57 Figure 4: Issues Faced by Respondents in Different Roles .............................................. 58 Figure 5: Importance of Various Features in Different Sized Data Warehouses ............. 60 Figure 6: Issues in Different Sized Data Warehouses ...................................................... 61 vii 1 Chapter 1 BACKGROUND AND OVERVIEW A data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis. It is commonly used by companies to analyze trends over time. Companies use data warehousing to view dayto-day operations, but its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews - business models, forecasts, and other reports and projections can be made. Data warehousing has become necessary for running an enterprise of any size to make intelligent decisions. It enables the competitive advantage. Data warehousing captures data and their relationships and it is foundation for Business Intelligence (BI). It clearly draws the distinction between data and information. Data warehousing emphasizes organizing, standardizing and formatting facts in such a way that we can derive information from them. BI is then concerned about acting on that information. The primary goal of any data warehouse is to integrate data from disparate sources into a centralized store, where data can be used across the enterprise for decision support. Data warehousing is not only used as a method to organize data but also to save the information from huge data loss in unforeseen incidents like natural calamities, mass data corruption, etc. Many multinational companies use the data warehousing system to save their important data from any such causes. The economy has to continue to grow even in 2 times of disaster. Therefore, it is very important that all useful data be backed up in data warehouse, checked, established and maintained at safe places. 3 Chapter 2 THE GOAL OF THE PROJECT It is clear that the data warehouse design and management is extremely critical for the successful functioning of the data warehouse system. There exist many theoretical explanations on how to design an efficient and effective data warehouse but many companies still face hurdles when working in the data warehouse environment. They face various issues which hinder the smooth functioning. Therefore, there is a strong need to understand the issues affecting the data warehouse efficiency. The goal of this project is to identify these issues and cross verify if these issues are valid in reality by using interview, case studies and online survey. 2.1 Objective 1 The project first attempts to understand what are the critical success factors in the designing of a data warehouse environment. 2.2 Objective 2 The project further tries to understand what are the current major issues faced by businesses through case studies and online survey. 4 2.3 Objective 3 The issues affecting the data warehouse efficiency in theory and practicality are discussed. 2.4 Objective 4 Finally, a few ways to overcome or minimize these issues are discussed. The overall goal of the project is to understand what the critical factors for a successful data warehouse design are and how they should be taken into consideration while designing a data warehouse environment. 5 Chapter 3 LITERARY REVIEW Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis. This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata. In contrast to data warehouses are operational databases that support day-to-day transaction processing. The data warehouses are supposed to provide storage, functionality and responsiveness to queries beyond the capabilities of today's transactionoriented databases. Also data warehouses are set to improve the data access performance of databases. Traditional databases balance the requirement of data access with the need to ensure integrity of data. In present day organizations, users of data are often completely removed from the data sources. Many people only need read-access to data, but still need a very rapid access to a larger volume of data than what can be conveniently downloaded to the desktop. Often such data comes from multiple databases. Because many of the analyses performed are recurrent and predictable, software vendors and systems support staff have begun to design systems to support these functions. Currently there comes a necessity for 6 providing decision makers from middle management upward with information at the correct level of detail to support decision-making. Data warehousing, online analytical processing (OLAP) and data mining provide this functionality. Before embarking on the design of a data warehouse, it is imperative that the architectural goals of the data warehouse are clear and well understood. Because the purpose of a data warehouse is to serve users, it is also critical to understand the various types of users, their needs, and the characteristics of their interactions with the data warehouse. 3.1 Data Warehouse Architecture Goals A data warehouse exists to serve its users - analysts and decision makers. A data warehouse must be designed to satisfy the following requirements [MSDN Microsoft Library]: Deliver a great user experience—user acceptance is the measure of success. Function without interfering with OLTP systems. Provide a central repository of consistent data. Answer complex queries quickly. Provide a variety of powerful analytical tools, such as OLAP and data mining. Most successful data warehouses that meet these requirements have these common characteristics [MSDN Microsoft Library]: Are based on a dimensional model. 7 Contain historical data. Include both detailed and summarized data. Consolidate disparate data from multiple sources while retaining consistency. Focus on a single subject, such as sales, inventory, or finance. Data warehouses are often quite large. However, size is not an architectural goal - it is a characteristic driven by the amount of data needed to serve the users. 3.2 Data Warehouse Users The success of a data warehouse is measured merely by its acceptance by users. Without users, historical data might as well be archived to magnetic tape and stored in the organization. Successful data warehouse design starts with understanding the users and their needs. Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers, Information Consumers, and Executives. Statisticians: There are typically only a handful of sophisticated analysts - Statisticians and operations research types - in any organization. Though few in number, they are some of the best users of the data warehouse, whose work can contribute to closed loop systems that deeply influence the operations and profitability of the company. These people are often very self-sufficient and need only to be pointed to the database and given some simple instructions about how to get to the data and what times of the day are best 8 for performing large queries to retrieve data to analyze using their own sophisticated tools [MSDN Microsoft Library]. Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and analyses against the data warehouse. These are the users who get the "Designer" or "Analyst" versions of user access tools. They will figure out how to quantify a subject area. After a few iterations, their queries and reports typically get published for the benefit of the Information Consumers. Knowledge Workers are often deeply engaged with the data warehouse design and place the greatest demands on the ongoing data warehouse operations team for training and support [MSDN Microsoft Library]. Information Consumers: Most users of the data warehouse are Information Consumers; they do not compose a true ad hoc query. They use static or simple interactive reports that others have developed. They usually interact with the data warehouse only through the work product of others. This group includes a large number of people, and published reports are highly visible. A great communication infrastructure must be set up for distributing information widely, and gather feedback from these users to improve the information sites over time [MSDN Microsoft Library]. Executives: Executives are a special case of the Information Consumers group. Few executives actually issue their own queries, but an executive's slightest musing can 9 generate a flurry of activity among the other types of users. A wise data warehouse designer should develop a very sophisticated digital dashboard for executives assuming that it is easy and economical to do so [MSDN Microsoft Library]. 3.3 How Users Query the Data Warehouse Information for users can be extracted from the data warehouse relational database or from the output of analytical services such as OLAP or data mining. Direct queries to the data warehouse relational database should be limited to those that cannot be accomplished through existing tools, which are often more efficient than direct queries and impose less load on the relational database [MSDN Microsoft Library]. Reporting tools and custom applications often access the database directly. Statisticians frequently extract data for use by special analytical tools. Analysts may write complex queries to extract and compile specific information not readily accessible through existing tools. Information consumers do not interact directly with the relational database but may receive e-mail reports or access web pages that expose data from the relational database. Executives use standard reports or ask others to create specialized reports for them. 3.4 Principles of Data Warehouse Design and Implementation A DW can quickly become a chaos if it's not designed, implemented and maintained correctly. Following are the ‘Seven Principles of Data Warehousing’ that helps a DW 10 design and implementation on the road to achieving your desired results [http://www.academictutorials.com/data-warehousing/]. 3.4.1 Organizational Consensus From the outset of the data warehousing effort, there should be a consensus-building process that helps guide the planning, design and implementation process. If the knowledge workers and managers see the data warehouse as an unnecessary intrusion - or worse, a threatening intrusion - into their jobs, they won't like it and won't use it. Every effort must be made to gain acceptance for, and minimize resistance to, the data warehouse. If the stakeholders are involved early in the process, they're much more likely to embrace the data warehouse, use it and, hopefully, champion it to the rest of the company [http://www.academictutorials.com/data-warehousing/]. 3.4.2 Data Integrity The critical function of data warehousing - of any business intelligence (BI) project - is to provide a single version of the truth about organizational data. The path to this brass ring begins with achieving data integrity in the data warehouse. Therefore, any design for the data warehouse should begin by minimizing the chances for data replication and inconsistency. It should also promote data integration and standardization. Any reasonable methodology chosen to achieve data integrity should be implemented with the end result in mind [http://www.academictutorials.com/data-warehousing/]. 11 3.4.3 Implementation Efficiency To help meet the needs of the company as early as possible and minimize project costs, the data warehouse design should be straightforward and efficient to implement. This is truly a fundamental design issue. A technically elegant data warehouse could be designed. However, if that design is difficult to understand or implement or doesn't meet user needs, the data warehouse project will be stuck in difficulty and cost overruns almost from the start. It is a wise decision to opt for simplicity in the design plans and choose (to the most practical extent) function over beautiful form. This choice will help stay within budgetary constraints, and it will go a long way toward providing user needs that are effective [http://www.academictutorials.com/data-warehousing/]. 3.4.4 User Friendliness User friendliness and ease of use issues, though they are addressed by the technical people, are really business issues. It is because of the fact that if the end business users don't like the data warehouse or if they find it difficult to use, they won't use it, and all your work will be for nothing. To help achieve a user-friendly design, the data warehouse should leverage a common front-end across the company - based on user roles and security levels. It should also be intuitive enough to have a minimal learning curve for most users. Of course, there will be exceptions. The rule of thumb is: The least technical users should find the interface [http://www.academictutorials.com/data-warehousing/]. reasonably intuitive 12 3.4.5 Operational Efficiency This principle is really a corollary to the principle of implementation efficiency. Once implemented, the data warehouse should be easy to support and facilitate rapid responses to business change requests. Errors and exceptions should also be easy to remedy, and support costs should be moderate over the life of the data warehouse. The reason this principle is a corollary to the implementation efficiency principle is that operational efficiency can be achieved only with a data warehouse design that is easy to implement and maintain. Again, a technically elegant solution might be attractive but a practical, easy-to-maintain solution will yield better results in the long run [http://www.academictutorials.com/data-warehousing/]. 3.4.6 Scalability Scalability is often a big problem with data warehouse design. The solution is to build in scalability from the start. Choose toolsets and platforms that support future expansions of data volumes and types as well as changing business requirements. It's also a good idea to look at toolsets and platforms that support integration of, and reporting on, unstructured content and document repositories [http://www.academictutorials.com/data- warehousing/]. 3.4.7 Compliance with IT Standards Perhaps the most important IT principle to keep in mind is not to reinvent the wheel when building the data warehouse. That is, the toolsets and platforms chosen to implement the 13 data warehouse should conform to and leverage existing IT standards. The existing skill sets of IT and business users must be leveraged. In a way, this is a corollary of the user friendliness principle. The more the users know going in, the easier they'll find the data warehouse to use once they see it [http://www.academictutorials.com/data-warehousing/]. 3.5 Data Warehouse Issues There are certain issues surrounding data warehouses that companies need to be prepared for. A failure to prepare for these issues is one of the key reasons why many data warehouse projects are unsuccessful. 3.5.1 Loading and Cleansing Data One of the first issues companies need to confront is that they are going to spend a great deal of time loading and cleaning data. Some experts have said that the typical data warehouse project will require companies to spend 80% of their time for it. While the percentage may or may not be as high as 80%, one thing to understand is that most vendors will understate the amount of time needed to spend doing it. While cleaning the data can be complicated, extracting it can be even more challenging. Not matter how well a company prepares for the project management; they must face the fact that the scope of the project will probably be broader than they estimate. While most projects will begin with specific requirements, they will conclude with data. Once the end users see what they can do with the data warehouse after it’s completed, it is very likely that they will place high demands on it. While there is nothing wrong with this, it is best 14 to find out what the users of the data warehouse need next rather than what they want right now. Another issue that companies will have to face is having problems with their systems placing information in the data warehouse. When a company enters this stage for the first time, they will find that problems that have been hidden for years will suddenly appear. Once this happens, the business managers will have to make the decision of whether or not the problem can be fixed via the transaction processing system or a data warehouse that is read only. It should also be noted that a company will often be responsible for storing data that has not be collected by the existing systems they have. This can be a headache for developers who run into the problem, and the only way to solve it is by storing data into the system. Many companies will also find that some of their data is not being validated via the transaction processing programs. In a situation like this, the data will need to be validated. When data is placed in a warehouse, there will be a number of inconsistencies that will occur within fields. Many of these fields will have information that is descriptive. When of the most common issues is when controls are not placed under the names of customers. This will cause headaches for the warehouse user that will want the data warehouse to carry out an ad hoc query for selecting the name of a specific customer. The developer of the data warehouse may find themselves having to alter the transaction processing systems. In addition to this, they may also be required to purchase certain forms of technology. One of the most critical problems a company may face is a transaction processing system that feeds info into the data warehouse with little detail. This may occur frequently in a 15 data warehouse that is tailored towards products or customers. Some developers may refer to this as being a granular issue. Regardless, it is a problem you will want to avoid at all costs. It is important to make sure that the information that is placed in the data warehouse is rich in detail [http://www.exforsys.com/tutorials]. 3.5.2 Cost and Budgeting Issues Many companies also make the mistake of not budgeting high enough for the resources that are connected to the feeder system structure. To deal with this, companies will want to construct a portion of the cleaning logic for the feeder system platform. This is especially important if the platform happens to be a mainframe. During the cleaning process, you will be expected to do a great deal of sorting. The good news about this is that the mainframe utilities are often proficient in this area. Some users choose to construct aggregates within the mainframe since aggregation will also require a lot of sorting. It should also be noted that many end user will not use the training that they receive for using the data warehouse [Dain, Hansen, Cost Savings are the New Black for Data Warehousing (March 19, 2009)]. However, it is important that they be taught the fundamentals of using it, especially if the company wants them to use the data warehouse frequently. Typical data warehouse project costs derive from two sources: technology and staffing. The balance between these costs varies by a project's environment, but every project includes both. Technology costs often include system hardware, database-related 16 software, and reporting software. Staffing costs address the individuals who gather requirements as well as those who model, develop, and maintain the data warehouse. In reviewing the marketplace for data warehousing hardware and software, organizations can easily "break the bank" when procuring these technologies. As tempting as it may be to seek out a large SAN with clustered high-availability database servers and farms of Web servers to support a data warehouse, it is not always appropriate or necessary. It is quite possible, and sometimes far more effective, to build a data warehouse solution on relatively conservative hardware [Bill Inmon, Controlling Warehouse Costs (October 1998)]. Further, although there are a number of outstanding software solutions that address each step in the data flow, it is critical to first find the right toolset before selecting an expensive or feature-rich product. A common organizational mistake in data warehousing is to ignore the currently owned technology and software capabilities and move ahead quickly, if imprudently, to purchase a different product [Nathan, Rawling, Data Warehousing on a Shoestring Budget (May, 2008)]. An organization often has the toolset in place to effectively meet its data warehousing demands but simply needs the right partner to fully implement its current technology. 3.5.3 Data Warehousing Security Issues Data warehousing systems present special security issues which include the degree of security appropriate to summaries and aggregates of data 17 the security appropriate for the exploration data warehouse, specifically designed for browsing and ad hoc queries the uses and abuses of data encryption as a method of enhancing privacy Many data structures in the data warehouse are completely devoid of sensitive individual identities by design and, therefore, do not require protection appropriate for the most private and sensitive data. For example, when data has been aggregated into summaries by brand or region, as is often the case with data warehousing, the data no longer presents the risk of compromising the private identities of individuals. However, the data can still have value as competitive intelligence of market trends, and thus requires careful handling to keep it out of the hands of rival firms. Relaxed security does not mean a lack of commitment to security. The point is that differing levels of security requirements ought to remind us that the one-size-fits-all solutions are likely to create trouble. Another special security problem presented by data warehousing is precisely the reason why such systems exist. Data warehouses are frequently used for browsing and exploring vast reams of data – undirected exploration and knowledge discovery is provided by an entire class of data mining tools. The point is to find novel combinations of products and issues. Whether authentic or mythical, the example of market basket analysis whereby diapers are frequently purchased with beer is now a classic case. The father going to the convenience store for "emergency" disposable diapers and picking up a six-pack on the way out suggests a novel product placement. The point is that it is hard to say in advance what restrictions would disable such an exploratory data warehouse; therefore, the tendency is to define an unrestricted scope to the exploration. A similar consideration of 18 undirected knowledge discovery applies to simple ad hoc access to the data warehouse. Examples where a business analyst uses end-user self-service tools such as those by Business Objects, Information Builders, Cognos or Oracle to issue queries directly against the data without intermediate application security give the end user access to all the data in the data warehouse. Given privacy and security imperatives, it may be necessary to render the data anonymous prior to unleashing such an exploratory, ad hoc process. That will create complexity where the goal is (sanctioned) cross-selling and upselling. The identity must be removed in such a way that it can be recovered, as the purpose is often to make an offer to an individual customer. Encryption of data has its uses, especially if the data must be transmitted over an insecure media such as the Internet. An employee, his or her manager and the human resources clerk all require access to the employee's record. Therefore, encrypting the data will not distinguish between their access levels. It is misguided to believe that if encrypting some data improves security, encrypting all the data improves security even more. Blanket, global encryption degrades performance, lessens availability and requires complex encryption key administration. Encryption is a computationally intense operation. It may not impact performance noticeably when performed for one or two data elements; but when performed arbitrarily for an entire table, the result may very well be a noticeable performance impact. It might make sense to encrypt all the data on a laptop PC that is being taken off site if the data is extremely sensitive. If the PC is lost or stolen, only encryption will guarantee that the data is not compromised. However, an even better 19 alternative would be selective encryption and organizational steps to make sure the physical site is secure and the media containing the data is handled diligently. As a general rule, proven security practices and solutions developed to secure networks will be appropriate and extended to protect the data warehouse. In other cases, data warehouses present special challenges and situations because the data is likely to be the target that encourages hackers to try to gain access to the system. These practices extend from organizational practices to high-technology code. The requirement for authentication implies certain behavior – on-site staff should wear their corporate identification badges and be required to sign an agreement never to share a user ID or password with anyone. Based on the enterprise's specific confidentiality rules, new areas where technologies are still emerging may be selectively used. It is essential that database administrators work together with their security colleagues to define policies and implement them using the role-based access control provided with the standard relational database data control language [http://www.academictutorials.com]. 3.5.4 Maintenance Issues for Data Warehousing Systems Another important aspect of data warehousing is maintenance of these systems. Adequate knowledge about business and feeder system changes that will affect the data warehouse systems is of utmost concern for anyone doing systems maintenance. In data warehousing environment data is fed from more sources than typical transaction processing system. Though intelligent use of the data extraction, cleaning, and loading tools and the information catalogs can greatly ease the burden here, many changes will require a fair 20 amount of effort. Keeping informed and assessing the impact of technically driven changes to the feeder systems may be more difficult than keeping track of the business driven changes. Maintenance issues are hard to handle. Some of the concerns are as follows: To figure out if, when, and how to purge data. There comes a point when it does not make business sense to hold certain data in the warehousing system. This usually happens because of some type of capacity limit or restructuring of data and it is not worth the effort to restructure certain data. At this point, purging of data is necessary with less expensive, alternative means of storage. Knowledge to determine which queries and reports should be IS written and which should be user written is needed. Store data in the data warehouse "for data’s sake". Balance the need for building aggregate structures for processing efficiency with the desire not to build a maintenance nightmare. Uncertainty whether to create certain reports/queries in the data warehousing system or in the "feeder" transaction processing system. Pressured to implement a means to interactively correct data in the data warehouse (and perhaps send back corrections to the transaction processing system). Uncertainty about which tools are most appropriate for a certain task. To figure out how to test the effect of structure changes on end user written queries and reports. 21 To determine how problems with feeder system update processing affect data warehouse system update processing. Knowledge that the business changes the meanings of attributes over time and that these changes can be overlooked. Reworking the implemented security. The need to keep reconciling feeder systems with the data warehouse systems. In short, maintaining data warehouse architecture may be much harder than establishing the architecture. It is also far more expensive (and complex) to maintain a data warehouse than to build one [http://www.academictutorials.com]. Through this literature review we can conclude that the three main issues to be handled to keep the data warehousing working efficiently are to maintain data integrity, to maintain high performance system and to do it all in a cost efficient manner. 22 Chapter 4 SIGNIFICANCE OF THE PROJECT The project is descriptive and explanatory in nature. The descriptive part of the project studies the various data warehousing issues and the solution to these issues whereas, and the explanatory part attempts to understand these issues in the industry – real world. The literature review of the project includes information obtained from reading whitepapers, journals, periodicals and press releases. The understanding of the real issues of the industry are obtained by interviews, survey, and case studies published by companies. This project will give a better understanding of the current issues and how they affect the data warehouse environment. Not just the theoretical implications but how these issues affect the real time world through the use of case studies. This will effectively help to understand a better way to design a data warehouse environment. 23 Chapter 5 RESEARCH METHODOLOGY The approach used to gather information on the subject included an extensive research and conducting a survey. The research included a thorough study of various books, white papers, journals, white papers, periodicals, and library archives. Information was also obtained by means of search engines. Industry professionals who had experience in data warehousing environment were interviewed for problem cases and how they worked up a solution. A small survey of people working in the data warehousing environment was conducted to understand more about the issues they face and how important overcoming these issues is for them. 24 Chapter 6 CASE STUDIES 6.1 John Deere – Interview with Project Manager Company background Deere & Company was founded in 1837 and has grown from a one-man blacksmith shop into a corporation that today does business around the world and employs more than 50,000 people. The company is collectively called John Deere. John Deere consists of three major business segments - agriculture and turf, construction and forestry, and credit. These segments, along with the support operations of parts and power systems, are focused on helping customers be more productive as they help to improve the quality of life for people around the world. The company's products and services are primarily sold and serviced through John Deere's dealer network. John Deere is the world's leading manufacturer of farm equipment. The company also produces and markets North America's broadest line of lawn and garden tractors, mowers, golf course equipment, and other outdoor power products. John Deere Landscapes provides irrigation equipment and nursery supplies to landscape service professionals across the United States. John Deere is the world's leading manufacturer of forestry equipment and is a major manufacturer of construction equipment in North America. John Deere Credit is one of the largest equipment finance companies in the U.S. with more than 2.4 million accounts and a managed portfolio of nearly $23 billion (U.S.) 25 In addition to providing retail, wholesale and lease financing to help facilitate the sale of John Deere agricultural, construction and forestry, and commercial and consumer equipment, John Deere Credit also offers revolving credit, operating loans to farmers, crop insurance (as a Managing General Agent), and debt financing for wind energy. Today, John Deere Credit has approximately 1,900 employees worldwide and has operations in 19 countries. Resource For this case study, Mr. Paresh Deshpande, a project manager at John Deere was interviewed on October 12, 2010 at 10:30 am. The case provided here focuses on the issues they faced during improving DB2 UPDATE performance. The case discusses how to design Informatica mappings that do a large number of UPDATEs to a DB2 table so that they can be partitioned to do multiple updates simultaneously. The goal was to improve performance of the system. Situation Deere Credit was converting most of its loans from a system called M010 to a new system called Profile. One of the new features of Profile is daily interest calculation. This is something that M010 only did a few times a month, but Profile would do it nightly. When M010 calculated interest, it caused an update in a table for every open agreement. Typically, this meant updates to about 350,000 rows. Updating this many rows one at a time would typically take 35-45 minutes. Since M010 only did 26 this about three times a month, this wasn’t considered to be a problem. However with Profile, this many updates would happen every night. This would add approximately 20 minutes to their normal nightly processing, which they didn’t desire to do. It was important that the nightly processing completed in a timely manner because they couldn’t bring Profile online until most of the batch processing was done. Doing 350,000 DB2 UPDATEs became the bottleneck in the nightly processing and hence there was a need to find a faster way of doing them. Another problem at Deere Credit was that DB2 was generally configured to use page level locking. This means that when a row in a table is updated, it not only locks that particular row, but also locks any other rows that happen to be in the same page on the disk. This improves DB2 performance but it can cause deadlock situations to occur more frequently when multiple connections are updating the same table. Since one UPDATE will cause all records on that page to be locked, simply creating multiple UPDATE threads will almost certainly cause deadlocks. Solution The solution was to use Informatica’s partitioning feature to do multiple UPDATE streams simultaneously. Informatica has the ability to thread mappings. When a mapping is threaded, it creates several instances of the mapping running simultaneously. Source data is sent to each thread depending on how the partitions are configured. Each thread will get its own connection to the database and will appear to be different users to the database. When carefully done, this allows 27 multiple UPDATEs to occur simultaneously to the same table leading to performance improvement Further to avoid the deadlocks caused by page level locking, they developed a technique to minimize (but not completely eliminate) updates to the same DB2 page in the different Informatica partitions. The attempt was to send all rows that are on the same physical page to the same Informatica thread. This minimizes the chances that two separate connections are trying to update data on the same DB2 page, but doesn’t totally eliminate the deadlock situation. This solution assumes that automatic deadlock retry logic is in place for automatically restarting Informatica maps that fail to DB2 deadlocks. This solution does not completely avoid deadlocks, but it minimizes the chance of their occurrence. There are several other efforts taken to enhance the performance. To enhance performance, the DBAs try to keep the rows in a table in a particular physical order, usually in primary key order. This is called being ‘in cluster’ and it’s important that the table being updated stay ‘in cluster’. The table will probably never be 100% ‘in cluster’ but it should be at least 90% in cluster. If it falls below that level, the table should be reorganized to be put back ‘in cluster’. The rows being updated need to be sorted in the same order that they are in the physical table. A sorter transformation should be used to guarantee this order. For the sorter transformation to work properly in a partitioned Informatica map, all Informatica data should be ‘funneled’ through a single partition point here so that all records will go through the same sorter. 28 Once data is sorted, it needs to be assigned a group number. Group numbers start at zero and are incremented when the record count exceeds GroupSize. For example, if the GroupSize is 100, records 1-100 will be assigned a group number of 1, records 101-200 will be assigned a number of 2, etc. This number is then used to decide which update partition to use. The number of records in a single group should be at about 10 times larger than the number of rows that fit on a single DB2 page. This size seems to be a good balance of performance and deadlock occurrence. The larger the group size, the more likely that nearby records will be sent to the same partition, but it will slow down the performance. For example, a group size of 1 would be very fast, but it will almost certainly deadlock. Likewise, a group size larger than the number of rows to be updated will send all rows to the same partition, so it would never deadlock, and will effectively disable partitioned updates. Informatica normally decides internally when to commit DB changes and this doesn’t work well with partitioned updates. Informatica typically does several thousand updates before committing and this isn’t frequent enough to avoid deadlocks. The Transaction Control transformation must be used to commit transactions frequently. Committing transactions frequently decreases both the chance of a deadlock and performance. The lower the number, the less likely a deadlock will occur, but performance will be impacted. 29 When a group of records comes close to being finished, Informatica seems to ‘cache’ the last few records and not send them through. This leaves the possibility that some records will be left in an uncommitted state and that can cause deadlocks. To prevent this, when a group of records is sent to a partition, the first few records and the last few records are committed after every update. This cuts the chances for a deadlock between the rows in this group and the rows in the group preceding it, and following it. Because deadlocks can and do occasionally happen when partitioning updates, it’s best to skip records that have already been updated in a previous run. Besides the performance benefits of skipping records that have already been updated, beginning the update in a different position helps to prevent deadlocks from occurring in the same spot during the update process. When a mapping is restarted because of a deadlock, they run with an alternate configuration that is called ‘failsafe’ mode. This is an alternate configuration for commit points and group sizes that is guaranteed to complete. Normally the commit point in the restart configuration is set to 1 to guarantee that no deadlocks will occur in the second attempt. The commit size, group size and edge size should all be defined in the session parameter file. Also, the alternate values to use when restarting should also be contained in the parameter file. The values for these parameters may need to be tweaked from time to time. 30 In addition to the mapping changes, the workflow needs to be configured for partitioning. Deere found that six partitions worked well for them. This setting is something that can be tuned also. They observed increased deadlocks and some decrease in performance with more than 8 partitions. A partition point needs to be set on the sorter transformation so that all rows go through one partition when being sorted and when going through the expression transformation. This ensures that all records with similar primary keys are assigned to the same group. The transaction control transformation should also have a partition point based on a hash key on the assigned group number. This will ensure that all rows with the same group number will be sent to the same database connection. 31 6.2 Kyobo Life Insurance – Case Study One Company Background Kyobo Life Insurance is one of South Korea's top life insurance firms. The company provides life insurance and asset management products to more than 10 million customers. Its offerings include traditional life, health and disability, and retirement and pension products for individuals and businesses. Products are distributed through financial planning agents. Kyobo Life also offers personal-use and mortgage loans, and it has some international operations. The company was founded in 1958 by Shin Yong-Ho, father of CEO Shin Chang-Jae. The Kyobo Group conglomerate operates under a common management structure in a variety of sectors including real estate, investment banking, and a bookstore. Kyobo Life Insurance has quite a few accolades as an industry leader: ‘One of the Top 30 Admired Companies in Korea', winning the Customer Satisfaction Grand Prize awarded by KMAC for 5 years in a row, and the #1 insurance company in the KCSI (Korean Customer Satisfaction Index) 2004 survey. These are the results of Kyobo Life's paradigm shift from volume to value. Relying on a solid customer base, Kyobo Life is striving to maximize customer value through qualitative growth. With a firmly established open and high-performance corporate culture, Kyobo Life will expand into foreign markets and it aims to become the ‘most-preferred brand in Northeast Asia' by the end of this decade. 32 Situation Kyobo Life wanted to make an effort to establish transparent management and efficiency. In addition to ensuring transparency in finance and administration, they also wanted to manage four major outcomes including management planning and profit management, and simplify the complicated interfaces between IT systems, which had come to resemble spaghetti. The project was called value innovation project. Implementation of this project would provide the business advantage of making diverse management information available for quick decision-making. The users could directly extract analytical and statistical data, hence raising user productivity and satisfaction. The goal was to reduce data extraction time from 5 days to 1 day, improve data quality and unify data across all business units for single access. Solution In the former computing environment, limited statistical and analytical data was available only in certain areas such as insurance, credit, and accounting. To get more diverse data, users had to consult the Information System Office. Sybase IQ can quickly process large amounts of data, has a unique way of storing data and is capable of compressed storage. Sybase IQ was used to create the EDW (Enterprise Data Warehouse) and datamarts were implemented for 14 subject areas in 4 groups. Implementing Sybase IQ on IBM System allowed Kyobo Life to efficiently allocate resources without any interruption in service. 33 The result of this implementation was that the quality of analysis data was improved, and the availability of diverse information improved decision-making. As the users could now perform analysis and extract statistical data on their own, average data extraction time was reduced from 5 days to 1 day, improving user productivity and satisfaction. To realize these goals, Kyobo Life finished the value innovation project which had three primary goals: to establish a responsible management system; to support strategic decision-making; and to integrate and accelerate financial data. The company wanted to become more competitive in insurance industry through value management, which was achieved by the value innovation project. The project consisted of three parts: enterprise resource planning (ERP), enterprise data warehouse (EDW) and enterprise application integration (EAI). Kyobo Life employed a "big-bang" method in which both the EDW and ERP were built concurrently. The scale of the effort was also noteworthy as the size of the EDW gained attention across Korea as well as abroad. Kyobo Life installed the EDW because the existing computing environment only provided statistical and analysis data in a few areas such as insurance, credit and accounting. To access more diverse data, users had to consult the Information System Office leading sometimes to long waiting periods. Although many statistical systems were available, they did not provide enough strategic information. To address this issue, Kyobo Life installed an information infrastructure to make it easier to create, manage, and use the information. The company also decided to integrate its 34 data and shift its computing focus from business processing to analysis. To keep all the data at one place, Kyobo set up and applied data standards to integrate enterprisewide information, provided speedy and correct information related to profitability and the four major project focus areas, and developed an infrastructure that allowed the direct use of data. In this process Kyobo Life adopted Sybase IQ, which was different from other databases used in insurance, credit and stocks and bonds, and chose IBM Websphere DataStage as the ETL tool. To build and deploy the EDW, Kyobo Life worked closely with both Sybase and IBM. The combined software, hardware and services technical teams contributed unique skill sets to design data architecture, optimize user analysis, provide a data governance structure, and assure data integrity for the system. Kyobo Life combined insurance, credit (host), stocks and bonds (SAPCFM), special accounting, personnel (SAP HR) into EAI to create a unified system that provides office data such as account closing management, bill management, funds management, tax management, budgeting, and financial accounting. The ETL tool was used to load data nightly into the EDW and make it available to the entire information system. EDW contains large amounts of data on customers, insurance contracts, products, and insurance premiums. Yet, its greatest advantage is its ease of use. Related data are divided into 4 major categories: customers/activities; contracts/financial transactions; commissions; and investment/finance/managerial accounting. The warehouse is set up with 14 subject areas including customers, activities, communication, contracts, products, fund setting, financial transactions, 35 closing, organization, market, investment, and fixed assets/real estate. This data design makes it easy for users to analyze data. As a result, 700 employees use the EDW to obtain analysis data. A spokesperson says [Lee, Hae Seok, Kyobo Life Insurance Case Study. http://www.sybase.com/detail?id=1054785], "After the system was installed, enterprise-wide standards were applied to business data from various channels and data was integrated through the ETL tool to improve the quality of analysis data. Diverse management information is available for quick decision-making. In addition, as users can directly extract analytical and statistical data, data extraction time was reduced from 5 days to less than a day, raising user productivity and satisfaction." In addition to these benefits, Kyobo Life Insurance uses the enterprise-wide data offered by the EDW to conduct analysis from diverse viewpoints, thereby improving transactional processes and work efficiency. The EDW, a single channel that provides the information system with data, improved the flexibility and scalability of the system, and raised development productivity. The company shifted from providing task-related information to providing analytical enterprise-wide information, and is transforming into a "real-time corporation" by providing forecast data based on realtime analysis. To this end, Kyobo Life now holds management performance reporting sessions using the management information system based on EDW data and also maintains data quality by reinforcing the sense of data ownership of each department. Kyobo Life also uses the data to perform the information system's analysis and statistical 36 work. To expand the user base, increase use, and improve competencies, the company provides convenient user tools, open training courses and a specialist license system. Kyobo Life trains instructors at each branch office, who will then give training courses for their respective branches [Lee, Hae Seok, Kyobo Life Insurance Case Study]. 37 6.3 Scandinavian Airlines – Case Study Two Company Background Scandinavian Airlines (SAS) is the flag carrier of Denmark, Norway and Sweden. It is the largest airline in Scandinavia, flies to 150 destinations, employs 32,000 staff, and carries 40 million passengers annually. It is a founding member of the Star Alliance (Star Alliance is the world's first and largest airline alliance, headquartered in Frankfurt, Germany. Founded in 1997, its name and emblem represent the five founding airlines, Air Canada, Lufthansa, Scandinavian Airlines, Thai Airways International and United Airlines. Star Alliance has since grown considerably and now has 28 member airlines). The company has its head office in Solna, near Stockholm, Sweden. Situation Scandinavian Airlines (SAS), the largest airline in Scandinavia, but with major changes reshaping the airline industry - particularly low-cost competition and rising fuel prices - SAS executives are facing significant challenges. To operate effectively, SAS employees deal with large volumes of operational and customer data. Existing systems for processing this information depended on an aging data warehouse powered by IBM DB2 mainframe technology. This had become expensive to operate and maintain. The company’s IT team faced increasing difficulties providing employees with timely access to data. In particular, the existing system suffered from scalability issues and low performance due to priority conflicts. 38 This made it difficult for the team to increase analysis capabilities or meet business demands for new reports and online queries. SAS decided to investigate the advantages of improved data management and more flexible support for business analysis and reporting. In particular, they wanted to improve the availability of business intelligence relating to reservations, customer behavior, and agent activities. The IT Architect at SAS, Patterson, said that they needed a new information management environment that could significantly reduce IT management and maintenance costs, deliver enhanced analysis and business intelligence tools, and give more employees across the airline reliable access to critical business data. Solution SAS executives realized this was the start of the ongoing strategic process required to replace aging mainframe applications with new technology solutions that could meet the dynamic needs of a leading airline. They conducted a thorough evaluation of the issues and likely cost benefits, and turned to Microsoft to provide essential support. Microsoft is a long-term partner of SAS. It participates in regular forums with the airline’s managers to help identify critical business challenges and provide strategic IT solutions to resolve them. The Microsoft team worked with SAS to consider the business case. Using advice about best practices from Microsoft helped SAS senior managers develop a comprehensive, dependable strategy to overcome their immediate business data issues and meet future business analysis needs. 39 The strategy was also influenced by key technology partner Intel Corporation, which worked closely with SAS and Microsoft to define a road map of increasing processor capacity featuring Intel processors. Additionally, the Intel Lab played a central role in benchmarking, while Intel Services assisted in testing the scalability of the proposed solution. A feasibility study followed, which led to the planning and preparation for proof of concept that would be the initial phase of the overall strategy. The proof of concept focused on: The migration of data from 600 DB2 tables in the mainframe environment to a Microsoft SQL Server 2005 database running in a Microsoft Windows Server 2003 environment on a new HP Superdome server platform based on Intel processors. The testing of 25 COBOL pre-processing programs from the mainframe environment, using emulation technologies provided by partner Micro Focus International. Rather than retrain developers in the use of new technologies, the decision was taken to introduce the Microsoft .NET Framework. This provides an up-to-date development environment capable of supporting the company’s COBOL programmers. The .NET Framework is an integral component of Windows Server 2003. It provides a user-friendly programming model and runtime for Web services, Web applications, and smart client applications. 40 SQL Server 2005 would help the SAS IT team resolve technical issues related to data warehousing. In long term their intention was to use the advanced integration capabilities within SQL Server 2005, in particular its bulk data transfer management extract, transfer, and load (ETL) functionality. SQL Server 2005 Analysis Services was another important element of the new solution, providing a combination of transactional and online analytical processing (OLAP) benefits. The Unified Data Model (UDM) would allow combining various relational and analysis OLAP data models and create a single dimensional model that gives users fast access to accurate business data. With plans to extend access to business data to users across the business, security is a key issue for the IT team. To meet these challenges, it decided to use Microsoft Active Directory’s directory services as a replacement for the existing mainframe security layer. Active Directory allows to efficiently manage all the security policies and permissions needed to ensure authorized employees have reliable access to the data warehouse [Microsoft Case Studies, Scandinavian Airlines]. Benefits By deploying the latest Microsoft technologies, SAS reduced development times and costs for creating new business intelligence functionality. In addition, it has increased employee access to key business information, enhanced network security, and maximized the value of existing technologies and IT skills. 41 SAS minimized these issues by investing in Microsoft technologies. At the same time, the company has had the opportunity to extend business intelligence functionality by reducing development costs. Excellent support is also protecting technology investments. Improved Scalability Provides More Users with Faster Access to Business Intelligence In the DB2 mainframe environment, it was no longer cost effective or practical to separate data analysis and transactional environments. As a result, the performance of applications accessing the data warehouse was often impeded because priority was given to transactional systems. Inevitably, business analysts suffered delays and were hindered by inaccurate data. The scalability of the new environment ensured that analytical and transactional systems were now separate. Using SQL Server 2005, more users can be added as required, while delivering improved response times. Thanks to the ease of integration between Microsoft technologies, SAS will also be able to offer low-end data analysis capabilities to the wider range of employees working with existing Microsoft desktop environments [Microsoft Case Studies, Scandinavian Airlines]. 42 Helping Managers Reduce Infrastructure Costs Low cost operators and rising fuel prices are increasing pressure on airline managers to reduce costs. As a result, the cost-effectiveness of IT operations is a key requirement for SAS. Following migration of its data warehouse to the SQL Server 2005, the company predicts substantial savings. The IT architect Petterson said that [Microsoft Case Studies, Scandinavian Airlines. http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=20033 0] this initiative would result in reduction of data warehousing costs by 50 per cent. The Intel processors will also contribute significantly to operational savings. Compared to a mainframe environment, this technology requires lower initial investment. In addition, it requires less stringent environmental control, resulting in reduced operational costs [Microsoft Case Studies, Scandinavian Airlines]. Improved Operational Efficiency As well as helping SAS managers significantly reduce the cost of hardware and software, the new Intel and Microsoft strategy is also achieving resource efficiencies by maximizing productivity. Petterson says: “Now that we can implement standard analysis tools that offer access to shared data across the business. We can also eliminate the costs of duplicated and disparate departmental systems. Because it is easier for our managers and analysts to understand the whole business, they can draw conclusions across organizational units and data boundaries. Before, they only had a restricted, departmentalized view of operations. As a result, we are making substantial 43 operational savings through increased productivity and efficiency. For example, the commercial insights gained from improved access to business data are helping the sales team target prospects more effectively, eliminate unproductive sales channels, and save on advertising costs.” [Microsoft Case Studies, Scandinavian Airlines]. Providing Employees with More Accurate Business Intelligence It is essential that SAS managers can respond quickly to challenges from their competitors. Essential capabilities are rapid access to accurate business data, effective analysis, and reporting that provides timely business intelligence. SQL Server 2005 will deliver these benefits, ensuring SAS managers have access to the information they need to work effectively. The UDM capability provided by SQL Server 2005 Analysis Services will ensure the gap between users and business data is closed. In doing so, it will deliver valuable operational and data management benefits [Microsoft Case Studies, Scandinavian Airlines]. Improving Competitiveness The new solution will provide self-help environments for SAS business analysts, who will now be able to quickly design their own queries and reports. This means business will be more responsive to changing circumstances. IT will be easier to take full advantage of every new business opportunity and improve competitive advantage. For example, by better understanding passenger loads per service, they could 44 accelerate pricing and scheduling decisions for capacity flights [Microsoft Case Studies, Scandinavian Airlines]. Delivering Better Services to Customers Using the advanced analysis capabilities of SQL Server 2005, SAS will be able to enhance relationships with its customers and increase per-customer revenues. The new Microsoft solution would allow performing detailed analysis of customer behavior - such as Internet reservations and check-ins, and calling center interactions. It would also help reduce the impact of delays and cancellations on customers. The benefits would be better sales services and products, and improved customer service overall [Microsoft Case Studies, Scandinavian Airlines]. 45 6.4 Philips Consumer Electronics – Case Study Three Company Background Philips Consumer Electronics (Phillips CE) is a global leader across its healthcare, lighting and lifestyle portfolio. It is one of the world’s top consumer electronics companies delivering technology that enhances everyday lives. Philips CE manufactures and markets a huge range of consumer goods, from flat screen televisions to DVD recorders, and has a significant global presence. Situation Local operational units and Business Units at Philips CE had previously handled their own data warehousing, using locally selected tools. This inevitably led to duplication of data and of IT costs, and an increase in the interfaces between disparate systems. The resulting framework was inefficient because it could not evolve fast swiftly or cost-effectively in response to changing business needs. Philips CE is never static: meeting the specific challenges of the consumer electronics markets - fast-changing product lines and competitive landscape, numerous customers and channels to market - requires constant evolution. Competition drives increasing sophistication in business controls, expanding supply chains introduce new data sources, and evolving business processes drive adaptations and integration of data models and data definitions. Han Slaats, Program Manager of Enterprise Data Warehousing at Philips CE states that the process of data warehousing is key, because the whole point is to deliver the 46 right data at the right time in the right place. To do so, you have to treat data warehousing as a service, which changes to meet the needs of the business. Philips CE needed a strategic data warehousing tool to populate its best-practice framework for information management, which draws on the concept of adaptive enterprise data warehousing as a central service rather than a number of stand-alone projects across the company. By focusing resources on repeatable solutions and a single technological skill-set, Philips CE aimed to reduce operational costs and increase the speed of ROI from data warehousing. Solution Philips CE moved into a richer information world through an advanced data warehousing strategy, based on the Kalido Information Engine (Kalido). With a board-level view of data warehousing as a process and a service, rather than as a oneoff project, the company has built a best-practice intellectual framework for information management that will allow it to reduce operational costs and deliver greater value to the business. The Philips CE vision of best practice and its focus on generating a core of data warehousing expertise drove the development of a central service organization for information management. The company partnered with systems integrator Atos Origin and selected Kalido as its global strategic data warehousing tool. Through fast, flexible and repeatable implementations, Kalido delivers a federated data 47 warehousing solution at high speed and with low implementation and management costs. They evolved a best-practice solution for data management that would enable to reduce IT expenditure and deliver value-added information to the business [Kalido Case Study, Philips Builds Best-Practice Information Management Architecture]. Benefits Ability to handle change The new best-practice framework at Philips CE defined a central service organization to manage all corporate data warehousing using a single strategic technological tool. The criteria governing the selection of the tool were that it should be highly flexible, driven by business metadata, quick to implement, and cost-effective to maintain. Philips CE worked with systems integrator Atos Origin to select a tool to populate its data warehousing concept. Following a successful proof of concept by Atos Origin, Philips CE selected Kalido as its key strategic data warehousing tool [Kalido Case Study, Philips Builds Best-Practice Information Management Architecture]. Fast ROI Kalido is a non-intrusive adaptive enterprise data warehousing solution, so it fitted with Philips CE’s existing technology and removed the need for costly and inflexible standardization of source systems. Its revolutionary approach to database design adapted to changes in data and data structures without altering database definitions, or 48 any of the code that accesses them. This meant that it could deliver integrated information consistently even while changes in business structures were occurring. By designing an architecture based on a central service model and a single technological tool, Kalido, they ensured fast ROI on data warehousing projects, excellent on-going support, and increased value to the business as a whole with subsequent projects. The central core of data warehousing competence is enriched with each new project. Successive implementations could be planned and completed more quickly and costeffectively. As new instances of Kalido are put in place, more enterprise data is put into a comparable structure, and the total picture of data within the enterprise becomes ever richer. Their strategy was not the creation of a single, unified data warehouse for the whole organization. However, bringing all data warehouse projects into a single service domain with one supporting technology meant that if they choose to integrate reporting across Business Units, they would be able to develop a common data model quickly and cost-effectively. Furthermore, single skill-set for data warehousing reduced training and support costs, and accelerated implementation of repeatable Kalido solutions. It led to significant savings on the cost of deploying data warehouses and to significant reductions in data warehouse implementation time [Kalido Case Study, Philips Builds Best-Practice Information Management Architecture]. 49 Chapter 7 SURVEY At the onset of the study, it was planned to conduct a study of a few cases to emphasize the current issues in the data warehousing environment. After gathering the cases, there was a need felt to support the results drawn with a small online survey. This online survey was conducted to get a better understanding of what features are important to the professionals working in a data warehousing environment. This survey was conducted to further validate the conclusions drawn from the cases studied. 7.1 Sample Population The population of the subjects for the survey comprised of all those people who work in a data warehousing environment that the investigator has access to. This is a convenient sample that provides data for an exploratory study, which by no means is a random and scientific survey. The difference between their gender, location, organization, and work experience was not taken into consideration for the purpose of this study. The size of the data warehouse and the work position of the subject were gathered to understand if it made any difference in their viewpoint towards the problem. 7.2 Sample Frame The sample frame for the survey included those professionals who had experience working in a data warehousing environment. 50 The survey was drafted using Google Documents. Google Document lets you create a form which can be emailed and then it creates a spreadsheet where all the responses are recorded. The survey was emailed to all the acquaintances/past colleagues/friends working in data warehousing environment. The survey was also made available on a social networking site for response. 7.3 Survey Response The emails were sent out to 48 prospective professional respondents, of which only 25 professionals participated in the survey. The questions used in the survey were straightforward and not too detailed. The answers were gathered in a manner that it generated nominal data which would be easy to analyze. The survey is presented in Appendix A of this project report. 7.4 Data Collection via Online Survey Online survey was chosen as the method of data collection because it was the lowest cost option. It covered a larger geographic area with no increase in cost. This method of survey also made it possible to get responses from all over the world and from inaccessible respondents. This method is also more convenient for the respondents as they can think about it and answer at their convenience. Finally, the online survey helps in collecting the responses rapidly and more accurately. 51 Chapter 8 CASE STUDIES AND SURVEY OBSERVATION John Deere is a very large organization and the case illustrated by Mr. Deshpande suggested that John Deere strives to keep up the performance of the system. In this case, migration to a new system was going to affect the overall performance of the system. Updating over 350,000 rows every night would add approximately 20 minutes to their normal nightly processing, which they didn’t desire to do. It was important that the nightly processing completed in a timely manner because they couldn’t bring Profile online until most of the batch processing was done. This increased the unavailability of the system until all the updates were done. Another problem was the deadlock situations occurring more frequently when multiple connections updated the same table due to the page locking configuration of DB2. As illustrated in the case, they took efforts to take care of the situation by using Informatica’s partitioning feature and also made some additional efforts to improve overall performance and keep the system up and running. Improving the performance of the system was given higher priority in this case. Other issues faced by Mr. Deshpande as a project manager are related to the cost management and is associated with software and hardware updates. Eventually, the technology gets obsolete and maintenance costs increase. They use Informatica as their ETL tool. They have to keep up with the changing technology i.e., if the software is changed, they need to support it with hardware which is needed for the functioning of the new software version. 52 Apart from this, he also has to deal with data management which includes policies related to data retention, data storage, data mapping and data recovery. In case of Kyobo Life Insurance, one of South Korea's top life insurance firms, the objective was to maximize customer value through qualitative growth. This was achieved through the ‘Innovation Project’ which would provide the business advantage of making diverse management information available for quick decision-making. The goal was to reduce data extraction time from 5 days to 1 day, improve data quality and unify data across all business units for single access. This was achieved by Sybase IQ, which can quickly process large amounts of data, has a unique way of storing data and is capable of compressed storage. Sybase IQ was used to create the Enterprise Data Warehouse at Kyobo. It is evident from this case that a company as large as Kyobo, which provides life insurance and asset management products to more than 10 million customers need a system to make diverse management information available to improve the decisionmaking process. The data from all the channels was integrated through the ETL tool to improve the quality of analysis data. This case just goes to prove that more data equals better information - which leads to improved decisions. The Scandinavian Airlines (SAS), the largest airline in Scandinavia, were facing significant challenges as their existing systems for processing this information depended on an aging data warehouse powered by IBM DB2 mainframe technology. As 53 infrastructure becomes obsolete, it becomes expensive to operate and maintain. Moreover, low cost operators and rising fuel prices were increasing pressure on airline managers to reduce costs. Therefore, it became necessary for SAS to seek a solution which could help reduce the costs and could also meet the dynamic needs of a leading airline. SAS with the help of Microsoft came up with an IT solution to increase the performance and reduce implementation and maintenance costs. This case illustrates that the solution for every organization is going to be unique. Based on their requirements, their ability and budget every organization needs to come up with a solution with fits their needs. For example, to reduce the implementation costs, SAS and Microsoft decided that instead of retraining developers they would use Microsoft .NET Framework which would provide an up-to-date development environment capable of supporting the company’s COBOL programmers. At Philips Consumer Electronics, local operational units and business units had previously handled their own data warehousing, using locally selected tools which lead to duplication of data and of IT costs, and an increase in the interfaces between disparate systems. An organization that is as changing as Philips CE requires constant revolution. The framework they used was inefficient because it could not evolve fast swiftly or costeffectively in response to changing business needs. Similar the Scandinavian Airline case, they came up with a solution that fit their business needs. Philips CE moved to Kalido Information Engine (Kalido), which with a board- 54 level view of data warehousing as a process and a service, the company built a bestpractice intellectual framework for information management that would allow it to reduce operational costs and deliver greater value to the business. By creating a single, unified data warehouse for the whole organization and bringing all data warehouse projects into a single service domain with one supporting technology, Philips CE improved the efficiency of their system. Furthermore, single skill-set for data warehousing reduced training and support costs and, accelerated implementation of repeatable Kalido solutions which lead to significant savings on the cost of deploying data warehouses and to significant reductions in data warehouse implementation time. This case further validates the point that solutions to improve efficiency vary from organization to organization based on their needs, abilities and budget. From the Literature Review, the deduction was made that the three main issues to be handled to keep the data warehousing working efficiently are to maintain data integrity, to maintain high performance system and to do it all in a cost efficient manner. The four cases studies have provided sufficient evidence that these are the three key issues around which the solutions are designed for any organization. Later on in the project it was decided to conduct a small survey to find out how professionals at different positions in different size data warehouses viewed the importance of the issues. The survey will also take a look at what issues arise across different data warehouse sizes. 55 The survey asked the respondents their position in the organization and the size of the data warehouse in their organization. They were also asked to rate the features based on their importance to them. The features to be rated were easy access to data, efficient performance, data currency (recency), data accuracy, and implementation and maintenance costs. Finally, they were asked to rate which of the following issues they faced at their organization - problem in accessing/restoring data, inefficiency in performance, inaccurate data or obsolete infrastructure. The distribution of the survey respondents are shown in the Figure 1 and Figure 2. 12 Number of Respondants 10 8 6 4 2 0 <1000GB <10 TB >10TB Data Warehouse Size Figure 1: Number of Respondents from Different Data Warehouse Sizes 56 12 Number of Respondents 10 8 6 4 2 0 Applications Programmer Database administrator End User Project Manager Role in Data warehouse Figure 2: Respondents Role in the Data Warehouse Among the responses received, 7 worked in a small data warehouse smaller than 1000GB in size, 11 worked in a medium sized data warehouse with a size between 1000GB and 10TB, and 7 worked in a large data warehouse sized greater than 10TB. With respect to their roles in the data warehouse, 11 were Application Programmers, 4 Database Administrators, 3 End Users and 7 Project Managers. 57 Observations from Survey: Easy access to data Efficient performance Data currency (recency) Data accuracy Implementation and maintenance costs Rating Highest Moderate Lowest Applications Programmer Database administrator End User Project Manager Figure 3: Importance of Each Feature to Respondents in Different Roles Figure 3 was plotted by calculating the average rating of each feature with respect to each of the four roles. Similarly, the Figure 4 was plotted by calculating the average rating of each issue with respect to each of the four roles played in the data warehouse. The observations from the Figure 3 and Figure 4 with respect to the positions held by the respondents in their organization are as follows: Project Managers: Project Managers (7 respondents) in all size of data warehouse rated efficient performance as the most important of all. This was followed by data accuracy; 58 implementation and maintenance costs and, data recency – in that order. They placed least importance to east access to data. From Figure 4, we also observe that on average the project managers have more issues with the performance inefficiencies. The project managers have also rated the issue of obsolete infrastructure higher than any other respondents in other roles. Problem in accessing/restoring data Inefficiency in performance Inaccurate data Obsolete Infrastructure Rating Highest Moderate Lowest Applications Programmer Database administrator End User Project Manager Figure 4: Issues Faced by Respondents in Different Roles Application Programmers: Application Programmers (11 respondents) also valued efficient performance as the most important of all. Their outlooks differs from Project Managers in the rating where they rate data accuracy, data recency and ease to data access as next 59 important features - in that order. The least important feature to them is the Implementation and Maintenance Costs. Database Administrators: The observation for Database administrators (4 respondents) was similar to the observations made for the Application Programmers. End users End Users: The observation for End users (3 respondents) was similar to the observations made for the Application Programmers and Database Administrators. The observations made validate that the Project Managers have to deal with more than the operation of the system. High performance systems are important but project managers have to consider the implementation costs for every project and keep the maintenance costs at the minimum. They have to keep a track of the new software updates and how it will affect the current the hardware. The Project Manager respondents in the survey hence have valued implementation and maintenance costs over feature like data recency and ease of access. 60 Easy access to data Efficient performance Data currency (recency) Data accuracy Implementation and maintenance costs Rating Highest Moderate Lowest <1000GB <10 TB >10TB Figure 5: Importance of Various Features in Different Sized Data Warehouses Figure 5 was plotted by calculating the average rating of each feature with respect to each of the three data warehouse sizes. Similarly, the Figure 6 was plotted by calculating the average rating of each issue with respect to each of the three different sized data warehouses. The observations made from Figure 5 and Figure 6 with respect to size of the data warehouse in the organization of the respondents are as follows: <1000 GB In smaller data warehouses, the respondents (7) valued performance the most. The data related features got next importance followed by implementation costs. 61 Another observation made was that in smaller data warehouses, most of the respondents rated obsolete technology as their main problem. The other predominant problem observed is the data inaccuracy followed by inefficient performance. As a result of these responses, the average ratings for obsolete technology and inaccurate data appear to be same. A few respondents mentioned reasons such as lack of resources to support larger data warehouses which could be the reason of performance inefficiencies and data inaccuracies in their environments. Problem in accessing/restoring data Inefficiency in performance Inaccurate data Obsolete Infrastructure Rating Highest Moderate Lowest <1000GB <10 TB Figure 6: Issues in Different Sized Data Warehouses >10TB 62 <10 TB In medium sized data warehouse the respondents (11) too valued the performance the most followed by data related features and implementation costs. The commonly faced problems in these data warehouses were performance and data related. This could be attributed to the way the data warehouse is designed. These data warehouses carry huge amount of data (running in terabytes), the data stored here should be clean and non-redundant to have efficient performance. >10 TB The observations for respondents (7) data warehouses greater than 10 TB in size were similar to the data warehouses with size between 1 and 10TB. Today, most of the data warehouses fall into this category. The amount of data stored is increasing linearly and hence it needs to be stored in a manner which improves performance and reduces redundancy. Data warehouse design and architecture are of prime importance where data is in such abundance. In conclusion, the survey helps us understand the case studies better. As seen in all the four cases, which are larger data warehouses, they struggled to achieve better performance and better data quality. The cost issue is the only one which varies from business to business and organization to organization based on their requirements. 63 Chapter 9 CONCLUSION The data warehouse is the key to making good informed business decisions in any business sector. It is subject to complex ad hoc as well as regular queries to mostly assist decision-support processes. Data warehousing poses various challenges such as migration of data from legacy systems, maintenance of data quality, upholding the performance of the system and managing the system cost effectively. The main focus of this project was to investigate the factors that affected the efficiency of the data warehouse through case studies and a survey. This study confirmed that the main problem any organization is trying to tackle involves one of the three main issues discussed throughout the project report. Maintaining a high performance system As data warehousing has become more and more important to businesses, increasing data warehouse performance has become vital. With many people depending on the data in the data warehouse to do their jobs, data warehouse performance can have a profound impact on overall company performance. Many companies rely on numerous ways to improve data warehouse performance, including clearing obsolete data, increasing storage space and improving overall data warehouse architecture and design, to keep the data warehouse and the company functioning at their best [David J., DeWitt, Samuel, Madden, Michael, Stonebraker, How to Build a High-Performance Data Warehouse]. 64 Data warehouse performance tends to degrade as more data is collected over a period of time. Increased data mining while important to the business increases the overall load on the system. More people making use of the system also increases the load as a larger number of queries are made by various employees. Removing obsolete information means that queries can be processed more quickly and return more relevant results, making overall data warehouse maintenance an important part of improving data warehouse performance. Infrastructure is another important factor in data warehousing [Gavin, Powell, Oracle Data Warehouse Tuning for 10g (2005)]. A data warehouse system can be functioning at the highest possible level for the available technology, but three or even only two years later, it can be considered obsolete. Improving the data warehouse architecture, both on a hardware level and a programming level, also can greatly increase data warehouse performance. Updating processors, adding additional storage space and using newer, more streamlined query protocols can greatly improve performance. In addition, these changes in overall data warehouse design can make a dramatic difference in the amount of data that can be stored as well as the speed at which the system can process individual queries. Another approach that can help improve data warehouse performance is training. Data warehousing originally was designed to support decision making on a high executive level, but the overall usefulness of business intelligence has led to many other people using the data for a variety of purposes. In some cases, these employees have not received adequate training and do not know how to construct efficient queries to retrieve 65 the information they need. For these employees, training on the use of the system and how to effectively query the data can lead to great improvement in data warehouse performance. Maintaining data quality The main factors to consider when you are looking to maintain data quality: data integrity, data input source and methodology used, frequency of data import and audience. Data integrity is a concept common to data warehouse quality as it relates to the rules governing the relationships between the data, dates, definitions and business rules that shape the relevance of the data to the organization [Larry P., English, Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits (1999)]. Keeping the data consistent and reconcilable is the foundation of data integrity. Steps used to maintain data warehouse quality must include a cohesive data architecture plan, regular inspection of the data and the use of rules and processes to keep the data consistent whenever possible. The easiest way to maintain data warehouse quality is to implement rules and checkpoints in the data import program itself. Data that does not follow the appropriate pattern will not only be added to the data warehouse but will require user intervention to correct, reconcile or change it frequently [Panos, Vassiliadis, Data Warehouse Modeling and Quality Issues (June 2000)]. In many organizations, these types of changes can be 66 implemented only by the data warehouse architect, which greatly increases the data warehouse quality. The accuracy and relevance of the data is essential to maintaining data warehouse quality. The timing of the import and frequency has a large impact on the quality as well [Ang & Teo, Management issues in data warehousing: insights from the Housing and Development Board (December 6, 1999)]. Data warehouse quality is easiest to maintain and support if the users are knowledgeable and have a solid understanding of the business processes. Training the users to not only understand how to build queries, but also on the underlying data warehouse structure that enables them to identify inconsistencies much faster and to highlight potential issues early in the process [Ang & Teo, Management issues in data warehousing: insights from the Housing and Development Board (December 6, 1999)]. Any changes to the data tables, structure or linkages and the addition of new data fields must be reviewed with the entire team of users and support staff members in order to ensure a consistent understanding of the risks and challenges that might occur. Managing Costs Data warehouse projects need not be a multi-million dollar undertaking. Any organization, regardless of size, can benefit from data warehousing technologies, even when working with a limited budget. Although it may seem difficult, choices can be made which allow for the beneficial realization of data warehousing while also minimizing costs. By balancing technology and carefully positioning business, the 67 organization can quickly create cost-effective solutions using data warehousing technologies. The few simple rules that can help you develop a data warehouse on a small budget would include using what you have, using your knowledge, using what is free (s/w or h/w), buying only what you have to, thinking and building in phases, and using each phase to finance or justify the remainder of the projects [Brian, Babineau, IBM Information Infrastructure Initiative Tames the Information Explosion (April, 2009)]. A common organizational mistake in data warehousing is to ignore the currently owned technology and software capabilities and move ahead quickly, to purchase a different product [Nathan, Rawling, Data Warehousing on a Shoestring Budget (May, 2008)] . An organization often has the toolset in place to effectively meet its data warehousing demands but simply needs the right partner to fully implement its current technology. 68 APPENDIX Questionnaire Issues affecting the data warehouse efficiency This questionnaire is for my MS project which deals with issues that affect the data warehouse efficiency. The goal of this is to narrow down the major issues. Kindly fill in this questionnaire based on the experience at your workplace. Thank you for contributing! 1. What is your role in the data warehousing environment in your organization? a) Manager/Director b) Project Manager c) Database Administrator d) Data Architect e) Applications Programmer f) System Administrator g) End User 2. What is the size of the data warehouse you work on? a) <1000GB b) <10 TB c) >10TB 69 3. Please rate the following features based on their importance to you. (From 1 to 5, 1 being the highest) a) Easy access to data b) Efficient performance c) Data currency (recency) d) Data accuracy e) Implementation and maintenance costs f) Other (please specify) 4. Which of the following problems do you face in your data warehousing environment? Briefly describe the issue. a) Problem in accessing/restoring data b) Inefficiency in performance c) Inaccurate data d) Obsolete Infrastructure 5. Please describe briefly what other problem areas you come across in your work environment. 70 BIBLIOGRAPHY Gavin, Powell, Oracle Data Warehouse Tuning for 10g (2005). Published by Elsevier Inc.Larry P., English, Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits (1999). Wiley Computer Publishing. Panos, Vassiliadis, Data Warehouse Modeling and Quality Issues (June 2000). Retrieved from World Wide Web: http://www.cs.uoi.gr/~pvassil/publications/publications.html James, Ang, Thompson S.H., Teo, Management issues in data warehousing: insights from the Housing and Development Board (December 6, 1999). Retrieved from World Wide Web: http://www.cse.dmu.ac.uk/~ieb/Data%20warehouse%20tutorial.pdf Brian, Babineau, IBM Information Infrastructure Initiative Tames the Information Explosion (April, 2009). Retrieved from World Wide Web: http://whitepapers.businessweek.com/detail/RES/1287588950_343.html David J., DeWitt, Samuel, Madden, Michael, Stonebraker, How to Build a HighPerformance Data Warehouse. Retrieved from World Wide Web: http://db.csail.mit.edu/madden/high_perf.pdf Bill, Inmon, The Data Warehouse Environment: Quantifying Cost Justification and Return on Investment (November, 2000). Retrieved From World Wide Web: http://www.crmodyssey.com/Documentation/Documentation_PDF/Data%20Warehouse_ Environment_Cost_Justification_and_ROI.pdf Data warehouse Design considerations. Retrieved from http://msdn.microsoft.com/enus/library/aa902672(SQL.80).aspx Nathan, Rawling, Data Warehousing on a Shoestring Budget (May, 2008). Retrieved from World Wide Web: http://esj.com/articles/2008/05/07/data-warehousing-on-ashoestring-budget--part-1-of-3.aspx Dain, Hansen, Cost Savings are the New Black for Data Warehousing (March 19, 2009). Retrieved from World Wide Web: http://blogs.oracle.com/dataintegration/2009/03/cost_savings_are_the_new_black_for_da ta_warehousing.html Lee, Hae Seok, Kyobo Life Insurance Case Study. Retrieved From World Wide Web: http://www.sybase.com/detail?id=1054785 Microsoft Case Studies, Scandinavian Airlines,. Retrieved From World Wide Web: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=200330 71 Kalido Case Study, Philips Builds Best-Practice Information Management Architecture. Retrieved from World Wide Web: http://www.kalido.com/Collateral/Documents/EnglishUS/CS-Phillips.pdf Microsoft MSDN Library < http://msdn.microsoft.com/en-us/library/default.aspx> <http://www.exforsys.com/tutorials/data-warehousing.html> <http://www.deere.com/en_US/deerecom/usa_canada.html> <http://www.usa.philips.com/about/company/index.page> <http://www.dwinfocenter.org> <http://www.academictutorials.com/data-warehousing>