Data Warehousing Implementations: A Review

Data Warehousing Implementations: A Review Abdulrahman R. Alazmi Data Warehousing Implementations: A Review Abdulrahman R. Alazmi College of Engineering, Kuwait University, abdulrahmanr.alazmi@ku.edu.kw Abstract Any organizational body of substantial size has huge data logs that are stored in Data Warehouses; these Data Warehouses can store large amounts of data that chronicles the company’s transactions. The importance of Data Warehousing lies in their criticality in providing managers and business experts the information needed to make decisions that would shape the future of the company, because the Data Warehouses provide a single common model for multiple data from multiple sources. Data Warehouses are often large and can obtain data from several sources from the many systems used by the company; this would make the Data Warehouses difficult as they grow in size. The two main dominant approaches in implementing Data Warehouses are the Top-down, and Bottom-up. These approaches have two flavors as well, “dimensional,” and “normalized.” This review would shed light on the many approaches and implementations in data Warehousing designs, which include the dimensional, normalized, hybrid, and federated designs. In addition, how the company’s business intelligence is affected under these implementations. Keywords: Data Warehouse, Data warehouse Architecture, Top-down, Bottom-up, Dimensional, normalization, Hybrid, Federated, Decision Support Systems 1. Introduction Data Warehousing DW is used for analysis, reporting, and data storage, the data they store are of large size and time-variant. A DW serves as the middleware between the transactional applications and decision support applications, and are often fed from the company’s Customer Relationship Management CRM software, and\or the Enterprise Resource Planning ERP applications [1] [2] [3] [4] [5]. DW has three functional stages: staging, integration, and access. The first is used when obtaining the data. The second is used by the DW to keep abstraction among several heterogeneous data. The third is used when extracting data from the DW. Among the tasks assigned to DW are data dictionaries, data extraction, and transformation. Managers use DWs repositories as input for data mining and market research, while this may not be the sole goal of a DW solution, it still is among its chief objectives [6] [4] [7] [8] [9] [10]. Some business analyst use spreadsheet software for data analysis with On-Line Transaction Processing OLTP tools. However, the data stored in the DW and its level of quality would determine any data mining support efficiency [1] [9]. A high quality data is up to date, consistent, and comparable [10]. Considering using a spreadsheet over a DW solution, a user would have to use the transactional data, which may contain redundancies. Moreover, since this option uses raw data, which need to be queried and summarized several times, this complicates and prolong the efficiency of the data mining procedure. On the other hand, a DW solution would provide OLAP tools with much flexibility to deal with data summaries, provide consistent view of data across several departments, and would eliminate the need for a user to the sometimes complex scripting languages, or the dependence on expertise, history, and guesses [1] [3]. DW helps in data mining for knowledge discovery, risk assessment, and in providing input for decision support systems [13]. A successful DW deployment would help the company to transform its raw data to information; this information can then be used to extract knowledge and intelligence for the strategies and future policies of the company, while avoiding low quality data, such as incomplete data, or mistranslations [9] [14] [15] [16]. DWs are not synonym for a database; they do not include all the transactional data stored historically. However, DWs may contain parts of the transactional data, summarized data for analysis and decision-making support. A good installation of a DW can help in identifying ways to increase revenues, find trends in the customers’ behavior, and find weaknesses in any desired department. DW are an eager approach to data integration, as data are extracted from the operational log, transformed and merged with other relevant information, and are ready for any query made to it by the clients [10] [17] [18]. Moreover, DW solutions are far more expensive than a database system, and the risks involved as well as the potential of benefits are huge [9] [13]. The success of a DW deployment will International Journal on Data Mining and Intelligent Information Technology Applications(IJMIA) Volume4, Number1, June 2014 9 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi depend on its relation to business intelligence and performance management, as with any business intelligence systems. The adoption of a DW solution is proportionality increasing with the increase in the company size, number of staff, and the company’s age, not only that but in other fields such as Human Resource HR, Finance, and storage for the company’s old legacy systems [5] [10][19]. The way the DW was organized and structured would determine the speed and efficiency of the extracted data; some approaches provide simplicity in build and operation but may provide a challenge when adding new structures of data. While other approaches may be difficult to plan at first, but provide scalability adaptability as the DW grow in size and spectrum. A well-constructed DW would balances both sides, the ease of implementation, and the ease of expansion and addition, while not compromising extracted data accuracy, efficiency, and speed [20] [21] [17]. The review in this paper would give a study of the trends in designing DW organizations and structure, and give some results – to a degree- as what are the gains and benefit in each design implementation and their affect over decision-making. Fig. 1 shows the DW solution architecture. This paper is organized as follows; in Section II the related works will be discussed, these cover the different approaches in DW designs and studies done to evaluate them; Section III will discuss the approaches and the different implementations; in Section IV, results of evaluation done to each approach and implementation will be given. In section V, future issues in DW will be briefly discussed. Finally, section VI will draw the conclusions and final remarks of the study. Operational Data Storage Area Business Intelligence Data Marts Strategic Planning CRM Input ETL Tools DATABAS E Data Dictionaries OLAP Tools DATA WAREHOUS E Data Mining ANALY SIS DATA WAREHOUSE ENVIRONMENT Figure 1. A Data Warehouse environment 2. Related Work Deciding the approach and implementation of the DW is fundamental in the data storage and data analysis needs of the company. Several works chronicle this phase, where the basic needs plus the nature of the company would govern the choices in selecting what strategy is to be followed in implementing the DW. A deployed DW solution would also set the quality and efficiency of the analysis and reporting tools used by the end-users, such as managers and business analysts [1] [22]. For example, having sufficient models for each department well represented in the DW tables and relations with minimal redundancy, would make the data analysis accurate. A DW is a solution to data mining and business analysis, not a product, and therefore, it maintaining the DW is a continuing process that the chosen design and implementations would govern, and needs to be updated, maintained, and optimized, in according to choice made [7] [19] [16]. The survey done by T. Ariyachandra and H. Watson [23] covers the same topic as in this study, which is an answer to which data warehouse architecture, is best? The study identifies five data warehousing architectures, independent data marts, bus, hub and spoke, centralized, 10 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi and federated. The first is the basic building blocks of any DW. They are the summarized transactional data. The second architecture, the bus, begins with a single data mart, and then gradually builds the bus with additional conforming data marts which have shared dimensions. The third, begins with an enterprises wide view, and gradually builds the needs of the departments. The data in this model is third normal form 3N [24]. The fourth, the centralized architecture is a logical version of the third. There are no independent data marts, only atomic level data, summarized, and dimensional views. The fifth, the federated, is when several decision support systems and several DW solutions exist. It is the meta-data management among these system, to provide a centralized view and control in the entire project. In our study, we shall study most of the architectures identified here, yet the classification differs, the second is the Kimball/Bottom-up approach and implementation, the third is the Bill Inmon/Top-down approach and implementation, the fourth is the Hybrid approach, and the fifth is the federated approach. The study had 20 data warehousing experts participating in its survey form (Bill Inmon and Ralph Kimball, two DW pioneers, were among them). Some of the survey results were the following: the most dominant architecture is the hub and spoke, followed by the bus. A DW solution supports specific business units slightly more than the entire enterprise. The bus architecture implementation took less time than the federated or hub and spoke architectures. The study concludes that no architecture is the dominant winner, yet each comes close to one another, and it depends on the company’s needs rather than the architecture itself. In our study, we wish to survey the aspects in these architectures, while not empirically, but rather based on the literature and technologies in the DW field. In [25] the same authors, T. Ariyachandra and H. Watson, have done a survey that involves the organizational factors that will drive the decision in selecting the data warehouse’s architecture. These factors include information interdependence, urgency, resource constraints, and strategic views among many others. Their survey was based on literature, theory and expert interviews; it spans independent to dependent data marts, enterprise data warehouses and all the way to the federated data warehouses. The DW solution implementation needs the implementers to take a look at the whole of the enterprise need of data, and to have in mind the huge budget that is wagered on the project. The IT people in the organization and their prowess provide the premises for selecting architecture, and must be aware of the emerging technologies that provide new tools and add-on packages to serve new and existing DW solutions. In this review we shall focus on the DW architectures’ features, rather than the organizational factors. In [22] the paper discusses a U.S. retailing company and its decision in selecting a DW solution. Above the many software and hardware choices a company needs to decide on for its data storage DW, the approach and implementations of the DW must also be decided. The two dominant paradigms are the dimensional and normalized, invented and popularized each by their founders, Ralph Kimball and Bill Inmon, respectively. Both approaches are further discusses in the next section. The choice made is based on what paradigm is more suited for the retailing company. While the paper doesn’t give detailed comparison (which is in this review we will see) but the choice was the normalized approach. Reasons for the choice are its broadness in view, its adaptability to requirements change, it application-independence, its atomic level stored data. The review given in [20] shows an ontological study over the various DW approaches and implementations. The paper reviews 30 DW commercial processes used in the market. The goal is to provide a roadmap for the standardization of the many heterogeneous components found in DW solutions. This could be the standardization of the metadata, since DW are metadata repositories. This would enable companies and enterprises to have a better view in selecting what to include in its DW solutions, and their interoperability. The DW design has two main components: the DW Architecture DWA and DW Process DWP (called DW Approach and DW Implementation in this paper, respectively). The first is that DWs are used with Extraction, Transformation, and Loading ETL tools. These enable for an On-Line Analytical Processing OLAP, which allows for complex and summarized data to be analyzed more efficiently, as opposed to OLTP, where transactional data is queried for analysis [26] [8]. OLTP systems perform structured, predefined and repetitive tasks and queries for data update and day to day information; they deal with smaller sets of data than data found in the 11 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi DW repositories [18]. The second, the DWP, includes the business requirements analysis, data formats, and deployment. While the study clearly draws the for the many steps in designing and implementing a DW solution such as Requirement Analysis by interview or templates among many others; architectural design to be hybrid, dimensional, or normalized; the Extraction to be immediate or deferred; and finally the end-user application can be OLAP or Client-Server among many others. The study doesn’t compare a design choice over another, but instead, gives a gallery of choices to choose from. In this study we hope we can clarify about the much debated on subject of selecting which approach to optimize an enterprise. Since data quality and management are affected by the DW structure, the findings in [27] explore this situation. Measurement of data quality implies having some parameters for each data set to define its value such as its completeness, and integrity. The approach defines a DW implementation to support better data mining and data quality measurement support, the DW contained software version control and code information. A dimensional approach was used, because of its high level abstraction of data [28]. Additional facts were added to describe the state of the programs, while additional dimensions were added to describe program event to the information. Two analytical tools were developed to enhance the modified DW data mining and quality levels. The first is change history miner, which enables the end-user to discover what programs are usually modified together, and so have inter-dependence between them. The second tool is the corrective maintenance dashboard; this tool enables a single view for multiple systems against quality factors, sometimes even manual data quality checks by experts are needed. In this review the effects of the DW structure on data analysis tools will show [16]. In [29] it is stated that the Business Intelligence BI tools such as the reporting and analysis tools, and data mining capabilities are affected by the DW solutions implementation, a lack of business requirement or an inappropriate modeling. A business-model driven DW design process is proposed to remedy the situation of DW implementation that would improve the BI and data mining support. Several aspects add to the problem, these include issues relating miscommunication between the Information Technology IT and business experts, issues involving changes to the requirements during development, and issues involving. The businessmodel driven DW process is composed of business-model software that governs the whole development cycle, a staging area for multiple data sources that enables a solution wide object model for all the system data, and BI metadata to promote easy configurations in the DW for new changes [19]. In [30] the paper discusses the importance of DW design schema and its effect on the OLAP tools. Deployment and installation of a DW solution is of considerable cost to any firm of ant size, therefore designing a schema that would benefit the company is integral in justifying such high costs. Efficient OLAP requirements elicitation methods are required to develop a good DW schema. The paper suggest four stages for this process, acquisition of OLAP requirements, generation of the star schema, DW schema generation, and mapping the Data Mart/DW. These processes may involve natural language processing for OLAP requirements. The building of DW solutions and their choices and challenges have been discussed in [31]. The paper mentions the two dominant DW design styles, the “Bill Inmon” style, and the “Ralph Kimball” style, two styles of which shall be discussed in this review. The paper sections the DW implementation as two important steps. First are the DW design schema, which is built from the requirements of the business itself, BI, and data mining [19]. The second step, which is the decision-making support, and OLAP integration, this step dictates most of the operational life time of the DW. DW is composed of ETL tools, Restitution tools to deal with BI needs, and Meta-data management to relate and connect data marts stored in DW storage. The paper also demonstrates problems in DW implementations such as dynamic data sources, where a new wrapper has to be used, or the data is inconsistent with the data model in the DW; security in handling data; and data scrubbing where redundant data is to be removed. Future issues include real-time enterprises, and federated implementations, which is discussed in this review. 3. DW Design Approaches Since the inception of DW, the main purpose of the DWs was to serve as a middleware between the operational systems and decision support software [32]. This was no easy task, as companies had data 12 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi flowing from several sources and in different formats, also the decision support software were multiple and would have redundant data across several software. DWs had to clean, transform, and present the pieces of datum for the end-users, hence the three principle functions (staging, integration, and access) also known as ETL process. The choice in choosing the design of the architecture for a DW solution would be decided by a managerial decision, but also affected by the current Information Systems ISs used, resources available in terms of employees, infrastructure, and budget and time constraints [7] [15] [16]. The DWs have seen many milestones. Big contributions are credited to Ralph Kimball, and Bill Inmon, both of which are responsible for the “dimensional” and “normalized” respectively. These two being the main two approaches in DWs designs. Both of these approaches can be represented in Entity Relationship ER diagrams, and both of these approaches can be implemented as Bottom-up or Topdown implementations [33] [34]. Dimensional and normalized approaches are not mutually exclusive [28]. Several new “hybrid” approaches do exist and have received acceptance, as well as “federated” approaches [35] [36]. The choice in selecting a good approach and implementation is fundamental in a DW installation. The questions such as how is the information will be represented, the summarized views will be updated, the additions of new information how will it be carried out? [18]. Independent data marts are also an approach considered to a separate approach [23] [7]. These are data marts that are independent of a DW solution. They shall not be discussed in this review individually, since they are a fundamental building block in developing most of the approaches and implementation which will be discussed next. However, implementing data marts is part of the discussion in the review. 3.1. Dimensional Approach In [37] Ralph Kimball has introduced the DW approach based on dimensions and facts, in which data stored to be extracted and transformed is stored as such, called the dimensional Model. This style is referred to as a star schema [22]. Dimensions are reference information for facts; facts are numerical figures in transactional data. If bill information were stored in a Dimensional DW, dimensions would be items bought or sold, while a fact would be the quantity of the billed item. Information stored in a dimensional DW would have one table with the facts, linked by many dimensions; hence, this model is also called the star schema. Fig. 2 shows an example of a dimensional modeling schema. Dimensional DWs are easy to use by business professionals and managers, those who have little computer science knowledge of the database and data storage concepts, since facts and dimensions are easy to understand and relate [28] [38]. In addition, data marts can be queried faster, since fact and dimension tables are condensed, and they span fewer tables and relations. Its author Ralph Kimball has been an avid advocate for the dimensional model, and had defended his model from criticism such as that the star schema has summarizations and loses some details. Both the dimensional modeling and ER model have similarities and can be represented in terms of the other. Only terminology and the degree of normalization differ from one model to the other [7]. The survey done in [39] gives an overview and analysis of the many dimensional modeling schemas; some such as the Husemann et al. suggest the use of an ER schema first. Pros:  Data is prearranged according to the required outputs [31].  Uses fact and dimensions, often referred to as the “star schema”, can constructs “snowflake” structures, and “constellation” schemas as well, where several fact tables are joined by a shared dimension [28].  Dimensional modeling can summarize data marts easily, and is dominant among business analyst, since no need of the data structure and layout is needed [28] [13].  Mostly used, and optimized with the Bottom-up implementation [37].  Subject oriented and provide a user view of the data [40]. Cons:  The dimensional model is very asymmetric. A single join can relate a single fact table to several dimension tables [7].  A single star join, while it may span several dimensions from different departments such as customers, sales, and time, it is however, usable to a single departmental need. A different department might as well define another star join that would also span several dimensions, just 13 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi to serve it needs. As the process continues, the DW would unnecessarily grow in size, and redundancies would arise [41].  Each new star join is built individually. This will consume as much time and effort as the first and every star join. There is no foundation to build upon, and quicken the process [41]. Figure 2. Dimensional Modeling 3.2. Normalized Approach Bill Inmon, another pioneer in the field of DW, has compared his approach against the dimensional approach in [41]; his model is the “normalized” DW approach, introduced in the early 1990s. In this design, the data stored in the warehouse follow a third normalization form 3N [24] [42]. The first step would be to identify each entity, and the relationships among them. All information to be introduced in the DW must be normalized, which means key information is stored in tables each. If an entry has more than one key value, these keys are stored on separate tables, and a look-up table is made to relate both. Data dependent on value other than the key value would be normalized until every piece of information in the table is dependent only on the key value [24]. Fig. 3 has an example of a 3N relational model. This approach is highly scalable, since each key piece of information is maintained separately. On the other hand, this may make the DW size huge considerably, and querying could span over several tables, across many join operations under relationships. Accessing data in normalized DW require understanding of the DW schema, it keys, relationships, and existing tables. Inmon’s approach has been used to help structure textual data to structured sets of identifiers in [43] [44], given the fact that most DW storage are predominantly numbers, so in order to process text was a challenge, especially to enhance the queries made against the DW [38]. In [45] a natural language method to extract requirements for data mart OLAP processing is proposed. The Inmon approach can help in formatting these textual inputs. Pros:  Data is of atomic 3N relational form [31].  Easily adaptive to new changes [42].  3N form eliminates data redundancy [24].  Independent of the application [22] [31]. Cons:  3N data repositories do not produce efficient queries; sometimes they are slow and need to be done on several queries.  OLAP tools as opposed to OLTP tools prefer de-normalization among data to enhance performance. This is an inherited trait from the nature of OLAP statements such as: “find the 14 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi total sales made by all the customers last week”, unlike an OLTP statement which should read: “find the sales for a particular customer”  Since 3N form makes the data normalized, atomic, and with no redundancy, large number of tables and relationships would exist. These can exist over several servers; this can negatively affect the queries’ performance [38].  More appropriate to databases not DW [7]. 3.3. Bottom-up Implementation In the Bottom-up implementation [46] the data marts are assigned first, as shown in Fig. 4, this would realize the outcome needed from DW storage early in its development to managers and business experts, dimensional DWs are more suitable to this implementation. As more data marts are developed they are added incrementally, and a global policy would slowly emerge from these data marts. Data marts are the summary report of the stored, cleaned, combined, and extracted data. Data marts range from simple facts and dimensions to facts, dimensions, as well as summarized data. Building DW in a bottom-up implementation centers around the building of the first data mart, and then building more data marts, possibly in a different department, and finally linking them in the Bus. The Bus is another fundamental aspect in bottom-up design. The Bus links data marts with shared dimensions – drilling across- [7] [15] [16]. Maintaining the Bus as the DW grows is critical in managing the DW design as it grows in several departments, as integration between data marts happens if each is touched upon by the Bus. The Bottom-up implementation builds iteratively, and some think that this is how the BI and data mining tools are used, not from an enterprise wide single view [47]. Figure 3. 3rd Normal Form Pros:  Given that the data mart are built incrementally, in an average period 9 month for each. This gives the business team to have a better understanding of the data marts [47].  This implementation gives a better estimate of Return On Investment ROI [47].  Can quickly be adapted to solve temporary problems, since the data marts are built individually [47].  Better adapted to a Dimensional approach [37].  Easy implementation, as only individual data marts are developed iteratively.  Less initial cost, as the system would be built incrementally. Cons:  Since in the Bottom-up implementation data marts are easy and quick to build, they may not be compatible with the overall strategy and business model of the company. They may solve immediate problems quickly, but might affect the overall BI of the company [47] [14].  In the long term, maintaining the many data marts that exist, and would most likely make “multiple sources of truth”, and removing redundancies might negatively affect BI and data mining practices.  The Bus must be maintained throughout the building of new data marts, otherwise inefficient and misleading results might be produced [37]. 15 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi Figure 4. Bottom-up Implementation 3.4. Top-down Implementation The Top-down approach, promoted by Bill Inmon, has also seen many successful implementations and applications [41]. Is the design approach where the DW, is the centralized container for the entire company. Fig. 5 demonstrates the implementation. It is built by defining a normalized data model, the data stored is the datum defined at their lowest form. The Top-down approach defines the whole business model for all the data to be used in the company at its first step, the tables, data, and the relationships among them. Top-down implementation requires that every department, business lines, and workgroups in the company to gather, plan, and illicit the requirements needed, that would satisfy the needs for every branch. Data marts needed for Business Intelligence BI are extracted from the DW, which sits as the central Corporate Information Factory CIF and becomes a “single source of truth”. Since the DW design in the Top-down implementation, is broad and large, the tables with high relevance to other tables are grouped under subject areas, and most likely share relationships with them. DW under these implementations has very consistent data marts, since all data marts are computed from the centralized repository of the DW. Given this fact, such DWs are highly scalable, and can adapt to future changes in the data marts requirements. Top-down implementation is preferred to be used the Inmon style normalized DW, since the data are kept under the lowest and in 3N form. This allows the Top-down approach to broad and keeps all data application independent [22]. Pros:  According to [47] this approach eases maintenance, meta-data management, ETL tools integration, and the making of new data mart.  Better adapted to a Normalized approach [41].  In the existence of good IS and centralized Information Technology IT infrastructure that provide reliable hardware and software support, this approach have beneficial effects [7]. Cons:  The Top-down implementation is a very wide and broad project and it is a considerable undertaking that might take a large amount of time and human power [14].  The methodology used in implementation, due to time of development, might not satisfy the end users once launched and online.  After implementation begins no changes can be made to the deployment data structures. 3.5. Hybrid Approach These models use both implementations, the Top-down and Bottom-up in their designs. It focuses on utilizing the user-friendly and performance optimization of the Bottom-up, while keeping the scalability and integration of the Top-down designs. Also the Hybrid approach can be defined as the 16 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi mixture of both approaches, the dimensional and normalized [48] [49] [50]. Fig. 6 illustrates this arrangement. Other research describe the Hybrid approach as using a database, ETL tools, and OLAP tools, in this survey, we consider as the mixture of a normalized and dimensional modeling, as with most implementations and usages [49]. Figure 5. Top-down Implementation Figure 6. A Hybrid Approach This approach starts with an ER diagram for the data marts; these data marts represent the company’s basic needs, and can grow in size as new requirements emerge. The data marts are developed in a dimensional model, and then ETL tools are used to extract data from sources to a nonpersistent staging area (otherwise known as Operational Data Store ODS) and then into the data marts. By now the data marts, containing summarized and atomic data are loaded into the DW, or sometimes even moved to a temporary storage space for cleansing before going into the DW [8] [51]. These data marts might face normalization processes along the way in order to reduce redundancy in the system. Finally, query tools are used to get data marts and atomic data from the DW. This method provides a single source of truth, which is the DW. Pros: The next set is from [49]:  Has the best of implementations, the dimensional and the normalized. Because data can be stored in the normalized form, yet aggregate dimensional queries can be made against them.  When using Materialized View Aggregates MVAs, the Hybrid approach performs better than the dimensional and normalized approaches.  Using ODS helps in data integration from multiple, constantly changing data sources [51] [50] [14].  Rapid development of the DW, independent of the data marts [50]. 17 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi Cons: The next set is from [49]:  Since the hybrid approach is the sum of the dimensional and normalized approaches, it size if the sum of both, minus the union data.  The hybrid approach has lesser performance than the dimensional when using dimensional queries.  The Hybrid approach’s design is more complex to the normalized and dimensional alternatives. Planners need to have good background and experience in both the Top-down and Bottom-up designs to balance the hybrid design, by selecting which department to develop next, and to what degree the scope of the view should be [7]. 3.6. Federated Approach This architecture follows a hub-and-spoke design. Where the hub is linked to all the spokes, and can send and receive information from them. The federated approach, illustrated in Fig. 7, suggests the integration of many DW, and quite possibly other source such as BI tools and CRM feeds. The needs for this approach comes from the fact that a company or enterprise might have a plethora of DW, BI, and decision-making solutions, and to accommodate this, a central hub, which is the federated DW, is found, to allow managers and consultants to have a broad view into the many assets in the company [13] [35] [48]. A federated DW is a more plausible choice than building a very huge and single DW to accommodate all the many DW solutions a company has. This approaches centers around two main concepts, the common business model and the common staging area. The first ensures the consistent use of data names and definitions across the heterogeneous analytical and DW software in the company. New data marts made in any of the many DW solutions would have to conform to the common business model. The second, is used to reduce the redundancy in data marts, enhance performance, and helps in maintaining ETL tools. Because different data marts require differently configured ETL tools, keeping track of these new ETL can be problematic. Therefore, the common staging area would help in restricting and sectioning the ETL tools objectives. First data are staged unto the common staging area, and then fed into the best data marts from the many DWs to optimize performance. This process can also make use of data reengineering and data profiling tools [48]. Figure 7. A Federated DW Architecture In the federated approach it is important to keep the “single source of truth” aspect maintained throughout the system. Therefore, issues such as privacy between the many sources must be maintained, as an extra functionality [40]. There have been several attempts to use the federated databases concepts to implement on a federated DW, details can be found in [52]. The emphasis on using an access layer to remove the data heterogeneous nature is the same as the common staging area in the federated DW. Pros:  When large data repositories need to be analyzed and queried against one another, yet due to legacy formats, legal standards, different sources, or different data models such integration might not be possible using a standard single DW solution. The federated implementation can 18 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi solve this situation by providing a global hub and spoke architecture that governs the many data sources [40].  When acquisitions and reorganizations happen in the company, a federated DW would complement the situation [23].  The staging area in a federated DW provides better performance and optimization for ETL tools [31] [48]. Cons:  Security in a federated data warehouse has a major impact on its functionality. Maintaining data privacy, while using a uniform data model and a unified staging area, becomes difficult. Some organization and standard do not even allow their information to be copied or available in such staging areas. Carefully planned security policies must be deployed when dealing with many sources holding critical information that are sometimes protected by the law, or under standards [40].  When data is private and confidential, the query deployed from a federated DW, maybe be launched, but the processing of the query would be done outside the federated staging area, usually by third parties or the organization holding the data private. This may affect the speed and efficiency of a federated DW [40].  The federated approach had scored the lowest in usage according to [23]. 4. Evaluation Results In this section we will compare the DW architectures we have discussed against common and critical criteria. The normalized and dimensional are mixed with the Top-down and Bottom-up implementations, respectively, as it is the common case for such approaches. Table 1 and Table 2 have the results. The results show that to a degree no DW implementation is an all-out winner, but on the other hand, it shows how each has its area of excellence and areas of weakness. However, the first two show that they are the most common and used overall. This shows just how much the newer trends, such as the Hybrid and Federated approaches, are less in use due to their varied costs, design requirements, and usage. 5. Future Trends and Challenges This section provides the future issues in DW architecture designs, some of which were touched upon by the literature, but no study shows their effect on the selection of DW architecture have been thoroughly explored. 5.1. Security Security has been a major issue in almost every information system, DWs are no exception. The amounts of different architectures we have discussed in this review have different ways of dealing with the data before being committed to the DW repositories. Classic methods suggest the encryption of data before entering the DW. Other methods are imported from the database and federated databases domains, which suggest the use of requirement analysis, logical and physical design. But they are not suitable for DWs because they originated in the database domain, where data is atomic, and operational [53]. 5.2. Outsourcing DW Outsourcing a DW solution might be an attractive choice, since the resources and time for such a project are substantial. Outsourcing a DW might only include parts of the DW, such as the data model, ETL tools, or even the BI transactions [54]. The immediate advantages in outsourcing a DW include the access to global expertise in DW implementations, reduction of planning and resources needed to build the DW solution, 19 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi considering its complexity, and the reduction of costs and flexibility in maintenance and upgrade [5]. However, the promise of lesser costs of DW implementations by an external body, possibly a third party, should not be the sole reason to adopt DW outsourcing [55]. The disadvantages included are the risks of having the company’s BI and data marts exposed to an outside source; switching from one vendor to another might produce contract issues and difficulties; and the design of the DW and its data marts might not be what the company desires but what the vendor provides. In addition, the party responsible for the DW should be aware of how the original company business models work [56]. DW architecture Top-down (normalized) Initial Cost The highest of the surveyed implementations [23] Table 1. Result 1 Initial View Additions The whole organization of the company, department by department [23] Well formulated and documented. The process is as easy as the level of normalization used [24] [42] Bottom-up (Dimensional) The Lowest. Since only data marts are implemented first [46] Departmental. Then it grows to fulfill the organization as needs grow [28] Can be problematic if the data marts do not conform to the bus. Can make way for redundancies. A carefully revised dimensional modeling schema is needed [57] Hybrid Moderate to High. Depends on the amount of vertical (top-down) or horizontal (bottom-up) trends used [49]. Moderate. Can begin by one department or more. Can support independent data marts or an entire DW [58] Structured and scalable. Redundancies can exist due to the dimensional notations used [49] [41] [14] Federated High. Many solutions have to be modified to be compatible with the hub and spoke design [23]. While other existing systems might needn’t modifications at all [25] The whole system of DW solutions, ODSs, and or any data source [25] [52] Must be compliant to the unified business model. Scalable overall [35] [52] 20 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi Table 2. Result 2 DW architecture Maintenance Domain Efficiency Top-down (normalized) Easy to locate. Difficult to optimize because of the presence of many tables and relationship due to 3N [24] The design governs the needs of all the company’s departments [23] 3N is Optimized for databases. For DW and OLAP processing, optimizing the queries can be difficult, because they will most likely cross many tables. At worst case across multiple servers [38]. Bottom-up (Dimensional) Difficult. Because there would occur many overlaps in data marts’ summaries from department to another [41] The design can cover the whole company, when mature. Mostly it is company-wide in most implementation [23] Optimized for OLAP. Dimensional tables and facts can span many data fields, which give very fast querying. Easy to understand [38] Hybrid Needs knowledge of both the Top-down and Bottom-up designs. Designs are complex to understand [7] [49] Company-wide. More details might grow in several departments as needed. Depends on the balance between the Topdown and Bottom-up mix. Using an ODS will unify the ETL tools and models [58] Depending on the schema. However, it is faster than the Relational (normalized) but slower than a pure dimensional design. A Hybrid design include both an ODS and a DW, the ODS maybe the bottleneck and cause slowdown [58] Federated Well formulated if all data sources comply with the unified business model. A large process overall [35] [40] [52] Company-wide. And can even be more, can be the administrator over several data systems in the company [23] [25] The common staging area provides grounds for ETL tools to optimize the data marts before committing them to a DW [31][48]. 6. Concluding Remarks In our study we have evaluated some of the most used designs, which are the Top-down, Bottom-up, Hybrid, and Federated DW designs. As with [23], there were no clear winner in choosing which design implementation, but in the analysis it has been shown that the differences in these designs, in terms of costs, adaptability, scalability, and efficiency, do have contrasting values. These results would help researchers to investigate further these designs, as well as IT or managers in deciding what DW flavor to adopt. A DW design choice ultimately be affected by the needs of the company, a manufacturing firm would have different BI and data mining needs than a monetary body, and therefore a different DW design and implementation would be selected. The Top-down used with the normalized design, had the highest spectrum in terms of the business data model used in the company followed by the Federated, its initial cost is high, but it provided high scalability. The Bottom-down used with the dimensional design, had the highest lowest initial cost, simplest data models, but procured difficult data mart management, as redundancies and difficulties in adding new data marts did appear. The Hybrid implementation had stroke a balance between the two aforementioned implementations. But it had high complexity designs, because it mixes a normalized with dimensional notations in its data models. Its performance was close to a dimensional model in terms of OLAP efficiency. The Federated implementation had high initial costs, a system wide view that can span all of the company’s data sources. The Federated approach needs a reliable and fault 21 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi tolerant hub and spoke infrastructure to provide high interoperability. If the common business model and common staging area are well configured, this approach can enhance the OLAP and BI transactions done to the DWs data marts. Research on DW and its designs needs further empirical studies to show just how much these designs differ in usage and actual performance [23] [5]. Security, DW outsourcing, ROI on DW projects, and DW migration, portability, and interoperability need further investigation in DW, and their effects on the company’s BI. 7. References [1] The Data Warehousing Information Center. LGI Systems Incorporated. Available at: http://www.dwinfocenter.org/against.html. Retrieved on 23-October-2011 [2] Data Warehouse Platform Acceleration. Syncsort. Available at: http://www.syncsort.com/Portals/0/Resources/DMX_Solution_DataWarehouseAcceleration_scree n.pdf. Retrieved 25-October-2011 [3] Frank Hayes. The Story So Far. COMPUTERWORLD. Available at: http://www.computerworld.com/s/article/70102/The_Story_So_Far?taxonomyId=009. Retrieved 2011-October-23 Sunday [4] Colin White. The Federated Data Warehouse. Information Management. Available at: http://www.information-management.com/issues/20000301/1953-1.html. Retrieved 2011-October23 [5] Kimball Group. Resources .Available at: http://www.kimballgroup.com/html/articles_search/articles1997/9708d15.html. Retrieved 25October-2011 [6] Jeff Lawyer, and Shamsul Chowdhury, “Best Practices in Data Warehousing to Support Business Initiatives and Needs,” In the 37th Annual Hawaii International Conference on System Sciences, pp. 80223a, 2004. [7] Ralph Kimball, and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, Wiley Publishing: USA, 2002. [8] Bill Inmon. The Problem with Dimensional Modeling. DM Review Magazine. Available at: http://www.uniriotec.br/~tanaka/SAIN/03-TheProblemWithDimensionalModeling-InmonDMReview-2000.pdf. Retrieved 23-October-2011 [9] Ming-Chang Lee, “ The Combination of Knowledge Management and Data mining with Knowledge Warehouse” in International Journal of Advancements in Computing Technology, vol. 1, no. 1, pp. 39 -45, 2009 [10] Kornelije Rabuzin, “Data Warehousing and Business Intelligence: Understanding What is Really Going on in Higher Education”, in the International Journal of Information Processing and Management, vol. 4, no. 4, pp. 121-129, 2013 [11] Edgar Frank Codd, “A relational model of data for large shared data banks”, In Communications of the ACM, vol. 13, no. 6, pp. 377-387, 1970. [12] InformationWeek. Available at: http://www.informationweek.com/software/030917/615warehouse1_1.jhtml. Retrieved 2011October-23 [13] Infonitive Research. Design & Architectural Approaches for Data Warehouses: Inmon vs Kimball, Federated, Hybrid. Infonitive. Available at: http://infonitive.com/?p=255. Retrieved 23-October2011 [14] Arun Sen, and Atish Sinha, “Toward Developing Data Warehousing Process Standards: An Ontology-Based Review of Existing Methodologies,” In the IEEE Transactions On Systems, Man, And Cybernetics—Part C: Applications And Reviews, vol. 37, no. 1, January 2007. [15] Wilson Mar. Testing Data Warehouses DSS. Available at: http://www.wilsonmar.com/1datawh.htm. Retrieved 23-October-2011 [16] Exforsys Inc. Design of the data warehouse: Kimball Vs Inmon. Available at: http://www.exforsys.com/tutorials/msas/data-warehouse-design-kimball-vs-inmon.html. Retrieved 23-October-2011 22 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi [17] Abhishek Mehta, Tableau Software. Big Data: Powering the Next Industrial Revolution. Available at: http://www.tableausoftware.com/learn/whitepapers/big-data-revolution. Retrieved 23-October2011 [18] 1Keydata. Bill Inmon vs. Ralph Kimball. Available at: http://www.1keydata.com/datawarehousing/inmon-kimball.html. Retrieved 23-October-2011 [19] Bill Inmon. Data Warehousing in a Healthcare Environment. Available at: http://www.tdan.com/view-articles/4584. Retrieved 23rd -October-2011 [20] Methanias Colaço, Manoel Mendonça, and Francisco Rodrigues, “Data Warehousing in an Industrial Software Development Environment,” In the 33rd Annual IEEE Software Engineering Workshop (SEW), pp. 131-135, October 2009. [21] Kalido. Business-Model-Driven Data Warehousing: Keeping Data Warehouses Connected to Your Business. Available at: http://www.kalido.com/Collateral/Documents/English-US/WPBizModelDWa.pdf. Retrieved 25th – October- 2011 [22] David Haertzen, Data Warehousing Tutorial. Data Warehousing and Business Intelligence Introduction. Available at: http://infogoal.com/datawarehousing/overview.htm. Retrieved on the 15th of November 2011 [23] Rashed Salem, Omar Boussaid, and Jerome Darmont, “Conceptual Workflow for Complex Data Integration Using AXML,” In the International Conference on Machine and Web Intelligence ICMWI, pp. 380-385, 2010. [24] Matteo Golfarelli, and Stefano Rizzi, Data Warehouse Design: Modern Principles and Methodologies, McGraw-Hill Osborne Media, Italy, 2009. [25] Anindya Datta, Bongki Moon, and Helen Thomas, “A Case for Parallelism in data Warehousing and OLAP,” in Proceedings of the 9th International Workshop on Database and Expert Systems Applications, pp. 226-231, 1998. [26] Bill Inmon, “Building the Data Warehouse”, Wiley Publishing: Canada, 1994. [27] Olivier Teste, “Towards Conceptual Multidimensional Design In Decision Support Systems,” In Proceedings of the 5th East-European Conference on Advances in Databases and Information Systems ADBIS, pp. 77-88, 2001. [28] Inmon Data Systems. Data warehousing in the health care environment. Available at: http://inmoncif.com/registration/whitepapers/DATA%20WAREHOUSING%20IN%20THE%20H EALTHCARE%20ENVIRONMENTR1.pdf. Retrieved on the 14th of November 2011 [29] Christopher Chute, Scott Beck, Thomas Fisk, and David Mohr, “The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data,” in the Journal of the American Medical Informatics Association, vol. 17, no. 2, pp. 131-135, 2010. [30] Jamel Feki, Ahlem Nabli, Hanene Ben-Abdallah , and Faiez Gargouri, “An Automatic Data Warehouse Conceptual Design Approach,” In the 2nd Encyclopedia of Data Warehousing and Mining, John Wang, IGI Global: USA, 2008. [31] Srikanth Rokkam, and Yijie Han, “Data Warehousing: A Functional Overview”, In the International Conference on Data Mining DMIN, pp. 362-369, 2008. [32] Information Management Software, IBM. Data warehouses: improve strategic decision making with coherent views of data. Available at: ftp://ftp.software.ibm.com/software/data/bi/smb-why-warehouse.pdf. Retrieved on the 16th of November of 2011 [33] Christopher Dorobek, GCN. Experts: A good data warehouse improves decision-making. Available at: http://gcn.com/articles/1999/05/experts-a-good-data-warehouse-improvesdecisionmaking.aspx. Retrieved on the 16th of November of 2011 [34] G. Sweety Peter. Data Warehouse-An Overview. Available on: http://www.peterindia.net/DataWarehousingView.html. Retrieved on the 16th of November of 2011 [35] Jerome Darmont, Omar Boussaid, Fadila Bentayeb, "Warehousing Web Data", in the 4th International Conference on Information Integration and Web-based Applications and Services (iiWAS 02), Bandung, Indonesia, pp. 148-152, 2002. [36] Thilini Ariyachandra, and Hugh Watson, “Which Data Warehouse Architecture is best?” in Communications of the ACM, vol. 51, no. 10, pp. 146-147 , 2008. [37] James Madison, ORACLE. Building a Hybrid Data Warehouse Model. Available at: 23 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi http://www.oracle.com/technetwork/articles/madison-models-086845.html. Retrieved on the 14th of November 2011 [38] William McKnight, McKnight Consulting Group. Hybrid Approaches to Business Intelligence. Available at: http://www.information-management.com/issues/20040101/7908-1.html. Retrieved on the 15th of November 2011 [39] Nevena Stolba, Marko Banek, and A. Min Tjoa, “The Security Issue of Federated Data Warehouses In the Area of Evidence-Based Medicine,” in the Proceedings of the 1st Conference on Availability, Reliability and Security ARES, pp. 11, April 2006. [40] Julia Vowler. ComputerWeekly. Data warehouse design from top to bottom. Available at: http://www.computerweekly.com/feature/Data-warehouse-design-from-top-to-bottom. Retrieved on the 15th of November 2011 [41] Microsoft TechNet. Chapter 12 - Data Warehousing Framework. Available at: http://technet.microsoft.com/en-us/library/cc966470.aspx. Retrieved on the 15th of November of 2011 [42] Chuck Ballard, Dirk Herreman, Don Schau, Rhonda Bell, Eunasaeng Kim, and Ann Valencic, IBM. Data Modeling Techniques for Data Warehousing. Available at: www.redbooks.ibm.com/redbooks/pdfs/sg242238.pdf. Retrieved on the 15th of November of 2011 [43] Data Warehousing and Business Intelligence. Available at: http://knol.google.com/k/data-warehousing-and-business-intelligence. Retrieved of 15th of November of 2011 [44] The Hybrid Structure. Logimethods. Available at: http://www.logimethods.com/thoughtleadership-hybrid-data-warehouse.php. Retrieved on the 17th of November 2011 [45] Michel Schneider, “A general model for the design of data warehouses”, In the International Journal of Production Economics, vol. 112, no. 1, pp. 309-325, 2008. [46] Oscar Romero, and Alberto Abello, “A Survey of Multidimensional Modeling Methodologies”, In the International Journal of Data Warehousing & Mining, vol. 5, no. 2, pp. 1-23, 2009. [47] Sajjad Ahmad, and Rohiza Ahmad, “An Improved Security Framework for Data Warehouse: A Hybrid Approach”, In Proceedings of the International Symposium in Information Technology ISIT, pp. 1586-1590, 2010. [48] Stefan Berger, and Michael Schrefl, “From Federated Databases to a Federated Data Warehouse System”, In the 41st Annual Hawaii International Conference on System Science, pp. 394 - 394, 2008. [49] Thilini Ariyachandra, and Hugh Watson, “Key organizational factors in data warehouse architecture selection”, In Decision Support Systems, vol. 49, no. 2, pp. 200–212, 2010. [50] Fahmi Bargui, Jamel Feki, Hanene Ben-Abdallah, “A natural language approach for data mart schema design”, In the 9th International Arab Conference on Information Technology ACIT, 2008. [51] K. Ram Ramamurthy, Arun Sen, and Atish Sinha, “An empirical investigation of the key determinants of data warehouse adoption”, In Decision Support Systems, vol. 44, no. 4, pp. 817841, 2008. [52] Mohammed Tafti, “IT Outsourcing: A Knowledge-Management Perspective”, In Issues in Information Systems, vol. 8, no. 2, pp. 488-493, 2007. [53] Sid Adelman, Information management. Should the Data Warehouse be Outsourced? Available at: http://www.information-management.com/issues/20060501/1053405-1.html. Retrieved on the 5th of December 2011 [54] Roger Llewellyn, Kognitio. ComputerWorldUK. Why IT Departments are Outsourcing Data Warehousing. Available at: http://www.computerworlduk.com/in-depth/outsourcing/2003/why-it-departments-areoutsourcing-data-warehousing/. Retrieved on the 5th of December 2011 [55] Marwa Farhan, Mohammed Marie, Yehia Helmi, and Laila El-Fangary, “Transforming Conceptual Model into Logical Model for Temporal Data Warehouse Security: A Case Study”, International Journal of Advanced Computer Science and Applications, vol. 3, no. 3, pp. 115-122, 2012. [56] Arindam Paul, Varuni Ganesan, Jagat Challa, and Yashvardhan Sharma, “HADCLEAN: A hybrid approach to data cleaning in data warehouses”, In the Proceedings of International Conference on Information Retrieval & Knowledge Management CAMP, pp. 136-142, 2012. 24 Data Warehousing Implementations: A Review Abdulrahman R. Alazmi [57] Mahmoud Boufaida, and Louradi Bradji, “Open User Involvement in Data Cleaning for Data Warehouse Quality”, International Journal of Digital Information and Wireless Communications, vol. 1, no. 2, pp.536-544, 2011. [58] Junwei Di, Zhanbao Gao, and Limei Zhang, “PHM framework design based on data warehouse”, In the IEEE Conference on Prognostics and System Health Management PHM, pp. 1-5, 2012. 25

Data Warehousing Implementations: A Review

Related documents

Products

Support

Data Warehousing Implementations: A Review

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib