Data Warehousing Implementations: A Review

advertisement
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
College of Engineering, Kuwait University, abdulrahmanr.alazmi@ku.edu.kw
Abstract
Any organizational body of substantial size has huge data logs that are stored in Data
Warehouses; these Data Warehouses can store large amounts of data that chronicles the company’s
transactions. The importance of Data Warehousing lies in their criticality in providing managers and
business experts the information needed to make decisions that would shape the future of the company,
because the Data Warehouses provide a single common model for multiple data from multiple sources.
Data Warehouses are often large and can obtain data from several sources from the many systems
used by the company; this would make the Data Warehouses difficult as they grow in size. The two
main dominant approaches in implementing Data Warehouses are the Top-down, and Bottom-up.
These approaches have two flavors as well, “dimensional,” and “normalized.” This review would shed
light on the many approaches and implementations in data Warehousing designs, which include the
dimensional, normalized, hybrid, and federated designs. In addition, how the company’s business
intelligence is affected under these implementations.
Keywords: Data Warehouse, Data warehouse Architecture, Top-down, Bottom-up, Dimensional,
normalization, Hybrid, Federated, Decision Support Systems
1. Introduction
Data Warehousing DW is used for analysis, reporting, and data storage, the data they store are of
large size and time-variant. A DW serves as the middleware between the transactional applications and
decision support applications, and are often fed from the company’s Customer Relationship
Management CRM software, and\or the Enterprise Resource Planning ERP applications [1] [2] [3] [4]
[5]. DW has three functional stages: staging, integration, and access. The first is used when obtaining
the data. The second is used by the DW to keep abstraction among several heterogeneous data. The
third is used when extracting data from the DW. Among the tasks assigned to DW are data dictionaries,
data extraction, and transformation. Managers use DWs repositories as input for data mining and
market research, while this may not be the sole goal of a DW solution, it still is among its chief
objectives [6] [4] [7] [8] [9] [10]. Some business analyst use spreadsheet software for data analysis
with On-Line Transaction Processing OLTP tools. However, the data stored in the DW and its level of
quality would determine any data mining support efficiency [1] [9]. A high quality data is up to date,
consistent, and comparable [10]. Considering using a spreadsheet over a DW solution, a user would
have to use the transactional data, which may contain redundancies. Moreover, since this option uses
raw data, which need to be queried and summarized several times, this complicates and prolong the
efficiency of the data mining procedure. On the other hand, a DW solution would provide OLAP tools
with much flexibility to deal with data summaries, provide consistent view of data across several
departments, and would eliminate the need for a user to the sometimes complex scripting languages, or
the dependence on expertise, history, and guesses [1] [3]. DW helps in data mining for knowledge
discovery, risk assessment, and in providing input for decision support systems [13]. A successful DW
deployment would help the company to transform its raw data to information; this information can then
be used to extract knowledge and intelligence for the strategies and future policies of the company,
while avoiding low quality data, such as incomplete data, or mistranslations [9] [14] [15] [16].
DWs are not synonym for a database; they do not include all the transactional data stored
historically. However, DWs may contain parts of the transactional data, summarized data for analysis
and decision-making support. A good installation of a DW can help in identifying ways to increase
revenues, find trends in the customers’ behavior, and find weaknesses in any desired department. DW
are an eager approach to data integration, as data are extracted from the operational log, transformed
and merged with other relevant information, and are ready for any query made to it by the clients [10]
[17] [18]. Moreover, DW solutions are far more expensive than a database system, and the risks
involved as well as the potential of benefits are huge [9] [13]. The success of a DW deployment will
International Journal on Data Mining and Intelligent Information Technology Applications(IJMIA)
Volume4, Number1, June 2014
9
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
depend on its relation to business intelligence and performance management, as with any business
intelligence systems. The adoption of a DW solution is proportionality increasing with the increase in
the company size, number of staff, and the company’s age, not only that but in other fields such as
Human Resource HR, Finance, and storage for the company’s old legacy systems [5] [10][19].
The way the DW was organized and structured would determine the speed and efficiency of the
extracted data; some approaches provide simplicity in build and operation but may provide a challenge
when adding new structures of data. While other approaches may be difficult to plan at first, but
provide scalability adaptability as the DW grow in size and spectrum. A well-constructed DW would
balances both sides, the ease of implementation, and the ease of expansion and addition, while not
compromising extracted data accuracy, efficiency, and speed [20] [21] [17]. The review in this paper
would give a study of the trends in designing DW organizations and structure, and give some results –
to a degree- as what are the gains and benefit in each design implementation and their affect over
decision-making. Fig. 1 shows the DW solution architecture.
This paper is organized as follows; in Section II the related works will be discussed, these cover the
different approaches in DW designs and studies done to evaluate them; Section III will discuss the
approaches and the different implementations; in Section IV, results of evaluation done to each
approach and implementation will be given. In section V, future issues in DW will be briefly discussed.
Finally, section VI will draw the conclusions and final remarks of the study.
Operational
Data
Storage
Area
Business
Intelligence
Data Marts
Strategic
Planning
CRM Input
ETL
Tools
DATABAS
E
Data
Dictionaries
OLAP
Tools
DATA
WAREHOUS
E
Data
Mining
ANALY
SIS
DATA WAREHOUSE ENVIRONMENT
Figure 1. A Data Warehouse environment
2. Related Work
Deciding the approach and implementation of the DW is fundamental in the data storage and
data analysis needs of the company. Several works chronicle this phase, where the basic needs
plus the nature of the company would govern the choices in selecting what strategy is to be
followed in implementing the DW. A deployed DW solution would also set the quality and
efficiency of the analysis and reporting tools used by the end-users, such as managers and
business analysts [1] [22]. For example, having sufficient models for each department well
represented in the DW tables and relations with minimal redundancy, would make the data
analysis accurate. A DW is a solution to data mining and business analysis, not a product, and
therefore, it maintaining the DW is a continuing process that the chosen design and
implementations would govern, and needs to be updated, maintained, and optimized, in
according to choice made [7] [19] [16].
The survey done by T. Ariyachandra and H. Watson [23] covers the same topic as in this
study, which is an answer to which data warehouse architecture, is best? The study identifies
five data warehousing architectures, independent data marts, bus, hub and spoke, centralized,
10
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
and federated. The first is the basic building blocks of any DW. They are the summarized
transactional data. The second architecture, the bus, begins with a single data mart, and then
gradually builds the bus with additional conforming data marts which have shared dimensions.
The third, begins with an enterprises wide view, and gradually builds the needs of the
departments. The data in this model is third normal form 3N [24]. The fourth, the centralized
architecture is a logical version of the third. There are no independent data marts, only atomic
level data, summarized, and dimensional views. The fifth, the federated, is when several
decision support systems and several DW solutions exist. It is the meta-data management
among these system, to provide a centralized view and control in the entire project. In our study,
we shall study most of the architectures identified here, yet the classification differs, the second
is the Kimball/Bottom-up approach and implementation, the third is the Bill Inmon/Top-down
approach and implementation, the fourth is the Hybrid approach, and the fifth is the federated
approach.
The study had 20 data warehousing experts participating in its survey form (Bill Inmon and
Ralph Kimball, two DW pioneers, were among them). Some of the survey results were the
following: the most dominant architecture is the hub and spoke, followed by the bus. A DW
solution supports specific business units slightly more than the entire enterprise. The bus
architecture implementation took less time than the federated or hub and spoke architectures.
The study concludes that no architecture is the dominant winner, yet each comes close to one
another, and it depends on the company’s needs rather than the architecture itself. In our study,
we wish to survey the aspects in these architectures, while not empirically, but rather based on
the literature and technologies in the DW field.
In [25] the same authors, T. Ariyachandra and H. Watson, have done a survey that involves
the organizational factors that will drive the decision in selecting the data warehouse’s
architecture. These factors include information interdependence, urgency, resource constraints,
and strategic views among many others. Their survey was based on literature, theory and expert
interviews; it spans independent to dependent data marts, enterprise data warehouses and all the
way to the federated data warehouses. The DW solution implementation needs the implementers
to take a look at the whole of the enterprise need of data, and to have in mind the huge budget
that is wagered on the project. The IT people in the organization and their prowess provide the
premises for selecting architecture, and must be aware of the emerging technologies that
provide new tools and add-on packages to serve new and existing DW solutions. In this review
we shall focus on the DW architectures’ features, rather than the organizational factors.
In [22] the paper discusses a U.S. retailing company and its decision in selecting a DW
solution. Above the many software and hardware choices a company needs to decide on for its
data storage DW, the approach and implementations of the DW must also be decided. The two
dominant paradigms are the dimensional and normalized, invented and popularized each by
their founders, Ralph Kimball and Bill Inmon, respectively. Both approaches are further
discusses in the next section. The choice made is based on what paradigm is more suited for the
retailing company. While the paper doesn’t give detailed comparison (which is in this review
we will see) but the choice was the normalized approach. Reasons for the choice are its
broadness in view, its adaptability to requirements change, it application-independence, its
atomic level stored data.
The review given in [20] shows an ontological study over the various DW approaches and
implementations. The paper reviews 30 DW commercial processes used in the market. The goal
is to provide a roadmap for the standardization of the many heterogeneous components found in
DW solutions. This could be the standardization of the metadata, since DW are metadata
repositories. This would enable companies and enterprises to have a better view in selecting
what to include in its DW solutions, and their interoperability.
The DW design has two main components: the DW Architecture DWA and DW Process
DWP (called DW Approach and DW Implementation in this paper, respectively). The first is
that DWs are used with Extraction, Transformation, and Loading ETL tools. These enable for
an On-Line Analytical Processing OLAP, which allows for complex and summarized data to be
analyzed more efficiently, as opposed to OLTP, where transactional data is queried for analysis
[26] [8]. OLTP systems perform structured, predefined and repetitive tasks and queries for data
update and day to day information; they deal with smaller sets of data than data found in the
11
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
DW repositories [18]. The second, the DWP, includes the business requirements analysis, data
formats, and deployment. While the study clearly draws the for the many steps in designing and
implementing a DW solution such as Requirement Analysis by interview or templates among
many others; architectural design to be hybrid, dimensional, or normalized; the Extraction to be
immediate or deferred; and finally the end-user application can be OLAP or Client-Server
among many others. The study doesn’t compare a design choice over another, but instead, gives
a gallery of choices to choose from. In this study we hope we can clarify about the much
debated on subject of selecting which approach to optimize an enterprise.
Since data quality and management are affected by the DW structure, the findings in [27]
explore this situation. Measurement of data quality implies having some parameters for each
data set to define its value such as its completeness, and integrity. The approach defines a DW
implementation to support better data mining and data quality measurement support, the DW
contained software version control and code information. A dimensional approach was used,
because of its high level abstraction of data [28]. Additional facts were added to describe the
state of the programs, while additional dimensions were added to describe program event to the
information. Two analytical tools were developed to enhance the modified DW data mining and
quality levels. The first is change history miner, which enables the end-user to discover what
programs are usually modified together, and so have inter-dependence between them. The
second tool is the corrective maintenance dashboard; this tool enables a single view for multiple
systems against quality factors, sometimes even manual data quality checks by experts are
needed. In this review the effects of the DW structure on data analysis tools will show [16].
In [29] it is stated that the Business Intelligence BI tools such as the reporting and analysis
tools, and data mining capabilities are affected by the DW solutions implementation, a lack of
business requirement or an inappropriate modeling. A business-model driven DW design
process is proposed to remedy the situation of DW implementation that would improve the BI
and data mining support. Several aspects add to the problem, these include issues relating
miscommunication between the Information Technology IT and business experts, issues
involving changes to the requirements during development, and issues involving. The businessmodel driven DW process is composed of business-model software that governs the whole
development cycle, a staging area for multiple data sources that enables a solution wide object
model for all the system data, and BI metadata to promote easy configurations in the DW for
new changes [19].
In [30] the paper discusses the importance of DW design schema and its effect on the OLAP
tools. Deployment and installation of a DW solution is of considerable cost to any firm of ant
size, therefore designing a schema that would benefit the company is integral in justifying such
high costs. Efficient OLAP requirements elicitation methods are required to develop a good DW
schema. The paper suggest four stages for this process, acquisition of OLAP requirements,
generation of the star schema, DW schema generation, and mapping the Data Mart/DW. These
processes may involve natural language processing for OLAP requirements.
The building of DW solutions and their choices and challenges have been discussed in [31].
The paper mentions the two dominant DW design styles, the “Bill Inmon” style, and the “Ralph
Kimball” style, two styles of which shall be discussed in this review. The paper sections the
DW implementation as two important steps. First are the DW design schema, which is built
from the requirements of the business itself, BI, and data mining [19]. The second step, which is
the decision-making support, and OLAP integration, this step dictates most of the operational
life time of the DW. DW is composed of ETL tools, Restitution tools to deal with BI needs, and
Meta-data management to relate and connect data marts stored in DW storage. The paper also
demonstrates problems in DW implementations such as dynamic data sources, where a new
wrapper has to be used, or the data is inconsistent with the data model in the DW; security in
handling data; and data scrubbing where redundant data is to be removed. Future issues include
real-time enterprises, and federated implementations, which is discussed in this review.
3. DW Design Approaches
Since the inception of DW, the main purpose of the DWs was to serve as a middleware between the
operational systems and decision support software [32]. This was no easy task, as companies had data
12
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
flowing from several sources and in different formats, also the decision support software were multiple
and would have redundant data across several software. DWs had to clean, transform, and present the
pieces of datum for the end-users, hence the three principle functions (staging, integration, and access)
also known as ETL process. The choice in choosing the design of the architecture for a DW solution
would be decided by a managerial decision, but also affected by the current Information Systems ISs
used, resources available in terms of employees, infrastructure, and budget and time constraints [7]
[15] [16].
The DWs have seen many milestones. Big contributions are credited to Ralph Kimball, and Bill
Inmon, both of which are responsible for the “dimensional” and “normalized” respectively. These two
being the main two approaches in DWs designs. Both of these approaches can be represented in Entity
Relationship ER diagrams, and both of these approaches can be implemented as Bottom-up or Topdown implementations [33] [34]. Dimensional and normalized approaches are not mutually exclusive
[28]. Several new “hybrid” approaches do exist and have received acceptance, as well as “federated”
approaches [35] [36]. The choice in selecting a good approach and implementation is fundamental in a
DW installation. The questions such as how is the information will be represented, the summarized
views will be updated, the additions of new information how will it be carried out? [18]. Independent
data marts are also an approach considered to a separate approach [23] [7]. These are data marts that
are independent of a DW solution. They shall not be discussed in this review individually, since they
are a fundamental building block in developing most of the approaches and implementation which will
be discussed next. However, implementing data marts is part of the discussion in the review.
3.1. Dimensional Approach
In [37] Ralph Kimball has introduced the DW approach based on dimensions and facts, in which
data stored to be extracted and transformed is stored as such, called the dimensional Model. This style
is referred to as a star schema [22]. Dimensions are reference information for facts; facts are numerical
figures in transactional data. If bill information were stored in a Dimensional DW, dimensions would
be items bought or sold, while a fact would be the quantity of the billed item. Information stored in a
dimensional DW would have one table with the facts, linked by many dimensions; hence, this model is
also called the star schema. Fig. 2 shows an example of a dimensional modeling schema.
Dimensional DWs are easy to use by business professionals and managers, those who have little
computer science knowledge of the database and data storage concepts, since facts and dimensions are
easy to understand and relate [28] [38]. In addition, data marts can be queried faster, since fact and
dimension tables are condensed, and they span fewer tables and relations. Its author Ralph Kimball has
been an avid advocate for the dimensional model, and had defended his model from criticism such as
that the star schema has summarizations and loses some details.
Both the dimensional modeling and ER model have similarities and can be represented in terms of
the other. Only terminology and the degree of normalization differ from one model to the other [7].
The survey done in [39] gives an overview and analysis of the many dimensional modeling schemas;
some such as the Husemann et al. suggest the use of an ER schema first.
Pros:
 Data is prearranged according to the required outputs [31].
 Uses fact and dimensions, often referred to as the “star schema”, can constructs “snowflake”
structures, and “constellation” schemas as well, where several fact tables are joined by a shared
dimension [28].
 Dimensional modeling can summarize data marts easily, and is dominant among business
analyst, since no need of the data structure and layout is needed [28] [13].
 Mostly used, and optimized with the Bottom-up implementation [37].
 Subject oriented and provide a user view of the data [40].
Cons:
 The dimensional model is very asymmetric. A single join can relate a single fact table to
several dimension tables [7].
 A single star join, while it may span several dimensions from different departments such as
customers, sales, and time, it is however, usable to a single departmental need. A different
department might as well define another star join that would also span several dimensions, just
13
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
to serve it needs. As the process continues, the DW would unnecessarily grow in size, and
redundancies would arise [41].
 Each new star join is built individually. This will consume as much time and effort as the first
and every star join. There is no foundation to build upon, and quicken the process [41].
Figure 2. Dimensional Modeling
3.2. Normalized Approach
Bill Inmon, another pioneer in the field of DW, has compared his approach against the dimensional
approach in [41]; his model is the “normalized” DW approach, introduced in the early 1990s. In this
design, the data stored in the warehouse follow a third normalization form 3N [24] [42]. The first step
would be to identify each entity, and the relationships among them. All information to be introduced in
the DW must be normalized, which means key information is stored in tables each. If an entry has
more than one key value, these keys are stored on separate tables, and a look-up table is made to relate
both. Data dependent on value other than the key value would be normalized until every piece of
information in the table is dependent only on the key value [24]. Fig. 3 has an example of a 3N
relational model.
This approach is highly scalable, since each key piece of information is maintained separately. On
the other hand, this may make the DW size huge considerably, and querying could span over several
tables, across many join operations under relationships. Accessing data in normalized DW require
understanding of the DW schema, it keys, relationships, and existing tables. Inmon’s approach has
been used to help structure textual data to structured sets of identifiers in [43] [44], given the fact that
most DW storage are predominantly numbers, so in order to process text was a challenge, especially to
enhance the queries made against the DW [38]. In [45] a natural language method to extract
requirements for data mart OLAP processing is proposed. The Inmon approach can help in formatting
these textual inputs.
Pros:
 Data is of atomic 3N relational form [31].
 Easily adaptive to new changes [42].
 3N form eliminates data redundancy [24].
 Independent of the application [22] [31].
Cons:
 3N data repositories do not produce efficient queries; sometimes they are slow and need to be
done on several queries.
 OLAP tools as opposed to OLTP tools prefer de-normalization among data to enhance
performance. This is an inherited trait from the nature of OLAP statements such as: “find the
14
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
total sales made by all the customers last week”, unlike an OLTP statement which should read:
“find the sales for a particular customer”
 Since 3N form makes the data normalized, atomic, and with no redundancy, large number of
tables and relationships would exist. These can exist over several servers; this can negatively
affect the queries’ performance [38].
 More appropriate to databases not DW [7].
3.3. Bottom-up Implementation
In the Bottom-up implementation [46] the data marts are assigned first, as shown in Fig. 4, this
would realize the outcome needed from DW storage early in its development to managers and business
experts, dimensional DWs are more suitable to this implementation. As more data marts are developed
they are added incrementally, and a global policy would slowly emerge from these data marts.
Data marts are the summary report of the stored, cleaned, combined, and extracted data. Data marts
range from simple facts and dimensions to facts, dimensions, as well as summarized data. Building
DW in a bottom-up implementation centers around the building of the first data mart, and then building
more data marts, possibly in a different department, and finally linking them in the Bus. The Bus is
another fundamental aspect in bottom-up design. The Bus links data marts with shared dimensions –
drilling across- [7] [15] [16]. Maintaining the Bus as the DW grows is critical in managing the DW
design as it grows in several departments, as integration between data marts happens if each is touched
upon by the Bus. The Bottom-up implementation builds iteratively, and some think that this is how the
BI and data mining tools are used, not from an enterprise wide single view [47].
Figure 3. 3rd Normal Form
Pros:
 Given that the data mart are built incrementally, in an average period 9 month for each. This
gives the business team to have a better understanding of the data marts [47].
 This implementation gives a better estimate of Return On Investment ROI [47].
 Can quickly be adapted to solve temporary problems, since the data marts are built individually
[47].
 Better adapted to a Dimensional approach [37].
 Easy implementation, as only individual data marts are developed iteratively.
 Less initial cost, as the system would be built incrementally.
Cons:
 Since in the Bottom-up implementation data marts are easy and quick to build, they may not be
compatible with the overall strategy and business model of the company. They may solve
immediate problems quickly, but might affect the overall BI of the company [47] [14].
 In the long term, maintaining the many data marts that exist, and would most likely make
“multiple sources of truth”, and removing redundancies might negatively affect BI and data
mining practices.
 The Bus must be maintained throughout the building of new data marts, otherwise inefficient
and misleading results might be produced [37].
15
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
Figure 4. Bottom-up Implementation
3.4. Top-down Implementation
The Top-down approach, promoted by Bill Inmon, has also seen many successful implementations
and applications [41]. Is the design approach where the DW, is the centralized container for the entire
company. Fig. 5 demonstrates the implementation. It is built by defining a normalized data model, the
data stored is the datum defined at their lowest form. The Top-down approach defines the whole
business model for all the data to be used in the company at its first step, the tables, data, and the
relationships among them. Top-down implementation requires that every department, business lines,
and workgroups in the company to gather, plan, and illicit the requirements needed, that would satisfy
the needs for every branch.
Data marts needed for Business Intelligence BI are extracted from the DW, which sits as the central
Corporate Information Factory CIF and becomes a “single source of truth”. Since the DW design in the
Top-down implementation, is broad and large, the tables with high relevance to other tables are
grouped under subject areas, and most likely share relationships with them.
DW under these implementations has very consistent data marts, since all data marts are computed
from the centralized repository of the DW. Given this fact, such DWs are highly scalable, and can
adapt to future changes in the data marts requirements. Top-down implementation is preferred to be
used the Inmon style normalized DW, since the data are kept under the lowest and in 3N form. This
allows the Top-down approach to broad and keeps all data application independent [22].
Pros:
 According to [47] this approach eases maintenance, meta-data management, ETL tools
integration, and the making of new data mart.
 Better adapted to a Normalized approach [41].
 In the existence of good IS and centralized Information Technology IT infrastructure that
provide reliable hardware and software support, this approach have beneficial effects [7].
Cons:
 The Top-down implementation is a very wide and broad project and it is a considerable
undertaking that might take a large amount of time and human power [14].
 The methodology used in implementation, due to time of development, might not satisfy the
end users once launched and online.
 After implementation begins no changes can be made to the deployment data structures.
3.5. Hybrid Approach
These models use both implementations, the Top-down and Bottom-up in their designs. It focuses
on utilizing the user-friendly and performance optimization of the Bottom-up, while keeping the
scalability and integration of the Top-down designs. Also the Hybrid approach can be defined as the
16
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
mixture of both approaches, the dimensional and normalized [48] [49] [50]. Fig. 6 illustrates this
arrangement. Other research describe the Hybrid approach as using a database, ETL tools, and OLAP
tools, in this survey, we consider as the mixture of a normalized and dimensional modeling, as with
most implementations and usages [49].
Figure 5. Top-down Implementation
Figure 6. A Hybrid Approach
This approach starts with an ER diagram for the data marts; these data marts represent the
company’s basic needs, and can grow in size as new requirements emerge. The data marts are
developed in a dimensional model, and then ETL tools are used to extract data from sources to a nonpersistent staging area (otherwise known as Operational Data Store ODS) and then into the data marts.
By now the data marts, containing summarized and atomic data are loaded into the DW, or sometimes
even moved to a temporary storage space for cleansing before going into the DW [8] [51]. These data
marts might face normalization processes along the way in order to reduce redundancy in the system.
Finally, query tools are used to get data marts and atomic data from the DW. This method provides a
single source of truth, which is the DW.
Pros:
The next set is from [49]:
 Has the best of implementations, the dimensional and the normalized. Because data can be
stored in the normalized form, yet aggregate dimensional queries can be made against them.
 When using Materialized View Aggregates MVAs, the Hybrid approach performs better than
the dimensional and normalized approaches.
 Using ODS helps in data integration from multiple, constantly changing data sources [51] [50]
[14].
 Rapid development of the DW, independent of the data marts [50].
17
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
Cons:
The next set is from [49]:
 Since the hybrid approach is the sum of the dimensional and normalized approaches, it size if
the sum of both, minus the union data.
 The hybrid approach has lesser performance than the dimensional when using dimensional
queries.
 The Hybrid approach’s design is more complex to the normalized and dimensional alternatives.
Planners need to have good background and experience in both the Top-down and Bottom-up
designs to balance the hybrid design, by selecting which department to develop next, and to
what degree the scope of the view should be [7].
3.6. Federated Approach
This architecture follows a hub-and-spoke design. Where the hub is linked to all the spokes, and can
send and receive information from them. The federated approach, illustrated in Fig. 7, suggests the
integration of many DW, and quite possibly other source such as BI tools and CRM feeds. The needs
for this approach comes from the fact that a company or enterprise might have a plethora of DW, BI,
and decision-making solutions, and to accommodate this, a central hub, which is the federated DW, is
found, to allow managers and consultants to have a broad view into the many assets in the company
[13] [35] [48]. A federated DW is a more plausible choice than building a very huge and single DW to
accommodate all the many DW solutions a company has.
This approaches centers around two main concepts, the common business model and the common
staging area. The first ensures the consistent use of data names and definitions across the
heterogeneous analytical and DW software in the company. New data marts made in any of the many
DW solutions would have to conform to the common business model. The second, is used to reduce the
redundancy in data marts, enhance performance, and helps in maintaining ETL tools. Because different
data marts require differently configured ETL tools, keeping track of these new ETL can be
problematic. Therefore, the common staging area would help in restricting and sectioning the ETL
tools objectives. First data are staged unto the common staging area, and then fed into the best data
marts from the many DWs to optimize performance. This process can also make use of data
reengineering and data profiling tools [48].
Figure 7. A Federated DW Architecture
In the federated approach it is important to keep the “single source of truth” aspect maintained
throughout the system. Therefore, issues such as privacy between the many sources must be
maintained, as an extra functionality [40]. There have been several attempts to use the federated
databases concepts to implement on a federated DW, details can be found in [52]. The emphasis on
using an access layer to remove the data heterogeneous nature is the same as the common staging area
in the federated DW.
Pros:
 When large data repositories need to be analyzed and queried against one another, yet due to
legacy formats, legal standards, different sources, or different data models such integration
might not be possible using a standard single DW solution. The federated implementation can
18
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
solve this situation by providing a global hub and spoke architecture that governs the many
data sources [40].
 When acquisitions and reorganizations happen in the company, a federated DW would
complement the situation [23].
 The staging area in a federated DW provides better performance and optimization for ETL
tools [31] [48].
Cons:
 Security in a federated data warehouse has a major impact on its functionality. Maintaining
data privacy, while using a uniform data model and a unified staging area, becomes difficult.
Some organization and standard do not even allow their information to be copied or available
in such staging areas. Carefully planned security policies must be deployed when dealing with
many sources holding critical information that are sometimes protected by the law, or under
standards [40].
 When data is private and confidential, the query deployed from a federated DW, maybe be
launched, but the processing of the query would be done outside the federated staging area,
usually by third parties or the organization holding the data private. This may affect the speed
and efficiency of a federated DW [40].
 The federated approach had scored the lowest in usage according to [23].
4. Evaluation Results
In this section we will compare the DW architectures we have discussed against common and
critical criteria. The normalized and dimensional are mixed with the Top-down and Bottom-up
implementations, respectively, as it is the common case for such approaches. Table 1 and Table 2 have
the results.
The results show that to a degree no DW implementation is an all-out winner, but on the other hand,
it shows how each has its area of excellence and areas of weakness. However, the first two show that
they are the most common and used overall. This shows just how much the newer trends, such as the
Hybrid and Federated approaches, are less in use due to their varied costs, design requirements, and
usage.
5. Future Trends and Challenges
This section provides the future issues in DW architecture designs, some of which were
touched upon by the literature, but no study shows their effect on the selection of DW
architecture have been thoroughly explored.
5.1. Security
Security has been a major issue in almost every information system, DWs are no exception.
The amounts of different architectures we have discussed in this review have different ways of
dealing with the data before being committed to the DW repositories.
Classic methods suggest the encryption of data before entering the DW. Other methods are
imported from the database and federated databases domains, which suggest the use of
requirement analysis, logical and physical design. But they are not suitable for DWs because
they originated in the database domain, where data is atomic, and operational [53].
5.2. Outsourcing DW
Outsourcing a DW solution might be an attractive choice, since the resources and time for
such a project are substantial. Outsourcing a DW might only include parts of the DW, such as
the data model, ETL tools, or even the BI transactions [54].
The immediate advantages in outsourcing a DW include the access to global expertise in DW
implementations, reduction of planning and resources needed to build the DW solution,
19
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
considering its complexity, and the reduction of costs and flexibility in maintenance and
upgrade [5]. However, the promise of lesser costs of DW implementations by an external body,
possibly a third party, should not be the sole reason to adopt DW outsourcing [55]. The
disadvantages included are the risks of having the company’s BI and data marts exposed to an
outside source; switching from one vendor to another might produce contract issues and
difficulties; and the design of the DW and its data marts might not be what the company desires
but what the vendor provides. In addition, the party responsible for the DW should be aware of
how the original company business models work [56].
DW
architecture
Top-down
(normalized)
Initial Cost
The highest of the
surveyed
implementations [23]
Table 1. Result 1
Initial View
Additions
The whole organization
of the company,
department by
department [23]
Well formulated and
documented. The process is
as easy as the level of
normalization used [24]
[42]
Bottom-up
(Dimensional)
The Lowest. Since only
data marts are
implemented first [46]
Departmental. Then it
grows to fulfill the
organization as needs
grow [28]
Can be problematic if the
data marts do not conform
to the bus. Can make way
for redundancies. A
carefully revised
dimensional modeling
schema is needed [57]
Hybrid
Moderate to High.
Depends on the amount
of vertical (top-down) or
horizontal (bottom-up)
trends used [49].
Moderate. Can begin by
one department or more.
Can support independent
data marts or an entire
DW [58]
Structured and scalable.
Redundancies can exist due
to the dimensional notations
used [49] [41] [14]
Federated
High. Many solutions
have to be modified to
be compatible with the
hub and spoke design
[23]. While other
existing systems might
needn’t modifications at
all [25]
The whole system of
DW solutions, ODSs,
and or any data source
[25] [52]
Must be compliant to the
unified business model.
Scalable overall [35] [52]
20
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
Table 2. Result 2
DW
architecture
Maintenance
Domain
Efficiency
Top-down
(normalized)
Easy to locate.
Difficult to optimize
because of the
presence of many
tables and
relationship due to
3N [24]
The design governs the
needs of all the
company’s departments
[23]
3N is Optimized for
databases. For DW and
OLAP processing,
optimizing the queries can be
difficult, because they will
most likely cross many
tables. At worst case across
multiple servers [38].
Bottom-up
(Dimensional)
Difficult. Because
there would occur
many overlaps in
data marts’
summaries from
department to
another [41]
The design can cover the
whole company, when
mature. Mostly it is
company-wide in most
implementation [23]
Optimized for OLAP.
Dimensional tables and facts
can span many data fields,
which give very fast
querying.
Easy to understand [38]
Hybrid
Needs knowledge of
both the Top-down
and Bottom-up
designs. Designs are
complex to
understand [7] [49]
Company-wide. More
details might grow in
several departments as
needed. Depends on the
balance between the Topdown and Bottom-up
mix. Using an ODS will
unify the ETL tools and
models [58]
Depending on the schema.
However, it is faster than the
Relational (normalized) but
slower than a pure
dimensional design. A
Hybrid design include both
an ODS and a DW, the ODS
maybe the bottleneck and
cause slowdown [58]
Federated
Well formulated if
all data sources
comply with the
unified business
model. A large
process overall [35]
[40] [52]
Company-wide. And can
even be more, can be the
administrator over
several data systems in
the company [23] [25]
The common staging area
provides grounds for ETL
tools to optimize the data
marts before committing
them to a DW [31][48].
6. Concluding Remarks
In our study we have evaluated some of the most used designs, which are the Top-down, Bottom-up,
Hybrid, and Federated DW designs. As with [23], there were no clear winner in choosing which design
implementation, but in the analysis it has been shown that the differences in these designs, in terms of
costs, adaptability, scalability, and efficiency, do have contrasting values. These results would help
researchers to investigate further these designs, as well as IT or managers in deciding what DW flavor
to adopt. A DW design choice ultimately be affected by the needs of the company, a manufacturing
firm would have different BI and data mining needs than a monetary body, and therefore a different
DW design and implementation would be selected.
The Top-down used with the normalized design, had the highest spectrum in terms of the business
data model used in the company followed by the Federated, its initial cost is high, but it provided high
scalability. The Bottom-down used with the dimensional design, had the highest lowest initial cost,
simplest data models, but procured difficult data mart management, as redundancies and difficulties in
adding new data marts did appear. The Hybrid implementation had stroke a balance between the two
aforementioned implementations. But it had high complexity designs, because it mixes a normalized
with dimensional notations in its data models. Its performance was close to a dimensional model in
terms of OLAP efficiency. The Federated implementation had high initial costs, a system wide view
that can span all of the company’s data sources. The Federated approach needs a reliable and fault
21
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
tolerant hub and spoke infrastructure to provide high interoperability. If the common business model
and common staging area are well configured, this approach can enhance the OLAP and BI
transactions done to the DWs data marts.
Research on DW and its designs needs further empirical studies to show just how much these
designs differ in usage and actual performance [23] [5]. Security, DW outsourcing, ROI on DW
projects, and DW migration, portability, and interoperability need further investigation in DW, and
their effects on the company’s BI.
7. References
[1] The Data Warehousing Information Center. LGI Systems Incorporated. Available at:
http://www.dwinfocenter.org/against.html. Retrieved on 23-October-2011
[2] Data Warehouse Platform Acceleration. Syncsort. Available at:
http://www.syncsort.com/Portals/0/Resources/DMX_Solution_DataWarehouseAcceleration_scree
n.pdf. Retrieved 25-October-2011
[3] Frank Hayes. The Story So Far. COMPUTERWORLD. Available at:
http://www.computerworld.com/s/article/70102/The_Story_So_Far?taxonomyId=009. Retrieved
2011-October-23 Sunday
[4] Colin White. The Federated Data Warehouse. Information Management. Available at:
http://www.information-management.com/issues/20000301/1953-1.html. Retrieved 2011-October23
[5] Kimball Group. Resources .Available at:
http://www.kimballgroup.com/html/articles_search/articles1997/9708d15.html. Retrieved 25October-2011
[6] Jeff Lawyer, and Shamsul Chowdhury, “Best Practices in Data Warehousing to Support Business
Initiatives and Needs,” In the 37th Annual Hawaii International Conference on System Sciences,
pp. 80223a, 2004.
[7] Ralph Kimball, and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling, Wiley Publishing: USA, 2002.
[8] Bill Inmon. The Problem with Dimensional Modeling. DM Review Magazine. Available at:
http://www.uniriotec.br/~tanaka/SAIN/03-TheProblemWithDimensionalModeling-InmonDMReview-2000.pdf. Retrieved 23-October-2011
[9] Ming-Chang Lee, “ The Combination of Knowledge Management and Data mining with
Knowledge Warehouse” in International Journal of Advancements in Computing Technology, vol.
1, no. 1, pp. 39 -45, 2009
[10] Kornelije Rabuzin, “Data Warehousing and Business Intelligence: Understanding What is Really
Going on in Higher Education”, in the International Journal of Information Processing and
Management, vol. 4, no. 4, pp. 121-129, 2013
[11] Edgar Frank Codd, “A relational model of data for large shared data banks”, In Communications
of the ACM, vol. 13, no. 6, pp. 377-387, 1970.
[12] InformationWeek. Available at:
http://www.informationweek.com/software/030917/615warehouse1_1.jhtml. Retrieved 2011October-23
[13] Infonitive Research. Design & Architectural Approaches for Data Warehouses: Inmon vs Kimball,
Federated, Hybrid. Infonitive. Available at: http://infonitive.com/?p=255. Retrieved 23-October2011
[14] Arun Sen, and Atish Sinha, “Toward Developing Data Warehousing Process Standards: An
Ontology-Based Review of Existing Methodologies,” In the IEEE Transactions On Systems, Man,
And Cybernetics—Part C: Applications And Reviews, vol. 37, no. 1, January 2007.
[15] Wilson Mar. Testing Data Warehouses DSS. Available at:
http://www.wilsonmar.com/1datawh.htm. Retrieved 23-October-2011
[16] Exforsys Inc. Design of the data warehouse: Kimball Vs Inmon. Available at:
http://www.exforsys.com/tutorials/msas/data-warehouse-design-kimball-vs-inmon.html. Retrieved
23-October-2011
22
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
[17] Abhishek Mehta, Tableau Software. Big Data: Powering the Next Industrial Revolution. Available
at: http://www.tableausoftware.com/learn/whitepapers/big-data-revolution. Retrieved 23-October2011
[18] 1Keydata. Bill Inmon vs. Ralph Kimball. Available at:
http://www.1keydata.com/datawarehousing/inmon-kimball.html. Retrieved 23-October-2011
[19] Bill Inmon. Data Warehousing in a Healthcare Environment. Available at:
http://www.tdan.com/view-articles/4584. Retrieved 23rd -October-2011
[20] Methanias Colaço, Manoel Mendonça, and Francisco Rodrigues, “Data Warehousing in an
Industrial Software Development Environment,” In the 33rd Annual IEEE Software Engineering
Workshop (SEW), pp. 131-135, October 2009.
[21] Kalido. Business-Model-Driven Data Warehousing: Keeping Data Warehouses Connected to Your
Business.
Available
at:
http://www.kalido.com/Collateral/Documents/English-US/WPBizModelDWa.pdf. Retrieved 25th – October- 2011
[22] David Haertzen, Data Warehousing Tutorial. Data Warehousing and Business Intelligence
Introduction. Available at: http://infogoal.com/datawarehousing/overview.htm. Retrieved on the
15th of November 2011
[23] Rashed Salem, Omar Boussaid, and Jerome Darmont, “Conceptual Workflow for Complex Data
Integration Using AXML,” In the International Conference on Machine and Web Intelligence
ICMWI, pp. 380-385, 2010.
[24] Matteo Golfarelli, and Stefano Rizzi, Data Warehouse Design: Modern Principles and
Methodologies, McGraw-Hill Osborne Media, Italy, 2009.
[25] Anindya Datta, Bongki Moon, and Helen Thomas, “A Case for Parallelism in data Warehousing
and OLAP,” in Proceedings of the 9th International Workshop on Database and Expert Systems
Applications, pp. 226-231, 1998.
[26] Bill Inmon, “Building the Data Warehouse”, Wiley Publishing: Canada, 1994.
[27] Olivier Teste, “Towards Conceptual Multidimensional Design In Decision Support Systems,” In
Proceedings of the 5th East-European Conference on Advances in Databases and Information
Systems ADBIS, pp. 77-88, 2001.
[28] Inmon Data Systems. Data warehousing in the health care environment. Available at:
http://inmoncif.com/registration/whitepapers/DATA%20WAREHOUSING%20IN%20THE%20H
EALTHCARE%20ENVIRONMENTR1.pdf. Retrieved on the 14th of November 2011
[29] Christopher Chute, Scott Beck, Thomas Fisk, and David Mohr, “The Enterprise Data Trust at
Mayo Clinic: a semantically integrated warehouse of biomedical data,” in the Journal of the
American Medical Informatics Association, vol. 17, no. 2, pp. 131-135, 2010.
[30] Jamel Feki, Ahlem Nabli, Hanene Ben-Abdallah , and Faiez Gargouri, “An Automatic Data
Warehouse Conceptual Design Approach,” In the 2nd Encyclopedia of Data Warehousing and
Mining, John Wang, IGI Global: USA, 2008.
[31] Srikanth Rokkam, and Yijie Han, “Data Warehousing: A Functional Overview”, In the
International Conference on Data Mining DMIN, pp. 362-369, 2008.
[32] Information Management Software, IBM. Data warehouses: improve strategic decision making
with coherent views of data. Available at:
ftp://ftp.software.ibm.com/software/data/bi/smb-why-warehouse.pdf. Retrieved on the 16th of
November of 2011
[33] Christopher Dorobek, GCN. Experts: A good data warehouse improves decision-making.
Available at: http://gcn.com/articles/1999/05/experts-a-good-data-warehouse-improvesdecisionmaking.aspx. Retrieved on the 16th of November of 2011
[34] G. Sweety Peter. Data Warehouse-An Overview. Available on:
http://www.peterindia.net/DataWarehousingView.html. Retrieved on the 16th of November of
2011
[35] Jerome Darmont, Omar Boussaid, Fadila Bentayeb, "Warehousing Web Data", in the 4th
International Conference on Information Integration and Web-based Applications and Services
(iiWAS 02), Bandung, Indonesia, pp. 148-152, 2002.
[36] Thilini Ariyachandra, and Hugh Watson, “Which Data Warehouse Architecture is best?” in
Communications of the ACM, vol. 51, no. 10, pp. 146-147 , 2008.
[37] James Madison, ORACLE. Building a Hybrid Data Warehouse Model. Available at:
23
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
http://www.oracle.com/technetwork/articles/madison-models-086845.html. Retrieved on the 14th
of November 2011
[38] William McKnight, McKnight Consulting Group. Hybrid Approaches to Business Intelligence.
Available at:
http://www.information-management.com/issues/20040101/7908-1.html. Retrieved on the 15th of
November 2011
[39] Nevena Stolba, Marko Banek, and A. Min Tjoa, “The Security Issue of Federated Data
Warehouses In the Area of Evidence-Based Medicine,” in the Proceedings of the 1st Conference on
Availability, Reliability and Security ARES, pp. 11, April 2006.
[40] Julia Vowler. ComputerWeekly. Data warehouse design from top to bottom. Available at:
http://www.computerweekly.com/feature/Data-warehouse-design-from-top-to-bottom. Retrieved
on the 15th of November 2011
[41] Microsoft TechNet. Chapter 12 - Data Warehousing Framework. Available at:
http://technet.microsoft.com/en-us/library/cc966470.aspx. Retrieved on the 15th of November of
2011
[42] Chuck Ballard, Dirk Herreman, Don Schau, Rhonda Bell, Eunasaeng Kim, and Ann Valencic,
IBM.
Data
Modeling
Techniques
for
Data
Warehousing.
Available
at:
www.redbooks.ibm.com/redbooks/pdfs/sg242238.pdf. Retrieved on the 15th of November of 2011
[43] Data Warehousing and Business Intelligence. Available at:
http://knol.google.com/k/data-warehousing-and-business-intelligence. Retrieved of 15th of
November of 2011
[44] The Hybrid Structure. Logimethods. Available at: http://www.logimethods.com/thoughtleadership-hybrid-data-warehouse.php. Retrieved on the 17th of November 2011
[45] Michel Schneider, “A general model for the design of data warehouses”, In the International
Journal of Production Economics, vol. 112, no. 1, pp. 309-325, 2008.
[46] Oscar Romero, and Alberto Abello, “A Survey of Multidimensional Modeling Methodologies”, In
the International Journal of Data Warehousing & Mining, vol. 5, no. 2, pp. 1-23, 2009.
[47] Sajjad Ahmad, and Rohiza Ahmad, “An Improved Security Framework for Data Warehouse: A
Hybrid Approach”, In Proceedings of the International Symposium in Information Technology
ISIT, pp. 1586-1590, 2010.
[48] Stefan Berger, and Michael Schrefl, “From Federated Databases to a Federated Data Warehouse
System”, In the 41st Annual Hawaii International Conference on System Science, pp. 394 - 394,
2008.
[49] Thilini Ariyachandra, and Hugh Watson, “Key organizational factors in data warehouse
architecture selection”, In Decision Support Systems, vol. 49, no. 2, pp. 200–212, 2010.
[50] Fahmi Bargui, Jamel Feki, Hanene Ben-Abdallah, “A natural language approach for data mart
schema design”, In the 9th International Arab Conference on Information Technology ACIT, 2008.
[51] K. Ram Ramamurthy, Arun Sen, and Atish Sinha, “An empirical investigation of the key
determinants of data warehouse adoption”, In Decision Support Systems, vol. 44, no. 4, pp. 817841, 2008.
[52] Mohammed Tafti, “IT Outsourcing: A Knowledge-Management Perspective”, In Issues in
Information Systems, vol. 8, no. 2, pp. 488-493, 2007.
[53] Sid Adelman, Information management. Should the Data Warehouse be Outsourced? Available at:
http://www.information-management.com/issues/20060501/1053405-1.html. Retrieved on the 5th
of December 2011
[54] Roger Llewellyn, Kognitio. ComputerWorldUK. Why IT Departments are Outsourcing Data
Warehousing. Available at:
http://www.computerworlduk.com/in-depth/outsourcing/2003/why-it-departments-areoutsourcing-data-warehousing/. Retrieved on the 5th of December 2011
[55] Marwa Farhan, Mohammed Marie, Yehia Helmi, and Laila El-Fangary, “Transforming
Conceptual Model into Logical Model for Temporal Data Warehouse Security: A Case Study”,
International Journal of Advanced Computer Science and Applications, vol. 3, no. 3, pp. 115-122,
2012.
[56] Arindam Paul, Varuni Ganesan, Jagat Challa, and Yashvardhan Sharma, “HADCLEAN: A hybrid
approach to data cleaning in data warehouses”, In the Proceedings of International Conference on
Information Retrieval & Knowledge Management CAMP, pp. 136-142, 2012.
24
Data Warehousing Implementations: A Review
Abdulrahman R. Alazmi
[57] Mahmoud Boufaida, and Louradi Bradji, “Open User Involvement in Data Cleaning for Data
Warehouse Quality”, International Journal of Digital Information and Wireless Communications,
vol. 1, no. 2, pp.536-544, 2011.
[58] Junwei Di, Zhanbao Gao, and Limei Zhang, “PHM framework design based on data warehouse”,
In the IEEE Conference on Prognostics and System Health Management PHM, pp. 1-5, 2012.
25
Download