ISSUES AFFECTING THE DATA WARHOUSE EFFICIENCY Jaee Ranavde

advertisement
ISSUES AFFECTING THE DATA WARHOUSE EFFICIENCY
Jaee Ranavde
B.E., Mumbai University, India, 2004
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
BUSINESS ADMINISTRATION
(Management Information Systems)
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL
2010
ISSUES AFFECTING THE DATAWAREHOUSE EFICIENCY
A Project
by
Jaee Ranavde
Approved by:
__________________________________, Committee Chair
Dr. Monica Lam, Ph.D.
____________________________
Date
ii
Student: Jaee Ranavde
I certify that this student has met the requirements for format contained in the University format
manual, and that this project is suitable for shelving in the Library and credit is to be awarded for
the Project.
__________________________, Dean
________________
Date
Sanjay Varshney, Ph.D., CFA
College of Business Administration
iii
Abstract
of
ISSUES AFFECTING THE DATA WAREHOUSE EFFICIENCY
by
Jaee Ranavde
Data warehousing is achieving huge importance in all businesses around the world.
Data warehousing is commonly used by companies to analyze trends over time. In other words,
companies use data warehousing to view day-to-day operations, but its primary function is
facilitating strategic planning resulting from long-term data overviews. From such overviews business models, forecasts, and other reports and projections can be made. In the world where
data warehousing is gaining so much importance, it becomes necessary to study the factors that
hinder its efficiency. These problems have to be studied in detail to isolate the reason of their
existence. The aim of this project is to understand these issues affecting the data warehousing
efficiency. Understanding these issues will help us to design better data warehouses in the future.
_______________________, Committee Chair
Dr. Monica Lam, Ph.D
_______________________
Date
iv
TABLE OF CONTENTS
Page
List of Figures ................................................................................................................................ vii
Chapter
1. BACKGROUND AND OVERVIEW ......................................................................................... 1
2. THE GOAL OF THE PROJECT ................................................................................................. 3
2.1
Objective 1 ....................................................................................................................... 3
2.2
Objective 2 ....................................................................................................................... 3
2.3
Objective 3 ....................................................................................................................... 4
2.4
Objective 4 ....................................................................................................................... 4
3. LITERARY REVIEW ................................................................................................................. 5
3.1
Data Warehouse Architecture Goals ................................................................................ 6
3.2
Data Warehouse Users ..................................................................................................... 7
3.3
How Users Query the Data Warehouse ........................................................................... 9
3.4
Principles of Data Warehouse Design and Implementation ............................................ 9
3.4.1
Organizational Consensus ...................................................................................... 10
3.4.2
Data Integrity ......................................................................................................... 10
3.4.3
Implementation Efficiency ..................................................................................... 11
3.4.4
User Friendliness ................................................................................................... 11
3.4.5
Operational Efficiency ........................................................................................... 12
3.4.6
Scalability .............................................................................................................. 12
3.4.7
Compliance with IT Standards ............................................................................... 12
v
3.5
Data Warehouse Issues .................................................................................................. 13
3.5.1
Loading and Cleansing Data .................................................................................. 13
3.5.2
Cost and Budgeting Issues ..................................................................................... 15
3.5.3
Data Warehousing Security Issues ......................................................................... 16
3.5.4
Maintenance Issues for Data Warehousing Systems ............................................. 19
4. SIGNIFICANCE OF THE PROJECT ....................................................................................... 22
5. RESEARCH METHODOLOGY............................................................................................... 23
6. CASE STUDIES ........................................................................................................................ 24
6.1
John Deere – Interview with Project Manager............................................................... 24
6.2
Kyobo Life Insurance – Case Study One ....................................................................... 31
6.3
Scandinavian Airlines – Case Study Two ...................................................................... 37
6.4
Philips Consumer Electronics – Case Study Three ........................................................ 45
7. SURVEY ................................................................................................................................... 49
7.1
Sample Population ......................................................................................................... 49
7.2
Sample Frame ................................................................................................................ 49
7.3
Survey Response ............................................................................................................ 50
7.4
Data Collection via Online Survey ................................................................................ 50
8. CASE STUDIES AND SURVEY OBSERVATION ................................................................ 51
9. CONCLUSION .......................................................................................................................... 63
Appendix. Questionnaire .............................................................................................................. 68
Bibliography .................................................................................................................................. 70
vi
LIST OF FIGURES
Page
Figure 1: Number of Respondents from Different Data Warehouse Sizes ....................... 55
Figure 2: Respondents Role in the Data Warehouse ........................................................ 56
Figure 3: Importance of Each Feature to Respondents in Different Roles ....................... 57
Figure 4: Issues Faced by Respondents in Different Roles .............................................. 58
Figure 5: Importance of Various Features in Different Sized Data Warehouses ............. 60
Figure 6: Issues in Different Sized Data Warehouses ...................................................... 61
vii
1
Chapter 1
BACKGROUND AND OVERVIEW
A data warehouse is a repository of an organization's electronically stored data. Data
warehouses are designed to facilitate reporting and analysis. It is commonly used
by companies to analyze trends over time. Companies use data warehousing to view dayto-day operations, but its primary function is facilitating strategic planning resulting from
long-term data overviews. From such overviews - business models, forecasts, and other
reports and projections can be made.
Data warehousing has become necessary for running an enterprise of any size to make
intelligent decisions. It enables the competitive advantage. Data warehousing captures
data and their relationships and it is foundation for Business Intelligence (BI). It clearly
draws the distinction between data and information.
Data warehousing emphasizes organizing, standardizing and formatting facts in such a
way that we can derive information from them. BI is then concerned about acting on that
information. The primary goal of any data warehouse is to integrate data from disparate
sources into a centralized store, where data can be used across the enterprise for decision
support.
Data warehousing is not only used as a method to organize data but also to save the
information from huge data loss in unforeseen incidents like natural calamities, mass data
corruption, etc. Many multinational companies use the data warehousing system to save
their important data from any such causes. The economy has to continue to grow even in
2
times of disaster. Therefore, it is very important that all useful data be backed up in data
warehouse, checked, established and maintained at safe places.
3
Chapter 2
THE GOAL OF THE PROJECT
It is clear that the data warehouse design and management is extremely critical for the
successful functioning of the data warehouse system. There exist many theoretical
explanations on how to design an efficient and effective data warehouse but many
companies still face hurdles when working in the data warehouse environment. They face
various issues which hinder the smooth functioning. Therefore, there is a strong need to
understand the issues affecting the data warehouse efficiency. The goal of this project is
to identify these issues and cross verify if these issues are valid in reality by using
interview, case studies and online survey.
2.1 Objective 1
The project first attempts to understand what are the critical success factors in the
designing of a data warehouse environment.
2.2 Objective 2
The project further tries to understand what are the current major issues faced by
businesses through case studies and online survey.
4
2.3 Objective 3
The issues affecting the data warehouse efficiency in theory and practicality are
discussed.
2.4 Objective 4
Finally, a few ways to overcome or minimize these issues are discussed. The overall goal
of the project is to understand what the critical factors for a successful data warehouse
design are and how they should be taken into consideration while designing a data
warehouse environment.
5
Chapter 3
LITERARY REVIEW
Data warehouse is a repository of an organization's electronically stored data. Data
warehouses are designed to facilitate reporting and analysis. This definition of the data
warehouse focuses on data storage. However, the means to retrieve and analyze data, to
extract, transform and load data, and to manage the data dictionary are also considered
essential components of a data warehousing system. Many references to data
warehousing use this broader context. Thus, an expanded definition for data warehousing
includes business intelligence tools, tools to extract, transform, and load data into the
repository, and tools to manage and retrieve metadata.
In contrast to data warehouses are operational databases that support day-to-day
transaction processing. The data warehouses are supposed to provide storage,
functionality and responsiveness to queries beyond the capabilities of today's transactionoriented databases. Also data warehouses are set to improve the data access performance
of databases. Traditional databases balance the requirement of data access with the need
to ensure integrity of data.
In present day organizations, users of data are often completely removed from the data
sources. Many people only need read-access to data, but still need a very rapid access to a
larger volume of data than what can be conveniently downloaded to the desktop. Often
such data comes from multiple databases. Because many of the analyses performed are
recurrent and predictable, software vendors and systems support staff have begun to
design systems to support these functions. Currently there comes a necessity for
6
providing decision makers from middle management upward with information at the
correct level of detail to support decision-making. Data warehousing, online analytical
processing (OLAP) and data mining provide this functionality.
Before embarking on the design of a data warehouse, it is imperative that the
architectural goals of the data warehouse are clear and well understood. Because the
purpose of a data warehouse is to serve users, it is also critical to understand the various
types of users, their needs, and the characteristics of their interactions with the data
warehouse.
3.1 Data Warehouse Architecture Goals
A data warehouse exists to serve its users - analysts and decision makers. A data
warehouse must be designed to satisfy the following requirements [MSDN Microsoft
Library]:

Deliver a great user experience—user acceptance is the measure of success.

Function without interfering with OLTP systems.

Provide a central repository of consistent data.

Answer complex queries quickly.

Provide a variety of powerful analytical tools, such as OLAP and data mining.
Most successful data warehouses that meet these requirements have these common
characteristics [MSDN Microsoft Library]:

Are based on a dimensional model.
7

Contain historical data.

Include both detailed and summarized data.

Consolidate disparate data from multiple sources while retaining consistency.

Focus on a single subject, such as sales, inventory, or finance.
Data warehouses are often quite large. However, size is not an architectural goal - it is a
characteristic driven by the amount of data needed to serve the users.
3.2 Data Warehouse Users
The success of a data warehouse is measured merely by its acceptance by users. Without
users, historical data might as well be archived to magnetic tape and stored in the
organization. Successful data warehouse design starts with understanding the users and
their needs.
Data warehouse users can be divided into four categories: Statisticians, Knowledge
Workers, Information Consumers, and Executives.
Statisticians: There are typically only a handful of sophisticated analysts - Statisticians
and operations research types - in any organization. Though few in number, they are
some of the best users of the data warehouse, whose work can contribute to closed loop
systems that deeply influence the operations and profitability of the company. These
people are often very self-sufficient and need only to be pointed to the database and given
some simple instructions about how to get to the data and what times of the day are best
8
for performing large queries to retrieve data to analyze using their own sophisticated
tools [MSDN Microsoft Library].
Knowledge Workers: A relatively small number of analysts perform the bulk of new
queries and analyses against the data warehouse. These are the users who get the
"Designer" or "Analyst" versions of user access tools. They will figure out how to
quantify a subject area. After a few iterations, their queries and reports typically get
published for the benefit of the Information Consumers. Knowledge Workers are often
deeply engaged with the data warehouse design and place the greatest demands on the
ongoing data warehouse operations team for training and support [MSDN Microsoft
Library].
Information Consumers: Most users of the data warehouse are Information Consumers;
they do not compose a true ad hoc query. They use static or simple interactive reports that
others have developed. They usually interact with the data warehouse only through the
work product of others. This group includes a large number of people, and published
reports are highly visible. A great communication infrastructure must be set up for
distributing information widely, and gather feedback from these users to improve the
information sites over time [MSDN Microsoft Library].
Executives: Executives are a special case of the Information Consumers group. Few
executives actually issue their own queries, but an executive's slightest musing can
9
generate a flurry of activity among the other types of users. A wise data warehouse
designer should develop a very sophisticated digital dashboard for executives assuming
that it is easy and economical to do so [MSDN Microsoft Library].
3.3 How Users Query the Data Warehouse
Information for users can be extracted from the data warehouse relational database or
from the output of analytical services such as OLAP or data mining. Direct queries to the
data warehouse relational database should be limited to those that cannot be
accomplished through existing tools, which are often more efficient than direct queries
and impose less load on the relational database [MSDN Microsoft Library].
Reporting tools and custom applications often access the database directly. Statisticians
frequently extract data for use by special analytical tools. Analysts may write complex
queries to extract and compile specific information not readily accessible through
existing tools. Information consumers do not interact directly with the relational database
but may receive e-mail reports or access web pages that expose data from the relational
database. Executives use standard reports or ask others to create specialized reports for
them.
3.4 Principles of Data Warehouse Design and Implementation
A DW can quickly become a chaos if it's not designed, implemented and maintained
correctly. Following are the ‘Seven Principles of Data Warehousing’ that helps a DW
10
design and implementation on the road to achieving your desired results
[http://www.academictutorials.com/data-warehousing/].
3.4.1
Organizational Consensus
From the outset of the data warehousing effort, there should be a consensus-building
process that helps guide the planning, design and implementation process. If the
knowledge workers and managers see the data warehouse as an unnecessary intrusion - or
worse, a threatening intrusion - into their jobs, they won't like it and won't use it. Every
effort must be made to gain acceptance for, and minimize resistance to, the data
warehouse. If the stakeholders are involved early in the process, they're much more likely
to embrace the data warehouse, use it and, hopefully, champion it to the rest of the
company [http://www.academictutorials.com/data-warehousing/].
3.4.2
Data Integrity
The critical function of data warehousing - of any business intelligence (BI) project - is to
provide a single version of the truth about organizational data. The path to this brass ring
begins with achieving data integrity in the data warehouse. Therefore, any design for the
data warehouse should begin by minimizing the chances for data replication and
inconsistency. It should also promote data integration and standardization. Any
reasonable methodology chosen to achieve data integrity should be implemented with the
end result in mind [http://www.academictutorials.com/data-warehousing/].
11
3.4.3
Implementation Efficiency
To help meet the needs of the company as early as possible and minimize project costs,
the data warehouse design should be straightforward and efficient to implement. This is
truly a fundamental design issue.
A technically elegant data warehouse could be
designed. However, if that design is difficult to understand or implement or doesn't meet
user needs, the data warehouse project will be stuck in difficulty and cost overruns almost
from the start. It is a wise decision to opt for simplicity in the design plans and choose (to
the most practical extent) function over beautiful form. This choice will help stay within
budgetary constraints, and it will go a long way toward providing user needs that are
effective [http://www.academictutorials.com/data-warehousing/].
3.4.4
User Friendliness
User friendliness and ease of use issues, though they are addressed by the technical
people, are really business issues. It is because of the fact that if the end business users
don't like the data warehouse or if they find it difficult to use, they won't use it, and all
your work will be for nothing. To help achieve a user-friendly design, the data warehouse
should leverage a common front-end across the company - based on user roles and
security levels. It should also be intuitive enough to have a minimal learning curve for
most users. Of course, there will be exceptions. The rule of thumb is: The least technical
users
should
find
the
interface
[http://www.academictutorials.com/data-warehousing/].
reasonably
intuitive
12
3.4.5
Operational Efficiency
This principle is really a corollary to the principle of implementation efficiency. Once
implemented, the data warehouse should be easy to support and facilitate rapid responses
to business change requests. Errors and exceptions should also be easy to remedy, and
support costs should be moderate over the life of the data warehouse. The reason this
principle is a corollary to the implementation efficiency principle is that operational
efficiency can be achieved only with a data warehouse design that is easy to implement
and maintain. Again, a technically elegant solution might be attractive but a practical,
easy-to-maintain
solution
will
yield
better
results
in
the
long
run
[http://www.academictutorials.com/data-warehousing/].
3.4.6
Scalability
Scalability is often a big problem with data warehouse design. The solution is to build in
scalability from the start. Choose toolsets and platforms that support future expansions of
data volumes and types as well as changing business requirements. It's also a good idea to
look at toolsets and platforms that support integration of, and reporting on, unstructured
content
and
document
repositories
[http://www.academictutorials.com/data-
warehousing/].
3.4.7
Compliance with IT Standards
Perhaps the most important IT principle to keep in mind is not to reinvent the wheel when
building the data warehouse. That is, the toolsets and platforms chosen to implement the
13
data warehouse should conform to and leverage existing IT standards. The existing skill
sets of IT and business users must be leveraged. In a way, this is a corollary of the user
friendliness principle. The more the users know going in, the easier they'll find the data
warehouse to use once they see it [http://www.academictutorials.com/data-warehousing/].
3.5 Data Warehouse Issues
There are certain issues surrounding data warehouses that companies need to be prepared
for. A failure to prepare for these issues is one of the key reasons why many
data warehouse projects are unsuccessful.
3.5.1
Loading and Cleansing Data
One of the first issues companies need to confront is that they are going to spend a great
deal of time loading and cleaning data. Some experts have said that the typical
data warehouse project will require companies to spend 80% of their time for it. While
the percentage may or may not be as high as 80%, one thing to understand is that most
vendors will understate the amount of time needed to spend doing it. While cleaning the
data can be complicated, extracting it can be even more challenging.
Not matter how well a company prepares for the project management; they must face the
fact that the scope of the project will probably be broader than they estimate. While most
projects will begin with specific requirements, they will conclude with data. Once the end
users see what they can do with the data warehouse after it’s completed, it is very likely
that they will place high demands on it. While there is nothing wrong with this, it is best
14
to find out what the users of the data warehouse need next rather than what they want
right now.
Another issue that companies will have to face is having problems with their systems
placing information in the data warehouse. When a company enters this stage for the first
time, they will find that problems that have been hidden for years will suddenly appear.
Once this happens, the business managers will have to make the decision of whether or
not the problem can be fixed via the transaction processing system or a data warehouse
that is read only. It should also be noted that a company will often be responsible for
storing data that has not be collected by the existing systems they have. This can be a
headache for developers who run into the problem, and the only way to solve it is by
storing data into the system. Many companies will also find that some of their data is not
being validated via the transaction processing programs.
In a situation like this, the data will need to be validated. When data is placed in
a warehouse, there will be a number of inconsistencies that will occur within fields. Many
of these fields will have information that is descriptive. When of the most common issues
is when controls are not placed under the names of customers. This will cause headaches
for the warehouse user that will want the data warehouse to carry out an ad hoc query for
selecting the name of a specific customer. The developer of the data warehouse may find
themselves having to alter the transaction processing systems. In addition to this, they
may also be required to purchase certain forms of technology.
One of the most critical problems a company may face is a transaction processing system
that feeds info into the data warehouse with little detail. This may occur frequently in a
15
data warehouse that is tailored towards products or customers. Some developers may
refer to this as being a granular issue. Regardless, it is a problem you will want to avoid
at all costs. It is important to make sure that the information that is placed in the
data warehouse is rich in detail [http://www.exforsys.com/tutorials].
3.5.2
Cost and Budgeting Issues
Many companies also make the mistake of not budgeting high enough for the resources
that are connected to the feeder system structure. To deal with this, companies will want
to construct a portion of the cleaning logic for the feeder system platform.
This is especially important if the platform happens to be a mainframe. During the
cleaning process, you will be expected to do a great deal of sorting. The good news about
this is that the mainframe utilities are often proficient in this area. Some users choose to
construct aggregates within the mainframe since aggregation will also require a lot of
sorting. It should also be noted that many end user will not use the training that they
receive for using the data warehouse [Dain, Hansen, Cost Savings are the New Black for
Data Warehousing (March 19, 2009)]. However, it is important that they be taught the
fundamentals of using it, especially if the company wants them to use the
data warehouse frequently.
Typical data warehouse project costs derive from two sources: technology and staffing.
The balance between these costs varies by a project's environment, but every project
includes both. Technology costs often include system hardware, database-related
16
software, and reporting software. Staffing costs address the individuals who gather
requirements as well as those who model, develop, and maintain the data warehouse.
In reviewing the marketplace for data warehousing hardware and software, organizations
can easily "break the bank" when procuring these technologies. As tempting as it may be
to seek out a large SAN with clustered high-availability database servers and farms of
Web servers to support a data warehouse, it is not always appropriate or necessary. It is
quite possible, and sometimes far more effective, to build a data warehouse solution on
relatively conservative hardware [Bill Inmon, Controlling Warehouse Costs (October
1998)].
Further, although there are a number of outstanding software solutions that address each
step in the data flow, it is critical to first find the right toolset before selecting an
expensive or feature-rich product. A common organizational mistake in data warehousing
is to ignore the currently owned technology and software capabilities and move ahead
quickly, if imprudently, to purchase a different product [Nathan, Rawling, Data
Warehousing on a Shoestring Budget (May, 2008)]. An organization often has the toolset
in place to effectively meet its data warehousing demands but simply needs the right
partner to fully implement its current technology.
3.5.3
Data Warehousing Security Issues
Data warehousing systems present special security issues which include

the degree of security appropriate to summaries and aggregates of data
17

the security appropriate for the exploration data warehouse, specifically designed
for browsing and ad hoc queries

the uses and abuses of data encryption as a method of enhancing privacy
Many data structures in the data warehouse are completely devoid of sensitive individual
identities by design and, therefore, do not require protection appropriate for the most
private and sensitive data. For example, when data has been aggregated into summaries
by brand or region, as is often the case with data warehousing, the data no longer presents
the risk of compromising the private identities of individuals. However, the data can still
have value as competitive intelligence of market trends, and thus requires careful
handling to keep it out of the hands of rival firms. Relaxed security does not mean a lack
of commitment to security. The point is that differing levels of security requirements
ought to remind us that the one-size-fits-all solutions are likely to create trouble. Another
special security problem presented by data warehousing is precisely the reason why such
systems exist. Data warehouses are frequently used for browsing and exploring vast
reams of data – undirected exploration and knowledge discovery is provided by an entire
class of data mining tools. The point is to find novel combinations of products and issues.
Whether authentic or mythical, the example of market basket analysis whereby diapers
are frequently purchased with beer is now a classic case. The father going to the
convenience store for "emergency" disposable diapers and picking up a six-pack on the
way out suggests a novel product placement. The point is that it is hard to say in advance
what restrictions would disable such an exploratory data warehouse; therefore, the
tendency is to define an unrestricted scope to the exploration. A similar consideration of
18
undirected knowledge discovery applies to simple ad hoc access to the data warehouse.
Examples where a business analyst uses end-user self-service tools such as those by
Business Objects, Information Builders, Cognos or Oracle to issue queries directly
against the data without intermediate application security give the end user access to all
the data in the data warehouse. Given privacy and security imperatives, it may be
necessary to render the data anonymous prior to unleashing such an exploratory, ad hoc
process. That will create complexity where the goal is (sanctioned) cross-selling and upselling. The identity must be removed in such a way that it can be recovered, as the
purpose is often to make an offer to an individual customer.
Encryption of data has its uses, especially if the data must be transmitted over an insecure
media such as the Internet. An employee, his or her manager and the human resources
clerk all require access to the employee's record. Therefore, encrypting the data will not
distinguish between their access levels. It is misguided to believe that if encrypting some
data improves security, encrypting all the data improves security even more. Blanket,
global encryption degrades performance, lessens availability and requires complex
encryption key administration. Encryption is a computationally intense operation. It may
not impact performance noticeably when performed for one or two data elements; but
when performed arbitrarily for an entire table, the result may very well be a noticeable
performance impact. It might make sense to encrypt all the data on a laptop PC that is
being taken off site if the data is extremely sensitive. If the PC is lost or stolen, only
encryption will guarantee that the data is not compromised. However, an even better
19
alternative would be selective encryption and organizational steps to make sure the
physical site is secure and the media containing the data is handled diligently.
As a general rule, proven security practices and solutions developed to secure networks
will be appropriate and extended to protect the data warehouse. In other cases, data
warehouses present special challenges and situations because the data is likely to be the
target that encourages hackers to try to gain access to the system. These practices extend
from organizational practices to high-technology code. The requirement for
authentication implies certain behavior – on-site staff should wear their corporate
identification badges and be required to sign an agreement never to share a user ID or
password with anyone. Based on the enterprise's specific confidentiality rules, new areas
where technologies are still emerging may be selectively used. It is essential that database
administrators work together with their security colleagues to define policies and
implement them using the role-based access control provided with the standard relational
database data control language [http://www.academictutorials.com].
3.5.4
Maintenance Issues for Data Warehousing Systems
Another important aspect of data warehousing is maintenance of these systems. Adequate
knowledge about business and feeder system changes that will affect the data warehouse
systems is of utmost concern for anyone doing systems maintenance. In data warehousing
environment data is fed from more sources than typical transaction processing system.
Though intelligent use of the data extraction, cleaning, and loading tools and the
information catalogs can greatly ease the burden here, many changes will require a fair
20
amount of effort. Keeping informed and assessing the impact of technically driven
changes to the feeder systems may be more difficult than keeping track of the business
driven changes. Maintenance issues are hard to handle. Some of the concerns are as
follows:

To figure out if, when, and how to purge data. There comes a point when it does
not make business sense to hold certain data in the warehousing system. This
usually happens because of some type of capacity limit or restructuring of data
and it is not worth the effort to restructure certain data. At this point, purging of
data is necessary with less expensive, alternative means of storage.

Knowledge to determine which queries and reports should be IS written and
which should be user written is needed.

Store data in the data warehouse "for data’s sake".

Balance the need for building aggregate structures for processing efficiency with
the desire not to build a maintenance nightmare.

Uncertainty whether to create certain reports/queries in the data warehousing
system or in the "feeder" transaction processing system.

Pressured to implement a means to interactively correct data in the data
warehouse (and perhaps send back corrections to the transaction processing
system).

Uncertainty about which tools are most appropriate for a certain task.

To figure out how to test the effect of structure changes on end user written
queries and reports.
21

To determine how problems with feeder system update processing affect data
warehouse system update processing.

Knowledge that the business changes the meanings of attributes over time and
that these changes can be overlooked.

Reworking the implemented security.

The need to keep reconciling feeder systems with the data warehouse systems.
In short, maintaining data warehouse architecture may be much harder than establishing
the architecture. It is also far more expensive (and complex) to maintain a data warehouse
than to build one [http://www.academictutorials.com].
Through this literature review we can conclude that the three main issues to be handled to
keep the data warehousing working efficiently are to maintain data integrity, to maintain
high performance system and to do it all in a cost efficient manner.
22
Chapter 4
SIGNIFICANCE OF THE PROJECT
The project is descriptive and explanatory in nature. The descriptive part of the project
studies the various data warehousing issues and the solution to these issues whereas, and
the explanatory part attempts to understand these issues in the industry – real world. The
literature review of the project includes information obtained from reading whitepapers,
journals, periodicals and press releases. The understanding of the real issues of the
industry are obtained by interviews, survey, and case studies published by companies.
This project will give a better understanding of the current issues and how they affect the
data warehouse environment. Not just the theoretical implications but how these issues
affect the real time world through the use of case studies. This will effectively help to
understand a better way to design a data warehouse environment.
23
Chapter 5
RESEARCH METHODOLOGY
The approach used to gather information on the subject included an extensive research
and conducting a survey. The research included a thorough study of various books, white
papers, journals, white papers, periodicals, and library archives. Information was also
obtained by means of search engines. Industry professionals who had experience in data
warehousing environment were interviewed for problem cases and how they worked up a
solution. A small survey of people working in the data warehousing environment was
conducted to understand more about the issues they face and how important overcoming
these issues is for them.
24
Chapter 6
CASE STUDIES
6.1 John Deere – Interview with Project Manager
Company background
Deere & Company was founded in 1837 and has grown from a one-man blacksmith
shop into a corporation that today does business around the world and employs more
than 50,000 people. The company is collectively called John Deere.
John Deere consists of three major business segments - agriculture and turf,
construction and forestry, and credit. These segments, along with the support
operations of parts and power systems, are focused on helping customers be more
productive as they help to improve the quality of life for people around the world.
The company's products and services are primarily sold and serviced through
John Deere's dealer network.
John Deere is the world's leading manufacturer of farm equipment. The company also
produces and markets North America's broadest line of lawn and garden tractors,
mowers, golf course equipment, and other outdoor power products. John Deere
Landscapes provides irrigation equipment and nursery supplies to landscape service
professionals across the United States.
John Deere is the world's leading manufacturer of forestry equipment and is a major
manufacturer of construction equipment in North America.
John Deere Credit is one of the largest equipment finance companies in the U.S. with
more than 2.4 million accounts and a managed portfolio of nearly $23 billion (U.S.)
25
In addition to providing retail, wholesale and lease financing to help facilitate the sale
of John Deere agricultural, construction and forestry, and commercial and consumer
equipment, John Deere Credit also offers revolving credit, operating loans to farmers,
crop insurance (as a Managing General Agent), and debt financing for wind energy.
Today, John Deere Credit has approximately 1,900 employees worldwide and has
operations in 19 countries.
Resource
For this case study, Mr. Paresh Deshpande, a project manager at John Deere was
interviewed on October 12, 2010 at 10:30 am. The case provided here focuses on the
issues they faced during improving DB2 UPDATE performance. The case discusses
how to design Informatica mappings that do a large number of UPDATEs to a DB2
table so that they can be partitioned to do multiple updates simultaneously. The goal
was to improve performance of the system.
Situation
Deere Credit was converting most of its loans from a system called M010 to a new
system called Profile. One of the new features of Profile is daily interest calculation.
This is something that M010 only did a few times a month, but Profile would do it
nightly. When M010 calculated interest, it caused an update in a table for every open
agreement. Typically, this meant updates to about 350,000 rows. Updating this
many rows one at a time would typically take 35-45 minutes. Since M010 only did
26
this about three times a month, this wasn’t considered to be a problem. However with
Profile, this many updates would happen every night. This would add approximately
20 minutes to their normal nightly processing, which they didn’t desire to do. It was
important that the nightly processing completed in a timely manner because they
couldn’t bring Profile online until most of the batch processing was done.
Doing 350,000 DB2 UPDATEs became the bottleneck in the nightly processing and
hence there was a need to find a faster way of doing them.
Another problem at Deere Credit was that DB2 was generally configured to use page
level locking. This means that when a row in a table is updated, it not only locks that
particular row, but also locks any other rows that happen to be in the same page on
the disk. This improves DB2 performance but it can cause deadlock situations to
occur more frequently when multiple connections are updating the same table. Since
one UPDATE will cause all records on that page to be locked, simply creating
multiple UPDATE threads will almost certainly cause deadlocks.
Solution
The solution was to use Informatica’s partitioning feature to do multiple UPDATE
streams simultaneously. Informatica has the ability to thread mappings. When a
mapping is threaded, it creates several instances of the mapping running
simultaneously. Source data is sent to each thread depending on how the partitions
are configured. Each thread will get its own connection to the database and will
appear to be different users to the database.
When carefully done, this allows
27
multiple UPDATEs to occur simultaneously to the same table leading to performance
improvement Further to avoid the deadlocks caused by page level locking, they
developed a technique to minimize (but not completely eliminate) updates to the same
DB2 page in the different Informatica partitions. The attempt was to send all rows
that are on the same physical page to the same Informatica thread. This minimizes the
chances that two separate connections are trying to update data on the same DB2
page, but doesn’t totally eliminate the deadlock situation. This solution assumes that
automatic deadlock retry logic is in place for automatically restarting Informatica
maps that fail to DB2 deadlocks. This solution does not completely avoid deadlocks,
but it minimizes the chance of their occurrence.
There are several other efforts taken to enhance the performance.

To enhance performance, the DBAs try to keep the rows in a table in a
particular physical order, usually in primary key order. This is called being
‘in cluster’ and it’s important that the table being updated stay ‘in cluster’.
The table will probably never be 100% ‘in cluster’ but it should be at least
90% in cluster. If it falls below that level, the table should be reorganized to
be put back ‘in cluster’.

The rows being updated need to be sorted in the same order that they are in
the physical table. A sorter transformation should be used to guarantee this
order.
For the sorter transformation to work properly in a partitioned
Informatica map, all Informatica data should be ‘funneled’ through a single
partition point here so that all records will go through the same sorter.
28

Once data is sorted, it needs to be assigned a group number. Group numbers
start at zero and are incremented when the record count exceeds GroupSize.
For example, if the GroupSize is 100, records 1-100 will be assigned a group
number of 1, records 101-200 will be assigned a number of 2, etc. This
number is then used to decide which update partition to use. The number of
records in a single group should be at about 10 times larger than the number
of rows that fit on a single DB2 page. This size seems to be a good balance of
performance and deadlock occurrence. The larger the group size, the more
likely that nearby records will be sent to the same partition, but it will slow
down the performance. For example, a group size of 1 would be very fast, but
it will almost certainly deadlock. Likewise, a group size larger than the
number of rows to be updated will send all rows to the same partition, so it
would never deadlock, and will effectively disable partitioned updates.

Informatica normally decides internally when to commit DB changes and this
doesn’t work well with partitioned updates. Informatica typically does several
thousand updates before committing and this isn’t frequent enough to avoid
deadlocks. The Transaction Control transformation must be used to commit
transactions frequently.
Committing transactions frequently decreases both the chance of a deadlock
and performance. The lower the number, the less likely a deadlock will occur,
but performance will be impacted.
29
When a group of records comes close to being finished, Informatica seems to
‘cache’ the last few records and not send them through. This leaves the
possibility that some records will be left in an uncommitted state and that can
cause deadlocks. To prevent this, when a group of records is sent to a
partition, the first few records and the last few records are committed after
every update. This cuts the chances for a deadlock between the rows in this
group and the rows in the group preceding it, and following it.

Because deadlocks can and do occasionally happen when partitioning updates,
it’s best to skip records that have already been updated in a previous run.
Besides the performance benefits of skipping records that have already been
updated, beginning the update in a different position helps to prevent
deadlocks from occurring in the same spot during the update process.

When a mapping is restarted because of a deadlock, they run with an alternate
configuration that is called ‘failsafe’ mode. This is an alternate configuration
for commit points and group sizes that is guaranteed to complete. Normally
the commit point in the restart configuration is set to 1 to guarantee that no
deadlocks will occur in the second attempt. The commit size, group size and
edge size should all be defined in the session parameter file.
Also, the
alternate values to use when restarting should also be contained in the
parameter file. The values for these parameters may need to be tweaked from
time to time.
30

In addition to the mapping changes, the workflow needs to be configured for
partitioning. Deere found that six partitions worked well for them. This
setting is something that can be tuned also.
They observed increased
deadlocks and some decrease in performance with more than 8 partitions.
A partition point needs to be set on the sorter transformation so that all rows
go through one partition when being sorted and when going through the
expression transformation. This ensures that all records with similar primary
keys are assigned to the same group.
The transaction control transformation should also have a partition point based
on a hash key on the assigned group number. This will ensure that all rows
with the same group number will be sent to the same database connection.
31
6.2 Kyobo Life Insurance – Case Study One
Company Background
Kyobo Life Insurance is one of South Korea's top life insurance firms. The company
provides life insurance and asset management products to more than 10 million
customers. Its offerings include traditional life, health and disability, and retirement
and pension products for individuals and businesses. Products are distributed through
financial planning agents. Kyobo Life also offers personal-use and mortgage loans,
and it has some international operations. The company was founded in 1958 by Shin
Yong-Ho, father of CEO Shin Chang-Jae. The Kyobo Group conglomerate operates
under a common management structure in a variety of sectors including real estate,
investment banking, and a bookstore.
Kyobo Life Insurance has quite a few accolades as an industry leader: ‘One of the
Top 30 Admired Companies in Korea', winning the Customer Satisfaction Grand
Prize awarded by KMAC for 5 years in a row, and the #1 insurance company in the
KCSI (Korean Customer Satisfaction Index) 2004 survey.
These are the results of Kyobo Life's paradigm shift from volume to value. Relying
on a solid customer base, Kyobo Life is striving to maximize customer value through
qualitative growth. With a firmly established open and high-performance corporate
culture, Kyobo Life will expand into foreign markets and it aims to become the
‘most-preferred brand in Northeast Asia' by the end of this decade.
32
Situation
Kyobo Life wanted to make an effort to establish transparent management and
efficiency. In addition to ensuring transparency in finance and administration, they
also wanted to manage four major outcomes including management planning and
profit management, and simplify the complicated interfaces between IT systems,
which had come to resemble spaghetti. The project was called value innovation
project.
Implementation of this project would provide the business advantage of making
diverse management information available for quick decision-making. The users
could directly extract analytical and statistical data, hence raising user productivity
and satisfaction.
The goal was to reduce data extraction time from 5 days to 1 day,
improve data quality and unify data across all business units for single access.
Solution
In the former computing environment, limited statistical and analytical data was
available only in certain areas such as insurance, credit, and accounting. To get more
diverse data, users had to consult the Information System Office. Sybase IQ can
quickly process large amounts of data, has a unique way of storing data and is capable
of compressed storage. Sybase IQ was used to create the EDW (Enterprise Data
Warehouse) and datamarts were implemented for 14 subject areas in 4 groups.
Implementing Sybase IQ on IBM System allowed Kyobo Life to efficiently allocate
resources without any interruption in service.
33
The result of this implementation was that the quality of analysis data was improved,
and the availability of diverse information improved decision-making. As the users
could now perform analysis and extract statistical data on their own, average data
extraction time was reduced from 5 days to 1 day, improving user productivity and
satisfaction.
To realize these goals, Kyobo Life finished the value innovation project which had
three primary goals: to establish a responsible management system; to support
strategic decision-making; and to integrate and accelerate financial data. The
company wanted to become more competitive in insurance industry through value
management, which was achieved by the value innovation project. The project
consisted of three parts: enterprise resource planning (ERP), enterprise data
warehouse (EDW) and enterprise application integration (EAI). Kyobo Life
employed a "big-bang" method in which both the EDW and ERP were built
concurrently. The scale of the effort was also noteworthy as the size of the EDW
gained attention across Korea as well as abroad.
Kyobo Life installed the EDW because the existing computing environment only
provided statistical and analysis data in a few areas such as insurance, credit and
accounting. To access more diverse data, users had to consult the Information
System Office leading sometimes to long waiting periods. Although many statistical
systems were available, they did not provide enough strategic information. To address
this issue, Kyobo Life installed an information infrastructure to make it easier to
create, manage, and use the information. The company also decided to integrate its
34
data and shift its computing focus from business processing to analysis. To keep all
the data at one place, Kyobo set up and applied data standards to integrate enterprisewide information, provided speedy and correct information related to profitability and
the four major project focus areas, and developed an infrastructure that allowed the
direct use of data. In this process Kyobo Life adopted Sybase IQ, which was different
from other databases used in insurance, credit and stocks and bonds, and chose IBM
Websphere DataStage as the ETL tool. To build and deploy the EDW, Kyobo Life
worked closely with both Sybase and IBM. The combined software, hardware and
services technical teams contributed unique skill sets to design data architecture,
optimize user analysis, provide a data governance structure, and assure data integrity
for the system.
Kyobo Life combined insurance, credit (host), stocks and bonds (SAPCFM), special
accounting, personnel (SAP HR) into EAI to create a unified system that provides
office data such as account closing management, bill management, funds
management, tax management, budgeting, and financial accounting. The ETL tool
was used to load data nightly into the EDW and make it available to the entire
information system. EDW contains large amounts of data on customers, insurance
contracts, products, and insurance premiums. Yet, its greatest advantage is its ease of
use. Related data are divided into 4 major categories: customers/activities;
contracts/financial transactions; commissions; and investment/finance/managerial
accounting. The warehouse is set up with 14 subject areas including customers,
activities, communication, contracts, products, fund setting, financial transactions,
35
closing, organization, market, investment, and fixed assets/real estate. This data
design makes it easy for users to analyze data. As a result, 700 employees use the
EDW to obtain analysis data.
A spokesperson says [Lee, Hae Seok, Kyobo Life Insurance Case Study.
http://www.sybase.com/detail?id=1054785], "After the system
was
installed,
enterprise-wide standards were applied to business data from various channels and
data was integrated through the ETL tool to improve the quality of analysis data.
Diverse management information is available for quick decision-making. In addition,
as users can directly extract analytical and statistical data, data extraction time was
reduced from 5 days to less than a day, raising user productivity and satisfaction."
In addition to these benefits, Kyobo Life Insurance uses the enterprise-wide data
offered by the EDW to conduct analysis from diverse viewpoints, thereby improving
transactional processes and work efficiency. The EDW, a single channel that provides
the information system with data, improved the flexibility and scalability of the
system, and raised development productivity. The company shifted from providing
task-related information to providing analytical enterprise-wide information, and is
transforming into a "real-time corporation" by providing forecast data based on realtime analysis.
To this end, Kyobo Life now holds management performance reporting sessions
using the management information system based on EDW data and also maintains
data quality by reinforcing the sense of data ownership of each department. Kyobo
Life also uses the data to perform the information system's analysis and statistical
36
work. To expand the user base, increase use, and improve competencies, the company
provides convenient user tools, open training courses and a specialist license system.
Kyobo Life trains instructors at each branch office, who will then give training
courses for their respective branches [Lee, Hae Seok, Kyobo Life Insurance Case
Study].
37
6.3 Scandinavian Airlines – Case Study Two
Company Background
Scandinavian Airlines (SAS) is the flag carrier of Denmark, Norway and Sweden. It
is the largest airline in Scandinavia, flies to 150 destinations, employs 32,000 staff,
and carries 40 million passengers annually. It is a founding member of the Star
Alliance (Star Alliance is the world's first and largest airline alliance, headquartered
in Frankfurt, Germany. Founded in 1997, its name and emblem represent the five
founding airlines, Air Canada, Lufthansa, Scandinavian Airlines, Thai Airways
International and United Airlines. Star Alliance has since grown considerably and
now has 28 member airlines). The company has its head office in Solna,
near Stockholm, Sweden.
Situation
Scandinavian Airlines (SAS), the largest airline in Scandinavia, but with major
changes reshaping the airline industry - particularly low-cost competition and rising
fuel prices - SAS executives are facing significant challenges.
To operate effectively, SAS employees deal with large volumes of operational and
customer data. Existing systems for processing this information depended on an aging
data warehouse powered by IBM DB2 mainframe technology. This had become
expensive to operate and maintain. The company’s IT team faced increasing
difficulties providing employees with timely access to data. In particular, the existing
system suffered from scalability issues and low performance due to priority conflicts.
38
This made it difficult for the team to increase analysis capabilities or meet business
demands for new reports and online queries.
SAS decided to investigate the advantages of improved data management and more
flexible support for business analysis and reporting. In particular, they wanted to
improve the availability of business intelligence relating to reservations, customer
behavior, and agent activities.
The IT Architect at SAS, Patterson, said that they needed a new information
management environment that could significantly reduce IT management and
maintenance costs, deliver enhanced analysis and business intelligence tools, and give
more employees across the airline reliable access to critical business data.
Solution
SAS executives realized this was the start of the ongoing strategic process required to
replace aging mainframe applications with new technology solutions that could meet
the dynamic needs of a leading airline. They conducted a thorough evaluation of the
issues and likely cost benefits, and turned to Microsoft to provide essential support.
Microsoft is a long-term partner of SAS. It participates in regular forums with the
airline’s managers to help identify critical business challenges and provide strategic
IT solutions to resolve them. The Microsoft team worked with SAS to consider the
business case. Using advice about best practices from Microsoft helped SAS senior
managers develop a comprehensive, dependable strategy to overcome their immediate
business data issues and meet future business analysis needs.
39
The strategy was also influenced by key technology partner Intel Corporation, which
worked closely with SAS and Microsoft to define a road map of increasing processor
capacity featuring Intel processors. Additionally, the Intel Lab played a central role in
benchmarking, while Intel Services assisted in testing the scalability of the proposed
solution.
A feasibility study followed, which led to the planning and preparation for proof of
concept that would be the initial phase of the overall strategy. The proof of concept
focused on:

The migration of data from 600 DB2 tables in the mainframe environment to a
Microsoft SQL Server 2005 database running in a Microsoft Windows Server
2003 environment on a new HP Superdome server platform based on Intel
processors.

The testing of 25 COBOL pre-processing programs from the mainframe
environment, using emulation technologies provided by partner Micro Focus
International.
Rather than retrain developers in the use of new technologies, the decision was taken
to introduce the Microsoft .NET Framework. This provides an up-to-date
development
environment
capable
of
supporting
the
company’s
COBOL
programmers. The .NET Framework is an integral component of Windows Server
2003. It provides a user-friendly programming model and runtime for Web services,
Web applications, and smart client applications.
40
SQL Server 2005 would help the SAS IT team resolve technical issues related to data
warehousing. In long term their intention was to use the advanced integration
capabilities within SQL Server 2005, in particular its bulk data transfer management
extract, transfer, and load (ETL) functionality.
SQL Server 2005 Analysis Services was another important element of the new
solution, providing a combination of transactional and online analytical processing
(OLAP) benefits. The Unified Data Model (UDM) would allow combining various
relational and analysis OLAP data models and create a single dimensional model that
gives users fast access to accurate business data.
With plans to extend access to business data to users across the business, security is a
key issue for the IT team. To meet these challenges, it decided to use Microsoft
Active Directory’s directory services as a replacement for the existing mainframe
security layer. Active Directory allows to efficiently manage all the security policies
and permissions needed to ensure authorized employees have reliable access to the
data warehouse [Microsoft Case Studies, Scandinavian Airlines].
Benefits
By deploying the latest Microsoft technologies, SAS reduced development times and
costs for creating new business intelligence functionality. In addition, it has increased
employee access to key business information, enhanced network security, and
maximized the value of existing technologies and IT skills.
41
SAS minimized these issues by investing in Microsoft technologies. At the same
time, the company has had the opportunity to extend business intelligence
functionality by reducing development costs. Excellent support is also protecting
technology investments.
Improved Scalability Provides More Users with Faster Access to Business
Intelligence
In the DB2 mainframe environment, it was no longer cost effective or practical to
separate data analysis and transactional environments. As a result, the performance of
applications accessing the data warehouse was often impeded because priority was
given to transactional systems. Inevitably, business analysts suffered delays and were
hindered by inaccurate data.
The scalability of the new environment ensured that analytical and transactional
systems were now separate. Using SQL Server 2005, more users can be added as
required, while delivering improved response times.
Thanks to the ease of integration between Microsoft technologies, SAS will also be
able to offer low-end data analysis capabilities to the wider range of employees
working with existing Microsoft desktop environments [Microsoft Case Studies,
Scandinavian Airlines].
42
Helping Managers Reduce Infrastructure Costs
Low cost operators and rising fuel prices are increasing pressure on airline managers
to reduce costs. As a result, the cost-effectiveness of IT operations is a key
requirement for SAS. Following migration of its data warehouse to the SQL Server
2005, the company predicts substantial savings. The IT architect Petterson said that
[Microsoft
Case
Studies,
Scandinavian
Airlines.
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=20033
0] this initiative would result in reduction of data warehousing costs by 50 per cent.
The Intel processors will also contribute significantly to operational savings.
Compared to a mainframe environment, this technology requires lower initial
investment. In addition, it requires less stringent environmental control, resulting in
reduced operational costs [Microsoft Case Studies, Scandinavian Airlines].
Improved Operational Efficiency
As well as helping SAS managers significantly reduce the cost of hardware and
software, the new Intel and Microsoft strategy is also achieving resource efficiencies
by maximizing productivity. Petterson says: “Now that we can implement standard
analysis tools that offer access to shared data across the business. We can also
eliminate the costs of duplicated and disparate departmental systems. Because it is
easier for our managers and analysts to understand the whole business, they can draw
conclusions across organizational units and data boundaries. Before, they only had a
restricted, departmentalized view of operations. As a result, we are making substantial
43
operational savings through increased productivity and efficiency. For example, the
commercial insights gained from improved access to business data are helping the
sales team target prospects more effectively, eliminate unproductive sales channels,
and save on advertising costs.” [Microsoft Case Studies, Scandinavian Airlines].
Providing Employees with More Accurate Business Intelligence
It is essential that SAS managers can respond quickly to challenges from their
competitors. Essential capabilities are rapid access to accurate business data, effective
analysis, and reporting that provides timely business intelligence. SQL Server 2005
will deliver these benefits, ensuring SAS managers have access to the information
they need to work effectively. The UDM capability provided by SQL Server 2005
Analysis Services will ensure the gap between users and business data is closed. In
doing so, it will deliver valuable operational and data management benefits
[Microsoft Case Studies, Scandinavian Airlines].
Improving Competitiveness
The new solution will provide self-help environments for SAS business analysts, who
will now be able to quickly design their own queries and reports. This means business
will be more responsive to changing circumstances. IT will be easier to take full
advantage of every new business opportunity and improve competitive advantage.
For example, by better understanding passenger loads per service, they could
44
accelerate pricing and scheduling decisions for capacity flights [Microsoft Case
Studies, Scandinavian Airlines].
Delivering Better Services to Customers
Using the advanced analysis capabilities of SQL Server 2005, SAS will be able to
enhance relationships with its customers and increase per-customer revenues. The
new Microsoft solution would allow performing detailed analysis of customer
behavior - such as Internet reservations and check-ins, and calling center interactions.
It would also help reduce the impact of delays and cancellations on customers. The
benefits would be better sales services and products, and improved customer service
overall [Microsoft Case Studies, Scandinavian Airlines].
45
6.4 Philips Consumer Electronics – Case Study Three
Company Background
Philips Consumer Electronics (Phillips CE) is a global leader across its
healthcare, lighting and lifestyle portfolio. It is one of the world’s top consumer
electronics companies delivering technology that enhances everyday lives. Philips CE
manufactures and markets a huge range of consumer goods, from flat screen
televisions to DVD recorders, and has a significant global presence.
Situation
Local operational units and Business Units at Philips CE had previously handled their
own data warehousing, using locally selected tools. This inevitably led to duplication
of data and of IT costs, and an increase in the interfaces between disparate systems.
The resulting framework was inefficient because it could not evolve fast swiftly or
cost-effectively in response to changing business needs.
Philips CE is never static: meeting the specific challenges of the consumer electronics
markets - fast-changing product lines and competitive landscape, numerous customers
and channels to market - requires constant evolution. Competition drives increasing
sophistication in business controls, expanding supply chains introduce new data
sources, and evolving business processes drive adaptations and integration of data
models and data definitions.
Han Slaats, Program Manager of Enterprise Data Warehousing at Philips CE states
that the process of data warehousing is key, because the whole point is to deliver the
46
right data at the right time in the right place. To do so, you have to treat data
warehousing as a service, which changes to meet the needs of the business.
Philips CE needed a strategic data warehousing tool to populate its best-practice
framework for information management, which draws on the concept of adaptive
enterprise data warehousing as a central service rather than a number of stand-alone
projects across the company. By focusing resources on repeatable solutions and a
single technological skill-set, Philips CE aimed to reduce operational costs and
increase the speed of ROI from data warehousing.
Solution
Philips CE moved into a richer information world through an advanced data
warehousing strategy, based on the Kalido Information Engine (Kalido). With a
board-level view of data warehousing as a process and a service, rather than as a oneoff project, the company has built a best-practice intellectual framework for
information management that will allow it to reduce operational costs and deliver
greater value to the business.
The Philips CE vision of best practice and its focus on generating a core of data
warehousing expertise drove the development of a central service organization for
information management. The company partnered with systems integrator Atos
Origin and selected Kalido as its global strategic data warehousing tool. Through fast,
flexible and repeatable implementations, Kalido delivers a federated data
47
warehousing solution at high speed and with low implementation and management
costs.
They evolved a best-practice solution for data management that would enable to
reduce IT expenditure and deliver value-added information to the business [Kalido
Case Study, Philips Builds Best-Practice Information Management Architecture].
Benefits
Ability to handle change
The new best-practice framework at Philips CE defined a central service organization
to manage all corporate data warehousing using a single strategic technological tool.
The criteria governing the selection of the tool were that it should be highly flexible,
driven by business metadata, quick to implement, and cost-effective to maintain.
Philips CE worked with systems integrator Atos Origin to select a tool to populate its
data warehousing concept. Following a successful proof of concept by Atos Origin,
Philips CE selected Kalido as its key strategic data warehousing tool [Kalido Case
Study, Philips Builds Best-Practice Information Management Architecture].
Fast ROI
Kalido is a non-intrusive adaptive enterprise data warehousing solution, so it fitted
with Philips CE’s existing technology and removed the need for costly and inflexible
standardization of source systems. Its revolutionary approach to database design
adapted to changes in data and data structures without altering database definitions, or
48
any of the code that accesses them. This meant that it could deliver integrated
information consistently even while changes in business structures were occurring.
By designing an architecture based on a central service model and a single
technological tool, Kalido, they ensured fast ROI on data warehousing projects,
excellent on-going support, and increased value to the business as a whole with
subsequent projects.
The central core of data warehousing competence is enriched with each new project.
Successive implementations could be planned and completed more quickly and costeffectively. As new instances of Kalido are put in place, more enterprise data is put
into a comparable structure, and the total picture of data within the enterprise
becomes ever richer.
Their strategy was not the creation of a single, unified data warehouse for the whole
organization. However, bringing all data warehouse projects into a single service
domain with one supporting technology meant that if they choose to integrate
reporting across Business Units, they would be able to develop a common data model
quickly and cost-effectively.
Furthermore, single skill-set for data warehousing reduced training and support costs,
and accelerated implementation of repeatable Kalido solutions. It led to significant
savings on the cost of deploying data warehouses and to significant reductions in data
warehouse implementation time [Kalido Case Study, Philips Builds Best-Practice
Information Management Architecture].
49
Chapter 7
SURVEY
At the onset of the study, it was planned to conduct a study of a few cases to emphasize
the current issues in the data warehousing environment. After gathering the cases, there
was a need felt to support the results drawn with a small online survey. This online
survey was conducted to get a better understanding of what features are important to the
professionals working in a data warehousing environment. This survey was conducted to
further validate the conclusions drawn from the cases studied.
7.1 Sample Population
The population of the subjects for the survey comprised of all those people who work in a
data warehousing environment that the investigator has access to. This is a convenient
sample that provides data for an exploratory study, which by no means is a random and
scientific survey. The difference between their gender, location, organization, and work
experience was not taken into consideration for the purpose of this study. The size of the
data warehouse and the work position of the subject were gathered to understand if it
made any difference in their viewpoint towards the problem.
7.2 Sample Frame
The sample frame for the survey included those professionals who had experience
working in a data warehousing environment.
50
The survey was drafted using Google Documents. Google Document lets you create a
form which can be emailed and then it creates a spreadsheet where all the responses are
recorded.
The survey was emailed to all the acquaintances/past colleagues/friends working in data
warehousing environment. The survey was also made available on a social networking
site for response.
7.3 Survey Response
The emails were sent out to 48 prospective professional respondents, of which only 25
professionals participated in the survey.
The questions used in the survey were straightforward and not too detailed. The answers
were gathered in a manner that it generated nominal data which would be easy to analyze.
The survey is presented in Appendix A of this project report.
7.4 Data Collection via Online Survey
Online survey was chosen as the method of data collection because it was the lowest cost
option. It covered a larger geographic area with no increase in cost. This method of
survey also made it possible to get responses from all over the world and from
inaccessible respondents. This method is also more convenient for the respondents as
they can think about it and answer at their convenience. Finally, the online survey helps
in collecting the responses rapidly and more accurately.
51
Chapter 8
CASE STUDIES AND SURVEY OBSERVATION
John Deere is a very large organization and the case illustrated by Mr. Deshpande
suggested that John Deere strives to keep up the performance of the system. In this case,
migration to a new system was going to affect the overall performance of the system.
Updating over 350,000 rows every night would add approximately 20 minutes to their
normal nightly processing, which they didn’t desire to do. It was important that the
nightly processing completed in a timely manner because they couldn’t bring Profile
online until most of the batch processing was done. This increased the unavailability of
the system until all the updates were done. Another problem was the deadlock situations
occurring more frequently when multiple connections updated the same table due to the
page locking configuration of DB2.
As illustrated in the case, they took efforts to take care of the situation by using
Informatica’s partitioning feature and also made some additional efforts to improve
overall performance and keep the system up and running. Improving the performance of
the system was given higher priority in this case.
Other issues faced by Mr. Deshpande as a project manager are related to the cost
management and is associated with software and hardware updates. Eventually, the
technology gets obsolete and maintenance costs increase. They use Informatica as their
ETL tool. They have to keep up with the changing technology i.e., if the software is
changed, they need to support it with hardware which is needed for the functioning of the
new software version.
52
Apart from this, he also has to deal with data management which includes policies related
to data retention, data storage, data mapping and data recovery.
In case of Kyobo Life Insurance, one of South Korea's top life insurance firms, the
objective was to maximize customer value through qualitative growth. This was achieved
through the ‘Innovation Project’ which would provide the business advantage of making
diverse management information available for quick decision-making.
The goal was
to reduce data extraction time from 5 days to 1 day, improve data quality and unify data
across all business units for single access. This was achieved by Sybase IQ, which can
quickly process large amounts of data, has a unique way of storing data and is capable of
compressed storage. Sybase IQ was used to create the Enterprise Data Warehouse at
Kyobo.
It is evident from this case that a company as large as Kyobo, which provides life
insurance and asset management products to more than 10 million customers need a
system to make diverse management information available to improve the decisionmaking process. The data from all the channels was integrated through the ETL tool to
improve the quality of analysis data. This case just goes to prove that more data equals
better information - which leads to improved decisions.
The Scandinavian Airlines (SAS), the largest airline in Scandinavia, were facing
significant challenges as their existing systems for processing this information depended
on an aging data warehouse powered by IBM DB2 mainframe technology. As
53
infrastructure becomes obsolete, it becomes expensive to operate and maintain.
Moreover, low cost operators and rising fuel prices were increasing pressure on airline
managers to reduce costs.
Therefore, it became necessary for SAS to seek a solution which could help reduce the
costs and could also meet the dynamic needs of a leading airline. SAS with the help of
Microsoft came up with an IT solution to increase the performance and reduce
implementation and maintenance costs.
This case illustrates that the solution for every organization is going to be unique. Based
on their requirements, their ability and budget every organization needs to come up with
a solution with fits their needs. For example, to reduce the implementation costs, SAS
and Microsoft decided that instead of retraining developers they would use Microsoft
.NET Framework which would provide an up-to-date development environment capable
of supporting the company’s COBOL programmers.
At Philips Consumer Electronics, local operational units and business units had
previously handled their own data warehousing, using locally selected tools which lead to
duplication of data and of IT costs, and an increase in the interfaces between disparate
systems. An organization that is as changing as Philips CE requires constant revolution.
The framework they used was inefficient because it could not evolve fast swiftly or costeffectively in response to changing business needs.
Similar the Scandinavian Airline case, they came up with a solution that fit their business
needs. Philips CE moved to Kalido Information Engine (Kalido), which with a board-
54
level view of data warehousing as a process and a service, the company built a bestpractice intellectual framework for information management that would allow it to reduce
operational costs and deliver greater value to the business.
By creating a single, unified data warehouse for the whole organization and bringing all
data warehouse projects into a single service domain with one supporting technology,
Philips CE improved the efficiency of their system. Furthermore, single skill-set for data
warehousing reduced training and support costs and, accelerated implementation of
repeatable Kalido solutions which lead to significant savings on the cost of deploying
data warehouses and to significant reductions in data warehouse implementation time.
This case further validates the point that solutions to improve efficiency vary from
organization to organization based on their needs, abilities and budget.
From the Literature Review, the deduction was made that the three main issues to be
handled to keep the data warehousing working efficiently are to maintain data integrity,
to maintain high performance system and to do it all in a cost efficient manner. The four
cases studies have provided sufficient evidence that these are the three key issues around
which the solutions are designed for any organization.
Later on in the project it was decided to conduct a small survey to find out how
professionals at different positions in different size data warehouses viewed the
importance of the issues. The survey will also take a look at what issues arise across
different data warehouse sizes.
55
The survey asked the respondents their position in the organization and the size of the
data warehouse in their organization. They were also asked to rate the features based on
their importance to them. The features to be rated were easy access to data, efficient
performance, data currency (recency), data accuracy, and implementation and
maintenance costs. Finally, they were asked to rate which of the following issues they
faced at their organization - problem in accessing/restoring data, inefficiency in
performance, inaccurate data or obsolete infrastructure.
The distribution of the survey respondents are shown in the Figure 1 and Figure 2.
12
Number of Respondants
10
8
6
4
2
0
<1000GB
<10 TB
>10TB
Data Warehouse Size
Figure 1: Number of Respondents from Different Data Warehouse Sizes
56
12
Number of Respondents
10
8
6
4
2
0
Applications
Programmer
Database
administrator
End User
Project Manager
Role in Data warehouse
Figure 2: Respondents Role in the Data Warehouse
Among the responses received, 7 worked in a small data warehouse smaller than 1000GB
in size, 11 worked in a medium sized data warehouse with a size between 1000GB and
10TB, and 7 worked in a large data warehouse sized greater than 10TB. With respect to
their roles in the data warehouse, 11 were Application Programmers, 4 Database
Administrators, 3 End Users and 7 Project Managers.
57
Observations from Survey:
Easy access to data
Efficient performance
Data currency (recency)
Data accuracy
Implementation and maintenance costs
Rating
Highest
Moderate
Lowest
Applications Programmer
Database administrator
End User
Project Manager
Figure 3: Importance of Each Feature to Respondents in Different Roles
Figure 3 was plotted by calculating the average rating of each feature with respect to each
of the four roles. Similarly, the Figure 4 was plotted by calculating the average rating of
each issue with respect to each of the four roles played in the data warehouse. The
observations from the Figure 3 and Figure 4 with respect to the positions held by the
respondents in their organization are as follows:

Project Managers:
Project Managers (7 respondents) in all size of data warehouse rated efficient
performance as the most important of all. This was followed by data accuracy;
58
implementation and maintenance costs and, data recency – in that order. They
placed least importance to east access to data. From Figure 4, we also observe that
on average the project managers have more issues with the performance
inefficiencies. The project managers have also rated the issue of obsolete
infrastructure higher than any other respondents in other roles.
Problem in accessing/restoring data
Inefficiency in performance
Inaccurate data
Obsolete Infrastructure
Rating
Highest
Moderate
Lowest
Applications Programmer
Database administrator
End User
Project Manager
Figure 4: Issues Faced by Respondents in Different Roles

Application Programmers:
Application Programmers (11 respondents) also valued efficient performance as
the most important of all. Their outlooks differs from Project Managers in the
rating where they rate data accuracy, data recency and ease to data access as next
59
important features - in that order. The least important feature to them is the
Implementation and Maintenance Costs.

Database Administrators:
The observation for Database administrators (4 respondents) was similar to the
observations made for the Application Programmers. End users

End Users:
The observation for End users (3 respondents) was similar to the observations
made for the Application Programmers and Database Administrators.
The observations made validate that the Project Managers have to deal with more than
the operation of the system. High performance systems are important but project
managers have to consider the implementation costs for every project and keep the
maintenance costs at the minimum. They have to keep a track of the new software
updates and how it will affect the current the hardware. The Project Manager respondents
in the survey hence have valued implementation and maintenance costs over feature like
data recency and ease of access.
60
Easy access to data
Efficient performance
Data currency (recency)
Data accuracy
Implementation and maintenance costs
Rating
Highest
Moderate
Lowest
<1000GB
<10 TB
>10TB
Figure 5: Importance of Various Features in Different Sized Data Warehouses
Figure 5 was plotted by calculating the average rating of each feature with respect to each
of the three data warehouse sizes. Similarly, the Figure 6 was plotted by calculating the
average rating of each issue with respect to each of the three different sized data
warehouses. The observations made from Figure 5 and Figure 6 with respect to size of
the data warehouse in the organization of the respondents are as follows:

<1000 GB
In smaller data warehouses, the respondents (7) valued performance the most. The
data related features got next importance followed by implementation costs.
61
Another observation made was that in smaller data warehouses, most of the
respondents rated obsolete technology as their main problem. The other
predominant problem observed is the data inaccuracy followed by inefficient
performance. As a result of these responses, the average ratings for obsolete
technology and inaccurate data appear to be same. A few respondents mentioned
reasons such as lack of resources to support larger data warehouses which could
be the reason of performance inefficiencies and data inaccuracies in their
environments.
Problem in accessing/restoring data
Inefficiency in performance
Inaccurate data
Obsolete Infrastructure
Rating
Highest
Moderate
Lowest
<1000GB
<10 TB
Figure 6: Issues in Different Sized Data Warehouses
>10TB
62

<10 TB
In medium sized data warehouse the respondents (11) too valued the performance
the most followed by data related features and implementation costs.
The commonly faced problems in these data warehouses were performance and
data related. This could be attributed to the way the data warehouse is designed.
These data warehouses carry huge amount of data (running in terabytes), the data
stored here should be clean and non-redundant to have efficient performance.

>10 TB
The observations for respondents (7) data warehouses greater than 10 TB in size
were similar to the data warehouses with size between 1 and 10TB. Today, most
of the data warehouses fall into this category. The amount of data stored is
increasing linearly and hence it needs to be stored in a manner which improves
performance and reduces redundancy. Data warehouse design and architecture are
of prime importance where data is in such abundance.
In conclusion, the survey helps us understand the case studies better. As seen in all the
four cases, which are larger data warehouses, they struggled to achieve better
performance and better data quality. The cost issue is the only one which varies from
business to business and organization to organization based on their requirements.
63
Chapter 9
CONCLUSION
The data warehouse is the key to making good informed business decisions in any
business sector. It is subject to complex ad hoc as well as regular queries to mostly assist
decision-support processes. Data warehousing poses various challenges such as migration
of data from legacy systems, maintenance of data quality, upholding the performance of
the system and managing the system cost effectively. The main focus of this project was
to investigate the factors that affected the efficiency of the data warehouse through case
studies and a survey.
This study confirmed that the main problem any organization is trying to tackle involves
one of the three main issues discussed throughout the project report.
Maintaining a high performance system
As data
warehousing has
become
more
and
more
important
to
businesses,
increasing data warehouse performance has become vital. With many people depending
on the data in the data warehouse to do their jobs, data warehouse performance can have
a profound impact on overall company performance. Many companies rely on numerous
ways
to improve data warehouse performance,
including
clearing
obsolete data,
increasing storage space and improving overall data warehouse architecture and design,
to keep the data warehouse and the company functioning at their best [David J., DeWitt,
Samuel, Madden, Michael, Stonebraker, How to Build a High-Performance Data
Warehouse].
64
Data warehouse performance tends to degrade as more data is collected over a period of
time. Increased data mining while important to the business increases the overall load on
the system. More people making use of the system also increases the load as a larger
number of queries are made by various employees. Removing obsolete information
means that queries can be processed more quickly and return more relevant results,
making
overall data warehouse maintenance
an
important
part
of
improving data warehouse performance.
Infrastructure is another important factor in data warehousing [Gavin, Powell, Oracle
Data Warehouse Tuning for 10g (2005)]. A data warehouse system can be functioning at
the highest possible level for the available technology, but three or even only two years
later, it can be considered obsolete. Improving the data warehouse architecture, both on a
hardware level and a programming level, also can greatly increase data warehouse
performance. Updating processors, adding additional storage space and using newer,
more streamlined query protocols can greatly improve performance. In addition, these
changes in overall data warehouse design can make a dramatic difference in the amount
of data that can be stored as well as the speed at which the system can process individual
queries.
Another
approach
that
can
help
improve
data
warehouse
performance
is
training. Data warehousing originally was designed to support decision making on a high
executive level, but the overall usefulness of business intelligence has led to many other
people using the data for a variety of purposes. In some cases, these employees have not
received adequate training and do not know how to construct efficient queries to retrieve
65
the information they need. For these employees, training on the use of the system and
how
to
effectively
query
the data can
lead
to
great
improvement
in data warehouse performance.
Maintaining data quality
The main factors to consider when you are looking to maintain data quality: data
integrity, data input
source
and methodology used,
frequency of data import
and
audience.
Data integrity is a concept common to data warehouse quality as it relates to the rules
governing the relationships between the data, dates, definitions and business rules that
shape the relevance of the data to the organization [Larry P., English, Improving Data
Warehouse and Business Information Quality: Methods for Reducing Costs and
Increasing Profits (1999)]. Keeping the data consistent and reconcilable is the foundation
of data integrity.
Steps
used
to maintain data warehouse quality must
include
a
cohesive data architecture plan, regular inspection of the data and the use of rules and
processes to keep the data consistent whenever possible.
The easiest way to maintain data warehouse quality is to implement rules and
checkpoints in the data import program itself. Data that does not follow the appropriate
pattern will not only be added to the data warehouse but will require user intervention to
correct, reconcile or change it frequently [Panos, Vassiliadis, Data Warehouse Modeling
and Quality Issues (June 2000)]. In many organizations, these types of changes can be
66
implemented only by the data warehouse
architect,
which
greatly increases
the data warehouse quality.
The accuracy and relevance of the data is essential to maintaining data warehouse quality.
The timing of the import and frequency has a large impact on the quality as well [Ang &
Teo, Management issues in data warehousing: insights from the Housing and
Development
Board
(December
6,
1999)].
Data warehouse quality
is
easiest
to maintain and support if the users are knowledgeable and have a solid understanding of
the business processes. Training the users to not only understand how to build queries,
but also on the underlying data warehouse structure that enables them to identify
inconsistencies much faster and to highlight potential issues early in the process [Ang &
Teo, Management issues in data warehousing: insights from the Housing and
Development Board (December 6, 1999)]. Any changes to the data tables, structure or
linkages and the addition of new data fields must be reviewed with the entire team of
users and support staff members in order to ensure a consistent understanding of the risks
and challenges that might occur.
Managing Costs
Data warehouse projects need not be a multi-million dollar undertaking. Any
organization, regardless of size, can benefit from data warehousing technologies, even
when working with a limited budget. Although it may seem difficult, choices can be
made which allow for the beneficial realization of data warehousing while also
minimizing costs. By balancing technology and carefully positioning business, the
67
organization can quickly create cost-effective solutions using data warehousing
technologies.
The few simple rules that can help you develop a data warehouse on a small budget
would include using what you have, using your knowledge, using what is free (s/w or
h/w), buying only what you have to, thinking and building in phases, and using each
phase to finance or justify the remainder of the projects [Brian, Babineau, IBM
Information Infrastructure Initiative Tames the Information Explosion (April, 2009)].
A common organizational mistake in data warehousing is to ignore the currently owned
technology and software capabilities and move ahead quickly, to purchase a different
product [Nathan, Rawling, Data Warehousing on a Shoestring Budget (May, 2008)] . An
organization often has the toolset in place to effectively meet its data warehousing
demands but simply needs the right partner to fully implement its current technology.
68
APPENDIX
Questionnaire
Issues affecting the data warehouse efficiency
This questionnaire is for my MS project which deals with issues that affect the data
warehouse efficiency. The goal of this is to narrow down the major issues. Kindly fill in
this questionnaire based on the experience at your workplace. Thank you for
contributing!
1. What is your role in the data warehousing environment in your organization?
a) Manager/Director
b) Project Manager
c) Database Administrator
d) Data Architect
e) Applications Programmer
f) System Administrator
g) End User
2. What is the size of the data warehouse you work on?
a) <1000GB
b) <10 TB
c) >10TB
69
3. Please rate the following features based on their importance to you. (From 1 to 5,
1 being the highest)
a) Easy access to data
b) Efficient performance
c) Data currency (recency)
d) Data accuracy
e) Implementation and maintenance costs
f) Other (please specify)
4. Which of the following problems do you face in your data warehousing
environment? Briefly describe the issue.
a) Problem in accessing/restoring data
b) Inefficiency in performance
c) Inaccurate data
d) Obsolete Infrastructure
5. Please describe briefly what other problem areas you come across in your work
environment.
70
BIBLIOGRAPHY
Gavin, Powell, Oracle Data Warehouse Tuning for 10g (2005). Published by Elsevier
Inc.Larry P., English, Improving Data Warehouse and Business Information Quality:
Methods for Reducing Costs and Increasing Profits (1999). Wiley Computer Publishing.
Panos, Vassiliadis, Data Warehouse Modeling and Quality Issues (June 2000). Retrieved
from World Wide Web: http://www.cs.uoi.gr/~pvassil/publications/publications.html
James, Ang, Thompson S.H., Teo, Management issues in data warehousing: insights
from the Housing and Development Board (December 6, 1999). Retrieved from World
Wide Web: http://www.cse.dmu.ac.uk/~ieb/Data%20warehouse%20tutorial.pdf
Brian, Babineau, IBM Information Infrastructure Initiative Tames the Information
Explosion
(April,
2009).
Retrieved
from
World
Wide
Web:
http://whitepapers.businessweek.com/detail/RES/1287588950_343.html
David J., DeWitt, Samuel, Madden, Michael, Stonebraker, How to Build a HighPerformance
Data
Warehouse.
Retrieved
from
World
Wide
Web:
http://db.csail.mit.edu/madden/high_perf.pdf
Bill, Inmon, The Data Warehouse Environment: Quantifying Cost Justification and
Return on Investment (November, 2000). Retrieved From World Wide Web:
http://www.crmodyssey.com/Documentation/Documentation_PDF/Data%20Warehouse_
Environment_Cost_Justification_and_ROI.pdf
Data warehouse Design considerations. Retrieved from http://msdn.microsoft.com/enus/library/aa902672(SQL.80).aspx
Nathan, Rawling, Data Warehousing on a Shoestring Budget (May, 2008). Retrieved
from World Wide Web: http://esj.com/articles/2008/05/07/data-warehousing-on-ashoestring-budget--part-1-of-3.aspx
Dain, Hansen, Cost Savings are the New Black for Data Warehousing (March 19, 2009).
Retrieved
from
World
Wide
Web:
http://blogs.oracle.com/dataintegration/2009/03/cost_savings_are_the_new_black_for_da
ta_warehousing.html
Lee, Hae Seok, Kyobo Life Insurance Case Study. Retrieved From World Wide Web:
http://www.sybase.com/detail?id=1054785
Microsoft Case Studies, Scandinavian Airlines,. Retrieved From World Wide Web:
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=200330
71
Kalido Case Study, Philips Builds Best-Practice Information Management Architecture.
Retrieved from World Wide Web: http://www.kalido.com/Collateral/Documents/EnglishUS/CS-Phillips.pdf
Microsoft MSDN Library < http://msdn.microsoft.com/en-us/library/default.aspx>
<http://www.exforsys.com/tutorials/data-warehousing.html>
<http://www.deere.com/en_US/deerecom/usa_canada.html>
<http://www.usa.philips.com/about/company/index.page>
<http://www.dwinfocenter.org>
<http://www.academictutorials.com/data-warehousing>
Download