Fundamental concepts of software reuse

advertisement
Technical University Hamburg-Harburg
Technische Informatik (TI 3)
Prof. Dr. Siegfried M. Rump
Applying Concepts of Software Reuse to the
Implementation of Data Warehouse ETL Systems
October, 2001
Jiayang Zhou
Content
STATEMENT .......................................................................................................................... 5
ACKNOWLEDGEMENTS........................................................................................................ 6
1. Introduction ....................................................................................................................... 7
1.1.
Description of the work ............................................................................................... 7
1.2.
Scenarios ................................................................................................................... 8
1.3.
The structure of this work ......................................................................................... 11
2. Fundamental of software reuse....................................................................................... 12
2.1.
What is software reuse? ........................................................................................... 12
2.2.
Why is software reuse important?............................................................................. 12
2.3.
Economics of software reuse .................................................................................... 14
2.4.
Where does software reuse pay off? ........................................................................ 15
2.5.
Upon what concept is software reuse based?........................................................... 15
2.6.
Principles of object-oriented software reuse ............................................................. 16
2.6.1.
Information hiding .............................................................................................. 17
2.6.2.
Modularity .......................................................................................................... 17
2.6.3.
Adaptability ........................................................................................................ 18
2.6.4.
Modification ........................................................................................................ 18
2.7.
State of the art .......................................................................................................... 18
3. Data Warehouse Loader: analysis example of software reuse ........................................ 21
3.1.
Introduction of data warehouse ETL systems ........................................................... 21
3.1.1.
Definition of data warehouse ETL systems ........................................................ 21
3.1.2.
Requirements of data warehouse ETL systems ................................................. 22
3.1.3.
Context requirement of data warehouse application ........................................... 23
3.1.4.
Other usage of Data Warehouse Loader ............................................................ 24
3.2.
Architecture of a data warehousing system .............................................................. 25
3.2.1.
Operational data source ..................................................................................... 27
3.2.2.
Data-transfer ...................................................................................................... 28
3.2.3.
Data warehouse ................................................................................................. 28
3.2.3.1.
Subject-orientation ....................................................................................... 28
3.2.3.2.
Integration ................................................................................................... 30
3.2.3.3.
Time variance .............................................................................................. 31
3.2.3.4.
Non-volatility ................................................................................................ 31
3.2.3.5.
Difference between data warehouse and operational systems .................... 32
3.2.4.
Analysis ............................................................................................................. 33
3.2.5.
Presentation ....................................................................................................... 37
3.2.6.
Metadata ............................................................................................................ 38
3.2.7.
Process management ........................................................................................ 38
3.2.8.
User administration ............................................................................................ 38
3.3.
The role of Data Warehouse Loader ......................................................................... 39
3.3.1.
Functionality of Data Warehouse Loader ........................................................... 39
3.3.2.
Software reusability consideration of Data Warehouse Loader .......................... 40
4. The Implementation of Data Warehouse Loader ............................................................. 42
4.1.
The description of Data Warehouse Loader ............................................................. 42
4.2.
The architecture of Data Warehouse Loader ............................................................ 45
4.3.
Loader-engine: the operation mode of Data Warehouse Loader............................... 46
4.3.1.
Advantages of workflow architecture .................................................................. 47
4.3.2.
Major task of the workflow of Data Warehouse Loader ...................................... 51
4.4.
4.3.2.1.
Extraction .................................................................................................... 52
4.3.2.2.
Transformation ............................................................................................ 55
4.3.2.3.
Figuring out difference ................................................................................. 59
4.3.2.4.
Importing ..................................................................................................... 60
4.3.2.5.
Merging ....................................................................................................... 63
4.3.2.6.
Retrieving .................................................................................................... 64
Loader-interface: interface concept of Data Warehouse Loader ............................... 66
4.4.1.
Extraction-interface ............................................................................................ 68
4.4.2.
Transformation-interface .................................................................................... 68
4.4.3.
Database-interface ............................................................................................. 69
4.4.4.
Record-interface ................................................................................................ 70
4.5.
Format of intermediate files ...................................................................................... 72
4.6.
Sorting the linked list of record objects ..................................................................... 73
4.7.
Graphic user interface of Data Warehouse Loader ................................................... 75
5. Reuse Analysis of Data Warehouse Loader .................................................................... 77
5.1.
Reuse development process of Data Warehouse Loader ......................................... 78
5.2.
Applying concepts of software reuse ........................................................................ 81
5.2.1.
Code reuse ........................................................................................................ 81
5.2.2.
Adaptability ........................................................................................................ 81
5.2.3.
Modularity .......................................................................................................... 82
5.2.4.
Interface ............................................................................................................. 83
3
5.3.
Reuse architecture analysis of Data Warehouse Loader .......................................... 84
6. Summary ........................................................................................................................ 87
6.1.
Summary of this work ............................................................................................... 87
6.2.
Lesson learned ......................................................................................................... 89
Appendix ............................................................................................................................... 90
4
STATEMENT
STATEMENT
Hereby I do state that the present work has been undertaken by myself and with the unique
help, which is referred within this thesis.
Jiayang Zhou
Hamburg, October 2001
ACKNOWLEDGEMENTS
ACKNOWLEDGEMENTS
I would like to express my thanks for Prof. Dr. Siegfried Mr. Rump, Mr. Stefan Krey, and Mr.
Lutz Russek for their essential help and advisory in the whole development of this work.
Sometimes their help is not merely academic. Besides, since this work is undertaken in sd&m
AG (software design & management), I would like to thank all my colleagues from sd&m AG,
Hamburg. Their cooperation and help are appreciated as well.
INTRODUCTION
1. Introduction
1.1. Description of the work
Software reuse is a process of implementing or updating software systems using existing
software assets. Software assets can be defined as software components, objects, software
requirement analysis, design models, domain architecture, database schema, code,
documentation, manuals, standards, test scenarios, and plans. Software reuse may occur
within a software system, across similar systems, or in widely different systems. This process
provides ways to reduce costs, shorten schedules, and produce quality products.
The importance of software reuse lies in its benefits of providing quality and reliable software
in a relatively short time. The computer industry has demonstrated that software reuse
generates a significant return on investment by reducing cost, time and effort while increasing
the quality, productivity, and maintainability of software systems throughout the software life
cycle. In a word, software reuse is advantageous because it:

Increases productivity

Enhances quality

Saves cost

Reduces software development schedules

Reduces maintenance

Enhances standardization
7
INTRODUCTION

Increases portability

Contributes to the evolution of a common component warehouse

Increases performance
Software reuse is now considered an integral principle of the software engineering process.
And software reuse can be developed in a manner similar to the development of computer
hardware products. [1]
In this work, the fundamental concept of software reuse with an object-oriented approach is
examined. It deals with the object-oriented software reuse strategies, the reuse paradigm, and
the reuse process. It is obvious that the mere use of a certain programming language does not
guarantee software reusability. The language must be accompanied with reuse technology,
such as tools and methodologies.
Moreover, the general concept of software reuse is applied to the implementation of data
warehouse ETL (Extraction, Transformation and Loading) systems. This data warehouse ETL
system, called Data Warehouse Loader, is implemented in Java. Therefore, it is an analysis
example of software reusability with object-oriented approach. The standard architecture of
data warehouse application and the role of Data Warehouse Loader inside is explained. The
detail implementation of Data Warehouse Loader is illustrated, such as its overall architecture,
workflow, interface concept and so on. Finally, the reuse analysis is conducted in order to
figure out the high reusable feature of this software, which means to illustrate the relationship
between the general reuse concept and the real implementation scheme of Data Warehouse
Loader. In a word, Data Warehouse Loader is implemented in a manner, where increasing
software reusability is especially emphasized. The overall architecture is designed with the
favor of applying the software reuse concept. On the other hand, this implementation of Data
Warehouse Loader in Java has some drawbacks of performance degradation.
1.2. Scenarios
There have been problems in the software development since its inception. The cost of
software development is constantly increasing. Many projects are challenged but not
completed. A challenged project is one that is completed with cost overruns and delays in
schedule. The percentage of failure is greater than that of successfully completed projects.
Please see Figure 1.1. The computer industry has tried to seek an easy way to reduce the
costs and shorten the schedules required for software development, while making quality
software with fewer errors. [1]
8
INTRODUCTION
Here shows that considering software
reuse should be one of the solutions.
Software reuse means that ideas and
code are developed once, and then
used to solve many software problems
in order to enhance productivity, reliability and quality. Reuse applies not
only to source-code fragment, but to all
the intermediate products generated
during software development, including
Figure 1.1 success and failure percentage for
software development project
documentation, system specifications,
design architecture and so on.
Reusability is a big issue these days. Pretested software should be used so that cost and time
can be saved. The development of object-oriented software means modeling a problem as a
set of types or classes from which the objects are created. This set is partitioned into a
hierarchical categorization that emphasizes reuse by relegating common characteristics and
behaviors to the highest possible level. Once this modeling had been done, coding (translation
of algorithms to program) is easier because it consists of mere creation of necessary objects
from the defined classes and invokes the behavioral operations of object. Reusable software
requires planned, analyzed, and structured design that withstands thorough testing for
functionality, reliability, and modularity. [1]
Here an object-oriented approach to software development is preferred because it leads to
reusable classes. Objects are discrete software components and contain data and procedure
together. Systems are partitioned based on objects. Data determine the structure of the
software. On the contrary, data-oriented or event-oriented analysis and design deal with
operations and data as distinct and loosely coupled. Operations determine the structure of the
system. Data are of secondary importance. Therefore, the cost of software development is
growing exponentially.
In 1998, sd&m AG (software design & management) completed one project, called STARTMDB (Management Database), to build a data warehouse application for START Holding in
Frankfurt. One part of this data warehouse application is Data Warehouse Loader, which
extracts data from different operational data source, transforms data into required format, and
9
INTRODUCTION
loads data into target data warehouse. This sd&m Data Warehouse Loader is implemented in
C, which is suitable in this case, since the operation mode of Data Warehouse Loader is nonobject-oriented. Because of choosing ANSI-C as programming language, it is possible to
migrate between different system platforms, for example, from Windows NT to Unix. On
account of the requirements of data warehouse application, this sd&m Data Warehouse
Loader is designed to be reusable from the beginning. That means it can be fit to different
data transformation scheme, different operational data source and different target data
warehouse. However, it is difficult to realize this reusability with C. The code is hard to read
and it needs to be recompiled when a new database is inserted or a new data transformation
scheme is required.
Therefore, the task of this work is to implement Data Warehouse Loader in Java. From the
programming language point of view, Java has several advantages as following:

Java is object-oriented from the ground up, which means it was explicitly designed from
the start to be object-oriented. However, C is not an object-oriented language.

Java has a facility, called Interface, whose name indicates its primary use: specifying a set
of methods that represent a particular class interface, which can be implemented
individually in a number of different classes. All of the classes will then share this common
interface, and the methods in it can be called polymorphically. While C does not have the
interface concept.

Java is compiled to a machine-independent low-level code called byte code. This byte
code is then interpreted by the Java Virtual Machine running on the particular machine.
This gives the Java code platform independence, which means that the same byte code
can be run on any of a huge variety of machines with different operating systems. Porting
a Java program to another machine does not even require recompilation. The cost is the
slowing down of run-time speed-up to a factor of 5.

Java Virtual Machine can carry out a number of checks that a program is running properly,
for example, array bounds, memory access, viruses in byte code and so on. Accordingly,
Java program is of more robustness and security compared with C.

When a C program requests memory to use as workspaces, it must keep track of it and
return it to the operating system when it ceases to use it. This requires extra programming
and extra care. This task of garbage collection is carried out automatically in Java. An
object that is no longer used is automatically destroyed and the memory is released.
Additionally, Java is now used worldwide. The management trend of most firms is to have
Java programs in their organization. That is also true for the management database system.
10
INTRODUCTION
Therefore, implementing Data Warehouse Loader in Java will make this software more
acceptable by the market.
1.3. The structure of this work
Chapter 1 gives the description of the work and far-ranging scenarios, which includes the
current situation of software development, the importance of software reuse and the initiate of
this project.
Chapter 2 introduces the fundamental concepts of software reuse and some state-of-art
software reuse technology, which is important for a better understanding of software
architecture design in favor of reusability. Concerned with software reuse, here discusses its
definition, importance, economics, basis and so on.
Chapter 3 presents the introduction of Data Warehouse Loader, which is the analysis example
of software reuse in this work. Firstly the introduction of ETL systems, so called Data
Warehouse Loader, is given. With the explanation of the standard architecture of data
warehouse application, where Data Warehouse Loader resides, the role of Data Warehouse
Loader is introduced, namely its functionality and its reuse consideration.
Chapter 4 is the detail implementation of Data Warehouse Loader. It illustrates the practical
part of this work, including the overall architecture, the workflow and the interface-concept of
Data Warehouse Loader. This chapter explains how is the relationship between each class,
how each stage of Data Warehouse Loader workflow works, how is the format of intermediate
files, how is the sorting inside each linked list of record object, and so on.
Chapter 5 is the link between the theoretical concept of software reuse and the practical
implementation of Data Warehouse Loader. In this chapter, the reuse analysis shows how the
abstract concept is applied.
Chapter 6 offers the conclusion of the whole project. Some lessons learned during the general
software reuse process and some drawbacks of this work are also showed.
Appendix contains the reference books of this work.
11
FUNDAMENTAL OF SOFTWARE REUSE
2. Fundamental of software reuse
2.1. What is software reuse?
Software reuse is defined as the process of implementing or updating software system using
existing software assets. Software reuse can occur within a system, across similar systems, or
in widely different systems. The term “asset” was selected to express that software can have
lasting value. Reusable software assets include more than just codes. Requirements, designs,
models, algorithms, tests, documents, and many other products of the software process can
be reused. [2]
Software reuse is a concept to acquire high-leverage software, which has the potential to be
reused across applications. However, as in many cases, taking a simple idea and making it
happen in reality often is not as easy as it sounds. Details have to be worked out before the
concept can be made to work in practice.
2.2. Why is software reuse important?
Systematic software reuse revolves around the planned development and exploitation of
reusable software assets within or across applications and products lines. Its primary goal is to
12
FUNDAMENTAL OF SOFTWARE REUSE
save your money and/or time. It succeeds when the amount of resources required to deliver
an acceptable product are reduced. It tries to take advantage of software that exists or can be
purchased off the shelf. It motivates to address the number of management, technical, and
people issues that inhibit reuse. When getting down to basics, software reuse is motivated by
the desire to get the job done cheaply and quickly.
At this point, it might be a question, why software reuse is especially important. Are there
many firms doing it? Do most developers build their software to be reused? Has the
underlying technology needed for software reuse been around for years? Are the guidelines
for this systematic reuse practice available? Are there examples, which illustrate the
successful reuse stories?
Unfortunately, the answers to those questions above have been NO until recently. Most
practitioners have not figured out how to do it in a repeatable and systematic manner. The
reason is that the technology needed just was not available until recently. The arrival of objectoriented approaches and languages, domain engineering methods, integrated software
development environments and new process paradigms make broad-spectrum software reuse
possible. Advances in software architecture provide us with the foundation for software reuse,
while a consensus on related standards provides us with the building codes.
Figure 2.1 Reuse maturity distributions
For the most part, software reuse tends to be done ad hoc in most firms. As illustrated in
Figure 2.1, most of the firms whose software reuse processes have been evaluated using a
13
FUNDAMENTAL OF SOFTWARE REUSE
reuse maturity model [21] are not using the state of the art. Reuse processes are not well
defined and practices are not institutionalized in the majority of the firms. This analysis
assumes that the processes, which organizations use to manage product lines, architectures,
and software reuse, should be part of their business practice framework. Reuse
considerations need to be incorporated into each of five levels of process maturity identified by
the model: Level 1 (ad hoc), Level 2 (repeatable), Level 3 (defined), Level 4 (managed), Level
5 (optimizing). Please see Table 2.1. [2]
Maturity level
1
2
3
4
5
Name
Characteristics

Reuse occurs ad hoc

Reuse is neither repeatable nor managed
Project-wide

Reuse is a product of a project, not a process
reuse

Reuse is repeatable o a project-by-project basis
Organization-wide

Reuse assets are a product of the process
reuse

Reuse is part of the way the organization does business
Product-line reuse

Reusable assets are a product of the process

Reuse is viewed as a business into itself
Broad-spectrum

Reuse is an integral part of the corporate culture
reuse

Processes are optimized with reuse in mind
Ad hoc reuse
Table 2.1: Process Maturity Models
2.3. Economics of software reuse
With the recent push to downsize or outsource, software costs have to be cut down. The
majority of improvement strategies being pursued today is either to reduce the inputs needed
to finish the job, (such as people, time, equipment, etc.), or to increase the outputs generated
per unit of input.
This dual nature of software productivity can be represented notionally using the following
equation [2]:
Productivity = Outputs / Inputs used to generate the results
When focusing on the equation’s input side, more workstation, CASE tools, mature processes,
and the like can be equipped with software engineers. Using this approach, more output can
be obtained from the people using an automation strategy. Just the reverse happens when the
output side of the equation is focused on. Instead of concentrating on improving staff
efficiency, reusing existing assets is emphasized on to get more output per unit of input. In
14
FUNDAMENTAL OF SOFTWARE REUSE
either case, the strategies employed tend to be complementary. For example, increased
automation can lead to increased reuse.
2.4. Where does software reuse pay off?
Industry has realized s significant payoff by instituting systematic software reuse practice. For
example, Wayne Lim of Hewlett-Packard reported the following benefits attributable to their
software reuse initiative in IEEE software magazine [7]:

Quality: the defect density for reused code was one quarter of that for new code.

Productivity: systems developed with reuse yielded a 57 percent increase in productivity
compared with those constructed using only new code.

Time to market: when development efforts ware compared, those exploiting reuse took 42
percent less time to bring the product to market.
2.5. Upon what concept is software reuse based?
It is necessary to cover some fundamental concepts upon which reuse is based. For reuse to
occur in practice, reusable software assets must be acquired that are reused by other
applications. This sentence is instructive. Let us examine the concepts that surround the three
forms of the term reuse with the sentence: reuse, reusable and reused. [2]
Reuse implies a known process for all those activities related to finding, retrieving, and using
software assets of known quality within certain application. When talking about reuse, the
following three types of processes are typically involved:

Application engineering: the processes or practices, which firms use to guide the
disciplined development, test, and life cycle support of their applications software.

Domain engineering: the reuse-based processes or practices, which firms use to define
the scope, specify the structure, and build reusable assets for a class of systems or
applications. These activities are typically conducted to figure out what to build to be
reusable.

Asset management: the processes or practices, which firms use to manage their assets
and make them readily available in quality form. These are the processes software
engineers use to search libraries to find the reusable assets of interest. The quality of the
assets is maintained along with the integrity of their configuration using some online
mechanism, which is part of software engineering environment.
15
FUNDAMENTAL OF SOFTWARE REUSE
Please see Figure 2.2, which illustrates how these processes are
related.
The
software
devel-
opment approach depicted in this
figure is called the dual-life-cycle
model, because domain and application engineering activities are
conducted in parallel. As shown,
domain engineering uses the architecture that it develops to identify the reusable software assets,
which
application
engineering
develops and uses. Asset management links these activities and
makes these assets available. [2]
Figure 2.2 Dual Life Cycle (Source: Reifer, “Software
Reuse: Making It Work For You”, 12/1991)
The term reusable refers to the product and its basic features. The reusable asset must have
high reuse potential and be packaged with reuse technique. If the asset is hard to understand
or adapt, it will be abandoned. Product-line software architectures are now in favor because
they let users identify and acquire the 20 percent of the assets responsible for 80 percent of
the reuse across families of like systems. Ease of reuse should be a design consideration for
each of the assets that is a part of the product line.
The term reused has a value-added connotation. An asset built to be reusable does not take
on value until it is reused with some advantage by someone else on another application.
Typically, incentives must be provided to make this happen.
2.6. Principles of object-oriented software reuse
The objectives of object-oriented software reuse are to produce a reusable asset for independent operating systems and plug-and-play applications. That means to identify common
architecture, establish a repository and integrate reuse in the software development process.
There are some principles of object-oriented software reuse as following:
16
FUNDAMENTAL OF SOFTWARE REUSE
2.6.1.
Information hiding
Information hiding is the protection of implementation details within object-oriented software,
which is the deliberate hiding of information from those who might misuse it. This differentiates
the “what” from the “how”. The “what” information should be available to everyone. The “what”
information includes specifications and interface information. The “how” information should be
available only to a limited group. The “how” information includes implementation details, such
as data structure.
Information hiding supports and enforces abstraction by the suppression of details. It
increases quality and supports reusability, portability, and maintainability. It prevents confusion
for the user, promotes correct data input, and enhances reliability. Information hiding also
enhances localization and usually includes cohesive data; hence, good modularity is achieved
and the goal of modifiability is more easily approached.
2.6.2.
Modularity
Modularity is defined as the breaking down of a program into small, manageable units.
Modularizing object-oriented system software breaks the solution space into smaller units. The
modules are grouped around a data type and objects of that type. Only subprograms, which
contain operations for objects of a certain type, are grouped together. For example, an array
type may be in a package along with subprograms for calculating the average of array
elements.
In well-modularized system software, the top modules are generally the “what” of the process,
while lower-level modules constitute the “how” of the process. This implies that the lower the
module is in the module group, the more implementation details it contains. In other words,
upper-level modules are the most abstract modules in the group, while lower-level modules
are the most detailed.
Good modularity also implies loose coupling between modules. Coupling is a measure of
dependence between modules. Global data shared by modules increases this inter-modular
dependence. The passing of only required data via parameters or the localizing of data within
a module decreases coupling. Loose coupling guarantees confirmability (independent module
testing) and enhances the principle of modularity. Loose coupling also implies the changes in
one modular will not affect the others. Thus, loose coupling also brings closer the goal of
modifiability.
17
FUNDAMENTAL OF SOFTWARE REUSE
In addition to loose coupling, another factor required for good modularity is data localization.
Localization consists of placing highly related cohesive data only in modules that operate on
this data. Only necessary data are passed from module to module and only through parameter.
Only very highly related or cohesive data are localized in a module. In a word, important
factors for good modularity are data localization, loose coupling, no data passing except via
parameters, information hiding.
2.6.3.
Adaptability
Adaptability in a system means that the system is easily adapted in a diversified environment.
Object-orient software can be easily adapted to new requirement because of the high level of
abstraction. It models a problem as a set of types or classes from which the objects are
created.
Especially, Java is compiled to a machine-independent low-level code, called byte code. This
byte code is then interpreted by the Java Virtual Machine, which runs on the particular
machine. This gives the Java code platform independence, which means that the same byte
code can adapt to any of a huge variety of machines with different operating systems. Porting
a Java program to another machine does not even require recompilation.
2.6.4.
Modification
Modification allows changes to be made to a system without alteration to its original structure.
Requirements change during the life cycle of a system. New versions of the system are
created, with new and modified changes of requirements. The changes must be cost effective.
Modifiability is achieved in a system by e design of small, meaningful modules; use of
localized data in these modules; and very little use of global data or numeric literal values. The
object-oriented concept incorporates a facility that includes data values and range constraints
for a given type of data in one place: the type declaration. A change made in a type
declaration is all that is necessary to modify this data throughout the whole software.
2.7. State of the art
A great deal of research is underway under the banner of software product lines, domainspecific software architectures, domain engineering and software reuse. The reason for the
18
FUNDAMENTAL OF SOFTWARE REUSE
interest in software reuse is primarily economic. Industry is looking for ways to increase the
speed, in which it brings products to market and to provide certain features and functions
quickly in order to maintain its competitive advantage. As a complementary strategy to
productivity improvement, software reuse is viewed as a reasonable way to accomplish these
goals with a minimum of disruption. [2]
Throughout the world, significant progress has been made in software reuse technology.
Many firms are pursuing software reuse. For example, AT&T has instituted a major reuse
program to lever its investments in telecommunications software assets within and across
domain-specific product-line architectures. They have developed a set of best practices that
many of their business units have adapted to. [8] Hewlett-Packard (HP) has focused on
developing and deploying software reuse concepts into its product divisions via a corporate
program. [9] It uses a software bus and glue language to interface reusable software building
blocks together in order to build application systems quickly.
Recent studies have been conducted to reflect today’s situation: [2]

Many efforts are underway to build a technical infrastructure for reuse. With the
introduction of object-oriented methods and languages, the technology exists to package
software for reuse. International programs, such as the European REBOOT effort, are
starting to realize benefits as they transfer technology to industrial firms. [10] And industry
consortia, such as the Software Productivity Consortium (SPC) [11], have active reuse
programs that address architecture, product lines, and domain engineering methods in
addition to other reuse issues.

The hot research topics in software reuse are product-line management, software
architectures, and domain engineering methods. Most current research populates
architectures designed for families of systems with like characteristics (product lines).
Many efforts are developing methods, notations, and languages used to model and
analyze domain experience in order to develop a responsive architecture.

Object-oriented techniques are starting to be viewed as reuse enablers. Object-oriented
methods, languages, and tools help package software for reuse. Based on their class
abstractions, object-oriented methods (such as Booch [12], Coad/Yourdon [13], FODA [14],
etc.) become more widely used. Framework and class libraries containing both fine-grain
(data structure such as queues, etc.) and coarse-grain (communications handlers and
subsystems, etc.) components are marketed to support developers.

Several prototype software reuse libraries that serve as models of the future are
operational today. Several industrial strength software reuse libraries that provide their
19
FUNDAMENTAL OF SOFTWARE REUSE
users with the ability to search, browse, and retrieve assets of interest are in operational
use. Assets become increasingly available to populate these libraries (class libraries, etc.)
Standards for interoperating these libraries across the Internet are devised by such groups
as the Reuse Library Interoperability Group. [15] Users need to agree on the architecture
so that they know how to use the library to get to the parts that matter. Without such a
framework or the architecture, few will use the library in a cost-effective manner.

Reuse tools are being put into the software engineering environments. Several efforts are
integrating software reuse tools and library capabilities into the next-generation software
engineering environment using Java-based servers as mechanisms for tool connectivity.
[19] The philosophy being pursued is to make software reuse a natural way, in which firms
do their business. Tools are the natural means to implement this philosophy, because they
automate tedious manual processes and often act as technology transfer agents. A lot of
attention is being placed on generative tools [20] for reuse, because they can generate the
assets needed directly from an architecture specification.
20
DATA WAREHOUSE LOADER
3. Data Warehouse Loader: analysis
example of software reuse
3.1. Introduction of data warehouse ETL systems
3.1.1.
Definition of data warehouse ETL systems
ETL stands for extraction (E), transformation (T) and loading (L). Data warehouse ETL
systems are the software to load data from different operational data storing systems to data
warehouse. The task of ETL systems is, first to extract data from data source, and then to
transform data into a required format with specified transformation scheme, finally to import
data into target data warehouse. In other words, the main operation of ETL systems is
extraction, transformation and loading. In this work, an ETL system is implemented in Java,
named Data Warehouse Loader.
Data Warehouse Loader is not a completely application product, which can be applicable
directly. It is rather than a key software component, which can be suitable for many
applications via a program supplementation and configuration.
21
DATA WAREHOUSE LOADER
3.1.2.
Requirements of data warehouse ETL systems
ETL systems are developed in data warehouse application, especially consisting of the
following requirements:

Adaptability to any data source

Adaptability to any target data warehouse

High operation speed

Ability to deal with large amount of data

Flexibility for data transformation

Portability
The following is the brief explanation of the requirements above:
The adaptability to any data source is an important property for the input of ETL systems,
which means the data is able to be extracted from different data storing system, as stated in
the requirements of data warehouse application. For relational database, different type of
database system are often concerned. Therefore, ETL systems will not be fixed for one
concrete type of source database, to which they connect later. It is recommended that ETL
systems should be prepared for more application conditions in the future.
The adaptability to any target data warehouse does not only result from the viewpoint of
software reusability. It is also possible for data warehouse application to replace the original
used warehouse database on account of efficiency, when the original used warehouse
database cannot meet the increasing workload any more.
The high operation speed is critical in case ETL systems have to process large amount of data
in a certain time interval with limited resource. It is often the case, that not the daily difference
of data is selectively extracted from data source, but, for example, the whole set of data with
subsequent differences are extracted. Therefore, ETL systems must be able to deal with large
volume of data and requires high operation speed.
The flexibility of data transformation must be provided by ETL systems. Between operational
data source and target data warehouse, there is always a stage of data transformation, which
is in most cases especially complicated. It should not only transform one source record into
exactly one target record (1:1-transformation), but also transform any large source record
group into another large target record group (m:n-transformation).
22
DATA WAREHOUSE LOADER
Portability is of course important for the requirements of software reusability. This aspect is
interesting for a single project as well. For example, one data warehouse project begins with
NT-server because of cost consideration, and then must migrate to a more efficient UNIXsystem because of the increasing system-burden. Moreover, this tactic also corresponds to a
special designing approach, that the total system is based on smaller sub-systems.
In this work, Data Warehouse Loader fulfills most of the requirements of ETL systems above.
Java is chosen as programming language in order to be portable and platform-independent.
Java interface concept and building block principle are used for the purpose of flexibility for
data transformation and adaptability to any data source & target data warehouse. However, it
still has some drawbacks.
On the one hand, it consumes quite a lot main memory especially when processing large
amount of data. One reason for that is this Data Warehouse Loader is implemented in Java
and Java naturally requires more memory resource and is slower. Another reason for that is
the workflow of this Data Warehouse Loader is based on object-oriented approach. The most
significant difference of this object-oriented approach is that each database record is
represented as an object instead of a stream of characters. Therefore, all the operations are
concerned with an object or a set of objects, which are concatenated in a linked list object.
Operations with objects need more main memory consumption compared to that of simple
characters. For example, it results in an internal sorting of records, which means the whole set
of objects needs to be read into main memory while processing. Therefore, this Data
Warehouse Loader is not so suitable to work with huge volume of data in case of the lack of
adequate hardware condition.
On the other hand, Data Warehouse Loader is called a key software component, which cannot
work alone and must be matched and complemented with other components when utilized in
certain applications.
3.1.3.
Context requirement of data warehouse application
In the last few years, a large market of data warehouse application has been developed. The
whole package of products for data warehouse application contains database to store data,
23
DATA WAREHOUSE LOADER
Front-Ends for data analysis & presentation, system to transform data, and system to import
data into data warehouse, etc.
In this work, the framework of Data Warehouse Loader does not include all the aspects of a
complete package of products for data warehouse application. As defined above, Data
Warehouse Loader only fulfills the functionality to extract data from source system, transform
data and load data into target system. However, since Data Warehouse Loader cannot work
alone and has to cooperate with other components of the data warehouse application, some
aspects of the whole processing sequence have already been considered as well in order to
guarantee safe running and to develop new standard software.
As far as the evaluation of suitability is concerned, the following questions should be
considered:

How is the complexity of the data transformation? Does it require simple 1:1
transformation or more complicated m:n transformation?

How big is the volume of data, which is processed?

Are the intermediate results of concatenated transformation steps stored in database,
which has a big influence on the speed and performance of the whole process?

Is it possible to process all the data within the required time interval?

What is the source data store? Can all the well-known and estimated system be always
added in?

What is the target data warehouse

How extensive is the metadata model?

Is the code generated or interpreted? Is a supplementary compiler needed?

Is it possible to modify the source code?
3.1.4.
Other usage of Data Warehouse Loader
Although Data Warehouse Loader is developed for one data warehouse application, it can
also possibly be employed in other application, where some data is transferred from system A
to system B. Of course, certain modification is necessary.
For example, Data Warehouse Loader can be used as a bridge. During the step-by-step
displacement from old system to new system, the bridge between two steps always needs to
24
DATA WAREHOUSE LOADER
be built. In principle, Data Warehouse Loader provides the same function. When replacing
“data source” with “old system” and replacing “target data warehouse” with “new system”,
Data Warehouse Loader will work as a transformation bridge.
3.2. Architecture of a data warehousing system
In this work, the research for applying the concept of software reuse is conducted with the
example of implementing Data Warehouse Loader in Java. Since Data Warehouse Loader is
tightly associated with a data warehousing system, here it is necessary to give some words
about the general architecture of data warehousing systems.
What is a data warehouse? A simple answer could be that it manages data situated after and
outside the operational systems. From the conceptual origins of a single database serving all
purposes has evolved the notion of a data management architecture, where data is divided
into a data warehouse and an operational database. The separation of operational data
storing and analytical data storing results from the consideration of data storing requirements
for the purpose of analysis in the enterprise. The evolution is in response to many
technological, economic and organizational factors: the difference in the users of the two
environments, the difference in the technology supporting the two environments, the
difference in the amount of data found in the two environments, the difference between the
business usage of the two environments and so forth. [3]
Operational systems
Analysis systems
operation
updating
analyzing, only read access
query
fixed, simple
User-defined, complicated
data view
record-oriented
multi-dimension
data per transaction few
many
structure of data
detailed data
aggregated data
time reference
current
historic and current
Table 3.1: The difference between operational system and analytical system
It is not practical to keep data in the operational systems indefinitely. The fundamental
requirements of the operational and analysis systems are different: the operational systems
need performance, whereas the analysis systems need flexibility and broad scope. It has
25
DATA WAREHOUSE LOADER
rarely been acceptable to degrade performance of the operational systems in order to have
business analysis interface. The difference between operational systems and analysis
systems is shown in Table 3.1: [3]
The first objective of data warehousing system is to serve decision support systems (DSS)
and executive information systems (EIS) community, which makes high-level and long-term
managerial decisions. DSS tend to focus more on detail and are targeted toward lower to midlevel managers. EIS have generally provided a higher level of consolidation and a multidimensional view of the data, as high-level executives need more ability to slice and dice the
same data than to drill down to review the data detail. The following are some characteristics
associated with DSS or EIS: [4]

These systems have data in descriptive standard business terms, rather than in cryptic
computer field’s names. Data names and data structures in these systems are designed
for use by non-technical users.

The data is generally preprocessed with the application of standard business rules, such
as how to allocate revenue to products, business units, and markets.

Consolidated views of the data, such as products, customers and market, are available.
Although these systems will have the ability to drill down to the detail data, rarely are they
able to access all the detail data at the same time.
In a word, DSS and EIS in the enterprise provide information about customers, market,
turnover and so on. However, the operational data storing system, which processes daily work
data, is organized according to business procedure and focused on current data. Therefore, it
is not suitable for analytical management information.
Many factors have influenced the quick evolution of the data warehousing discipline. The most
significant factor has been the enormous forward movement in the hardware and software
technologies. Sharply decreasing prices and the increasing power of computer hardware,
coupled with easy use of software, has made it possible to quick analyze hundreds of
gigabytes of information and business knowledge.
Another influence on the evolution has been the fundamental changes in the business
organization and structure. Firstly, the emergence of global economy has profoundly changed
the information demands made by corporations. Phenomena such as “business process
reengineering” forces businesses to reevaluate their practices. Secondly, flexible business
26
DATA WAREHOUSE LOADER
software suites adapted to the particular business have become a popular way to move to a
sophisticated multi-tier architecture. Lastly, information technology now is nearly universally
accepted as a key strategic business asset. Management is more information conscious. [5]
Figure 3.1: Architecture of data warehousing systems
Data warehousing systems provide a new information infrastructure besides the existing
systems. It is not only a conceptual but also a technical approach in the DSS fields. Data
warehousing systems provide the analytical tool for DSS or EIS, but their design is not only
derived from the specific requirements of analysts or executives, but also aligns with the
overall business structure. Figure 3.1 shows the standard architecture of data warehousing
systems, where Data Warehouse Loader resides. Each component of this architecture is
explained as following.
3.2.1.
Operational data source
Inside of operational data source, there are many elementary source, for example, files,
database table, external database, data management information supporter and so on. The
common property of those elementary sources is record-oriented while they might be
implemented in a different way or a mixed way.
27
DATA WAREHOUSE LOADER
3.2.2.
Data-transfer
Data-transfer is a bridge between operational data source and data warehouse, whose
functionality is as following:

Extraction of data from data source at a fixed point of time: The time interval of extraction
operation depends on the source systems and the technical requirements for analysis.
One major problem in extraction is to determine the difference between the previous
extraction result and the current one. Therefore, a complete subtraction is executed in
order to get the difference, namely new records, changed records and deleted records.
This procedure is possible only if the time template of data source systems is known.

Consolidation of data: For example, the revision of data inconsistency is done.

Transformation of data in order to meet the needs of data warehouse: The result of
extraction cannot be directly loaded into data warehouse, because neither its content nor
its structure is suitable for data warehouse architecture.
3.2.3.
Data warehouse
The primary concept of data warehouse is that the data stored for business analysis can most
effectively be accessed by separating it from the data in the operational systems. A data
warehouse is a structured extensible environment designed for the analysis of non-volatile
data. This data is logically and physically transformed from multiple source applications to
align with business structure. This data is updated and maintained for a long time period. This
data is expressed in simple business terms and summarized for quick analysis. In a word, the
definition of data warehouse from W.H.Inmon is: A data warehouse is a subject-oriented,
integrated, time-variant and nonvolatile collection of data in support of management’s decision
making process. [3]
The feature of data warehouse can be explored by its definition, namely subject-orientation,
integration, time variance and non-volatility, and its difference from operational data storing
systems, as explained below.
3.2.3.1.
Subject-orientation
Data warehouse is organized around the major subjects of the enterprise, i.e. the high-level
entities of the enterprise, which causes the data warehouse design to be data driven while
operational data storing system is organized around process and function. The major subject
28
DATA WAREHOUSE LOADER
area affects the key structure of the data and the organization of non-key data in data
warehouse. However, the functional boundaries are the major criteria for operational data
storing system. Please see the Figure 3.2.
Figure 3.2: Data warehouse entities align with the business structure
A data warehouse logical model aligns with the business structure rather than the data model
of any particular application. The entities defined and maintained in data warehouse are
parallel to the actual business entities, such as customers, products, orders, and distributors.
Different parts of an organization may have a narrow view of a business entity. In the example
of a customer, a loan service group in a bank may only know about a customer in the context
of one or more loans outstanding. Another group in the same bank may know about the same
customer in the context of a deposit account. However, the data warehouse view of the
customer would transcend the view from a particular part of the business.
The data warehouse would most likely build attributes of a business entity by collecting data
from multiple source applications. For example, considering the demographic data of a bank
customer, the retail operational system may provide with social security number, address, and
phone number. A mortgage system may provide with employment, income, and net worth
information.
Data warehouse does not include data that will not be used by DSS, while operational data
storing systems contain detailed data that may or may not have any relevance to the DSS
29
DATA WAREHOUSE LOADER
analysts. It is essential to understand the implications of not being able to maintain the state
information of the operational system when the data is moved into the data warehouse. Many
of the attributes of entities in the operational system are very dynamic and constantly modified.
Those dynamic attributes are not carried over to the data warehouse.
To understand the lose of operational state information, let us consider the example of an
order fulfillment system that tracks the inventory to fill orders. An order may go through many
different statuses before it is fulfilled and goes to the “closed” status. Other order statuses may
indicate that the order is ready to be filled, is being filled, back ordered, ready to be shipped,
etc. Those statuses capture all the business processes that have been applied to the order
entity. It is nearly impossible to carry forward all of attributes to the data warehousing system.
The data warehousing system is most likely to have just one final snapshot of this order.
As far as the relationship of data is concerned, operational system data relates to the
immediate needs and concerns of the business. On the contrary, data warehouse data spans
a spectrum of time and business rules, representing two or more tables.
3.2.3.2.
Integration
Data warehousing systems are most successful when data is combined from multiple
operational systems. When data brought together, it is important that this integration is done at
a place independent on the source. The data warehouse can effectively and incrementally
combine data from multiple sources such as sales, marketing, and production.
The primary reason for combining data from multiple source applications is the ability to crossreference data. Nearly all data in a typical data warehouse is built around the time dimension.
Time is the primary filtering criterion for most of the activities against the data warehouse. An
analyst may generate queries for a given week, month or year. Another popular query is the
review of year-on-year. Therefore, the time dimension here serves as a fundamental crossreferencing attribute. The ability to establish the correlation between activities of different
organizational groups within a company is often cited as the most advanced feature of the
data warehouse. With integration, data warehouse data takes on a very corporate flavor,
which can also be shown in its naming convention, measurement of variables, encoding
structure and so forth.
30
DATA WAREHOUSE LOADER
3.2.3.3.
Time variance
All data in data warehouse is accurate as of some moment in time while that of operational
data storing system is accurate as of the moment of access. The time variance of data
warehouse shows up in several ways. Firstly it represents data over a long time horizon, from
five to ten years, which is much shorter in operational system, namely from sixty to ninety
days. Secondly every key structure in data warehouse contains, explicitly or implicitly, an
element of time. Thirdly data warehouse data cannot be updated, once correctly recorded.
Data from most operational systems is archived after the data becomes inactive. For example,
a bank account may become inactive after it has been closed. The reason for archiving the
inactive data is the performance. Large amounts of inactive data mixed with operational live
data can significantly degrade the performance of a transaction, which processes only the
active data. Since the data warehouse is designed to archive the operational data, the data
here is saved for a very long period.
The cost of maintaining the data once it is loaded into the data warehouse is minimal. Most of
the significant costs are incurred in data transfer and data scrubbing. Storing data for more
than five year is very common for data warehousing systems. There are industry examples, in
which the enterprise expands the time horizon of the data stored once the wealth of business
knowledge in the data warehouse is discovered.
3.2.3.4.
Non-volatility
Updates, such as inserts, deletes and changes, are regularly done to the operational data
storing system on a record-by-record basis. But the basic manipulation of data warehouse is
only of two kinds: the initial loading of data and the access of data. This means after the data
is in the data warehouse, there are no modifications to be made to this information.
In an operational system, the data entities go through many attribute changes. For example,
an order may go through many statuses before it is completed. Or, a product moving through
the assembly line has many processes applied to it. Generally speaking, the data from an
operational system is triggered to go to the data warehouse, only when most of the activity on
the business entity data has been completed. This may mean the completion of an order or
the final assembly of a product. Once completed, the order is unlikely to go back to backorder
status and the product is unlikely to go back to the first assembly station. Another important
31
DATA WAREHOUSE LOADER
example can be the constantly changing data, which is transferred into the data warehouse
one snapshot at a time. Business logic might determine how often a snapshot is carried out is
adequate for the analysis. Such snapshot data naturally is non-volatile. [3]
3.2.3.5.
Difference between data warehouse and operational systems
The structure of data warehouse is different from that of operational data storing system.
Please see Figure 3.3. The components of data warehouse are current detail data, older detail
data, lightly summarized data, highly summarized data and metadata, which are introduced as
following. However, the summary data is never found in the operational environment.
Figure 3.3: Structure of data inside of data warehouse

Current detail data reflects the most recent happenings, which are always of great interest.
It is voluminous because it is stored at the lowest level of granularity. And it is always
stored on disk storage, which is fast to access but expensive and complex to manage.

Older detail data is stored on some form of mass storage and is infrequently accessed. It
is stored at a level of detail consistent with current detail data.

Lightly summarized data is distilled from the low level of detail found in current detailed
level. Building lightly summarized data is concerned with the architecture of data
warehouse.

Highly summarized data is compact and easily accessible. It is possible to store highly
summarized data outside of the data warehouse.
32
DATA WAREHOUSE LOADER

Metadata plays a special and important role in data warehouse. It is a directory to help the
DSS analysts locate the contents of data warehouse. It is a guide to the mapping of data
between operational environment and data warehouse. It is also a guide to the summarization between current detail data and summarized data.
Data warehouse tables are also different from those of operational environment. On one hand,
they are historic tables, which means they record the previous state. The difference between
the current status and previous one is regularly detected and imported. Every record is
accompanied with one valid period of time. The increase of historic tables depends on the
modification frequency of source data. On the other hand, they are accumulated tables, which
means they accumulate statistic data in a regular sequence. Each data contains a reference
period time and only the difference will be imported. [4]
3.2.4.
Analysis
Online Analytical Processing (OLAP)












Multidimensional conceptual view
Underlying technology is transparent
Access possibility for different source data
No efficiency decrease in case of more data or more users or more dimension of view
Client-server architecture with integration of different client-system
The same function range for each dimension
Efficient administration of weakly used matrices
Multi-user capability
Unlimited, automatic and cross-dimension operation
Simple and intuitive data navigation
Flexible written report format
Unlimited dimension level and aggregation level
Figure 3.4 Online Analytical Processing formulations
The task of analysis stage is to prepare the content of data warehouse according to user’s
question formulation. This stage provides a multidimensional view of the specially extracted
part of database, for example, the turnover of certain customer, of certain product or of certain
period of time. Online Analytical Processing (OLAP) has a central role in this stage. OLAP is
first formulated by E.F. Codd. Please see Figure 3.4. From the technical point of view, there
are mainly two kinds of OLAP-system type: [3]
1.
Multi-dimensional OLAP (MOLAP)
33
DATA WAREHOUSE LOADER
The construction of multi-dimensional data view is based on the so-called multi-dimensional
database. This new database technology is optimized for OLAP-requirements. In contrast to
relational approach, data in multi-dimensional database is stored not in tables but in multidimensional arrays, which is also called “data cube”. This data organization is suitable for the
access model, with which it is possible to access large data volume in a short time.
2.
Relational OLAP (ROLAP)
Relational OLAP does not organize the data for analysis in a physical multi-dimensional way.
The current observed data is extracted directly from relational data warehouse and merely a
“visual data cube” is built in intermediate storage software. This intermediate storage software
can run on any user workstation or separate server.
This technology requires a special physical data model, which is flexible and suitable to
manipulate large data volume in order to achieve an endurable response time. Therefore, it is
necessary here to give a brief introduction to the dimensional data modeling techniques of
data warehouse.
One technique that is gaining popularity in data warehousing is dimensional data modeling.
For some data analysis situations, it can meet the requirements for organizing warehouse
data. This technique enables data warehouse to facilitate users' ability to ask the right
questions and get answers to them. [6]
In Data Warehousing systems, queries tend to use more tables, return larger result sets and
run longer. Since these queries are typically unplanned and may not be reused, there is little
or no opportunity for comprehensive optimization activities. In order to answer a certain
question, more than one tables are related. However, performance suffers when many large
tables make up the query and indexes are not defined to optimize the query's access options.
The "Cube" metaphor provides a new approach to visualize how data is organized. A cube
gives the impression of multiple dimensions. Cubes can be two, three, four, or more
dimensions. Users can "slice and dice" the dimensions, choosing which dimension to use from
query to query. Please see Figure 3.5.
Each dimension represents an identifying attribute. The cube shape indicates the following: [6]

Several dimensions can be used simultaneously to categorize facts.
34
DATA WAREHOUSE LOADER

The more dimensions are used, the greater the levels of detail are retrieved.

Dimensions can also be used to constrain the data returned in a query, by restricting the
rows returned to match a specific value or range of values for the constraining dimension.
For example, the data warehouse user may want to "drilldown" and see the Total Sales for each Region by Sales
Quarter and Product Line. Here the aggregated fact
associated with each combination of the Regions, Time and
Product Lines dimensions is shown below in Figure 3.6.
Each additional dimension increases the level of detail. This
"flattened" view of the dimensional data unfolds the cube by
each dimension and the facts are aggregated at the
intersection of the chosen dimensions.
Figure 3.5: New approach to
visualizing how data is organized in data warehouse
Figure 3.6: The flattened view of the dimensional data cube
Dimensional data modeling adds a set of new schemas to the logical modeling toolkit. The first
is called a star schema, named for the star-like arrangement of the entities. A star schema
uses many of the same components as in any Entity-Relationship Diagram, for example,
35
DATA WAREHOUSE LOADER
entities, attributes, relationship connections, cardinality, optionality, primary keys and so on. A
star schema works properly when a consistent set of facts can be grouped together in a
common structure, called the fact table, and descriptive attributes about the facts can be
grouped into one or more common structures, called the dimension table. [6]
The center of a star schema is the fact table. It is the focus of dimensional queries. It is where
the real data, named facts are stored. Facts are numerical attributes, such counts and
amounts, which can be summed, averaged, and aggregated by a variety of statistical
operations. Calculating the maximal value or the minimal value is also necessary. Fact
attributes contain measurable numeric values about the subject matter. Dimensional attributes
provide descriptive information about each row in the fact table. These attributes are used to
provide links between the fact table and the associated dimension tables. Dimension tables
are used to guide the selection of rows from the fact table.
Figure3.7: The star schema
Please see the example in Figure 3.7, where Grocery transaction is the fact table, while Time,
Customer, Store and Product are the dimension tables. Time dimension is a critical
component of the data warehouse data model. DSS environments allow analyzing data and
how it is changing over time. Store dimension allows categorizing transactions by store,
including location of the store and its relation to other geographically distributed stores.
36
DATA WAREHOUSE LOADER
Product dimension allows analyzing purchasing patterns by products. Customer dimension
allow analyzing purchases by customer, such purchasing frequency or purchasing location.
Fact tables typically have large volumes of rows, while dimension tables tend to have a
smaller numbers of rows. The key advantage to this approach is that table join performance is
improved when one large table can be joined with a few small tables. Often dimension tables
are small enough to be fully cached in memory. There are some significant differences
between a star schema and a fully normalized relational design as following: [6]

The star schema makes heavy use of denormalization to optimize operation speed, at a
potential cost of storage space. However, the normalized relational design minimizes data
duplication, reduces the work that the system must perform when data changes.

The star schema restricts numerical measures to the fact tables, while the normalized
relational design can store transactional and reference data in any table.
Dimensional data modeling is used to manage data at the lowest level of detail available, such
as individual transactions. Aggregates are statistical summaries of these details. One or more
level of aggregates of the same facts can be defined. Aggregates give the perception of
improved query performance, since query results are much faster returned. Maintaining the
aggregates results could make significant performance improvements in analysis activities.
However, there are also some challenges when using dimensional data modeling techniques,
such as denormalization, dimension table data volumes, managing aggregates, data sharing
and so on.
In this work, the example implementation of Data Warehouse Loader makes use of this
dimensional data modeling techniques. Warehousing data is organized in terms of the fact
table and the dimension table. The detailed description will be given in the later chapter.
3.2.5.
Presentation
The task of presentation is to visualize the data from analysis system on the client. There are
a lot of data illustration possibilities in this stage, for example, tables, different diagrams or
maps for geographical information.
One important element of presentation is the control element for the navigation in data
warehousing systems. This navigation should enable user to find the required information as
37
DATA WAREHOUSE LOADER
easily as possible. A lot of presentation systems provide as well the access possibility via
Web-browser. In this case, the available company-intranet will be used as an alternative for
special and expensive analysis front-ends.
3.2.6.
Metadata
Every sub-system in data warehousing systems, from data-transfer stage to presentation
stage, needs the information about the structure and the relationship of the processed data.
For very condensed information, for example, the turnover of 2000, the user needs to know its
detailed development process, which is from a particular storage of elemental data in the
operational data storing system to the presentation in graph. The user requires showing this
detailed development process in order to interpret the illustrated information if necessary.
In addition to this rather static information about data structure and relationship, there is also
information about the functionality of data warehousing systems, for example, the time point or
the time interval of extraction operation, the number of successfully importing records, the
history of operation, and so on.
3.2.7.
Process management
Normally user requires scheduling the workflow of Data Warehouse Loader, for example, at
the end of every business day, the difference of operational database will be figured out and
integrated into data warehouse. In this case, many single processing stages will be concatenated in order to transform data, consolidate data, condense data and finally store data into
data warehouse or a multi-dimensional database.
Therefore, the task of process management is to coordinate the workflow as user requires,
which is related to data warehouse architecture and overlapping processing components.
3.2.8.
User administration
The administration of data access right to analysis data is an important aspect of data
warehousing system. There is different information belonging to different security level, which
should not be accessed by every user. For example, one regional business manager should
only be able to access the customer information in his own region. Because of this reason, the
38
DATA WAREHOUSE LOADER
access right to data warehouse data should be clearly defined. Especially, multi-dimensional
database enables the assignment of access right to a particular cell of the data cube.
Another consideration of user administration is the cost of the utilization of data warehouse
data in profit-center structure. Therefore, as the operator of data warehouse, the data
administration department should charge the use of data warehouse data from user. And the
charge should be based on the time and the different manners of utilization.
3.3. The role of Data Warehouse Loader
3.3.1.
Functionality of Data Warehouse Loader
In this work, the Data Warehouse Loader fulfills the
data-transfer task of the data warehousing systems,
described above. It works as a bridge between operational data storing system and data warehouse. Please
see Figure 3.8. The process of developing data warehousing systems has highlighted the need to effectively
and efficiently manage the extraction, cleaning, transformation and migration of data from data source systems. Efficiency is necessary, whenever the data is of
great value, which is usually the case. Effectiveness is
necessary, because the long-term investment of re-
Figure 3.8: The role of Data
Warehouse Loader
sources in these activities can be high.
Data Warehouse Loader extracts data from any record-oriented operational data source, and
then transforms data in an arbitrary way in order to meet the structure requirements of target
data warehouse. Afterwards, the difference between the current transformation result and the
previous one is determined and finally imported into data warehouse.
In order to figure out the difference between transformation results, the current transformation
result is compared to the previous status of target data warehouse. However, it will be a timeconsuming work to read data warehouse every time. Here a so-called status file realizes this
function, and this status file is always consistent with the newly updated status of data
warehouse. Accordingly, figuring out difference will be the comparison between the current
39
DATA WAREHOUSE LOADER
transformation result and this status file, which is much faster. Therefore, one additional
function of Data Warehouse Loader is to construct the status file, which means after importing
the differences into data warehouse, the differences also need to be merged into the status
file.
Lastly, in case of some system damages, it might need to fetch data directly from data warehouse in order to construct the status file. So another additional function of Data Warehouse
Loader is to retrieve data directly from data warehouse.
3.3.2.
Software reuse consideration of Data Warehouse Loader
The crucial design requirement of Data Warehouse Loader is to be suitable to various data
source, various target data warehouse and various data transformation scheme.
On the one hand, in the market there are numerous database vendors, who provide different
types of database. Any of them might be the data source or the target data warehouse of Data
Warehouse Loader. On the other hand, Data Warehouse Loader will be used by different
enterprise, each of which has its own business logic for the data analysis purpose. For
example, a super market will be interested in the turnover of a certain product during the last
few months. A marketing manager will be interested in the average income level of customers
in certain region. This indicates the transformation scheme between operational data storing
system and data warehouse is totally different.
Except for data source, target data warehouse and transformation scheme, other basic
workflow of Data Warehouse Loader is the same, such as extraction from data source, data
transformation, figuring out difference, importing into data warehouse, merging into the status
file and retrieving from data warehouse. Therefore, this basic workflow of Data Warehouse
Loader can be reused from this enterprise to that company, or from this application to that
application only if the functions related to various data source, various target data warehouse
and various transformation scheme are modified. Consequently, how to modify this part of
functionality is obviously directly associated with the software reusability of this software.
Modification with minimum reprogramming effort represents a high reusability, and vice versa.
It will be extremely important for Data Warehouse Loader architecture to decouple the
constant workflow and the various environments factors. Please see the Figure 3.9.
40
DATA WAREHOUSE LOADER
Figure 3.9: Goal of Data Warehouse Loader architecture
That is also the reason why software reusability consideration is extremely important for the
implementation of Data Warehouse Loader in one data warehouse application and why it is
chosen as the investigation example of software reusability research. The general concept of
software reuse and different reuse technology are applied to the development process of Data
Warehouse Loader. Especially, one goal is to achieve a well-designed interface, which
significantly reduces reprogramming effort in case of changing in order to improve software
reusability. Therefore, the interface concept of object-oriented language Java can be explored.
In the implementation of Data Warehouse Loader, there exist various interfaces, namely
interface for data representation, interface for data access and interface for business logic,
which will be explained in detail in later chapter.
41
IMPLEMENTATION OF DATA WAREHOUSE LOADER
4. The Implementation of Data
Warehouse Loader
4.1. The description of Data Warehouse Loader
In this work, in order to make Data Warehouse Loader executable, a data warehousing
system of a super market chain is chosen as the implementation example. Both data
source and target data warehouse are Microsoft Access database. Here is the brief
description of this implementation example.
The source database is of a normalized relational design, which consists of several tables,
namely Artikel, Artikel-type, Positionen, Kauf, and Markt. It is an operational system,
which keeps the records of each transaction, for example, every day which Aritkel is sold,
how many of this Artikel are sold, where are those Artikel are sold, what is the price of
this Artikel, what is the type of this Artikel, and so on. Please see the fields defined in
each table and the relationship between each table in Figure 4.1.
42
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Figure 4.1: Table relationship in source database
Data Warehouse Loader reads records from this source database, which means to
extract data from source system. Then it performs the data transformation in order to
make the data suitable for the architecture of data warehouse for the purpose of analysis.
The data in the data warehouse is organized in a multi-dimensional cube and each
dimension represents an identifying attribute. A user can drill down to see the detail data
by the flattened view of the dimensional data. For the purpose of easily having an overall
view of total sales, some facts are aggregated at the intersection of the chosen
dimensions. Therefore, in the target data warehouse, a star schema is utilized, which is
introduced in the former chapter. The reason to use dimensional data modeling and star
schema in the target data warehouse is to make use of the advantage of dimension table,
since dimension tables are small enough to be fully cached into main memory. This could
be a compensation for the large main memory consumption of Java program. Please see
Figure 4.2.
Here FactUmsatz is the fact table and DimMarkt, DimZeit, DimArtikel are the dimension
tables, which can also be recognized by their naming prefix. FactUmsatz table provides
fact attributes about Markt, Artikel, Zeit and Umsatz. Those fact attributes contain
measurable and numeric values about the subject matter. However, DimMarkt, DimZeit,
43
IMPLEMENTATION OF DATA WAREHOUSE LOADER
DimArtikel tables provides dimensional attributes about Markt, Zeit and Artikel. Those
dimensional attributes provide descriptive information about each row in the fact table
and provide the relationship links between the fact table and the associated dimension
tables as well.
Figure 4.2: Star schema in the target data warehouse
Let us have a look at the data transformation schema in detail. The record in DimZeit
table is relatively constant, which gives each month a unique ID. The record in DimArtikel
table gets Id and Name from Artikel table of the source database and gets Warengruppe
from Artikeltyp table of the source database. The record in DimMarkt table gets Id and
Name from Markt table of the source database. Then FactUmsatz table is the
aggregation of all the information from Artikel, Artikeltyp, Positionen, Kauf, and Markt
table of the source database. The Umsatz is calculated by multiplying Preis in Artikel
table with Menge in Positionen table. Zeit is obtained from Uhrzeit and Datum in Kauf
table. Of course, Markt and Artikel are extracted from Markt table and Artikel table
respectively. Therefore, FactUmsatz table enables user to clearly know, how much is the
turnover of a certain kind of item in a certain market of a certain period of time, when are
those items sold, in which market is a certain kind of item sold, and so on. The
44
IMPLEMENTATION OF DATA WAREHOUSE LOADER
intersection of each of these dimensions will determine which rows will be aggregated to
answer the query.
4.2. The architecture of Data Warehouse Loader
As stated before, one significant feature of a typical data warehouse-loading environment
in practice is there exist different types of data source, different types of target data
warehouse and different data transformation scheme. That means the user might use
different database system or file system as data source or target. And the transformation
rule from data source to data warehouse always depends on the particular and individual
business logic.
Therefore, a proper architecture of Data Warehouse Loader should take this factor into
account, such that it is able to separate the common elements of the software from the
data specific elements of the software. The common elements of the software are
oriented to basic operation flow, namely extraction from data source, data transformation,
figuring out difference, importing into data warehouse, merging into the status file and
retrieving from data warehouse. However, the data specific elements of the software are
oriented to different data, different source database, different transformation rules and
different target data warehouse. When a new source database is connected or a new
transformation rule is required, the data specific part will be reprogrammed while the
common part remains the same. In a word, the data specific part has nothing to do with
the basic loading operation process.
Accordingly, decoupling of the common part and the data specific part is the goal of the
architecture design for Data Warehouse Loader. This decoupling is achieved by
implementing Loader-engine and Loader-interface respectively. Loader-engine is the
common elements for basic operation flow, who knows “how to do”, for example, how to
extract, how to transform, how to import and so on. While Loader-interface is the data
specific elements for a particular application environment, who knows “what to do”, for
example, what is the record structure or data structure, what is the type of data source,
what is the transformation scheme, what is the type of target data warehouse and so on.
Please see Figure 4.3.
45
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Figure 4.3: Architecture of Data Warehouse Loader: decoupling of Loader-engine and
Loader-interface
4.3. Loader-engine: the operation mode of Data Warehouse Loader
Besides the data-oriented structure of Loader-interface, Loader-engine is a workfloworiented structure in Data Warehouse Loader. The task of Data Warehouse Loader is to
extract data from source system in a regular time interval, transform data in an arbitrary
way, and load it into data warehouse with a predefined format. Consequently, the basic
workflow of Data Warehouse Loader is divided into several sub-functions as following,
each of which is one processing stage and is connected to each other via a so-called
intermediate file.

Extraction: extract raw data from different source system

Transformation: transform the data to be fit to the structure of data warehouse

Figuring out the difference: determine the difference between the current
transformation result and the previous one

Importing: import the difference into data warehouse

Merging: merge the difference into the status file of data warehouse

Retrieving: retrieve data from data warehouse in case of system crash
46
IMPLEMENTATION OF DATA WAREHOUSE LOADER
The basic workflow of Data Warehouse Loader is illustrated in Figure 4.4, which shows
as well the position of Loader-interface and the internal data flow, represented by the
intermediate files.
Figure 4.4: Workflow of Data Warehouse Loader: multistage filter-architecture
4.3.1.
Advantages of workflow architecture
The workflow architecture of Data Warehouse Loader illustrates a Filter-Architecture
based on multiple stages with text files and record object files as intermediate files. The
whole complex functionality of Data Warehouse Loader is split into manageable subfunctions with definite results. This workflow architecture has the following advantages in
practice:
47
IMPLEMENTATION OF DATA WAREHOUSE LOADER
1. This workflow architecture brings easy maintenance. Every program, each single
stage of the operation flow can be separated from others in order to be extended or to
be tested. Since each single component is a complete unit, it is easily controlled from
the software technical point of view.
2. Each single stage can be inserted and utilized in other programs or other systems
without significant reprogramming effort. It will not only limit to the context of data
warehouse application, but also apply to other applications with similar functionality.
For example, the single extraction program can be used in the case of reading data
out of certain database. The single importing program can be used in the case of
loading data into certain database. This is exactly the modularity approach of the
concept of software reuse.
3. With this workflow architecture, the whole complex functionality of Data Warehouse
Loader becomes manageable and controllable. Since each sub-function is executed
sequentially, the intermediate files of results are steadily linked together step by step
and each of them is independently stored. Therefore, in case of an unpredictable
system crash, the damage can be reduced to minimum because all the intermediate
files till the currently crashing stage are remained and only the part of work starting
from the crashing stage needs to be done once again.
4. Because of this workflow architecture, the whole operation flow of Data Warehouse
Loader can be temporally decoupled. For example, in case the data warehouse has
not been ready for use, operations, such as “LdExtract” from data source and
“LdTransform” of data, can take place first, whose intermediate results are saved.
Afterwards, “LdImport” to data warehouse can be made up later.
5. The intermediate file of each single stage is compatible to each other. Therefore, the
operation sequence can be recomposed according to the new requirements. This
recomposing gives rise to the possibility for further usability, which is also so-called
modularity approach of software reuse.
There are some more words about the recomposition of the operation sequence of Data
Warehouse Loader. For example, in a certain application, the processed data is highly
48
IMPLEMENTATION OF DATA WAREHOUSE LOADER
dynamic, which means the data changes a lot by every access time. Therefore, figuring
out the difference between the previous results does not make sense any more, since the
data has totally changed. In this case, the user can simply get rid of “LdDiff” stage and
connect transformation stage and importing stage directly together. Please see Figure 4.5.
The output of transformation stage will be directly the input of importing stage, which
represents a new configuration of Data Warehouse Loader workflow. Obviously, it is a
good example of the reusability of this software, which means it is able to adapt to the
new application requirements very easily. This adaptability does not require great
reprogramming efforts, because except for getting rid of “LdDiff” stage, the other
operation stages just simply remain the same.
Figure 4.5: A new configuration of Data Warehouse Loader
Each operation stage of Data Warehouse Loader produces two kinds of file as results.
Therefore, there are two kinds of intermediate files, one is called text file and the other is
called record object file.
On the one hand, the text file is for the user to read and check the results of each
intermediate workflow stage. In the text file, the result of each stage is printed out as
character streams. All the text files have a uniform structure, which consists of an INFOheader with general information (source, reference data, creating time, target etc.), a
FORMAT-header with description of the data structure of record from database table, and
finally a DATA-body.
49
IMPLEMENTATION OF DATA WAREHOUSE LOADER
On the other hand, the record object file is the real carrier to forward information from the
previous stage to the next one. In the record object file, the result of each stage is stored
as a linked list, which is a set of record objects. In other words, each stage of operation
flow always reads the record object file as input, with which the individual processing is
done with regard to record object.
Both text file and record object file are necessary. They are serving for different purposes.
The text file makes the results of each intermediate stage visible and enables the user to
decouple and recompose the operation flow of Data Warehouse Loader. Otherwise, the
results can only be checked out in target data warehouse after the final loading process,
which is not practical at all. It is better to check out the result of each stage in order to
make sure that this current stage has been successfully executed. Since the result is
available after each single stage, each component of the workflow even can be used
separately, which means either temporally separated in different working time or spatially
separated in different application. This kind of decoupling technique leads to the
reusability of Data Warehouse Loader. Another purpose of text file is to make debugging
easier. By the means of examining the results of each single stage, the location of error
can be easier deduced.
The record object file makes it
Application
easier to pass information between
different stages. The crucial term
Artikel
object
Resultset
here is mapping relational data
JDBC
onto Java objects. Since Java is
object-oriented, in many cases it is
not only necessary to deal with
individual data items, but also to
Database
work with objects that represent
the individual database record.
The way that information is handled at the object level is usually
different from the way that data is
Figure 4.6: Mapping between Artikel Object
and Resultset from data base
50
IMPLEMENTATION OF DATA WAREHOUSE LOADER
stored in a relational database. In the world of objects, the underlying principle is to make
those objects exhibit the same characteristics (information and behavior) as their realworld counterparts. In other words, objects function at the level of the conceptual model.
Relational database, on the other hand, work at the data model level, which stores
information using normalized form, where conceptual objects can be decomposed into a
number of tables. In some cases, there is a straightforward relationship between the
columns (field) in a table and the variable member in an object. The mapping task simple
consists of matching the data types of database table with those of Java. Please see
Figure 4.6, which shows that the application constructs the Artikel object form the data of
Resultset, which is returned by JDBC from the database. This mapping is a significant
object-oriented feature of the non-object-oriented workflow of Data Warehouse Loader.
Each record of database table is represented by an object of record class, whose data
members correspond to each field of the database record. Therefore, all the processing
to database records is done via record objects, which means the manipulation of records
corresponds to manipulation of the attributes of record objects. Each record object also
has a data member to represent all the header information in its corresponding text file.
Since it is very easy to access the attributes of one object in Java, this object-oriented
structure results in an easy manipulation of each record from the database table, which
makes the Java source code much more readable than the C program without objectoriented feature. On the contrary, if each database record is represented by a stream of
characters, as in the text file, the record can only be accessed via a line of characters by
the means of reading the text file line by line. The position of each character needs to be
calculated in order to fetch exactly the required information and get rid of useless
information. It consumes a lot of programming efforts.
4.3.2.
Major task of the workflow of Data Warehouse Loader
As stated before, the basic workflow of Data Warehouse Loader realizes the functionality
of data warehouse ETL systems and it is divided into several sub-functions in this
implementation. Each of these sub-functions fulfills a certain task, as explained below
one by one.
51
IMPLEMENTATION OF DATA WAREHOUSE LOADER
4.3.2.1.
Extraction
Figure 4.7: Program of LdExtract
The task of extraction program, LdExtract, is to extract data from source system in order
to make the data available for further manipulation in a required form. It is necessary for
the extraction program to directly access data source intensively on account of heavy
system workload. Please see Figure 4.7.
The data source will be delivered to extraction program as a parameter. Afterwards,
LdExtract loads corresponding extraction classes and record classes. With the help of the
predefined functions in extraction classes and record classes, a connection to the
corresponding data source will be created, and then its record will be read out, finally the
connection is closed.
In this process, loader-engine program, LdExtract, only knows the basic operation
sequence to access data source, such as creating connection, reading and closing
connection. It does not know whether the data source is a database or a text file, or which
data should be extracted. Those questions will be answered by extraction classes, which
implement extraction-interface and define what is the data source type, which data
source connection to establish, which data to extract and so on. The task of record
classes, which implements record-interface, is to determine the data structure of
database record and the header information in the text file. Accordingly, each table
obtains its individual operation in the extraction process. Therefore, LdExtract belongs to
the common part of Data Warehouse Loader while extraction classes and record classes
belong to data specific part. Please see Figure 4.8 for the analysis of extraction-interface
in extraction program.
52
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Figure 4.8: Analysis of extraction-interface in extraction program
For example, extraction-interface declares the following methods: createConnection ( );
read ( ); closeConnection ( ); setExBasisFile (BasisFile aBasisFile); exSaveBasisFile ( ).
Each extraction class, namely ExArtikel, ExMarkt, ExVerkauf, has its own implementation
of those methods, which means different database table can have totally different ways to
establish the connection to data source, read data and then close the connection. When
class LdExtract calls these methods in the run time, which implementation of these
methods is executed will be determined by command line argument, which is passed into
LdExtract via the parameter myExtractor.
This dynamic class loading process makes use of one Java mechanism, called
polymorphism. The word polymorphism means the ability to assume several different
forms or shapes. In programming terms it means the ability of a single variable of a given
type to be used to reference objects of different types, and to automatically call the
methods that is specific to the type of object the variable references. This enables a
53
IMPLEMENTATION OF DATA WAREHOUSE LOADER
single method call to behave differently, depending on the type of the object, which the
call applies.
When a method is called using a variable of a base class type, polymorphism results in
the method being selected based on the type of the object stored, not the type of the
variable. Here the base type is ExInterface, and the specific object type is ExArtikel,
ExMarkt and ExVerkauf. Because a variable of a base type can store a reference to an
object of any derived type, the kind of object stored will not be known until the program
executes. Thus the choice of which method implementation to execute is made
dynamically when the program is running and it cannot be determined when the program
is compiled. The method createConnection ( ), read ( ) or closeConnection ( ) that are
called through the variable of type ExInterface in the earlier illustration, may do different
things depending on what kind of object the variable references.
This polymorphism mechanism of Java introduces a new level of capability in program
using objects. It implies that programs can adapt at runtime to accommodate and process
different kind of data quite automatically. It is also a key point for software reuse. Only if a
well-designed interface, such as extraction-interface, is established, it is easy when a
new data source is added in, or a new rule of reading data is required. What needs to be
modified is only to add a new extraction class, which implements extraction-interface, or
to change the existing extraction class respectively. However, the Loader-engine part,
LdExtract, will remain the same and even the original program needs not to be
recompiled. That is the way that the high software reusability of Data Warehouse Loader
is achieved.
It is the same mechanism used for record-interface. Here the base type is RecordInterface and the specific object type is RecordArtikel, RecordMarkt, and RecordVerkauf. The
methods, namely setFmBasisFile (argBasisFile), format ( ), getFmBasisFile ( ), which are
called through the variable of type RecordInterface in LdExtract, may do different things
depending on what kind of object the variable references. It is determined by command
line argument, which is passed into LdExtract via the parameter myRecord. Such that
when a new database table is added into Data Warehouse Loader and a new data structure of its records is required, the only modification is to add a new record class to repre-
54
IMPLEMENTATION OF DATA WAREHOUSE LOADER
sent, which implements record-interface. The rest of program remains the same. Please
see Figure 4.9 for the analysis of record-interface in extraction program.
Figure 4.9: Analysis of record-interface in extraction program
4.3.2.2.
Transformation
The transformation program, LdTransform, reads the extraction result and produces a
new data, which is in favor of the data structure in the target data warehouse table. The
transformation results can be either a historical table, which represents the newly
updated current state of database, or an accumulative table, which represents the newly
changes of database. Please see Figure 4.10.
The data source and data target will be delivered to transformation program as
parameters. Afterwards, LdTransform program loads corresponding transformation
classes and record classes. With the help of the predefined functions in the
transformation classes and the record classes, data is transformed into a required format,
55
IMPLEMENTATION OF DATA WAREHOUSE LOADER
and afterwards the transformation results are not only saved in the record object file, but
also output in the text file.
Figure 4.10: Program of LdTransform
In this process, Loader-engine program, LdTransform, only knows the basic operation
sequence to transform data, such as reading and saving the header information in the
text file, transforming data, saving and outputting transformation results. It does not know
the specific transformation scheme for different database table. This question is
answered by the transformation classes, which implement transformation-interface and
define how the data is transformed. The task of record classes is the same as those of
extraction program, namely determining the data structure of database record and the
header information in the text file. Such that, each table obtains its own individual
transformation scheme in this process. Therefore, LdTransform belongs to the common
part of Data Warehouse Loader while transformation classes and record classes belong
to data specific part. Please see Figure 4.11 for the analysis of transformation-interface in
transformation program.
Here the polymorphism mechanism is used the same as in extraction program.
Transformation-interface declares the following methods: setTrBasisFile (BasisFile
aBasisfile), tranform ( ), trSaveRecords ( ), trSaveBasisFile
( ), trOutput ( ). Each
transformation class, namely TrDimArtikel, TrDimMarkt, TrFactUmsatz, has its own
implementation of these methods. Therefore, the base type is TrInterface and the specific
object type is TrDimArtikel, TrDimMarkt, TrFactUmsatz. When these methods, declared in
56
IMPLEMENTATION OF DATA WAREHOUSE LOADER
transformation-interface, are called through the variable of type TrInterface in LdTransform program, they may do different things depending on what kind of object the variable
references. It is determined by command line argument, which is passed into LdTransform via the parameter myTransformer.
Figure 4.11: Analysis of transformation-interface in transformation program
Consequently, when a new transformation scheme is required or an old one is changed,
the only modification is to add a new transformation class or to change the existing one
respectively. All of the transformation classes should implement transformation-interface.
The rest of program remains the same. In this way, the high software reusability of Data
Warehouse Loader is achieved.
Normally one data source corresponds to one target data warehouse. But there is also
possibility, where more than one target data warehouse tables load data from one data
source table or one target data warehouse table loads data from more than one data
source tables. Such kind of combination relationship is also determined by the
transformation classes.
57
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Another supporter of the configuration of data transformation scheme is a transformation
initial interface, called TrIniInterface. Here another property of Java is used, namely one
class can implement multiple interfaces when necessary and when a class implements an
interface, any constants that were declared in the interface definition are available directly
in the class, just as though they were inherited from a base class. Therefore, one
transformation class can not only implement the transformation-interface, but also
implement this transformation initial interface, in which special rules for data
transformation are defined. Especially, the transformation class can make use of any
constants, which are declared in the transformation initial interface.
In the demo example of Data Warehouse Loader implementation, one of transformation
rules for Artikel table is to replace certain Artikel types with a predefined term. That
means when those required Artikel types are encountered in the extraction result, they
will be changed. Therefore, the names of those Artikel types are grouped as an array in
the transformation initial interface. Each transformation class implements this interface
and obtains the information about the array. Consequently, each Artikel type of the
extraction result can be compared to this array in order to determine whether it belongs to
this array or not for a proper operation.
The Java property of interface multi-implementation and constants inheritance is also
utilized for Data Warehouse Loader initial configuration, namely interface FileInterface,
RCInterface and LoaderIniInterface. FileInterface defines all the file type of intermediate
results for their header information in the text file. RCInterface defines return code in case
of error. LoaderIniInterface defines file storing path of intermediate files, postfix of
intermediate files, URL string for data source system and target data warehouse. This
configuration information will be known by every class all over the whole package, only if
the class implements those interfaces and obtains the constants.
As far as the record-interface in transformation program is concerned, it plays exactly the
same role as in extraction program. The only difference is that another three record
classes are used here, namely RecordDimArtikel, RecordDimMarkt and RecordFactUmsatz since the data structure has already changed in order to be suitable for target data
58
IMPLEMENTATION OF DATA WAREHOUSE LOADER
warehouse. That means after transformation stage the construction of the records in
dimension table is completed.
4.3.2.3.
Figuring out difference
Figure 4.12: Program of LdDiff
In this stage, the difference between the current transformation result and the previous
status of data warehouse is determined. Please see Figure 4.12.
Here the previous status of data warehouse is obtained from a so-called status file
instead of directly reading for data warehouse. This process is much faster and easier
than direct database access. However, the status file should be updated with the
updating of data warehouse, which is the task of merging program, LdMerge, described
later.
The name of table, which is being compared, is passed into LdDiff as command line
argument. Then its corresponding current transformation result and previous status are
read and compared record by record. First all the fields of two given records are
compared and then only the key fields of these records are compared in order to
determine whether it is a new record or a changed record. The result of LdDiff program is
three-difference file, namely new file, which includes new records; chg file, which includes
changed records; del file, which includes deleted records.
59
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Record-interface is used in this process when reading data from input files and saving
results. With the help of record-interface, LdDiff program can be used for any database
table, only if this database table has its corresponding record class, which implements
record-interface. In LdDiff program, the common methods are called through a variable of
type RecordInterface and they might do different things depending on the type of record,
the variable references. For example, the criteria for comparing records are different for
different records. Therefore, each record class has its own implementation of method
compareFields (RecordInterface aRecordInterface) and compareKeyFields (RecordInterface aRecordInterface). Method compareFields compares all the fields of two records
and returns true when they are the same and false when they are different. If the two
records are different, method compareKeyFields is then called to compare their key fields.
And it also gives the sequence of them by the return value of –1, 0 or 1, whose sorting
criteria is given by the key fields, as stated in the text file. Such that the given record can
be determined to be a new record or a changed record. Of course, if one record in the
status file is not found in the current transformation result any more, it should be a
deleted record.
4.3.2.4.
Importing
Figure 4.13: Program of LdImport
Importing program, LdImport, imports the resulted difference files from last stage into
data warehouse. Every running of importing program will import one of these difference
files. Therefore, the importing program will be executed three times for each database
60
IMPLEMENTATION OF DATA WAREHOUSE LOADER
table for the new records, the changed records and the deleted records separately.
Please see Figure 4.13.
The target data warehouse for importing will be delivered to importing program as a
parameter. Afterwards, LdImport loads corresponding database classes, which implement
database-interface. With the help of the predefined functions in the database classes, the
data from the difference files is imported, namely creating a connection to target data
warehouse, inserting the new records, updating the changed records, removing the
deleted records and finally closing the connection.
Figure 4.14: Analysis of database-interface in importing program and retrieving program
In this process, Loader-engine program, LdImport, only knows the basic operation
sequence to import data, such as inserting, updating and removing and so on. It does not
know which data should be processed for different database table and what kind of
connection to data warehouse should be created. These questions are answered by
database classes, which define what kind of connection is established, what data to
import, etc. Such that, each table obtains its individual operation in the importing process.
61
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Therefore, LdImport belongs to the common part of Data Warehouse Loader while
database classes belongs to data specific part. Please see Figure 4.14 for the analysis of
database-interface in importing program.
Figure 4.14 is served for both importing program and retrieving program, since both of
them make use of database-interface in the same way.
Here the polymorphism mechanism is used the same as in extraction program and
transformation program. Database-interface declares the following methods: createConnection ( ), setDbBasisFile ( ), dbNew (String aString), dbChg ( ), dbDel ( ), closeConnection ( ). Each database class, namely DbDimArtikel, DbDimMarkt, DbFactUmsatz, has its
own implementation of these methods. Therefore, the base type is DbInterface and the
specific object type is DbDimArtikel, DbDimMarkt, DbFactUmsatz. When these methods
declared in the database-interface are called through the variable of type DbInterface in
LdImport program, they may do different things depending on what kind of object the
variable references. It is determined by command line argument, which is passed into via
LdImport the parameter myImporter.
Consequently, when a new target data warehouse is added in, the only modification is to
add a new database class, which implements database-interface and does all the
updating work to data warehouse. The rest of program remains the same. In this way, the
high software reusability of Data warehouse Loader is achieved.
In the execution of importing program, it might happen that a certain record cannot be
imported into the target data warehouse, for example, it cannot be inserted, changed or
deleted. The reasons could be the security checking of target data warehouse, such as
the constraints against wrong foreign-key relationship or exceeding of data value range.
In this case, the status of data warehouse is not updated according to the corresponding
current transformation results. In order to indicate this case, each record class contains
one data member of type character, called cImportFlag. This importing flag is initialized to
be a dot ‘.’. If the record is successfully imported, namely successfully inserted, updated
or removed, the importing flag will be changed to ‘Y’. Otherwise, the importing flag
remains its initial state. Afterwards, this importing flag information will be read by merging
program in order to determine a proper merging operation, as described below.
62
IMPLEMENTATION OF DATA WAREHOUSE LOADER
4.3.2.5.
Merging
After importing the difference data into target data warehouse, the loading process has
not been completed yet. For the next workflow execution of Data Warehouse Loader, a
status file needs to be created, which represents the current status of data warehouse for
the purpose of execution speed. As stated before, because of the possible error occurring
while importing process, the construction of the status file should consider the reaction of
data warehouse in importing process as well.
That is the task of merging program,
LdMerge. Please see Figure 4.15.
Figure 4.15: Program of LdMerge
The merging process is done with new records, changed records and deleted records
separately. The most important information in merging process is the importing flag of
each record.
For the new records, if the importing flag is changed to ‘Y’, which stands for successfully
inserting, this record will be added into status file. Otherwise, nothing is added.
For the changed records, if the importing flag is changed to ‘Y’, which stands for
successfully changing, first the corresponding record in the status file is found and
63
IMPLEMENTATION OF DATA WAREHOUSE LOADER
deleted, then the record in chg file is added into status file. That is how this record is
changed in the status file as well. Otherwise, do nothing.
For the deleted records, if the importing flag is changed to ‘Y’, which stands for
successfully deleting, the corresponding record in status file will be removed. Otherwise,
do nothing.
Record-interface is used in this merging process when reading data from difference files
and saving the results. With the help of record-interface, LdMerge program can be used
for any database table, only if this database table has its corresponding record class,
which implements record-interface. In LdDiff program, the common methods are called
through a variable of type RecordInterface and they may do different things depending on
the type of record, the variable references. For example, when merging changed records,
if the importing flag is ‘Y’, the corresponding record in the status file should be first found
and then deleted. For this purpose, a method indexOf ( ) of an object of type LinkedList is
called in order to find the argument object in the linked list of the status file. In order to
make this method work properly, the compared object should implement one method,
called equals (Object aObject), which is inherited from Object. Therefore, record-interface
declares this method and each record class has its own implementation of this method to
define the concrete criteria of comparing records.
4.3.2.6.
Retrieving
Figure 4.16: Program of LdRetrieve
Retrieving program is used to construct the status file directly from the target data
warehouse. Please see Figure 4.16. Therefore, the result of retrieving program is the
64
IMPLEMENTATION OF DATA WAREHOUSE LOADER
same as merging program. Retrieving program does not belong to the normal daily
workflow of Data Warehouse Loader. However, it will be executed for the following
purposes:

Controlling the result of merging program

Recovering from system crash: When the workflow of Data Warehouse Loader is out
of order because of any system error and the status file does not exist any more, it
can be recovered directly from the target data warehouse by the means of retrieving
program.
The target data warehouse for retrieving will be delivered to retrieving program as a
parameter. Afterwards, LdRetrieve loads corresponding database classes. With the help
of the predefined functions in database classes, data from the different target data
warehouse is retrieved.
In this process, Loader-engine program, LdRetrieve, only knows the basic operation
sequence to retrieve data, such as creating the connection to data warehouse, fetching
data, saving records, outputting results and closing the connection. It does not know
which data should be retrieved and what kind of connection to data warehouse should be
created. These questions are answered by database classes, which implement databaseinterface and define which data to retrieve, what kind of connection is established, etc.
Such that, each table obtains its individual operation in the retrieving process. Therefore,
LdRetrieve program belongs to the common part of Data Warehouse Loader while
database classes belongs to data specific part. Please see Figure 4.14 for the analysis of
database-interface in retrieving program.
Here the polymorphism mechanism is used the same as in extraction program, transformation program and importing program. Database-interface declares the following methods: createConnection ( ), fetch ( ), dbSaveRecords ( ), setDbBasisFile ( ), dbOutput( ),
closeConnection ( ). Each database class, namely DbDimArtikel, DbDimMarkt, DbFactUmsatz, has its own implementation of these methods. Therefore, the base type is
DbInterface and the specific object type is DbDimArtikel, DbDimMarkt, DbFactUmsatz.
When those methods declared in database-interface are called through the variable of
type DbInterface in LdRetrieve program, they may do different things depending on what
kind of object the variable references. It is determined by command line argument, which
65
IMPLEMENTATION OF DATA WAREHOUSE LOADER
is passed into LdRetrieve via the parameter myImporter. The result of retrieving is both
saved in the record object file and printed out in the text file for checking.
Consequently, when a new target data warehouse is added in, the only modification is to
add a new database class, which implements database-interface and does all the
retrieving work from data warehouse. The rest of program remains the same. In this way,
the high software reusability of Data Warehouse Loader is achieved.
4.4. Loader-interface: interface concept of Data Warehouse Loader
According to the analysis before, one of the most important goals for the architecture
design of Data Warehouse Loader is to decouple the common part for the basic workflow
and the data specific part for an individual application environment. Therefore, interfaceconcept plays a crucial role here in the implementation of Data Warehouse Loader. The
common part, loader-engine, has been explained above. Then the following is for the
data specific part: loader-interface.
Loader-interface is the data-oriented structure of Data Warehouse Loader and each of
them supplements Loader-engine with the following functionality:

Extraction-interface: access to data source with sequential reading capability

Transformation-interface: transform the source data into a target format

Database-interface: access to data warehouse for updating and retrieving data

Record-interface: construct an object structure for each database record
Loader-interface resides in one package, named myPackage.loaderInterface. Each of
Loader-interface is combined to a particular Loader-engine program and fulfills certain
part of the whole processing task, from data source to target data warehouse. From the
Data Warehouse Loader system’s point of view, it is the proper combination of Loaderinterface and Loader-engine that makes the sequential extracting, transforming and
loading process run and moreover makes Data Warehouse Loader more reusable.
Loader-interface is responsible for all the functionality, which is related to specific data
source, specific data structure, specific data transformation scheme, specific business
66
IMPLEMENTATION OF DATA WAREHOUSE LOADER
logic and so on. Loader-interface needs to be reprogrammed when Data Warehouse
Loader is transferred from one application to another, since different customers have
apparently various business logics. However, a well-designed Java interface will reduce
this reprogramming effort as much as possible. Therefore, the interface concept of Java
is important here in the implementation of Loader-interface.
Normally in Java, a common method can be defined in base class, and then this method
can be implemented individually in each of the subclasses. The method signature was
the same in each class and the method could be called polymorphically. When all you
want is a set of methods to be implemented in a number of different classes such that
you can call them polymorphically, you can dispense with the base class altogether. The
same end result to obtain polymorphic behavior can be achieved much more simply by
using a Java facility, called interface.
The name of this facility indicates its primary use: specifying a set of methods that
represents a particular class interface, which can then be implemented individually in a
number of different classes. All of the classes will then share this common interface, and
the methods in it can be called polymorphically.
An interface is essentially a collection of constants and abstract methods. To make use of
an interface, one class can implement this interface, which means this class writes codes
for each of the methods declared in the interface as part of the class definition. When a
class implements an interface, any constants that were defined in the interface definition
are available directly in the class, just as though they were inherited from a base class.
Especially, this constants inheritance of interface property is used to define some interfaces in Data Warehouse Loader, which only consist of some constants, for the purpose
of system initial configuration, for example, FileInterface, RCInterface, LoaderIniInterface
and TrIniInterface.
The interfaces of Data Warehouse Loader have the following functions:
67
IMPLEMENTATION OF DATA WAREHOUSE LOADER
4.4.1.
Extraction-interface
Extraction-interface is the interface to data source system with the functions to create a
connection to data source, read data, close the connection, set and save the header
information for the text file. It assumes that the data can be read sequentially record-byrecord from data source system. The reading function of extraction-interface is
corresponding to read a line of text from text file or select a record from database;
therefore the current position in text file or in database is required. Extraction-interface is
used in extraction program. Please see Figure 4.17.
Figure 4.17: Extraction-interface of Data Warehouse Loader
4.4.2.
Transformation-interface
Transformation-interface is the interface for different data transformation scheme in a
particular application. The functions of transformation-interface is to transform data, output data in the text file, save data in the record object file, set and save the header information for the text file. Transformation-interface gives the essential changes of the record
68
IMPLEMENTATION OF DATA WAREHOUSE LOADER
from data source in order to be suitable for the data structure in target data warehouse,
for example, the dimensional data modeling and a star schema. Transformation-interface
is used in transformation program. Please Figure 4.18.
Figure 4.18: Transformation-interface of Data Warehouse Loader
4.4.3.
Database-interface
Database-interface is the interface to target data warehouse. Database-interface will
serve for both importing program and retrieving program, since both of them need to talk
with data warehouse. For importing program, database-interface will create a connection
to data warehouse, insert the new records, update the changed records, remove the
deleted records, close the connection, set and save the header information for the text file.
For retrieving program, database-interface will create the connection, fetch data, output
data in the text file, save data in the record object file, close the connection, set and save
the header information for the text file. Please see Figure 4.19.
69
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Figure 4.19: Database-interface of Data Warehouse Loader
4.4.4.
Record-interface
Record-interface is the interface to provide an object structure for each database record,
which gives object-oriented feature for this non-object-oriented procedure. The way that
information is organized at the object level is usually different from the way that
information is stored in relational database. In the object-oriented world, objects function
at the level of the conceptual model. However, relational databases work at the data
model level. Record-interface is the bridge to map relational data onto Java object.
70
IMPLEMENTATION OF DATA WAREHOUSE LOADER
Each field of database record corresponds to one of data members of record class.
Additionally, each record class has a data member of type character, called cImportFlag,
to indicate the importing status of each record. Record-interface also has a rich
functionality related to the configuration of data structure, such as initializing the importing
flag, getting the status of importing flag, comparing each field of records, comparing key
fields of records, determining the equivalence of records, specifying the format of records
shown in the text file and setting the header information for the text file. In a word, recordinterface takes the responsibility for data representation in Data Warehouse Loader.
Please see Figure 4.20.
Figure 4.20: Record-interface of Data Warehouse Loader
71
IMPLEMENTATION OF DATA WAREHOUSE LOADER
4.5. Format of intermediate files
As stated before, the intermediate text files have a uniform format, which consists of an
Info-header, a Format-header and a Data- body, as explained as following. Please see
Figure 4.21 for one example of the extraction text file.
Figure 4.21: Example of the extraction text file
1. Info-header
The Info-header starts with the key word $INFO and ends with the key word §END. The
Info-header includes the following information about the text file.

FILETYPE: type of intermediate files, which are produced by different loading
workflow stage. Please see Table 4.1.

SOURCE: data source, from which data is extracted by extraction program

TARGET: target data warehouse, to which data is finally imported

REFDATE: reference date, which indicates the time state of operational data source
and can be given as a command line argument in extraction program

CREDATE: creation date of this result file
72
IMPLEMENTATION OF DATA WAREHOUSE LOADER

CRETIME: creation time of this result file

EXTFILE: the storing path and the name of extraction file

SORT: sorting criteria, which define according to which column the records are sorted.
The name of the column is the consistent with Format-header. If there are more than
one column names, they are separated by comma, and the order of the names is
relevant. If the file does not need to be sorted, FILE_NO_SORT is given.
LD-EXT
Extraction data
LD-TFM
Transformation data
LD-NEW
Difference data with new records
LD-CHG
Difference data with changed records
LD-DEL
Difference data with deleted records
LD-MRG
Status data created by merging program
LD-RET
Status data created by retrieving program
Table 4.1 File type information in the text file
2. Format-header
The format-header starts with the key word $FORMAT and ends with the key word §END.
It defines the name, the starting position and the width of each column in the text file. The
importing flag is not counted. Each column has the format information as the following:
<name of column>: <starting position>, <width>
The format-header information is determined by the method format ( ) defined by each
record class.
3. Data-body
The data body starts with the key word §DATA and ends with the key word §END. It
includes the useful data of this file in a predefined column format, as described in Formatheader. Every line starts with the importing flag, which is initialized to be a dot. Then
there is a vertical bar to separate the importing flag and real data part.
4.6. Sorting the linked list of record objects
73
IMPLEMENTATION OF DATA WAREHOUSE LOADER
In the workflow of Data Warehouse Loader, all the record objects in the intermediate files
are organized in a linked list, which includes a set of record objects. It is necessary to sort
all the record objects according to certain sorting criteria, such as Artikel number, Markt
ID and so on. Sorting is especially important when two linked lists are compared in order
to find their difference, as it is required when figuring out the difference between the
current transformation result and the previous one. Sorting is also required to transform
any large source record groups into another large target record groups (m:n-transformation).
A specific method can be written down to sort those record objects, but it will be a lot less
trouble to take advantage of another feather of the java.util package and the Collections
class. The Collections class defines a variety of handy static methods, and one of them
happens to be sort ( ) method. The sort ( ) method will only sort lists, that is, collections
that implement the List interface.
Figure 4.22: Code fraction for sorting the linked list of record objects
Obviously there also has to be some ways for the sort ( ) method to determine the order
of objects from the list that is sorting, in this case now, record objects. The most suitable
way to do this for record objects is to implement the Comparable interface for each record
class. The Comparable interface only declares one method, called compareTo ( ). It is the
same method seen in the String class and it returns 1, 0, -1 depending on whether the
current object is less than, equal to or greater than, the argument passed to the method.
If the Comparable interface is implemented by each record class, the record class can be
passed as an argument to the sort ( ) method directly. Then the collection is sorted in
place as required, so there is no return value. This is the way all the record objects are
74
IMPLEMENTATION OF DATA WAREHOUSE LOADER
sorted in a linked list in Data Warehouse Loader. Please see Figure 4.22 for the code
fraction of sorting in Data Warehouse Loader.
4.7. Graphic user interface of Data Warehouse Loader
In this work, all the functionality of Data Warehouse Loader discussed above is
implemented in Java. Additionally, a graphic user interface is provided here for the
convenience to run the workflow of Data Warehouse Loader and check the intermediate
result of each processing stage. Please see Figure 4.23.
Figure 4.23: Graphic user interface of Data Warehouse Loader
75
IMPLEMENTATION OF DATA WAREHOUSE LOADER
The lower part of the GUI window works as output window. And the upper part of the GUI
window provides different options to run Data Warehouse Loader. “Loader Workflow” is
the complete operation sequence of data warehouse extracting, transforming and loading
process. “Target” is to indicate the database table, with which Data Warehouse Loader is
currently working. To run Data Warehouse Loader, normally first choose one of radio
buttons from “Loader Workflow” and one of radio buttons from “Target”, then press
“Execute” button. The resulted intermediate file of this current execution will be shown in
the lower part as output.
“Options” is especially used when executing figuring out difference and importing. After
figuring out the difference between the current transformation result and the previous one,
there are three difference files for new records, changed records and deleted records
separately. It is better to read each of them separately. Therefore, choose one of radio
buttons from “Options” in order to read those three difference files one after each other.
As far as the importing program is concerned, every execution of importing program will
import one of these three difference files, which means to insert new records, to update
changed records or to remove deleted records separately. Therefore, one of radio button
from “Options” needs to be chosen when importing. Especially, when it is the initial run of
Data Warehouse Loader, there is neither the content of old status file nor the content of
target data warehouse, and consequently no difference files. In this case, the
transformation result will be directly imported into target data warehouse, which means all
of the transformation results are new records and inserted. So, current transformation
result is also possibly one of the parameters for importing program.
76
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
5. Reuse Analysis of Data
Warehouse Loader
As stated before, it is important to make the software of Data Warehouse Loader
reusable in a standard data warehouse application. The reason is simple that this
software should be fit into different operational data source, different data transformation
scheme, and different target data warehouse. Besides these requirements from data
warehouse application, this software can also possibly be employed in other application,
where some data is transformed from system A to system B, for example, Data
Warehouse Loader works as a bridge during the step-by-step displacement from old
system to new system. Of course, a certain modification is necessary. Therefore, in this
work, the software of Data Warehouse Loader is designed especially in favor of software
reuse.
Chapter 2 gave some fundamental concepts of software reuse, which is essentially
important to understand the importance of this technology. Chapter 3 introduced the
development context of Data Warehouse Loader, especially the standard architecture of
77
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
data warehouse application, where Data Warehouse Loader resides. Afterwards, chapter
4 explained the detail implementation of this software, such as how is its overall
architecture, how is its workflow, how is its intermediate results, how is Loader-engine,
how is Loader-interface, and so on. In other words, chapter 2 is the theoretical part of this
work, while chapter 3 and chapter 4 are the practical part of this work. Therefore, in this
chapter the task is to combine the theoretical part and practical part of the work, which
means to analyze the reuse architecture of Data Warehouse Loader explicitly.
5.1. Reuse development process of Data Warehouse Loader
In order to make the software or the software component reusable, what has been done
in the implementation of Data Warehouse Loader is as following:

The reusability of software is taken into consideration even in the early planning stage
of software development.

Accurate and concrete reuse concepts are developed.

Good documentation is provided, such as system documentation as well as the
documentation in source code.

Universal programming convention is used.

Concrete identification and separation of specification from a particular project or a
particular customer is done.

The software is built on the base of certain framework, library and component.

The application requirements are thoroughly examined. It is not enough to only know
what to do. However, man must intensively consider the requirements in order to find
out and define even the blurring and contradictory points. Moreover, it is the best to
come up with one general component, such that other developers can make profit
from it.
Data Warehouse Loader of this work shows several characteristics to be reusable as
following:

It is general.

It masks the basis functions, such as database access, error handling and so on.

It is programmed with the widely and popularly used language and platform
independent programming language.
78
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
Accordingly, the requirement analysis and architecture design of Data Warehouse Loader
is extremely important in this work. Because of the requirements of data warehouse
application, namely to be fit to different data source, different target data warehouse, and
different data transformation scheme, the interface concept of Java is explored. There are
four interfaces in Data Warehouse Loader, namely extraction-interface is oriented to
different data source, transform-interface is oriented to different transformation scheme,
database-interface is oriented to different target data warehouse, and record-interface is
used during the whole process because records are the objects being always considered
with.
Figure 5.1: Decoupling architecture of Loader-engine and Loader-interface in detail
With regard to the architecture design of Data Warehouse Loader, it is important how to
divide the whole functionality into sub-functions. It is should be carefully considered that
the workflow of Data Warehouse Loader is non-object-oriented, while an object-oriented
language is chosen to be used here. As stated before, the “what” information is essential
information that should be available to everyone. The “what” information includes
79
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
specifications and interface information. The “how” information should be available only to
a limited group. The “how” information includes implementation details such as data
structure. The separation of “what” information and “how” information is done by the
means of the architecture containing both Loader-engine and Loader-interface.
Loader-engine contains the “how” information. Loader-engine knows how to extract data
from data source, how to transform data, how to figure out the difference, how to import
data into data warehouse, how to merge data into status files and how to retrieve data
from data warehouse. Moreover, Loader-engine knows the sequence of the workflow of
Data Warehouse Loader. However, Loader-interface contains the “what” information.
Loader-interface knows what kind of data to extract from data source, what kind of
transformation scheme to perform, what is the difference to deal with new records,
changed records, or deleted record, and what kind of data to retrieve into data warehouse.
Figure 5.2: Detail of Loader-engine
Since in chapter 4 the detail implementation of Data Warehouse Loader has been given,
the overall decoupling architecture of Loader-engine and Loader-interface can be seen
80
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
more clearly in Figure 5.1. When Data Warehouse Loader is transferred from one
application to another, the programs of Loader-engine and interface definition remain the
same, as shown in the middle of Figure 5.1, which stands for a software part with high
reusability. What need to be modified are the classes, which implement the
corresponding interface, as shown the four surrounding parts of classes in Figure 5.1.
This figure also indicates each surrounding part of classes is working for which workflow
stage of Data Warehouse Loader, namely extraction, transformation, differentiation,
merging and retrieving. For example, class ExArtikel, ExMarkt, ExVerkauf implement
interface ExInterface and work for extraction stage. It is the same for other parts of
classes. Furthermore, the details of Loader-engine in the middle are shown in Figure 5.2.
5.2. Applying concepts of software reuse
Especially, there are several reuse considerations inside of Data Warehouse Loader as
following:
5.2.1.
Code reuse
Reusability can be achieved by writing code once and using it for another application,
which might require recompilation or not. There are different levels of code reuse: byte
code reuse and source code reuse. Java is complied to a machine-independent low-level
code, which is then interpreted by Java Virtual Machine. This gives the Java code
platform independence, which means the same byte code can be used at any machine
with different operating systems. Therefore, Java realizes byte code reuse. However, C
program can be reused in terms of source code. Since C is widely used, its language
standardization and compiler are available for most of the systems.
5.2.2.
Adaptability
Adaptability means the system is easily adapted to a diversified environment. Reusability
of software requires the software to be used in a new context, which might be with new
hardware, such as different CPU, or might be with new software, such as different
operating system or database system. On account of Java Visual Machine, platform
81
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
independence is provided by Java in terms of both hardware and software. Porting a
Java program to another machine does not even require recompilation.
As far as the database access layer is concerned, JDBC and ODBC provide a reliable
database connection.
JDBC (Java Database Connectivity) is an API (Application
Programming Interface) that lets user access virtually any tabular data source from the
Java programming language. It provides cross-DBMS (Database Management System)
connectivity to a wide range of SQL database. And now, with the new JDBC API, it is also
possible to access to other tabular data sources, such as spreadsheets or flat files. JDBC
API enables to take advantage of the Java platform’s “Write Once, Run Anywhere”
capability for industrial strength, cross-platform applications that require access to enterprise data. ODBC (Open Database Connectivity) is a widely accepted API for database
access. It is based on the Call-Level Interface (CLI) specifications from X/O and ISO/IEC
for database APIs and uses Structured Query Language (SQL) as its database access
language.
As far as the access to database is concerned, there are two levels of access: technical
access and access with business logic. Technical access means to technically create a
connection to the database with the help of Java Visual Machine, JDBC API and ODBC
API, by using SQL as database access language. However, database access with
business logic means to define which data to extract, which data to retrieve and how the
different data is imported. This part of work is done by the database interface and
extraction interface of Data Warehouse Loader.
It is important to separate these two levels of database access functionality, since
creating database connection is a lower level of data extraction, data importing and data
retrieving in/to the database. It is a technical connection and has nothing to do with
specific business logic.
5.2.3.
Modularity
Modularity is defined as the breaking down of a program into small, manageable units. In
Data Warehouse Loader of this work, the workflow of the whole extracting, transforming
and loading process is broken down into several modules, namely extraction from
82
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
operational data source, data transformation, figuring out the differences, importing data
into data warehouse, merging data into status file and retrieving data from data
warehouse. In this way, there are several advantages as following:

Easy testing: Since the whole workflow is broken down into separated modules, each
module can be tested separated. It makes easier for trouble shooting, since in case of
error, the error can be directly allocated to a certain module instead of checking all of
the modules. In addition, it saves also testing time when errors exist only in a certain
module, because only this module needs to be tested once again instead of running
the whole process once again.

Easy use: Each module can be utilized separately, which means to have less context
limitation and easily use.

Module reuse: Since each module can be separated from another, each of them can
be used separately in another context with similar functionality. For example, the
module of extracting data can be used as a tool to read data from certain database.
The module of transforming data can be used as a tool to change data in a certain
manner.
5.2.4.
Interface
Interface concept is one of the unique features of Java, which C does not have. Interface
concept makes it easy for Data Warehouse Loader to adapt to different data source,
different data transformation scheme, and different target data warehouse, as the
requirements of data warehouse application stated. When those conditions or
requirements are changed, what needs to do is to provide a specific class, which
implements its corresponding interface. Therefore, the modification is of little effort and
high reusability of the software is achieved.
There are altogether four interfaces in Data Warehouse Loader. Extraction interface and
database interface is aimed for data access, which provides both technical database
access and database access with business logic. Transformation interface represents
business logic in terms of a particular data transformation scheme in a certain application.
Finally, record interface is aimed for data representation, which provides a manner to
map a record in database table onto an object. The mapping results in an object-oriented
feature in this non-object-oriented process of extracting, transforming and loading
83
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
process. Such that the advantages of object-oriented language can be explored. In fact,
since each record is represented by one object, it is much easier to process these
records, such as changing, searching, inserting and so on. And it is also easier to work
with another new database table, since all that has to do is to create a new class, which
implements record interface and represents the record in the new table. Please see
Figure 5.3.
Figure 5.3: Interfaces of Data Warehouse Loader
5.3. Reuse architecture analysis of Data Warehouse Loader
In a word, the architecture of Data Warehouse Loader is in favor of software reuse, which
can be seen in Figure 5.4. The whole software development context of Data Warehouse
Loader can be divided into several layers, one above each other. The higher layer it is,
more abstract it is.
As stated before, Java Visual Machine, JDBC API and ODBC API are utilized to provide
adaptability, which means to be fit into different hardware, different operating system and
different database system. The hardware layer and the operating system layer are
untouched. With regard to database, it can be said that hardware layer deals with
database in terms of real disk, operating system layer deals with database in terms of file
storing system, and JDBC API deals with database in terms of database table.
84
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
Figure 5.4: Reuse architecture analysis of Data Warehouse Loader
Record interface, database interface and extraction interface belong to business objects
layer, which uses objects to model business. By relegating common characteristics and
behaviors to the highest possible level and modeling a problem as a set of types, this
technique gives rise to the high reusability of object-oriented software. Business objects
layer is one layer lower than business functions layer, which represents specific business
logic and defines how is the requirements for data processing. Therefore, transformation
interface and every module of the whole workflow belong to this layer. Finally, business
process is the most general layer and is connected to external data presentation.
The main motivation to implement Data Warehouse Loader in Java is to make use of the
interface concept of Java, the platform independence of Java and the object-oriented
feature of Java, which are the things C does not contain. And of course, it is a fashion to
use Java nowadays, since it is so widely accepted and used. However, the original sd&m
Data Warehouse Loader achieves high reusability as well although it is written in C.
85
REUSE ANALYSIS OF DATA WAREHOUSE LOADER
Firstly, since ANSI-C is a widely and popularly used programming language, there is no
problem for this software to migrate between different systems, such as Window NT and
Unix. However, this migration can be achieved in the Java Data Warehouse Loader of
this work by the platform independence feature of Java. Secondly, the original sd&m Data
Warehouse Loader also meets the requirements to be fit to different data source, different
data warehouse and different data transformation scheme. However, realizing it in C is
difficult with some mechanisms, such as look-up table. On the contrary, the interface
concept of Java makes it much easier with the consideration of this functionality, since
the reprogramming effort is reduced in case of changing. But Java also brings some
drawbacks. Therefore, from the programming language’s point of view, both C and Java
are reusable, but with different efforts. The lesson learned is that mere use of a certain
programming language does not guarantee software reusability. The language must be
accompanied with reuse technology, such as tools and methodologies.
Let us come back to Java program to have an overview at the packages in Data
Warehouse Loader. Please see Figure 5.5. There are three packages: myPackage.loader
for Loader-engine classes, myPackage.loaderInterface for Loader-interface classes, and
myPackage.basis for the common basic classes of this work.
Figure 5.5: Package overview of Data Warehouse Loader
86
SUMMARY
6. Summary
6.1. Summary of this work
Since the advent of computer industry, the cost of software development is constantly
increasing. The high cost of producing and maintaining software is the compelling reason
to devise new ways, for example, software reuse. Software reuse is receiving much
attention, since there is enough software written and available that can be reused at a
fraction of the cost of developing new software from scratch.
The objectives of object-oriented software reuse are to produce software better, faster,
and more cheaply by reusing existing well-tested assets. These assets are domain
architecture, requirement analysis documents, prototype models, design, algorithm,
coding components, test scenarios, standards, and any other related documents. If
software is reused, it will result in the following advantages:

Shorter software development time

Software with higher quality and fewer errors
87
SUMMARY

Consequently lower cost
In this work, the general concepts of software reuse are introduced. Then, these concepts
are applied in the implementation of data warehouse ETL systems, called Data
Warehouse Loader. This software is implemented in Java. It fulfills most of the
requirements of data warehouse ETL systems. Java is chosen as programming language
in order to be portable and platform-independent. Java interface concept and building
block principle are used for the purpose of flexibility for data transformation scheme and
adaptability to any data source & target data warehouse. Such that, the advantages of
the object-oriented programming language are explored. In case of changing data
transformation scheme or connected different database systems, the reprogramming
efforts are reduced, which means the high software reusability is achieved. However, it
still has some drawbacks.
Compared to the former sd&m Data Warehouse Loader with C, this software achieves
high reusability with the expense of slowing down the run-time speed-up and large main
memory consumption. Therefore, it would be of problems when processing large volume
of data. One reason for that is this Data Warehouse Loader is implemented in Java and
Java naturally requires more memory resource and is slower, because Java program is
first compiled to low-level byte code and then interpreted by Java Visual Machine on the
particular machine. Another reason for that is the workflow of this Data Warehouse
Loader is based on object-oriented approach. The most significant difference between an
object and a non-object-oriented approach is that each database record is organized as
an object instead of a stream of characters. Therefore, all the operations are concerned
with an object or a set of objects, which are concatenated in a linked list object.
Operations with objects need more main memory consumption compared to simple
characters. Therefore, further work on this data warehouse ETL system would be some
modification of the data process workflow in order to reduce the main memory
consumption, especially in case of processing large amount of data.
In a word, the concept of software reuse should be an integral principle of software
engineering process. Software reuse techniques should be applied to the whole software
development process. Such that, high quality software with fewer errors can be produced
with lower cost and shorter schedules.
88
SUMMARY
6.2. Lesson learned
Generally, what lessons should be learned during the software reuse practice in the last
few years?

Reuse is not a silver bullet. It should be apparent that reuse is only part of the
solution to the software crisis. It cannot solve all the problems.

Product-line architectures should be emphasized. Probably the most important
lessen learned during the past few years is that product-line architecture is where the
action should be. The architecture creates the foundation for systematic reuse to
occur. It becomes very difficult to figure out how to be reusable without architecture
to serve as a decision framework.

While technical issues exist, software reuse remains a management problem. That
means most of the barriers facing software reuse adopters are cultural, managerial,
psychological, and political.

For reuse concepts to work, new processes and paradigms must be introduced.
Changes in management infrastructure, such as organization, policies, processes,
are needed to support the introduction of reuse. Structure and tools need to be
provided to get products out the door quickly and expertly.

Case tools need to be reoriented to model the solution space in addition to the
problem domain. Most CASE tool define what should be built without considering
available reusable assets. As a result, reuse opportunities are not fully considered as
systems are synthesized.
In a word, software must be viewed more as an asset than as expense by management
in the future. Viewing software as an asset would encourage management to capitalize its
software research and productize its software development efforts.
89
APPENDIX
Appendix
Reference Books:
[1] Jag Sodhi, and Prince Sodhi, “Software Reuse Domain Analysis and Design
Processed”, McGraw-Hill, 1998
[2] Donald J. Reifer, “Practical Software Reuse”, Wiley Computer Publishing, John Wiley
& Sons, Inc. 1997
[3] W.H. Inmon, “Data Architecture: the Information Paradigm”, QED Technical Publishing
Group, Second Edition, 1991
[4] Vivek R. Gupta, "An Introduction to Data Warehousing", Distributed by System
Services Corporation
[5] W.H.Inmon, Richard D. Hackathorn, "Using the Data Warehouse", A Wiley-QED
Publication, John Wiley & Sons, Inc. 1994
[6] Kelly J. Thomas, "Dimensional Data Modeling", Distributed by Sybase
[7] Wayne C. Lim, “Effects of Reuse on Quality, Productivity, and Economics”, Software,
IEEE Computer Society, September 1994, pp. 23-30
[8] Larry Berstein, “Politics of Reuse”, Processing AT&T Symposium on Software Reuse,
May 1995
[9] Brian W. Beach, Martin L. Griss, and Kevin D. Wentzel, “Bus-Based Kits for Reusable
software”, Proceedings of 2nd Irvine Software Symposium, University of California at
Irvine, March 1992, pp. 19-28
[10] DoD Software Reuse Initiative, “Software Reuse Benchmarking Study”, available
from Defense Information Systems Agency, February 1996
[11] Art Pyster, “A Product-line Approach”, Proceedings 8th Annual Software Technology
Conference, April 1996
[12] Grady Booch, “Object-Oriented Design with Applications”, the Benjamin/Cummings
Publishing Co., 1991
[13] Peter Coad and Edward Yourdon, “Object-Oriented Analysis (Second Edition)”,
Yourdon Press, 1991
90
APPENDIX
[14] Kyo C. Kang, Sholom G. Cohen, James A. Hess, William E. Novak, and A. Spencer
Peterson, “Feature-Oriented Domain Analysis (FODA) Feasibility study”, Software
Engineering Institute, Report CMU/SEI-90-TR-21, November 1990
[15] Jim Moore, “Reuse Library Interoperability Group (RLG)”, “Presentation of Status to
DoD Reuse Executive Steering Committee”, available from Defense Information Systems
Agency, September 1995
[16] DoD Software Reuse Initiative, “Virtual Reuse Library Assessment and Library
Infrastructure Report”, available from Defense Information Systems Agency, September
1994
[17] Elena Wright, “Library Business Fee for Service”, available from Defense Information
Systems Agency, February 1993
[18] Central Archive fro Reusable Defense Software (CARDS), Franchise Plan. STARAVC-B010/001/00, February 1994
[19] Bob Hennan, NASA Electronic Library System (NELS) Future Plans, available fro
NASA Johnson Space Center, August 1994
[20] Don Batory, “Validating Component Compositions in Software System Generators”,
TR-95-01, University of Texas at Austin, 1995
[21] Software Productivity Consortium, “Reuse Adoption Guidebook”, SPC-92051-CMC,
November 1993
[22] Mark Paul, “Survey of Major Changes proposed for CMM v2”, CMM Correspondence
Group, Software Engineering Institute, April 1996
91
Download