Technical University Hamburg-Harburg Technische Informatik (TI 3) Prof. Dr. Siegfried M. Rump Applying Concepts of Software Reuse to the Implementation of Data Warehouse ETL Systems October, 2001 Jiayang Zhou Content STATEMENT .......................................................................................................................... 5 ACKNOWLEDGEMENTS........................................................................................................ 6 1. Introduction ....................................................................................................................... 7 1.1. Description of the work ............................................................................................... 7 1.2. Scenarios ................................................................................................................... 8 1.3. The structure of this work ......................................................................................... 11 2. Fundamental of software reuse....................................................................................... 12 2.1. What is software reuse? ........................................................................................... 12 2.2. Why is software reuse important?............................................................................. 12 2.3. Economics of software reuse .................................................................................... 14 2.4. Where does software reuse pay off? ........................................................................ 15 2.5. Upon what concept is software reuse based?........................................................... 15 2.6. Principles of object-oriented software reuse ............................................................. 16 2.6.1. Information hiding .............................................................................................. 17 2.6.2. Modularity .......................................................................................................... 17 2.6.3. Adaptability ........................................................................................................ 18 2.6.4. Modification ........................................................................................................ 18 2.7. State of the art .......................................................................................................... 18 3. Data Warehouse Loader: analysis example of software reuse ........................................ 21 3.1. Introduction of data warehouse ETL systems ........................................................... 21 3.1.1. Definition of data warehouse ETL systems ........................................................ 21 3.1.2. Requirements of data warehouse ETL systems ................................................. 22 3.1.3. Context requirement of data warehouse application ........................................... 23 3.1.4. Other usage of Data Warehouse Loader ............................................................ 24 3.2. Architecture of a data warehousing system .............................................................. 25 3.2.1. Operational data source ..................................................................................... 27 3.2.2. Data-transfer ...................................................................................................... 28 3.2.3. Data warehouse ................................................................................................. 28 3.2.3.1. Subject-orientation ....................................................................................... 28 3.2.3.2. Integration ................................................................................................... 30 3.2.3.3. Time variance .............................................................................................. 31 3.2.3.4. Non-volatility ................................................................................................ 31 3.2.3.5. Difference between data warehouse and operational systems .................... 32 3.2.4. Analysis ............................................................................................................. 33 3.2.5. Presentation ....................................................................................................... 37 3.2.6. Metadata ............................................................................................................ 38 3.2.7. Process management ........................................................................................ 38 3.2.8. User administration ............................................................................................ 38 3.3. The role of Data Warehouse Loader ......................................................................... 39 3.3.1. Functionality of Data Warehouse Loader ........................................................... 39 3.3.2. Software reusability consideration of Data Warehouse Loader .......................... 40 4. The Implementation of Data Warehouse Loader ............................................................. 42 4.1. The description of Data Warehouse Loader ............................................................. 42 4.2. The architecture of Data Warehouse Loader ............................................................ 45 4.3. Loader-engine: the operation mode of Data Warehouse Loader............................... 46 4.3.1. Advantages of workflow architecture .................................................................. 47 4.3.2. Major task of the workflow of Data Warehouse Loader ...................................... 51 4.4. 4.3.2.1. Extraction .................................................................................................... 52 4.3.2.2. Transformation ............................................................................................ 55 4.3.2.3. Figuring out difference ................................................................................. 59 4.3.2.4. Importing ..................................................................................................... 60 4.3.2.5. Merging ....................................................................................................... 63 4.3.2.6. Retrieving .................................................................................................... 64 Loader-interface: interface concept of Data Warehouse Loader ............................... 66 4.4.1. Extraction-interface ............................................................................................ 68 4.4.2. Transformation-interface .................................................................................... 68 4.4.3. Database-interface ............................................................................................. 69 4.4.4. Record-interface ................................................................................................ 70 4.5. Format of intermediate files ...................................................................................... 72 4.6. Sorting the linked list of record objects ..................................................................... 73 4.7. Graphic user interface of Data Warehouse Loader ................................................... 75 5. Reuse Analysis of Data Warehouse Loader .................................................................... 77 5.1. Reuse development process of Data Warehouse Loader ......................................... 78 5.2. Applying concepts of software reuse ........................................................................ 81 5.2.1. Code reuse ........................................................................................................ 81 5.2.2. Adaptability ........................................................................................................ 81 5.2.3. Modularity .......................................................................................................... 82 5.2.4. Interface ............................................................................................................. 83 3 5.3. Reuse architecture analysis of Data Warehouse Loader .......................................... 84 6. Summary ........................................................................................................................ 87 6.1. Summary of this work ............................................................................................... 87 6.2. Lesson learned ......................................................................................................... 89 Appendix ............................................................................................................................... 90 4 STATEMENT STATEMENT Hereby I do state that the present work has been undertaken by myself and with the unique help, which is referred within this thesis. Jiayang Zhou Hamburg, October 2001 ACKNOWLEDGEMENTS ACKNOWLEDGEMENTS I would like to express my thanks for Prof. Dr. Siegfried Mr. Rump, Mr. Stefan Krey, and Mr. Lutz Russek for their essential help and advisory in the whole development of this work. Sometimes their help is not merely academic. Besides, since this work is undertaken in sd&m AG (software design & management), I would like to thank all my colleagues from sd&m AG, Hamburg. Their cooperation and help are appreciated as well. INTRODUCTION 1. Introduction 1.1. Description of the work Software reuse is a process of implementing or updating software systems using existing software assets. Software assets can be defined as software components, objects, software requirement analysis, design models, domain architecture, database schema, code, documentation, manuals, standards, test scenarios, and plans. Software reuse may occur within a software system, across similar systems, or in widely different systems. This process provides ways to reduce costs, shorten schedules, and produce quality products. The importance of software reuse lies in its benefits of providing quality and reliable software in a relatively short time. The computer industry has demonstrated that software reuse generates a significant return on investment by reducing cost, time and effort while increasing the quality, productivity, and maintainability of software systems throughout the software life cycle. In a word, software reuse is advantageous because it: Increases productivity Enhances quality Saves cost Reduces software development schedules Reduces maintenance Enhances standardization 7 INTRODUCTION Increases portability Contributes to the evolution of a common component warehouse Increases performance Software reuse is now considered an integral principle of the software engineering process. And software reuse can be developed in a manner similar to the development of computer hardware products. [1] In this work, the fundamental concept of software reuse with an object-oriented approach is examined. It deals with the object-oriented software reuse strategies, the reuse paradigm, and the reuse process. It is obvious that the mere use of a certain programming language does not guarantee software reusability. The language must be accompanied with reuse technology, such as tools and methodologies. Moreover, the general concept of software reuse is applied to the implementation of data warehouse ETL (Extraction, Transformation and Loading) systems. This data warehouse ETL system, called Data Warehouse Loader, is implemented in Java. Therefore, it is an analysis example of software reusability with object-oriented approach. The standard architecture of data warehouse application and the role of Data Warehouse Loader inside is explained. The detail implementation of Data Warehouse Loader is illustrated, such as its overall architecture, workflow, interface concept and so on. Finally, the reuse analysis is conducted in order to figure out the high reusable feature of this software, which means to illustrate the relationship between the general reuse concept and the real implementation scheme of Data Warehouse Loader. In a word, Data Warehouse Loader is implemented in a manner, where increasing software reusability is especially emphasized. The overall architecture is designed with the favor of applying the software reuse concept. On the other hand, this implementation of Data Warehouse Loader in Java has some drawbacks of performance degradation. 1.2. Scenarios There have been problems in the software development since its inception. The cost of software development is constantly increasing. Many projects are challenged but not completed. A challenged project is one that is completed with cost overruns and delays in schedule. The percentage of failure is greater than that of successfully completed projects. Please see Figure 1.1. The computer industry has tried to seek an easy way to reduce the costs and shorten the schedules required for software development, while making quality software with fewer errors. [1] 8 INTRODUCTION Here shows that considering software reuse should be one of the solutions. Software reuse means that ideas and code are developed once, and then used to solve many software problems in order to enhance productivity, reliability and quality. Reuse applies not only to source-code fragment, but to all the intermediate products generated during software development, including Figure 1.1 success and failure percentage for software development project documentation, system specifications, design architecture and so on. Reusability is a big issue these days. Pretested software should be used so that cost and time can be saved. The development of object-oriented software means modeling a problem as a set of types or classes from which the objects are created. This set is partitioned into a hierarchical categorization that emphasizes reuse by relegating common characteristics and behaviors to the highest possible level. Once this modeling had been done, coding (translation of algorithms to program) is easier because it consists of mere creation of necessary objects from the defined classes and invokes the behavioral operations of object. Reusable software requires planned, analyzed, and structured design that withstands thorough testing for functionality, reliability, and modularity. [1] Here an object-oriented approach to software development is preferred because it leads to reusable classes. Objects are discrete software components and contain data and procedure together. Systems are partitioned based on objects. Data determine the structure of the software. On the contrary, data-oriented or event-oriented analysis and design deal with operations and data as distinct and loosely coupled. Operations determine the structure of the system. Data are of secondary importance. Therefore, the cost of software development is growing exponentially. In 1998, sd&m AG (software design & management) completed one project, called STARTMDB (Management Database), to build a data warehouse application for START Holding in Frankfurt. One part of this data warehouse application is Data Warehouse Loader, which extracts data from different operational data source, transforms data into required format, and 9 INTRODUCTION loads data into target data warehouse. This sd&m Data Warehouse Loader is implemented in C, which is suitable in this case, since the operation mode of Data Warehouse Loader is nonobject-oriented. Because of choosing ANSI-C as programming language, it is possible to migrate between different system platforms, for example, from Windows NT to Unix. On account of the requirements of data warehouse application, this sd&m Data Warehouse Loader is designed to be reusable from the beginning. That means it can be fit to different data transformation scheme, different operational data source and different target data warehouse. However, it is difficult to realize this reusability with C. The code is hard to read and it needs to be recompiled when a new database is inserted or a new data transformation scheme is required. Therefore, the task of this work is to implement Data Warehouse Loader in Java. From the programming language point of view, Java has several advantages as following: Java is object-oriented from the ground up, which means it was explicitly designed from the start to be object-oriented. However, C is not an object-oriented language. Java has a facility, called Interface, whose name indicates its primary use: specifying a set of methods that represent a particular class interface, which can be implemented individually in a number of different classes. All of the classes will then share this common interface, and the methods in it can be called polymorphically. While C does not have the interface concept. Java is compiled to a machine-independent low-level code called byte code. This byte code is then interpreted by the Java Virtual Machine running on the particular machine. This gives the Java code platform independence, which means that the same byte code can be run on any of a huge variety of machines with different operating systems. Porting a Java program to another machine does not even require recompilation. The cost is the slowing down of run-time speed-up to a factor of 5. Java Virtual Machine can carry out a number of checks that a program is running properly, for example, array bounds, memory access, viruses in byte code and so on. Accordingly, Java program is of more robustness and security compared with C. When a C program requests memory to use as workspaces, it must keep track of it and return it to the operating system when it ceases to use it. This requires extra programming and extra care. This task of garbage collection is carried out automatically in Java. An object that is no longer used is automatically destroyed and the memory is released. Additionally, Java is now used worldwide. The management trend of most firms is to have Java programs in their organization. That is also true for the management database system. 10 INTRODUCTION Therefore, implementing Data Warehouse Loader in Java will make this software more acceptable by the market. 1.3. The structure of this work Chapter 1 gives the description of the work and far-ranging scenarios, which includes the current situation of software development, the importance of software reuse and the initiate of this project. Chapter 2 introduces the fundamental concepts of software reuse and some state-of-art software reuse technology, which is important for a better understanding of software architecture design in favor of reusability. Concerned with software reuse, here discusses its definition, importance, economics, basis and so on. Chapter 3 presents the introduction of Data Warehouse Loader, which is the analysis example of software reuse in this work. Firstly the introduction of ETL systems, so called Data Warehouse Loader, is given. With the explanation of the standard architecture of data warehouse application, where Data Warehouse Loader resides, the role of Data Warehouse Loader is introduced, namely its functionality and its reuse consideration. Chapter 4 is the detail implementation of Data Warehouse Loader. It illustrates the practical part of this work, including the overall architecture, the workflow and the interface-concept of Data Warehouse Loader. This chapter explains how is the relationship between each class, how each stage of Data Warehouse Loader workflow works, how is the format of intermediate files, how is the sorting inside each linked list of record object, and so on. Chapter 5 is the link between the theoretical concept of software reuse and the practical implementation of Data Warehouse Loader. In this chapter, the reuse analysis shows how the abstract concept is applied. Chapter 6 offers the conclusion of the whole project. Some lessons learned during the general software reuse process and some drawbacks of this work are also showed. Appendix contains the reference books of this work. 11 FUNDAMENTAL OF SOFTWARE REUSE 2. Fundamental of software reuse 2.1. What is software reuse? Software reuse is defined as the process of implementing or updating software system using existing software assets. Software reuse can occur within a system, across similar systems, or in widely different systems. The term “asset” was selected to express that software can have lasting value. Reusable software assets include more than just codes. Requirements, designs, models, algorithms, tests, documents, and many other products of the software process can be reused. [2] Software reuse is a concept to acquire high-leverage software, which has the potential to be reused across applications. However, as in many cases, taking a simple idea and making it happen in reality often is not as easy as it sounds. Details have to be worked out before the concept can be made to work in practice. 2.2. Why is software reuse important? Systematic software reuse revolves around the planned development and exploitation of reusable software assets within or across applications and products lines. Its primary goal is to 12 FUNDAMENTAL OF SOFTWARE REUSE save your money and/or time. It succeeds when the amount of resources required to deliver an acceptable product are reduced. It tries to take advantage of software that exists or can be purchased off the shelf. It motivates to address the number of management, technical, and people issues that inhibit reuse. When getting down to basics, software reuse is motivated by the desire to get the job done cheaply and quickly. At this point, it might be a question, why software reuse is especially important. Are there many firms doing it? Do most developers build their software to be reused? Has the underlying technology needed for software reuse been around for years? Are the guidelines for this systematic reuse practice available? Are there examples, which illustrate the successful reuse stories? Unfortunately, the answers to those questions above have been NO until recently. Most practitioners have not figured out how to do it in a repeatable and systematic manner. The reason is that the technology needed just was not available until recently. The arrival of objectoriented approaches and languages, domain engineering methods, integrated software development environments and new process paradigms make broad-spectrum software reuse possible. Advances in software architecture provide us with the foundation for software reuse, while a consensus on related standards provides us with the building codes. Figure 2.1 Reuse maturity distributions For the most part, software reuse tends to be done ad hoc in most firms. As illustrated in Figure 2.1, most of the firms whose software reuse processes have been evaluated using a 13 FUNDAMENTAL OF SOFTWARE REUSE reuse maturity model [21] are not using the state of the art. Reuse processes are not well defined and practices are not institutionalized in the majority of the firms. This analysis assumes that the processes, which organizations use to manage product lines, architectures, and software reuse, should be part of their business practice framework. Reuse considerations need to be incorporated into each of five levels of process maturity identified by the model: Level 1 (ad hoc), Level 2 (repeatable), Level 3 (defined), Level 4 (managed), Level 5 (optimizing). Please see Table 2.1. [2] Maturity level 1 2 3 4 5 Name Characteristics Reuse occurs ad hoc Reuse is neither repeatable nor managed Project-wide Reuse is a product of a project, not a process reuse Reuse is repeatable o a project-by-project basis Organization-wide Reuse assets are a product of the process reuse Reuse is part of the way the organization does business Product-line reuse Reusable assets are a product of the process Reuse is viewed as a business into itself Broad-spectrum Reuse is an integral part of the corporate culture reuse Processes are optimized with reuse in mind Ad hoc reuse Table 2.1: Process Maturity Models 2.3. Economics of software reuse With the recent push to downsize or outsource, software costs have to be cut down. The majority of improvement strategies being pursued today is either to reduce the inputs needed to finish the job, (such as people, time, equipment, etc.), or to increase the outputs generated per unit of input. This dual nature of software productivity can be represented notionally using the following equation [2]: Productivity = Outputs / Inputs used to generate the results When focusing on the equation’s input side, more workstation, CASE tools, mature processes, and the like can be equipped with software engineers. Using this approach, more output can be obtained from the people using an automation strategy. Just the reverse happens when the output side of the equation is focused on. Instead of concentrating on improving staff efficiency, reusing existing assets is emphasized on to get more output per unit of input. In 14 FUNDAMENTAL OF SOFTWARE REUSE either case, the strategies employed tend to be complementary. For example, increased automation can lead to increased reuse. 2.4. Where does software reuse pay off? Industry has realized s significant payoff by instituting systematic software reuse practice. For example, Wayne Lim of Hewlett-Packard reported the following benefits attributable to their software reuse initiative in IEEE software magazine [7]: Quality: the defect density for reused code was one quarter of that for new code. Productivity: systems developed with reuse yielded a 57 percent increase in productivity compared with those constructed using only new code. Time to market: when development efforts ware compared, those exploiting reuse took 42 percent less time to bring the product to market. 2.5. Upon what concept is software reuse based? It is necessary to cover some fundamental concepts upon which reuse is based. For reuse to occur in practice, reusable software assets must be acquired that are reused by other applications. This sentence is instructive. Let us examine the concepts that surround the three forms of the term reuse with the sentence: reuse, reusable and reused. [2] Reuse implies a known process for all those activities related to finding, retrieving, and using software assets of known quality within certain application. When talking about reuse, the following three types of processes are typically involved: Application engineering: the processes or practices, which firms use to guide the disciplined development, test, and life cycle support of their applications software. Domain engineering: the reuse-based processes or practices, which firms use to define the scope, specify the structure, and build reusable assets for a class of systems or applications. These activities are typically conducted to figure out what to build to be reusable. Asset management: the processes or practices, which firms use to manage their assets and make them readily available in quality form. These are the processes software engineers use to search libraries to find the reusable assets of interest. The quality of the assets is maintained along with the integrity of their configuration using some online mechanism, which is part of software engineering environment. 15 FUNDAMENTAL OF SOFTWARE REUSE Please see Figure 2.2, which illustrates how these processes are related. The software devel- opment approach depicted in this figure is called the dual-life-cycle model, because domain and application engineering activities are conducted in parallel. As shown, domain engineering uses the architecture that it develops to identify the reusable software assets, which application engineering develops and uses. Asset management links these activities and makes these assets available. [2] Figure 2.2 Dual Life Cycle (Source: Reifer, “Software Reuse: Making It Work For You”, 12/1991) The term reusable refers to the product and its basic features. The reusable asset must have high reuse potential and be packaged with reuse technique. If the asset is hard to understand or adapt, it will be abandoned. Product-line software architectures are now in favor because they let users identify and acquire the 20 percent of the assets responsible for 80 percent of the reuse across families of like systems. Ease of reuse should be a design consideration for each of the assets that is a part of the product line. The term reused has a value-added connotation. An asset built to be reusable does not take on value until it is reused with some advantage by someone else on another application. Typically, incentives must be provided to make this happen. 2.6. Principles of object-oriented software reuse The objectives of object-oriented software reuse are to produce a reusable asset for independent operating systems and plug-and-play applications. That means to identify common architecture, establish a repository and integrate reuse in the software development process. There are some principles of object-oriented software reuse as following: 16 FUNDAMENTAL OF SOFTWARE REUSE 2.6.1. Information hiding Information hiding is the protection of implementation details within object-oriented software, which is the deliberate hiding of information from those who might misuse it. This differentiates the “what” from the “how”. The “what” information should be available to everyone. The “what” information includes specifications and interface information. The “how” information should be available only to a limited group. The “how” information includes implementation details, such as data structure. Information hiding supports and enforces abstraction by the suppression of details. It increases quality and supports reusability, portability, and maintainability. It prevents confusion for the user, promotes correct data input, and enhances reliability. Information hiding also enhances localization and usually includes cohesive data; hence, good modularity is achieved and the goal of modifiability is more easily approached. 2.6.2. Modularity Modularity is defined as the breaking down of a program into small, manageable units. Modularizing object-oriented system software breaks the solution space into smaller units. The modules are grouped around a data type and objects of that type. Only subprograms, which contain operations for objects of a certain type, are grouped together. For example, an array type may be in a package along with subprograms for calculating the average of array elements. In well-modularized system software, the top modules are generally the “what” of the process, while lower-level modules constitute the “how” of the process. This implies that the lower the module is in the module group, the more implementation details it contains. In other words, upper-level modules are the most abstract modules in the group, while lower-level modules are the most detailed. Good modularity also implies loose coupling between modules. Coupling is a measure of dependence between modules. Global data shared by modules increases this inter-modular dependence. The passing of only required data via parameters or the localizing of data within a module decreases coupling. Loose coupling guarantees confirmability (independent module testing) and enhances the principle of modularity. Loose coupling also implies the changes in one modular will not affect the others. Thus, loose coupling also brings closer the goal of modifiability. 17 FUNDAMENTAL OF SOFTWARE REUSE In addition to loose coupling, another factor required for good modularity is data localization. Localization consists of placing highly related cohesive data only in modules that operate on this data. Only necessary data are passed from module to module and only through parameter. Only very highly related or cohesive data are localized in a module. In a word, important factors for good modularity are data localization, loose coupling, no data passing except via parameters, information hiding. 2.6.3. Adaptability Adaptability in a system means that the system is easily adapted in a diversified environment. Object-orient software can be easily adapted to new requirement because of the high level of abstraction. It models a problem as a set of types or classes from which the objects are created. Especially, Java is compiled to a machine-independent low-level code, called byte code. This byte code is then interpreted by the Java Virtual Machine, which runs on the particular machine. This gives the Java code platform independence, which means that the same byte code can adapt to any of a huge variety of machines with different operating systems. Porting a Java program to another machine does not even require recompilation. 2.6.4. Modification Modification allows changes to be made to a system without alteration to its original structure. Requirements change during the life cycle of a system. New versions of the system are created, with new and modified changes of requirements. The changes must be cost effective. Modifiability is achieved in a system by e design of small, meaningful modules; use of localized data in these modules; and very little use of global data or numeric literal values. The object-oriented concept incorporates a facility that includes data values and range constraints for a given type of data in one place: the type declaration. A change made in a type declaration is all that is necessary to modify this data throughout the whole software. 2.7. State of the art A great deal of research is underway under the banner of software product lines, domainspecific software architectures, domain engineering and software reuse. The reason for the 18 FUNDAMENTAL OF SOFTWARE REUSE interest in software reuse is primarily economic. Industry is looking for ways to increase the speed, in which it brings products to market and to provide certain features and functions quickly in order to maintain its competitive advantage. As a complementary strategy to productivity improvement, software reuse is viewed as a reasonable way to accomplish these goals with a minimum of disruption. [2] Throughout the world, significant progress has been made in software reuse technology. Many firms are pursuing software reuse. For example, AT&T has instituted a major reuse program to lever its investments in telecommunications software assets within and across domain-specific product-line architectures. They have developed a set of best practices that many of their business units have adapted to. [8] Hewlett-Packard (HP) has focused on developing and deploying software reuse concepts into its product divisions via a corporate program. [9] It uses a software bus and glue language to interface reusable software building blocks together in order to build application systems quickly. Recent studies have been conducted to reflect today’s situation: [2] Many efforts are underway to build a technical infrastructure for reuse. With the introduction of object-oriented methods and languages, the technology exists to package software for reuse. International programs, such as the European REBOOT effort, are starting to realize benefits as they transfer technology to industrial firms. [10] And industry consortia, such as the Software Productivity Consortium (SPC) [11], have active reuse programs that address architecture, product lines, and domain engineering methods in addition to other reuse issues. The hot research topics in software reuse are product-line management, software architectures, and domain engineering methods. Most current research populates architectures designed for families of systems with like characteristics (product lines). Many efforts are developing methods, notations, and languages used to model and analyze domain experience in order to develop a responsive architecture. Object-oriented techniques are starting to be viewed as reuse enablers. Object-oriented methods, languages, and tools help package software for reuse. Based on their class abstractions, object-oriented methods (such as Booch [12], Coad/Yourdon [13], FODA [14], etc.) become more widely used. Framework and class libraries containing both fine-grain (data structure such as queues, etc.) and coarse-grain (communications handlers and subsystems, etc.) components are marketed to support developers. Several prototype software reuse libraries that serve as models of the future are operational today. Several industrial strength software reuse libraries that provide their 19 FUNDAMENTAL OF SOFTWARE REUSE users with the ability to search, browse, and retrieve assets of interest are in operational use. Assets become increasingly available to populate these libraries (class libraries, etc.) Standards for interoperating these libraries across the Internet are devised by such groups as the Reuse Library Interoperability Group. [15] Users need to agree on the architecture so that they know how to use the library to get to the parts that matter. Without such a framework or the architecture, few will use the library in a cost-effective manner. Reuse tools are being put into the software engineering environments. Several efforts are integrating software reuse tools and library capabilities into the next-generation software engineering environment using Java-based servers as mechanisms for tool connectivity. [19] The philosophy being pursued is to make software reuse a natural way, in which firms do their business. Tools are the natural means to implement this philosophy, because they automate tedious manual processes and often act as technology transfer agents. A lot of attention is being placed on generative tools [20] for reuse, because they can generate the assets needed directly from an architecture specification. 20 DATA WAREHOUSE LOADER 3. Data Warehouse Loader: analysis example of software reuse 3.1. Introduction of data warehouse ETL systems 3.1.1. Definition of data warehouse ETL systems ETL stands for extraction (E), transformation (T) and loading (L). Data warehouse ETL systems are the software to load data from different operational data storing systems to data warehouse. The task of ETL systems is, first to extract data from data source, and then to transform data into a required format with specified transformation scheme, finally to import data into target data warehouse. In other words, the main operation of ETL systems is extraction, transformation and loading. In this work, an ETL system is implemented in Java, named Data Warehouse Loader. Data Warehouse Loader is not a completely application product, which can be applicable directly. It is rather than a key software component, which can be suitable for many applications via a program supplementation and configuration. 21 DATA WAREHOUSE LOADER 3.1.2. Requirements of data warehouse ETL systems ETL systems are developed in data warehouse application, especially consisting of the following requirements: Adaptability to any data source Adaptability to any target data warehouse High operation speed Ability to deal with large amount of data Flexibility for data transformation Portability The following is the brief explanation of the requirements above: The adaptability to any data source is an important property for the input of ETL systems, which means the data is able to be extracted from different data storing system, as stated in the requirements of data warehouse application. For relational database, different type of database system are often concerned. Therefore, ETL systems will not be fixed for one concrete type of source database, to which they connect later. It is recommended that ETL systems should be prepared for more application conditions in the future. The adaptability to any target data warehouse does not only result from the viewpoint of software reusability. It is also possible for data warehouse application to replace the original used warehouse database on account of efficiency, when the original used warehouse database cannot meet the increasing workload any more. The high operation speed is critical in case ETL systems have to process large amount of data in a certain time interval with limited resource. It is often the case, that not the daily difference of data is selectively extracted from data source, but, for example, the whole set of data with subsequent differences are extracted. Therefore, ETL systems must be able to deal with large volume of data and requires high operation speed. The flexibility of data transformation must be provided by ETL systems. Between operational data source and target data warehouse, there is always a stage of data transformation, which is in most cases especially complicated. It should not only transform one source record into exactly one target record (1:1-transformation), but also transform any large source record group into another large target record group (m:n-transformation). 22 DATA WAREHOUSE LOADER Portability is of course important for the requirements of software reusability. This aspect is interesting for a single project as well. For example, one data warehouse project begins with NT-server because of cost consideration, and then must migrate to a more efficient UNIXsystem because of the increasing system-burden. Moreover, this tactic also corresponds to a special designing approach, that the total system is based on smaller sub-systems. In this work, Data Warehouse Loader fulfills most of the requirements of ETL systems above. Java is chosen as programming language in order to be portable and platform-independent. Java interface concept and building block principle are used for the purpose of flexibility for data transformation and adaptability to any data source & target data warehouse. However, it still has some drawbacks. On the one hand, it consumes quite a lot main memory especially when processing large amount of data. One reason for that is this Data Warehouse Loader is implemented in Java and Java naturally requires more memory resource and is slower. Another reason for that is the workflow of this Data Warehouse Loader is based on object-oriented approach. The most significant difference of this object-oriented approach is that each database record is represented as an object instead of a stream of characters. Therefore, all the operations are concerned with an object or a set of objects, which are concatenated in a linked list object. Operations with objects need more main memory consumption compared to that of simple characters. For example, it results in an internal sorting of records, which means the whole set of objects needs to be read into main memory while processing. Therefore, this Data Warehouse Loader is not so suitable to work with huge volume of data in case of the lack of adequate hardware condition. On the other hand, Data Warehouse Loader is called a key software component, which cannot work alone and must be matched and complemented with other components when utilized in certain applications. 3.1.3. Context requirement of data warehouse application In the last few years, a large market of data warehouse application has been developed. The whole package of products for data warehouse application contains database to store data, 23 DATA WAREHOUSE LOADER Front-Ends for data analysis & presentation, system to transform data, and system to import data into data warehouse, etc. In this work, the framework of Data Warehouse Loader does not include all the aspects of a complete package of products for data warehouse application. As defined above, Data Warehouse Loader only fulfills the functionality to extract data from source system, transform data and load data into target system. However, since Data Warehouse Loader cannot work alone and has to cooperate with other components of the data warehouse application, some aspects of the whole processing sequence have already been considered as well in order to guarantee safe running and to develop new standard software. As far as the evaluation of suitability is concerned, the following questions should be considered: How is the complexity of the data transformation? Does it require simple 1:1 transformation or more complicated m:n transformation? How big is the volume of data, which is processed? Are the intermediate results of concatenated transformation steps stored in database, which has a big influence on the speed and performance of the whole process? Is it possible to process all the data within the required time interval? What is the source data store? Can all the well-known and estimated system be always added in? What is the target data warehouse How extensive is the metadata model? Is the code generated or interpreted? Is a supplementary compiler needed? Is it possible to modify the source code? 3.1.4. Other usage of Data Warehouse Loader Although Data Warehouse Loader is developed for one data warehouse application, it can also possibly be employed in other application, where some data is transferred from system A to system B. Of course, certain modification is necessary. For example, Data Warehouse Loader can be used as a bridge. During the step-by-step displacement from old system to new system, the bridge between two steps always needs to 24 DATA WAREHOUSE LOADER be built. In principle, Data Warehouse Loader provides the same function. When replacing “data source” with “old system” and replacing “target data warehouse” with “new system”, Data Warehouse Loader will work as a transformation bridge. 3.2. Architecture of a data warehousing system In this work, the research for applying the concept of software reuse is conducted with the example of implementing Data Warehouse Loader in Java. Since Data Warehouse Loader is tightly associated with a data warehousing system, here it is necessary to give some words about the general architecture of data warehousing systems. What is a data warehouse? A simple answer could be that it manages data situated after and outside the operational systems. From the conceptual origins of a single database serving all purposes has evolved the notion of a data management architecture, where data is divided into a data warehouse and an operational database. The separation of operational data storing and analytical data storing results from the consideration of data storing requirements for the purpose of analysis in the enterprise. The evolution is in response to many technological, economic and organizational factors: the difference in the users of the two environments, the difference in the technology supporting the two environments, the difference in the amount of data found in the two environments, the difference between the business usage of the two environments and so forth. [3] Operational systems Analysis systems operation updating analyzing, only read access query fixed, simple User-defined, complicated data view record-oriented multi-dimension data per transaction few many structure of data detailed data aggregated data time reference current historic and current Table 3.1: The difference between operational system and analytical system It is not practical to keep data in the operational systems indefinitely. The fundamental requirements of the operational and analysis systems are different: the operational systems need performance, whereas the analysis systems need flexibility and broad scope. It has 25 DATA WAREHOUSE LOADER rarely been acceptable to degrade performance of the operational systems in order to have business analysis interface. The difference between operational systems and analysis systems is shown in Table 3.1: [3] The first objective of data warehousing system is to serve decision support systems (DSS) and executive information systems (EIS) community, which makes high-level and long-term managerial decisions. DSS tend to focus more on detail and are targeted toward lower to midlevel managers. EIS have generally provided a higher level of consolidation and a multidimensional view of the data, as high-level executives need more ability to slice and dice the same data than to drill down to review the data detail. The following are some characteristics associated with DSS or EIS: [4] These systems have data in descriptive standard business terms, rather than in cryptic computer field’s names. Data names and data structures in these systems are designed for use by non-technical users. The data is generally preprocessed with the application of standard business rules, such as how to allocate revenue to products, business units, and markets. Consolidated views of the data, such as products, customers and market, are available. Although these systems will have the ability to drill down to the detail data, rarely are they able to access all the detail data at the same time. In a word, DSS and EIS in the enterprise provide information about customers, market, turnover and so on. However, the operational data storing system, which processes daily work data, is organized according to business procedure and focused on current data. Therefore, it is not suitable for analytical management information. Many factors have influenced the quick evolution of the data warehousing discipline. The most significant factor has been the enormous forward movement in the hardware and software technologies. Sharply decreasing prices and the increasing power of computer hardware, coupled with easy use of software, has made it possible to quick analyze hundreds of gigabytes of information and business knowledge. Another influence on the evolution has been the fundamental changes in the business organization and structure. Firstly, the emergence of global economy has profoundly changed the information demands made by corporations. Phenomena such as “business process reengineering” forces businesses to reevaluate their practices. Secondly, flexible business 26 DATA WAREHOUSE LOADER software suites adapted to the particular business have become a popular way to move to a sophisticated multi-tier architecture. Lastly, information technology now is nearly universally accepted as a key strategic business asset. Management is more information conscious. [5] Figure 3.1: Architecture of data warehousing systems Data warehousing systems provide a new information infrastructure besides the existing systems. It is not only a conceptual but also a technical approach in the DSS fields. Data warehousing systems provide the analytical tool for DSS or EIS, but their design is not only derived from the specific requirements of analysts or executives, but also aligns with the overall business structure. Figure 3.1 shows the standard architecture of data warehousing systems, where Data Warehouse Loader resides. Each component of this architecture is explained as following. 3.2.1. Operational data source Inside of operational data source, there are many elementary source, for example, files, database table, external database, data management information supporter and so on. The common property of those elementary sources is record-oriented while they might be implemented in a different way or a mixed way. 27 DATA WAREHOUSE LOADER 3.2.2. Data-transfer Data-transfer is a bridge between operational data source and data warehouse, whose functionality is as following: Extraction of data from data source at a fixed point of time: The time interval of extraction operation depends on the source systems and the technical requirements for analysis. One major problem in extraction is to determine the difference between the previous extraction result and the current one. Therefore, a complete subtraction is executed in order to get the difference, namely new records, changed records and deleted records. This procedure is possible only if the time template of data source systems is known. Consolidation of data: For example, the revision of data inconsistency is done. Transformation of data in order to meet the needs of data warehouse: The result of extraction cannot be directly loaded into data warehouse, because neither its content nor its structure is suitable for data warehouse architecture. 3.2.3. Data warehouse The primary concept of data warehouse is that the data stored for business analysis can most effectively be accessed by separating it from the data in the operational systems. A data warehouse is a structured extensible environment designed for the analysis of non-volatile data. This data is logically and physically transformed from multiple source applications to align with business structure. This data is updated and maintained for a long time period. This data is expressed in simple business terms and summarized for quick analysis. In a word, the definition of data warehouse from W.H.Inmon is: A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision making process. [3] The feature of data warehouse can be explored by its definition, namely subject-orientation, integration, time variance and non-volatility, and its difference from operational data storing systems, as explained below. 3.2.3.1. Subject-orientation Data warehouse is organized around the major subjects of the enterprise, i.e. the high-level entities of the enterprise, which causes the data warehouse design to be data driven while operational data storing system is organized around process and function. The major subject 28 DATA WAREHOUSE LOADER area affects the key structure of the data and the organization of non-key data in data warehouse. However, the functional boundaries are the major criteria for operational data storing system. Please see the Figure 3.2. Figure 3.2: Data warehouse entities align with the business structure A data warehouse logical model aligns with the business structure rather than the data model of any particular application. The entities defined and maintained in data warehouse are parallel to the actual business entities, such as customers, products, orders, and distributors. Different parts of an organization may have a narrow view of a business entity. In the example of a customer, a loan service group in a bank may only know about a customer in the context of one or more loans outstanding. Another group in the same bank may know about the same customer in the context of a deposit account. However, the data warehouse view of the customer would transcend the view from a particular part of the business. The data warehouse would most likely build attributes of a business entity by collecting data from multiple source applications. For example, considering the demographic data of a bank customer, the retail operational system may provide with social security number, address, and phone number. A mortgage system may provide with employment, income, and net worth information. Data warehouse does not include data that will not be used by DSS, while operational data storing systems contain detailed data that may or may not have any relevance to the DSS 29 DATA WAREHOUSE LOADER analysts. It is essential to understand the implications of not being able to maintain the state information of the operational system when the data is moved into the data warehouse. Many of the attributes of entities in the operational system are very dynamic and constantly modified. Those dynamic attributes are not carried over to the data warehouse. To understand the lose of operational state information, let us consider the example of an order fulfillment system that tracks the inventory to fill orders. An order may go through many different statuses before it is fulfilled and goes to the “closed” status. Other order statuses may indicate that the order is ready to be filled, is being filled, back ordered, ready to be shipped, etc. Those statuses capture all the business processes that have been applied to the order entity. It is nearly impossible to carry forward all of attributes to the data warehousing system. The data warehousing system is most likely to have just one final snapshot of this order. As far as the relationship of data is concerned, operational system data relates to the immediate needs and concerns of the business. On the contrary, data warehouse data spans a spectrum of time and business rules, representing two or more tables. 3.2.3.2. Integration Data warehousing systems are most successful when data is combined from multiple operational systems. When data brought together, it is important that this integration is done at a place independent on the source. The data warehouse can effectively and incrementally combine data from multiple sources such as sales, marketing, and production. The primary reason for combining data from multiple source applications is the ability to crossreference data. Nearly all data in a typical data warehouse is built around the time dimension. Time is the primary filtering criterion for most of the activities against the data warehouse. An analyst may generate queries for a given week, month or year. Another popular query is the review of year-on-year. Therefore, the time dimension here serves as a fundamental crossreferencing attribute. The ability to establish the correlation between activities of different organizational groups within a company is often cited as the most advanced feature of the data warehouse. With integration, data warehouse data takes on a very corporate flavor, which can also be shown in its naming convention, measurement of variables, encoding structure and so forth. 30 DATA WAREHOUSE LOADER 3.2.3.3. Time variance All data in data warehouse is accurate as of some moment in time while that of operational data storing system is accurate as of the moment of access. The time variance of data warehouse shows up in several ways. Firstly it represents data over a long time horizon, from five to ten years, which is much shorter in operational system, namely from sixty to ninety days. Secondly every key structure in data warehouse contains, explicitly or implicitly, an element of time. Thirdly data warehouse data cannot be updated, once correctly recorded. Data from most operational systems is archived after the data becomes inactive. For example, a bank account may become inactive after it has been closed. The reason for archiving the inactive data is the performance. Large amounts of inactive data mixed with operational live data can significantly degrade the performance of a transaction, which processes only the active data. Since the data warehouse is designed to archive the operational data, the data here is saved for a very long period. The cost of maintaining the data once it is loaded into the data warehouse is minimal. Most of the significant costs are incurred in data transfer and data scrubbing. Storing data for more than five year is very common for data warehousing systems. There are industry examples, in which the enterprise expands the time horizon of the data stored once the wealth of business knowledge in the data warehouse is discovered. 3.2.3.4. Non-volatility Updates, such as inserts, deletes and changes, are regularly done to the operational data storing system on a record-by-record basis. But the basic manipulation of data warehouse is only of two kinds: the initial loading of data and the access of data. This means after the data is in the data warehouse, there are no modifications to be made to this information. In an operational system, the data entities go through many attribute changes. For example, an order may go through many statuses before it is completed. Or, a product moving through the assembly line has many processes applied to it. Generally speaking, the data from an operational system is triggered to go to the data warehouse, only when most of the activity on the business entity data has been completed. This may mean the completion of an order or the final assembly of a product. Once completed, the order is unlikely to go back to backorder status and the product is unlikely to go back to the first assembly station. Another important 31 DATA WAREHOUSE LOADER example can be the constantly changing data, which is transferred into the data warehouse one snapshot at a time. Business logic might determine how often a snapshot is carried out is adequate for the analysis. Such snapshot data naturally is non-volatile. [3] 3.2.3.5. Difference between data warehouse and operational systems The structure of data warehouse is different from that of operational data storing system. Please see Figure 3.3. The components of data warehouse are current detail data, older detail data, lightly summarized data, highly summarized data and metadata, which are introduced as following. However, the summary data is never found in the operational environment. Figure 3.3: Structure of data inside of data warehouse Current detail data reflects the most recent happenings, which are always of great interest. It is voluminous because it is stored at the lowest level of granularity. And it is always stored on disk storage, which is fast to access but expensive and complex to manage. Older detail data is stored on some form of mass storage and is infrequently accessed. It is stored at a level of detail consistent with current detail data. Lightly summarized data is distilled from the low level of detail found in current detailed level. Building lightly summarized data is concerned with the architecture of data warehouse. Highly summarized data is compact and easily accessible. It is possible to store highly summarized data outside of the data warehouse. 32 DATA WAREHOUSE LOADER Metadata plays a special and important role in data warehouse. It is a directory to help the DSS analysts locate the contents of data warehouse. It is a guide to the mapping of data between operational environment and data warehouse. It is also a guide to the summarization between current detail data and summarized data. Data warehouse tables are also different from those of operational environment. On one hand, they are historic tables, which means they record the previous state. The difference between the current status and previous one is regularly detected and imported. Every record is accompanied with one valid period of time. The increase of historic tables depends on the modification frequency of source data. On the other hand, they are accumulated tables, which means they accumulate statistic data in a regular sequence. Each data contains a reference period time and only the difference will be imported. [4] 3.2.4. Analysis Online Analytical Processing (OLAP) Multidimensional conceptual view Underlying technology is transparent Access possibility for different source data No efficiency decrease in case of more data or more users or more dimension of view Client-server architecture with integration of different client-system The same function range for each dimension Efficient administration of weakly used matrices Multi-user capability Unlimited, automatic and cross-dimension operation Simple and intuitive data navigation Flexible written report format Unlimited dimension level and aggregation level Figure 3.4 Online Analytical Processing formulations The task of analysis stage is to prepare the content of data warehouse according to user’s question formulation. This stage provides a multidimensional view of the specially extracted part of database, for example, the turnover of certain customer, of certain product or of certain period of time. Online Analytical Processing (OLAP) has a central role in this stage. OLAP is first formulated by E.F. Codd. Please see Figure 3.4. From the technical point of view, there are mainly two kinds of OLAP-system type: [3] 1. Multi-dimensional OLAP (MOLAP) 33 DATA WAREHOUSE LOADER The construction of multi-dimensional data view is based on the so-called multi-dimensional database. This new database technology is optimized for OLAP-requirements. In contrast to relational approach, data in multi-dimensional database is stored not in tables but in multidimensional arrays, which is also called “data cube”. This data organization is suitable for the access model, with which it is possible to access large data volume in a short time. 2. Relational OLAP (ROLAP) Relational OLAP does not organize the data for analysis in a physical multi-dimensional way. The current observed data is extracted directly from relational data warehouse and merely a “visual data cube” is built in intermediate storage software. This intermediate storage software can run on any user workstation or separate server. This technology requires a special physical data model, which is flexible and suitable to manipulate large data volume in order to achieve an endurable response time. Therefore, it is necessary here to give a brief introduction to the dimensional data modeling techniques of data warehouse. One technique that is gaining popularity in data warehousing is dimensional data modeling. For some data analysis situations, it can meet the requirements for organizing warehouse data. This technique enables data warehouse to facilitate users' ability to ask the right questions and get answers to them. [6] In Data Warehousing systems, queries tend to use more tables, return larger result sets and run longer. Since these queries are typically unplanned and may not be reused, there is little or no opportunity for comprehensive optimization activities. In order to answer a certain question, more than one tables are related. However, performance suffers when many large tables make up the query and indexes are not defined to optimize the query's access options. The "Cube" metaphor provides a new approach to visualize how data is organized. A cube gives the impression of multiple dimensions. Cubes can be two, three, four, or more dimensions. Users can "slice and dice" the dimensions, choosing which dimension to use from query to query. Please see Figure 3.5. Each dimension represents an identifying attribute. The cube shape indicates the following: [6] Several dimensions can be used simultaneously to categorize facts. 34 DATA WAREHOUSE LOADER The more dimensions are used, the greater the levels of detail are retrieved. Dimensions can also be used to constrain the data returned in a query, by restricting the rows returned to match a specific value or range of values for the constraining dimension. For example, the data warehouse user may want to "drilldown" and see the Total Sales for each Region by Sales Quarter and Product Line. Here the aggregated fact associated with each combination of the Regions, Time and Product Lines dimensions is shown below in Figure 3.6. Each additional dimension increases the level of detail. This "flattened" view of the dimensional data unfolds the cube by each dimension and the facts are aggregated at the intersection of the chosen dimensions. Figure 3.5: New approach to visualizing how data is organized in data warehouse Figure 3.6: The flattened view of the dimensional data cube Dimensional data modeling adds a set of new schemas to the logical modeling toolkit. The first is called a star schema, named for the star-like arrangement of the entities. A star schema uses many of the same components as in any Entity-Relationship Diagram, for example, 35 DATA WAREHOUSE LOADER entities, attributes, relationship connections, cardinality, optionality, primary keys and so on. A star schema works properly when a consistent set of facts can be grouped together in a common structure, called the fact table, and descriptive attributes about the facts can be grouped into one or more common structures, called the dimension table. [6] The center of a star schema is the fact table. It is the focus of dimensional queries. It is where the real data, named facts are stored. Facts are numerical attributes, such counts and amounts, which can be summed, averaged, and aggregated by a variety of statistical operations. Calculating the maximal value or the minimal value is also necessary. Fact attributes contain measurable numeric values about the subject matter. Dimensional attributes provide descriptive information about each row in the fact table. These attributes are used to provide links between the fact table and the associated dimension tables. Dimension tables are used to guide the selection of rows from the fact table. Figure3.7: The star schema Please see the example in Figure 3.7, where Grocery transaction is the fact table, while Time, Customer, Store and Product are the dimension tables. Time dimension is a critical component of the data warehouse data model. DSS environments allow analyzing data and how it is changing over time. Store dimension allows categorizing transactions by store, including location of the store and its relation to other geographically distributed stores. 36 DATA WAREHOUSE LOADER Product dimension allows analyzing purchasing patterns by products. Customer dimension allow analyzing purchases by customer, such purchasing frequency or purchasing location. Fact tables typically have large volumes of rows, while dimension tables tend to have a smaller numbers of rows. The key advantage to this approach is that table join performance is improved when one large table can be joined with a few small tables. Often dimension tables are small enough to be fully cached in memory. There are some significant differences between a star schema and a fully normalized relational design as following: [6] The star schema makes heavy use of denormalization to optimize operation speed, at a potential cost of storage space. However, the normalized relational design minimizes data duplication, reduces the work that the system must perform when data changes. The star schema restricts numerical measures to the fact tables, while the normalized relational design can store transactional and reference data in any table. Dimensional data modeling is used to manage data at the lowest level of detail available, such as individual transactions. Aggregates are statistical summaries of these details. One or more level of aggregates of the same facts can be defined. Aggregates give the perception of improved query performance, since query results are much faster returned. Maintaining the aggregates results could make significant performance improvements in analysis activities. However, there are also some challenges when using dimensional data modeling techniques, such as denormalization, dimension table data volumes, managing aggregates, data sharing and so on. In this work, the example implementation of Data Warehouse Loader makes use of this dimensional data modeling techniques. Warehousing data is organized in terms of the fact table and the dimension table. The detailed description will be given in the later chapter. 3.2.5. Presentation The task of presentation is to visualize the data from analysis system on the client. There are a lot of data illustration possibilities in this stage, for example, tables, different diagrams or maps for geographical information. One important element of presentation is the control element for the navigation in data warehousing systems. This navigation should enable user to find the required information as 37 DATA WAREHOUSE LOADER easily as possible. A lot of presentation systems provide as well the access possibility via Web-browser. In this case, the available company-intranet will be used as an alternative for special and expensive analysis front-ends. 3.2.6. Metadata Every sub-system in data warehousing systems, from data-transfer stage to presentation stage, needs the information about the structure and the relationship of the processed data. For very condensed information, for example, the turnover of 2000, the user needs to know its detailed development process, which is from a particular storage of elemental data in the operational data storing system to the presentation in graph. The user requires showing this detailed development process in order to interpret the illustrated information if necessary. In addition to this rather static information about data structure and relationship, there is also information about the functionality of data warehousing systems, for example, the time point or the time interval of extraction operation, the number of successfully importing records, the history of operation, and so on. 3.2.7. Process management Normally user requires scheduling the workflow of Data Warehouse Loader, for example, at the end of every business day, the difference of operational database will be figured out and integrated into data warehouse. In this case, many single processing stages will be concatenated in order to transform data, consolidate data, condense data and finally store data into data warehouse or a multi-dimensional database. Therefore, the task of process management is to coordinate the workflow as user requires, which is related to data warehouse architecture and overlapping processing components. 3.2.8. User administration The administration of data access right to analysis data is an important aspect of data warehousing system. There is different information belonging to different security level, which should not be accessed by every user. For example, one regional business manager should only be able to access the customer information in his own region. Because of this reason, the 38 DATA WAREHOUSE LOADER access right to data warehouse data should be clearly defined. Especially, multi-dimensional database enables the assignment of access right to a particular cell of the data cube. Another consideration of user administration is the cost of the utilization of data warehouse data in profit-center structure. Therefore, as the operator of data warehouse, the data administration department should charge the use of data warehouse data from user. And the charge should be based on the time and the different manners of utilization. 3.3. The role of Data Warehouse Loader 3.3.1. Functionality of Data Warehouse Loader In this work, the Data Warehouse Loader fulfills the data-transfer task of the data warehousing systems, described above. It works as a bridge between operational data storing system and data warehouse. Please see Figure 3.8. The process of developing data warehousing systems has highlighted the need to effectively and efficiently manage the extraction, cleaning, transformation and migration of data from data source systems. Efficiency is necessary, whenever the data is of great value, which is usually the case. Effectiveness is necessary, because the long-term investment of re- Figure 3.8: The role of Data Warehouse Loader sources in these activities can be high. Data Warehouse Loader extracts data from any record-oriented operational data source, and then transforms data in an arbitrary way in order to meet the structure requirements of target data warehouse. Afterwards, the difference between the current transformation result and the previous one is determined and finally imported into data warehouse. In order to figure out the difference between transformation results, the current transformation result is compared to the previous status of target data warehouse. However, it will be a timeconsuming work to read data warehouse every time. Here a so-called status file realizes this function, and this status file is always consistent with the newly updated status of data warehouse. Accordingly, figuring out difference will be the comparison between the current 39 DATA WAREHOUSE LOADER transformation result and this status file, which is much faster. Therefore, one additional function of Data Warehouse Loader is to construct the status file, which means after importing the differences into data warehouse, the differences also need to be merged into the status file. Lastly, in case of some system damages, it might need to fetch data directly from data warehouse in order to construct the status file. So another additional function of Data Warehouse Loader is to retrieve data directly from data warehouse. 3.3.2. Software reuse consideration of Data Warehouse Loader The crucial design requirement of Data Warehouse Loader is to be suitable to various data source, various target data warehouse and various data transformation scheme. On the one hand, in the market there are numerous database vendors, who provide different types of database. Any of them might be the data source or the target data warehouse of Data Warehouse Loader. On the other hand, Data Warehouse Loader will be used by different enterprise, each of which has its own business logic for the data analysis purpose. For example, a super market will be interested in the turnover of a certain product during the last few months. A marketing manager will be interested in the average income level of customers in certain region. This indicates the transformation scheme between operational data storing system and data warehouse is totally different. Except for data source, target data warehouse and transformation scheme, other basic workflow of Data Warehouse Loader is the same, such as extraction from data source, data transformation, figuring out difference, importing into data warehouse, merging into the status file and retrieving from data warehouse. Therefore, this basic workflow of Data Warehouse Loader can be reused from this enterprise to that company, or from this application to that application only if the functions related to various data source, various target data warehouse and various transformation scheme are modified. Consequently, how to modify this part of functionality is obviously directly associated with the software reusability of this software. Modification with minimum reprogramming effort represents a high reusability, and vice versa. It will be extremely important for Data Warehouse Loader architecture to decouple the constant workflow and the various environments factors. Please see the Figure 3.9. 40 DATA WAREHOUSE LOADER Figure 3.9: Goal of Data Warehouse Loader architecture That is also the reason why software reusability consideration is extremely important for the implementation of Data Warehouse Loader in one data warehouse application and why it is chosen as the investigation example of software reusability research. The general concept of software reuse and different reuse technology are applied to the development process of Data Warehouse Loader. Especially, one goal is to achieve a well-designed interface, which significantly reduces reprogramming effort in case of changing in order to improve software reusability. Therefore, the interface concept of object-oriented language Java can be explored. In the implementation of Data Warehouse Loader, there exist various interfaces, namely interface for data representation, interface for data access and interface for business logic, which will be explained in detail in later chapter. 41 IMPLEMENTATION OF DATA WAREHOUSE LOADER 4. The Implementation of Data Warehouse Loader 4.1. The description of Data Warehouse Loader In this work, in order to make Data Warehouse Loader executable, a data warehousing system of a super market chain is chosen as the implementation example. Both data source and target data warehouse are Microsoft Access database. Here is the brief description of this implementation example. The source database is of a normalized relational design, which consists of several tables, namely Artikel, Artikel-type, Positionen, Kauf, and Markt. It is an operational system, which keeps the records of each transaction, for example, every day which Aritkel is sold, how many of this Artikel are sold, where are those Artikel are sold, what is the price of this Artikel, what is the type of this Artikel, and so on. Please see the fields defined in each table and the relationship between each table in Figure 4.1. 42 IMPLEMENTATION OF DATA WAREHOUSE LOADER Figure 4.1: Table relationship in source database Data Warehouse Loader reads records from this source database, which means to extract data from source system. Then it performs the data transformation in order to make the data suitable for the architecture of data warehouse for the purpose of analysis. The data in the data warehouse is organized in a multi-dimensional cube and each dimension represents an identifying attribute. A user can drill down to see the detail data by the flattened view of the dimensional data. For the purpose of easily having an overall view of total sales, some facts are aggregated at the intersection of the chosen dimensions. Therefore, in the target data warehouse, a star schema is utilized, which is introduced in the former chapter. The reason to use dimensional data modeling and star schema in the target data warehouse is to make use of the advantage of dimension table, since dimension tables are small enough to be fully cached into main memory. This could be a compensation for the large main memory consumption of Java program. Please see Figure 4.2. Here FactUmsatz is the fact table and DimMarkt, DimZeit, DimArtikel are the dimension tables, which can also be recognized by their naming prefix. FactUmsatz table provides fact attributes about Markt, Artikel, Zeit and Umsatz. Those fact attributes contain measurable and numeric values about the subject matter. However, DimMarkt, DimZeit, 43 IMPLEMENTATION OF DATA WAREHOUSE LOADER DimArtikel tables provides dimensional attributes about Markt, Zeit and Artikel. Those dimensional attributes provide descriptive information about each row in the fact table and provide the relationship links between the fact table and the associated dimension tables as well. Figure 4.2: Star schema in the target data warehouse Let us have a look at the data transformation schema in detail. The record in DimZeit table is relatively constant, which gives each month a unique ID. The record in DimArtikel table gets Id and Name from Artikel table of the source database and gets Warengruppe from Artikeltyp table of the source database. The record in DimMarkt table gets Id and Name from Markt table of the source database. Then FactUmsatz table is the aggregation of all the information from Artikel, Artikeltyp, Positionen, Kauf, and Markt table of the source database. The Umsatz is calculated by multiplying Preis in Artikel table with Menge in Positionen table. Zeit is obtained from Uhrzeit and Datum in Kauf table. Of course, Markt and Artikel are extracted from Markt table and Artikel table respectively. Therefore, FactUmsatz table enables user to clearly know, how much is the turnover of a certain kind of item in a certain market of a certain period of time, when are those items sold, in which market is a certain kind of item sold, and so on. The 44 IMPLEMENTATION OF DATA WAREHOUSE LOADER intersection of each of these dimensions will determine which rows will be aggregated to answer the query. 4.2. The architecture of Data Warehouse Loader As stated before, one significant feature of a typical data warehouse-loading environment in practice is there exist different types of data source, different types of target data warehouse and different data transformation scheme. That means the user might use different database system or file system as data source or target. And the transformation rule from data source to data warehouse always depends on the particular and individual business logic. Therefore, a proper architecture of Data Warehouse Loader should take this factor into account, such that it is able to separate the common elements of the software from the data specific elements of the software. The common elements of the software are oriented to basic operation flow, namely extraction from data source, data transformation, figuring out difference, importing into data warehouse, merging into the status file and retrieving from data warehouse. However, the data specific elements of the software are oriented to different data, different source database, different transformation rules and different target data warehouse. When a new source database is connected or a new transformation rule is required, the data specific part will be reprogrammed while the common part remains the same. In a word, the data specific part has nothing to do with the basic loading operation process. Accordingly, decoupling of the common part and the data specific part is the goal of the architecture design for Data Warehouse Loader. This decoupling is achieved by implementing Loader-engine and Loader-interface respectively. Loader-engine is the common elements for basic operation flow, who knows “how to do”, for example, how to extract, how to transform, how to import and so on. While Loader-interface is the data specific elements for a particular application environment, who knows “what to do”, for example, what is the record structure or data structure, what is the type of data source, what is the transformation scheme, what is the type of target data warehouse and so on. Please see Figure 4.3. 45 IMPLEMENTATION OF DATA WAREHOUSE LOADER Figure 4.3: Architecture of Data Warehouse Loader: decoupling of Loader-engine and Loader-interface 4.3. Loader-engine: the operation mode of Data Warehouse Loader Besides the data-oriented structure of Loader-interface, Loader-engine is a workfloworiented structure in Data Warehouse Loader. The task of Data Warehouse Loader is to extract data from source system in a regular time interval, transform data in an arbitrary way, and load it into data warehouse with a predefined format. Consequently, the basic workflow of Data Warehouse Loader is divided into several sub-functions as following, each of which is one processing stage and is connected to each other via a so-called intermediate file. Extraction: extract raw data from different source system Transformation: transform the data to be fit to the structure of data warehouse Figuring out the difference: determine the difference between the current transformation result and the previous one Importing: import the difference into data warehouse Merging: merge the difference into the status file of data warehouse Retrieving: retrieve data from data warehouse in case of system crash 46 IMPLEMENTATION OF DATA WAREHOUSE LOADER The basic workflow of Data Warehouse Loader is illustrated in Figure 4.4, which shows as well the position of Loader-interface and the internal data flow, represented by the intermediate files. Figure 4.4: Workflow of Data Warehouse Loader: multistage filter-architecture 4.3.1. Advantages of workflow architecture The workflow architecture of Data Warehouse Loader illustrates a Filter-Architecture based on multiple stages with text files and record object files as intermediate files. The whole complex functionality of Data Warehouse Loader is split into manageable subfunctions with definite results. This workflow architecture has the following advantages in practice: 47 IMPLEMENTATION OF DATA WAREHOUSE LOADER 1. This workflow architecture brings easy maintenance. Every program, each single stage of the operation flow can be separated from others in order to be extended or to be tested. Since each single component is a complete unit, it is easily controlled from the software technical point of view. 2. Each single stage can be inserted and utilized in other programs or other systems without significant reprogramming effort. It will not only limit to the context of data warehouse application, but also apply to other applications with similar functionality. For example, the single extraction program can be used in the case of reading data out of certain database. The single importing program can be used in the case of loading data into certain database. This is exactly the modularity approach of the concept of software reuse. 3. With this workflow architecture, the whole complex functionality of Data Warehouse Loader becomes manageable and controllable. Since each sub-function is executed sequentially, the intermediate files of results are steadily linked together step by step and each of them is independently stored. Therefore, in case of an unpredictable system crash, the damage can be reduced to minimum because all the intermediate files till the currently crashing stage are remained and only the part of work starting from the crashing stage needs to be done once again. 4. Because of this workflow architecture, the whole operation flow of Data Warehouse Loader can be temporally decoupled. For example, in case the data warehouse has not been ready for use, operations, such as “LdExtract” from data source and “LdTransform” of data, can take place first, whose intermediate results are saved. Afterwards, “LdImport” to data warehouse can be made up later. 5. The intermediate file of each single stage is compatible to each other. Therefore, the operation sequence can be recomposed according to the new requirements. This recomposing gives rise to the possibility for further usability, which is also so-called modularity approach of software reuse. There are some more words about the recomposition of the operation sequence of Data Warehouse Loader. For example, in a certain application, the processed data is highly 48 IMPLEMENTATION OF DATA WAREHOUSE LOADER dynamic, which means the data changes a lot by every access time. Therefore, figuring out the difference between the previous results does not make sense any more, since the data has totally changed. In this case, the user can simply get rid of “LdDiff” stage and connect transformation stage and importing stage directly together. Please see Figure 4.5. The output of transformation stage will be directly the input of importing stage, which represents a new configuration of Data Warehouse Loader workflow. Obviously, it is a good example of the reusability of this software, which means it is able to adapt to the new application requirements very easily. This adaptability does not require great reprogramming efforts, because except for getting rid of “LdDiff” stage, the other operation stages just simply remain the same. Figure 4.5: A new configuration of Data Warehouse Loader Each operation stage of Data Warehouse Loader produces two kinds of file as results. Therefore, there are two kinds of intermediate files, one is called text file and the other is called record object file. On the one hand, the text file is for the user to read and check the results of each intermediate workflow stage. In the text file, the result of each stage is printed out as character streams. All the text files have a uniform structure, which consists of an INFOheader with general information (source, reference data, creating time, target etc.), a FORMAT-header with description of the data structure of record from database table, and finally a DATA-body. 49 IMPLEMENTATION OF DATA WAREHOUSE LOADER On the other hand, the record object file is the real carrier to forward information from the previous stage to the next one. In the record object file, the result of each stage is stored as a linked list, which is a set of record objects. In other words, each stage of operation flow always reads the record object file as input, with which the individual processing is done with regard to record object. Both text file and record object file are necessary. They are serving for different purposes. The text file makes the results of each intermediate stage visible and enables the user to decouple and recompose the operation flow of Data Warehouse Loader. Otherwise, the results can only be checked out in target data warehouse after the final loading process, which is not practical at all. It is better to check out the result of each stage in order to make sure that this current stage has been successfully executed. Since the result is available after each single stage, each component of the workflow even can be used separately, which means either temporally separated in different working time or spatially separated in different application. This kind of decoupling technique leads to the reusability of Data Warehouse Loader. Another purpose of text file is to make debugging easier. By the means of examining the results of each single stage, the location of error can be easier deduced. The record object file makes it Application easier to pass information between different stages. The crucial term Artikel object Resultset here is mapping relational data JDBC onto Java objects. Since Java is object-oriented, in many cases it is not only necessary to deal with individual data items, but also to Database work with objects that represent the individual database record. The way that information is handled at the object level is usually different from the way that data is Figure 4.6: Mapping between Artikel Object and Resultset from data base 50 IMPLEMENTATION OF DATA WAREHOUSE LOADER stored in a relational database. In the world of objects, the underlying principle is to make those objects exhibit the same characteristics (information and behavior) as their realworld counterparts. In other words, objects function at the level of the conceptual model. Relational database, on the other hand, work at the data model level, which stores information using normalized form, where conceptual objects can be decomposed into a number of tables. In some cases, there is a straightforward relationship between the columns (field) in a table and the variable member in an object. The mapping task simple consists of matching the data types of database table with those of Java. Please see Figure 4.6, which shows that the application constructs the Artikel object form the data of Resultset, which is returned by JDBC from the database. This mapping is a significant object-oriented feature of the non-object-oriented workflow of Data Warehouse Loader. Each record of database table is represented by an object of record class, whose data members correspond to each field of the database record. Therefore, all the processing to database records is done via record objects, which means the manipulation of records corresponds to manipulation of the attributes of record objects. Each record object also has a data member to represent all the header information in its corresponding text file. Since it is very easy to access the attributes of one object in Java, this object-oriented structure results in an easy manipulation of each record from the database table, which makes the Java source code much more readable than the C program without objectoriented feature. On the contrary, if each database record is represented by a stream of characters, as in the text file, the record can only be accessed via a line of characters by the means of reading the text file line by line. The position of each character needs to be calculated in order to fetch exactly the required information and get rid of useless information. It consumes a lot of programming efforts. 4.3.2. Major task of the workflow of Data Warehouse Loader As stated before, the basic workflow of Data Warehouse Loader realizes the functionality of data warehouse ETL systems and it is divided into several sub-functions in this implementation. Each of these sub-functions fulfills a certain task, as explained below one by one. 51 IMPLEMENTATION OF DATA WAREHOUSE LOADER 4.3.2.1. Extraction Figure 4.7: Program of LdExtract The task of extraction program, LdExtract, is to extract data from source system in order to make the data available for further manipulation in a required form. It is necessary for the extraction program to directly access data source intensively on account of heavy system workload. Please see Figure 4.7. The data source will be delivered to extraction program as a parameter. Afterwards, LdExtract loads corresponding extraction classes and record classes. With the help of the predefined functions in extraction classes and record classes, a connection to the corresponding data source will be created, and then its record will be read out, finally the connection is closed. In this process, loader-engine program, LdExtract, only knows the basic operation sequence to access data source, such as creating connection, reading and closing connection. It does not know whether the data source is a database or a text file, or which data should be extracted. Those questions will be answered by extraction classes, which implement extraction-interface and define what is the data source type, which data source connection to establish, which data to extract and so on. The task of record classes, which implements record-interface, is to determine the data structure of database record and the header information in the text file. Accordingly, each table obtains its individual operation in the extraction process. Therefore, LdExtract belongs to the common part of Data Warehouse Loader while extraction classes and record classes belong to data specific part. Please see Figure 4.8 for the analysis of extraction-interface in extraction program. 52 IMPLEMENTATION OF DATA WAREHOUSE LOADER Figure 4.8: Analysis of extraction-interface in extraction program For example, extraction-interface declares the following methods: createConnection ( ); read ( ); closeConnection ( ); setExBasisFile (BasisFile aBasisFile); exSaveBasisFile ( ). Each extraction class, namely ExArtikel, ExMarkt, ExVerkauf, has its own implementation of those methods, which means different database table can have totally different ways to establish the connection to data source, read data and then close the connection. When class LdExtract calls these methods in the run time, which implementation of these methods is executed will be determined by command line argument, which is passed into LdExtract via the parameter myExtractor. This dynamic class loading process makes use of one Java mechanism, called polymorphism. The word polymorphism means the ability to assume several different forms or shapes. In programming terms it means the ability of a single variable of a given type to be used to reference objects of different types, and to automatically call the methods that is specific to the type of object the variable references. This enables a 53 IMPLEMENTATION OF DATA WAREHOUSE LOADER single method call to behave differently, depending on the type of the object, which the call applies. When a method is called using a variable of a base class type, polymorphism results in the method being selected based on the type of the object stored, not the type of the variable. Here the base type is ExInterface, and the specific object type is ExArtikel, ExMarkt and ExVerkauf. Because a variable of a base type can store a reference to an object of any derived type, the kind of object stored will not be known until the program executes. Thus the choice of which method implementation to execute is made dynamically when the program is running and it cannot be determined when the program is compiled. The method createConnection ( ), read ( ) or closeConnection ( ) that are called through the variable of type ExInterface in the earlier illustration, may do different things depending on what kind of object the variable references. This polymorphism mechanism of Java introduces a new level of capability in program using objects. It implies that programs can adapt at runtime to accommodate and process different kind of data quite automatically. It is also a key point for software reuse. Only if a well-designed interface, such as extraction-interface, is established, it is easy when a new data source is added in, or a new rule of reading data is required. What needs to be modified is only to add a new extraction class, which implements extraction-interface, or to change the existing extraction class respectively. However, the Loader-engine part, LdExtract, will remain the same and even the original program needs not to be recompiled. That is the way that the high software reusability of Data Warehouse Loader is achieved. It is the same mechanism used for record-interface. Here the base type is RecordInterface and the specific object type is RecordArtikel, RecordMarkt, and RecordVerkauf. The methods, namely setFmBasisFile (argBasisFile), format ( ), getFmBasisFile ( ), which are called through the variable of type RecordInterface in LdExtract, may do different things depending on what kind of object the variable references. It is determined by command line argument, which is passed into LdExtract via the parameter myRecord. Such that when a new database table is added into Data Warehouse Loader and a new data structure of its records is required, the only modification is to add a new record class to repre- 54 IMPLEMENTATION OF DATA WAREHOUSE LOADER sent, which implements record-interface. The rest of program remains the same. Please see Figure 4.9 for the analysis of record-interface in extraction program. Figure 4.9: Analysis of record-interface in extraction program 4.3.2.2. Transformation The transformation program, LdTransform, reads the extraction result and produces a new data, which is in favor of the data structure in the target data warehouse table. The transformation results can be either a historical table, which represents the newly updated current state of database, or an accumulative table, which represents the newly changes of database. Please see Figure 4.10. The data source and data target will be delivered to transformation program as parameters. Afterwards, LdTransform program loads corresponding transformation classes and record classes. With the help of the predefined functions in the transformation classes and the record classes, data is transformed into a required format, 55 IMPLEMENTATION OF DATA WAREHOUSE LOADER and afterwards the transformation results are not only saved in the record object file, but also output in the text file. Figure 4.10: Program of LdTransform In this process, Loader-engine program, LdTransform, only knows the basic operation sequence to transform data, such as reading and saving the header information in the text file, transforming data, saving and outputting transformation results. It does not know the specific transformation scheme for different database table. This question is answered by the transformation classes, which implement transformation-interface and define how the data is transformed. The task of record classes is the same as those of extraction program, namely determining the data structure of database record and the header information in the text file. Such that, each table obtains its own individual transformation scheme in this process. Therefore, LdTransform belongs to the common part of Data Warehouse Loader while transformation classes and record classes belong to data specific part. Please see Figure 4.11 for the analysis of transformation-interface in transformation program. Here the polymorphism mechanism is used the same as in extraction program. Transformation-interface declares the following methods: setTrBasisFile (BasisFile aBasisfile), tranform ( ), trSaveRecords ( ), trSaveBasisFile ( ), trOutput ( ). Each transformation class, namely TrDimArtikel, TrDimMarkt, TrFactUmsatz, has its own implementation of these methods. Therefore, the base type is TrInterface and the specific object type is TrDimArtikel, TrDimMarkt, TrFactUmsatz. When these methods, declared in 56 IMPLEMENTATION OF DATA WAREHOUSE LOADER transformation-interface, are called through the variable of type TrInterface in LdTransform program, they may do different things depending on what kind of object the variable references. It is determined by command line argument, which is passed into LdTransform via the parameter myTransformer. Figure 4.11: Analysis of transformation-interface in transformation program Consequently, when a new transformation scheme is required or an old one is changed, the only modification is to add a new transformation class or to change the existing one respectively. All of the transformation classes should implement transformation-interface. The rest of program remains the same. In this way, the high software reusability of Data Warehouse Loader is achieved. Normally one data source corresponds to one target data warehouse. But there is also possibility, where more than one target data warehouse tables load data from one data source table or one target data warehouse table loads data from more than one data source tables. Such kind of combination relationship is also determined by the transformation classes. 57 IMPLEMENTATION OF DATA WAREHOUSE LOADER Another supporter of the configuration of data transformation scheme is a transformation initial interface, called TrIniInterface. Here another property of Java is used, namely one class can implement multiple interfaces when necessary and when a class implements an interface, any constants that were declared in the interface definition are available directly in the class, just as though they were inherited from a base class. Therefore, one transformation class can not only implement the transformation-interface, but also implement this transformation initial interface, in which special rules for data transformation are defined. Especially, the transformation class can make use of any constants, which are declared in the transformation initial interface. In the demo example of Data Warehouse Loader implementation, one of transformation rules for Artikel table is to replace certain Artikel types with a predefined term. That means when those required Artikel types are encountered in the extraction result, they will be changed. Therefore, the names of those Artikel types are grouped as an array in the transformation initial interface. Each transformation class implements this interface and obtains the information about the array. Consequently, each Artikel type of the extraction result can be compared to this array in order to determine whether it belongs to this array or not for a proper operation. The Java property of interface multi-implementation and constants inheritance is also utilized for Data Warehouse Loader initial configuration, namely interface FileInterface, RCInterface and LoaderIniInterface. FileInterface defines all the file type of intermediate results for their header information in the text file. RCInterface defines return code in case of error. LoaderIniInterface defines file storing path of intermediate files, postfix of intermediate files, URL string for data source system and target data warehouse. This configuration information will be known by every class all over the whole package, only if the class implements those interfaces and obtains the constants. As far as the record-interface in transformation program is concerned, it plays exactly the same role as in extraction program. The only difference is that another three record classes are used here, namely RecordDimArtikel, RecordDimMarkt and RecordFactUmsatz since the data structure has already changed in order to be suitable for target data 58 IMPLEMENTATION OF DATA WAREHOUSE LOADER warehouse. That means after transformation stage the construction of the records in dimension table is completed. 4.3.2.3. Figuring out difference Figure 4.12: Program of LdDiff In this stage, the difference between the current transformation result and the previous status of data warehouse is determined. Please see Figure 4.12. Here the previous status of data warehouse is obtained from a so-called status file instead of directly reading for data warehouse. This process is much faster and easier than direct database access. However, the status file should be updated with the updating of data warehouse, which is the task of merging program, LdMerge, described later. The name of table, which is being compared, is passed into LdDiff as command line argument. Then its corresponding current transformation result and previous status are read and compared record by record. First all the fields of two given records are compared and then only the key fields of these records are compared in order to determine whether it is a new record or a changed record. The result of LdDiff program is three-difference file, namely new file, which includes new records; chg file, which includes changed records; del file, which includes deleted records. 59 IMPLEMENTATION OF DATA WAREHOUSE LOADER Record-interface is used in this process when reading data from input files and saving results. With the help of record-interface, LdDiff program can be used for any database table, only if this database table has its corresponding record class, which implements record-interface. In LdDiff program, the common methods are called through a variable of type RecordInterface and they might do different things depending on the type of record, the variable references. For example, the criteria for comparing records are different for different records. Therefore, each record class has its own implementation of method compareFields (RecordInterface aRecordInterface) and compareKeyFields (RecordInterface aRecordInterface). Method compareFields compares all the fields of two records and returns true when they are the same and false when they are different. If the two records are different, method compareKeyFields is then called to compare their key fields. And it also gives the sequence of them by the return value of –1, 0 or 1, whose sorting criteria is given by the key fields, as stated in the text file. Such that the given record can be determined to be a new record or a changed record. Of course, if one record in the status file is not found in the current transformation result any more, it should be a deleted record. 4.3.2.4. Importing Figure 4.13: Program of LdImport Importing program, LdImport, imports the resulted difference files from last stage into data warehouse. Every running of importing program will import one of these difference files. Therefore, the importing program will be executed three times for each database 60 IMPLEMENTATION OF DATA WAREHOUSE LOADER table for the new records, the changed records and the deleted records separately. Please see Figure 4.13. The target data warehouse for importing will be delivered to importing program as a parameter. Afterwards, LdImport loads corresponding database classes, which implement database-interface. With the help of the predefined functions in the database classes, the data from the difference files is imported, namely creating a connection to target data warehouse, inserting the new records, updating the changed records, removing the deleted records and finally closing the connection. Figure 4.14: Analysis of database-interface in importing program and retrieving program In this process, Loader-engine program, LdImport, only knows the basic operation sequence to import data, such as inserting, updating and removing and so on. It does not know which data should be processed for different database table and what kind of connection to data warehouse should be created. These questions are answered by database classes, which define what kind of connection is established, what data to import, etc. Such that, each table obtains its individual operation in the importing process. 61 IMPLEMENTATION OF DATA WAREHOUSE LOADER Therefore, LdImport belongs to the common part of Data Warehouse Loader while database classes belongs to data specific part. Please see Figure 4.14 for the analysis of database-interface in importing program. Figure 4.14 is served for both importing program and retrieving program, since both of them make use of database-interface in the same way. Here the polymorphism mechanism is used the same as in extraction program and transformation program. Database-interface declares the following methods: createConnection ( ), setDbBasisFile ( ), dbNew (String aString), dbChg ( ), dbDel ( ), closeConnection ( ). Each database class, namely DbDimArtikel, DbDimMarkt, DbFactUmsatz, has its own implementation of these methods. Therefore, the base type is DbInterface and the specific object type is DbDimArtikel, DbDimMarkt, DbFactUmsatz. When these methods declared in the database-interface are called through the variable of type DbInterface in LdImport program, they may do different things depending on what kind of object the variable references. It is determined by command line argument, which is passed into via LdImport the parameter myImporter. Consequently, when a new target data warehouse is added in, the only modification is to add a new database class, which implements database-interface and does all the updating work to data warehouse. The rest of program remains the same. In this way, the high software reusability of Data warehouse Loader is achieved. In the execution of importing program, it might happen that a certain record cannot be imported into the target data warehouse, for example, it cannot be inserted, changed or deleted. The reasons could be the security checking of target data warehouse, such as the constraints against wrong foreign-key relationship or exceeding of data value range. In this case, the status of data warehouse is not updated according to the corresponding current transformation results. In order to indicate this case, each record class contains one data member of type character, called cImportFlag. This importing flag is initialized to be a dot ‘.’. If the record is successfully imported, namely successfully inserted, updated or removed, the importing flag will be changed to ‘Y’. Otherwise, the importing flag remains its initial state. Afterwards, this importing flag information will be read by merging program in order to determine a proper merging operation, as described below. 62 IMPLEMENTATION OF DATA WAREHOUSE LOADER 4.3.2.5. Merging After importing the difference data into target data warehouse, the loading process has not been completed yet. For the next workflow execution of Data Warehouse Loader, a status file needs to be created, which represents the current status of data warehouse for the purpose of execution speed. As stated before, because of the possible error occurring while importing process, the construction of the status file should consider the reaction of data warehouse in importing process as well. That is the task of merging program, LdMerge. Please see Figure 4.15. Figure 4.15: Program of LdMerge The merging process is done with new records, changed records and deleted records separately. The most important information in merging process is the importing flag of each record. For the new records, if the importing flag is changed to ‘Y’, which stands for successfully inserting, this record will be added into status file. Otherwise, nothing is added. For the changed records, if the importing flag is changed to ‘Y’, which stands for successfully changing, first the corresponding record in the status file is found and 63 IMPLEMENTATION OF DATA WAREHOUSE LOADER deleted, then the record in chg file is added into status file. That is how this record is changed in the status file as well. Otherwise, do nothing. For the deleted records, if the importing flag is changed to ‘Y’, which stands for successfully deleting, the corresponding record in status file will be removed. Otherwise, do nothing. Record-interface is used in this merging process when reading data from difference files and saving the results. With the help of record-interface, LdMerge program can be used for any database table, only if this database table has its corresponding record class, which implements record-interface. In LdDiff program, the common methods are called through a variable of type RecordInterface and they may do different things depending on the type of record, the variable references. For example, when merging changed records, if the importing flag is ‘Y’, the corresponding record in the status file should be first found and then deleted. For this purpose, a method indexOf ( ) of an object of type LinkedList is called in order to find the argument object in the linked list of the status file. In order to make this method work properly, the compared object should implement one method, called equals (Object aObject), which is inherited from Object. Therefore, record-interface declares this method and each record class has its own implementation of this method to define the concrete criteria of comparing records. 4.3.2.6. Retrieving Figure 4.16: Program of LdRetrieve Retrieving program is used to construct the status file directly from the target data warehouse. Please see Figure 4.16. Therefore, the result of retrieving program is the 64 IMPLEMENTATION OF DATA WAREHOUSE LOADER same as merging program. Retrieving program does not belong to the normal daily workflow of Data Warehouse Loader. However, it will be executed for the following purposes: Controlling the result of merging program Recovering from system crash: When the workflow of Data Warehouse Loader is out of order because of any system error and the status file does not exist any more, it can be recovered directly from the target data warehouse by the means of retrieving program. The target data warehouse for retrieving will be delivered to retrieving program as a parameter. Afterwards, LdRetrieve loads corresponding database classes. With the help of the predefined functions in database classes, data from the different target data warehouse is retrieved. In this process, Loader-engine program, LdRetrieve, only knows the basic operation sequence to retrieve data, such as creating the connection to data warehouse, fetching data, saving records, outputting results and closing the connection. It does not know which data should be retrieved and what kind of connection to data warehouse should be created. These questions are answered by database classes, which implement databaseinterface and define which data to retrieve, what kind of connection is established, etc. Such that, each table obtains its individual operation in the retrieving process. Therefore, LdRetrieve program belongs to the common part of Data Warehouse Loader while database classes belongs to data specific part. Please see Figure 4.14 for the analysis of database-interface in retrieving program. Here the polymorphism mechanism is used the same as in extraction program, transformation program and importing program. Database-interface declares the following methods: createConnection ( ), fetch ( ), dbSaveRecords ( ), setDbBasisFile ( ), dbOutput( ), closeConnection ( ). Each database class, namely DbDimArtikel, DbDimMarkt, DbFactUmsatz, has its own implementation of these methods. Therefore, the base type is DbInterface and the specific object type is DbDimArtikel, DbDimMarkt, DbFactUmsatz. When those methods declared in database-interface are called through the variable of type DbInterface in LdRetrieve program, they may do different things depending on what kind of object the variable references. It is determined by command line argument, which 65 IMPLEMENTATION OF DATA WAREHOUSE LOADER is passed into LdRetrieve via the parameter myImporter. The result of retrieving is both saved in the record object file and printed out in the text file for checking. Consequently, when a new target data warehouse is added in, the only modification is to add a new database class, which implements database-interface and does all the retrieving work from data warehouse. The rest of program remains the same. In this way, the high software reusability of Data Warehouse Loader is achieved. 4.4. Loader-interface: interface concept of Data Warehouse Loader According to the analysis before, one of the most important goals for the architecture design of Data Warehouse Loader is to decouple the common part for the basic workflow and the data specific part for an individual application environment. Therefore, interfaceconcept plays a crucial role here in the implementation of Data Warehouse Loader. The common part, loader-engine, has been explained above. Then the following is for the data specific part: loader-interface. Loader-interface is the data-oriented structure of Data Warehouse Loader and each of them supplements Loader-engine with the following functionality: Extraction-interface: access to data source with sequential reading capability Transformation-interface: transform the source data into a target format Database-interface: access to data warehouse for updating and retrieving data Record-interface: construct an object structure for each database record Loader-interface resides in one package, named myPackage.loaderInterface. Each of Loader-interface is combined to a particular Loader-engine program and fulfills certain part of the whole processing task, from data source to target data warehouse. From the Data Warehouse Loader system’s point of view, it is the proper combination of Loaderinterface and Loader-engine that makes the sequential extracting, transforming and loading process run and moreover makes Data Warehouse Loader more reusable. Loader-interface is responsible for all the functionality, which is related to specific data source, specific data structure, specific data transformation scheme, specific business 66 IMPLEMENTATION OF DATA WAREHOUSE LOADER logic and so on. Loader-interface needs to be reprogrammed when Data Warehouse Loader is transferred from one application to another, since different customers have apparently various business logics. However, a well-designed Java interface will reduce this reprogramming effort as much as possible. Therefore, the interface concept of Java is important here in the implementation of Loader-interface. Normally in Java, a common method can be defined in base class, and then this method can be implemented individually in each of the subclasses. The method signature was the same in each class and the method could be called polymorphically. When all you want is a set of methods to be implemented in a number of different classes such that you can call them polymorphically, you can dispense with the base class altogether. The same end result to obtain polymorphic behavior can be achieved much more simply by using a Java facility, called interface. The name of this facility indicates its primary use: specifying a set of methods that represents a particular class interface, which can then be implemented individually in a number of different classes. All of the classes will then share this common interface, and the methods in it can be called polymorphically. An interface is essentially a collection of constants and abstract methods. To make use of an interface, one class can implement this interface, which means this class writes codes for each of the methods declared in the interface as part of the class definition. When a class implements an interface, any constants that were defined in the interface definition are available directly in the class, just as though they were inherited from a base class. Especially, this constants inheritance of interface property is used to define some interfaces in Data Warehouse Loader, which only consist of some constants, for the purpose of system initial configuration, for example, FileInterface, RCInterface, LoaderIniInterface and TrIniInterface. The interfaces of Data Warehouse Loader have the following functions: 67 IMPLEMENTATION OF DATA WAREHOUSE LOADER 4.4.1. Extraction-interface Extraction-interface is the interface to data source system with the functions to create a connection to data source, read data, close the connection, set and save the header information for the text file. It assumes that the data can be read sequentially record-byrecord from data source system. The reading function of extraction-interface is corresponding to read a line of text from text file or select a record from database; therefore the current position in text file or in database is required. Extraction-interface is used in extraction program. Please see Figure 4.17. Figure 4.17: Extraction-interface of Data Warehouse Loader 4.4.2. Transformation-interface Transformation-interface is the interface for different data transformation scheme in a particular application. The functions of transformation-interface is to transform data, output data in the text file, save data in the record object file, set and save the header information for the text file. Transformation-interface gives the essential changes of the record 68 IMPLEMENTATION OF DATA WAREHOUSE LOADER from data source in order to be suitable for the data structure in target data warehouse, for example, the dimensional data modeling and a star schema. Transformation-interface is used in transformation program. Please Figure 4.18. Figure 4.18: Transformation-interface of Data Warehouse Loader 4.4.3. Database-interface Database-interface is the interface to target data warehouse. Database-interface will serve for both importing program and retrieving program, since both of them need to talk with data warehouse. For importing program, database-interface will create a connection to data warehouse, insert the new records, update the changed records, remove the deleted records, close the connection, set and save the header information for the text file. For retrieving program, database-interface will create the connection, fetch data, output data in the text file, save data in the record object file, close the connection, set and save the header information for the text file. Please see Figure 4.19. 69 IMPLEMENTATION OF DATA WAREHOUSE LOADER Figure 4.19: Database-interface of Data Warehouse Loader 4.4.4. Record-interface Record-interface is the interface to provide an object structure for each database record, which gives object-oriented feature for this non-object-oriented procedure. The way that information is organized at the object level is usually different from the way that information is stored in relational database. In the object-oriented world, objects function at the level of the conceptual model. However, relational databases work at the data model level. Record-interface is the bridge to map relational data onto Java object. 70 IMPLEMENTATION OF DATA WAREHOUSE LOADER Each field of database record corresponds to one of data members of record class. Additionally, each record class has a data member of type character, called cImportFlag, to indicate the importing status of each record. Record-interface also has a rich functionality related to the configuration of data structure, such as initializing the importing flag, getting the status of importing flag, comparing each field of records, comparing key fields of records, determining the equivalence of records, specifying the format of records shown in the text file and setting the header information for the text file. In a word, recordinterface takes the responsibility for data representation in Data Warehouse Loader. Please see Figure 4.20. Figure 4.20: Record-interface of Data Warehouse Loader 71 IMPLEMENTATION OF DATA WAREHOUSE LOADER 4.5. Format of intermediate files As stated before, the intermediate text files have a uniform format, which consists of an Info-header, a Format-header and a Data- body, as explained as following. Please see Figure 4.21 for one example of the extraction text file. Figure 4.21: Example of the extraction text file 1. Info-header The Info-header starts with the key word $INFO and ends with the key word §END. The Info-header includes the following information about the text file. FILETYPE: type of intermediate files, which are produced by different loading workflow stage. Please see Table 4.1. SOURCE: data source, from which data is extracted by extraction program TARGET: target data warehouse, to which data is finally imported REFDATE: reference date, which indicates the time state of operational data source and can be given as a command line argument in extraction program CREDATE: creation date of this result file 72 IMPLEMENTATION OF DATA WAREHOUSE LOADER CRETIME: creation time of this result file EXTFILE: the storing path and the name of extraction file SORT: sorting criteria, which define according to which column the records are sorted. The name of the column is the consistent with Format-header. If there are more than one column names, they are separated by comma, and the order of the names is relevant. If the file does not need to be sorted, FILE_NO_SORT is given. LD-EXT Extraction data LD-TFM Transformation data LD-NEW Difference data with new records LD-CHG Difference data with changed records LD-DEL Difference data with deleted records LD-MRG Status data created by merging program LD-RET Status data created by retrieving program Table 4.1 File type information in the text file 2. Format-header The format-header starts with the key word $FORMAT and ends with the key word §END. It defines the name, the starting position and the width of each column in the text file. The importing flag is not counted. Each column has the format information as the following: <name of column>: <starting position>, <width> The format-header information is determined by the method format ( ) defined by each record class. 3. Data-body The data body starts with the key word §DATA and ends with the key word §END. It includes the useful data of this file in a predefined column format, as described in Formatheader. Every line starts with the importing flag, which is initialized to be a dot. Then there is a vertical bar to separate the importing flag and real data part. 4.6. Sorting the linked list of record objects 73 IMPLEMENTATION OF DATA WAREHOUSE LOADER In the workflow of Data Warehouse Loader, all the record objects in the intermediate files are organized in a linked list, which includes a set of record objects. It is necessary to sort all the record objects according to certain sorting criteria, such as Artikel number, Markt ID and so on. Sorting is especially important when two linked lists are compared in order to find their difference, as it is required when figuring out the difference between the current transformation result and the previous one. Sorting is also required to transform any large source record groups into another large target record groups (m:n-transformation). A specific method can be written down to sort those record objects, but it will be a lot less trouble to take advantage of another feather of the java.util package and the Collections class. The Collections class defines a variety of handy static methods, and one of them happens to be sort ( ) method. The sort ( ) method will only sort lists, that is, collections that implement the List interface. Figure 4.22: Code fraction for sorting the linked list of record objects Obviously there also has to be some ways for the sort ( ) method to determine the order of objects from the list that is sorting, in this case now, record objects. The most suitable way to do this for record objects is to implement the Comparable interface for each record class. The Comparable interface only declares one method, called compareTo ( ). It is the same method seen in the String class and it returns 1, 0, -1 depending on whether the current object is less than, equal to or greater than, the argument passed to the method. If the Comparable interface is implemented by each record class, the record class can be passed as an argument to the sort ( ) method directly. Then the collection is sorted in place as required, so there is no return value. This is the way all the record objects are 74 IMPLEMENTATION OF DATA WAREHOUSE LOADER sorted in a linked list in Data Warehouse Loader. Please see Figure 4.22 for the code fraction of sorting in Data Warehouse Loader. 4.7. Graphic user interface of Data Warehouse Loader In this work, all the functionality of Data Warehouse Loader discussed above is implemented in Java. Additionally, a graphic user interface is provided here for the convenience to run the workflow of Data Warehouse Loader and check the intermediate result of each processing stage. Please see Figure 4.23. Figure 4.23: Graphic user interface of Data Warehouse Loader 75 IMPLEMENTATION OF DATA WAREHOUSE LOADER The lower part of the GUI window works as output window. And the upper part of the GUI window provides different options to run Data Warehouse Loader. “Loader Workflow” is the complete operation sequence of data warehouse extracting, transforming and loading process. “Target” is to indicate the database table, with which Data Warehouse Loader is currently working. To run Data Warehouse Loader, normally first choose one of radio buttons from “Loader Workflow” and one of radio buttons from “Target”, then press “Execute” button. The resulted intermediate file of this current execution will be shown in the lower part as output. “Options” is especially used when executing figuring out difference and importing. After figuring out the difference between the current transformation result and the previous one, there are three difference files for new records, changed records and deleted records separately. It is better to read each of them separately. Therefore, choose one of radio buttons from “Options” in order to read those three difference files one after each other. As far as the importing program is concerned, every execution of importing program will import one of these three difference files, which means to insert new records, to update changed records or to remove deleted records separately. Therefore, one of radio button from “Options” needs to be chosen when importing. Especially, when it is the initial run of Data Warehouse Loader, there is neither the content of old status file nor the content of target data warehouse, and consequently no difference files. In this case, the transformation result will be directly imported into target data warehouse, which means all of the transformation results are new records and inserted. So, current transformation result is also possibly one of the parameters for importing program. 76 REUSE ANALYSIS OF DATA WAREHOUSE LOADER 5. Reuse Analysis of Data Warehouse Loader As stated before, it is important to make the software of Data Warehouse Loader reusable in a standard data warehouse application. The reason is simple that this software should be fit into different operational data source, different data transformation scheme, and different target data warehouse. Besides these requirements from data warehouse application, this software can also possibly be employed in other application, where some data is transformed from system A to system B, for example, Data Warehouse Loader works as a bridge during the step-by-step displacement from old system to new system. Of course, a certain modification is necessary. Therefore, in this work, the software of Data Warehouse Loader is designed especially in favor of software reuse. Chapter 2 gave some fundamental concepts of software reuse, which is essentially important to understand the importance of this technology. Chapter 3 introduced the development context of Data Warehouse Loader, especially the standard architecture of 77 REUSE ANALYSIS OF DATA WAREHOUSE LOADER data warehouse application, where Data Warehouse Loader resides. Afterwards, chapter 4 explained the detail implementation of this software, such as how is its overall architecture, how is its workflow, how is its intermediate results, how is Loader-engine, how is Loader-interface, and so on. In other words, chapter 2 is the theoretical part of this work, while chapter 3 and chapter 4 are the practical part of this work. Therefore, in this chapter the task is to combine the theoretical part and practical part of the work, which means to analyze the reuse architecture of Data Warehouse Loader explicitly. 5.1. Reuse development process of Data Warehouse Loader In order to make the software or the software component reusable, what has been done in the implementation of Data Warehouse Loader is as following: The reusability of software is taken into consideration even in the early planning stage of software development. Accurate and concrete reuse concepts are developed. Good documentation is provided, such as system documentation as well as the documentation in source code. Universal programming convention is used. Concrete identification and separation of specification from a particular project or a particular customer is done. The software is built on the base of certain framework, library and component. The application requirements are thoroughly examined. It is not enough to only know what to do. However, man must intensively consider the requirements in order to find out and define even the blurring and contradictory points. Moreover, it is the best to come up with one general component, such that other developers can make profit from it. Data Warehouse Loader of this work shows several characteristics to be reusable as following: It is general. It masks the basis functions, such as database access, error handling and so on. It is programmed with the widely and popularly used language and platform independent programming language. 78 REUSE ANALYSIS OF DATA WAREHOUSE LOADER Accordingly, the requirement analysis and architecture design of Data Warehouse Loader is extremely important in this work. Because of the requirements of data warehouse application, namely to be fit to different data source, different target data warehouse, and different data transformation scheme, the interface concept of Java is explored. There are four interfaces in Data Warehouse Loader, namely extraction-interface is oriented to different data source, transform-interface is oriented to different transformation scheme, database-interface is oriented to different target data warehouse, and record-interface is used during the whole process because records are the objects being always considered with. Figure 5.1: Decoupling architecture of Loader-engine and Loader-interface in detail With regard to the architecture design of Data Warehouse Loader, it is important how to divide the whole functionality into sub-functions. It is should be carefully considered that the workflow of Data Warehouse Loader is non-object-oriented, while an object-oriented language is chosen to be used here. As stated before, the “what” information is essential information that should be available to everyone. The “what” information includes 79 REUSE ANALYSIS OF DATA WAREHOUSE LOADER specifications and interface information. The “how” information should be available only to a limited group. The “how” information includes implementation details such as data structure. The separation of “what” information and “how” information is done by the means of the architecture containing both Loader-engine and Loader-interface. Loader-engine contains the “how” information. Loader-engine knows how to extract data from data source, how to transform data, how to figure out the difference, how to import data into data warehouse, how to merge data into status files and how to retrieve data from data warehouse. Moreover, Loader-engine knows the sequence of the workflow of Data Warehouse Loader. However, Loader-interface contains the “what” information. Loader-interface knows what kind of data to extract from data source, what kind of transformation scheme to perform, what is the difference to deal with new records, changed records, or deleted record, and what kind of data to retrieve into data warehouse. Figure 5.2: Detail of Loader-engine Since in chapter 4 the detail implementation of Data Warehouse Loader has been given, the overall decoupling architecture of Loader-engine and Loader-interface can be seen 80 REUSE ANALYSIS OF DATA WAREHOUSE LOADER more clearly in Figure 5.1. When Data Warehouse Loader is transferred from one application to another, the programs of Loader-engine and interface definition remain the same, as shown in the middle of Figure 5.1, which stands for a software part with high reusability. What need to be modified are the classes, which implement the corresponding interface, as shown the four surrounding parts of classes in Figure 5.1. This figure also indicates each surrounding part of classes is working for which workflow stage of Data Warehouse Loader, namely extraction, transformation, differentiation, merging and retrieving. For example, class ExArtikel, ExMarkt, ExVerkauf implement interface ExInterface and work for extraction stage. It is the same for other parts of classes. Furthermore, the details of Loader-engine in the middle are shown in Figure 5.2. 5.2. Applying concepts of software reuse Especially, there are several reuse considerations inside of Data Warehouse Loader as following: 5.2.1. Code reuse Reusability can be achieved by writing code once and using it for another application, which might require recompilation or not. There are different levels of code reuse: byte code reuse and source code reuse. Java is complied to a machine-independent low-level code, which is then interpreted by Java Virtual Machine. This gives the Java code platform independence, which means the same byte code can be used at any machine with different operating systems. Therefore, Java realizes byte code reuse. However, C program can be reused in terms of source code. Since C is widely used, its language standardization and compiler are available for most of the systems. 5.2.2. Adaptability Adaptability means the system is easily adapted to a diversified environment. Reusability of software requires the software to be used in a new context, which might be with new hardware, such as different CPU, or might be with new software, such as different operating system or database system. On account of Java Visual Machine, platform 81 REUSE ANALYSIS OF DATA WAREHOUSE LOADER independence is provided by Java in terms of both hardware and software. Porting a Java program to another machine does not even require recompilation. As far as the database access layer is concerned, JDBC and ODBC provide a reliable database connection. JDBC (Java Database Connectivity) is an API (Application Programming Interface) that lets user access virtually any tabular data source from the Java programming language. It provides cross-DBMS (Database Management System) connectivity to a wide range of SQL database. And now, with the new JDBC API, it is also possible to access to other tabular data sources, such as spreadsheets or flat files. JDBC API enables to take advantage of the Java platform’s “Write Once, Run Anywhere” capability for industrial strength, cross-platform applications that require access to enterprise data. ODBC (Open Database Connectivity) is a widely accepted API for database access. It is based on the Call-Level Interface (CLI) specifications from X/O and ISO/IEC for database APIs and uses Structured Query Language (SQL) as its database access language. As far as the access to database is concerned, there are two levels of access: technical access and access with business logic. Technical access means to technically create a connection to the database with the help of Java Visual Machine, JDBC API and ODBC API, by using SQL as database access language. However, database access with business logic means to define which data to extract, which data to retrieve and how the different data is imported. This part of work is done by the database interface and extraction interface of Data Warehouse Loader. It is important to separate these two levels of database access functionality, since creating database connection is a lower level of data extraction, data importing and data retrieving in/to the database. It is a technical connection and has nothing to do with specific business logic. 5.2.3. Modularity Modularity is defined as the breaking down of a program into small, manageable units. In Data Warehouse Loader of this work, the workflow of the whole extracting, transforming and loading process is broken down into several modules, namely extraction from 82 REUSE ANALYSIS OF DATA WAREHOUSE LOADER operational data source, data transformation, figuring out the differences, importing data into data warehouse, merging data into status file and retrieving data from data warehouse. In this way, there are several advantages as following: Easy testing: Since the whole workflow is broken down into separated modules, each module can be tested separated. It makes easier for trouble shooting, since in case of error, the error can be directly allocated to a certain module instead of checking all of the modules. In addition, it saves also testing time when errors exist only in a certain module, because only this module needs to be tested once again instead of running the whole process once again. Easy use: Each module can be utilized separately, which means to have less context limitation and easily use. Module reuse: Since each module can be separated from another, each of them can be used separately in another context with similar functionality. For example, the module of extracting data can be used as a tool to read data from certain database. The module of transforming data can be used as a tool to change data in a certain manner. 5.2.4. Interface Interface concept is one of the unique features of Java, which C does not have. Interface concept makes it easy for Data Warehouse Loader to adapt to different data source, different data transformation scheme, and different target data warehouse, as the requirements of data warehouse application stated. When those conditions or requirements are changed, what needs to do is to provide a specific class, which implements its corresponding interface. Therefore, the modification is of little effort and high reusability of the software is achieved. There are altogether four interfaces in Data Warehouse Loader. Extraction interface and database interface is aimed for data access, which provides both technical database access and database access with business logic. Transformation interface represents business logic in terms of a particular data transformation scheme in a certain application. Finally, record interface is aimed for data representation, which provides a manner to map a record in database table onto an object. The mapping results in an object-oriented feature in this non-object-oriented process of extracting, transforming and loading 83 REUSE ANALYSIS OF DATA WAREHOUSE LOADER process. Such that the advantages of object-oriented language can be explored. In fact, since each record is represented by one object, it is much easier to process these records, such as changing, searching, inserting and so on. And it is also easier to work with another new database table, since all that has to do is to create a new class, which implements record interface and represents the record in the new table. Please see Figure 5.3. Figure 5.3: Interfaces of Data Warehouse Loader 5.3. Reuse architecture analysis of Data Warehouse Loader In a word, the architecture of Data Warehouse Loader is in favor of software reuse, which can be seen in Figure 5.4. The whole software development context of Data Warehouse Loader can be divided into several layers, one above each other. The higher layer it is, more abstract it is. As stated before, Java Visual Machine, JDBC API and ODBC API are utilized to provide adaptability, which means to be fit into different hardware, different operating system and different database system. The hardware layer and the operating system layer are untouched. With regard to database, it can be said that hardware layer deals with database in terms of real disk, operating system layer deals with database in terms of file storing system, and JDBC API deals with database in terms of database table. 84 REUSE ANALYSIS OF DATA WAREHOUSE LOADER Figure 5.4: Reuse architecture analysis of Data Warehouse Loader Record interface, database interface and extraction interface belong to business objects layer, which uses objects to model business. By relegating common characteristics and behaviors to the highest possible level and modeling a problem as a set of types, this technique gives rise to the high reusability of object-oriented software. Business objects layer is one layer lower than business functions layer, which represents specific business logic and defines how is the requirements for data processing. Therefore, transformation interface and every module of the whole workflow belong to this layer. Finally, business process is the most general layer and is connected to external data presentation. The main motivation to implement Data Warehouse Loader in Java is to make use of the interface concept of Java, the platform independence of Java and the object-oriented feature of Java, which are the things C does not contain. And of course, it is a fashion to use Java nowadays, since it is so widely accepted and used. However, the original sd&m Data Warehouse Loader achieves high reusability as well although it is written in C. 85 REUSE ANALYSIS OF DATA WAREHOUSE LOADER Firstly, since ANSI-C is a widely and popularly used programming language, there is no problem for this software to migrate between different systems, such as Window NT and Unix. However, this migration can be achieved in the Java Data Warehouse Loader of this work by the platform independence feature of Java. Secondly, the original sd&m Data Warehouse Loader also meets the requirements to be fit to different data source, different data warehouse and different data transformation scheme. However, realizing it in C is difficult with some mechanisms, such as look-up table. On the contrary, the interface concept of Java makes it much easier with the consideration of this functionality, since the reprogramming effort is reduced in case of changing. But Java also brings some drawbacks. Therefore, from the programming language’s point of view, both C and Java are reusable, but with different efforts. The lesson learned is that mere use of a certain programming language does not guarantee software reusability. The language must be accompanied with reuse technology, such as tools and methodologies. Let us come back to Java program to have an overview at the packages in Data Warehouse Loader. Please see Figure 5.5. There are three packages: myPackage.loader for Loader-engine classes, myPackage.loaderInterface for Loader-interface classes, and myPackage.basis for the common basic classes of this work. Figure 5.5: Package overview of Data Warehouse Loader 86 SUMMARY 6. Summary 6.1. Summary of this work Since the advent of computer industry, the cost of software development is constantly increasing. The high cost of producing and maintaining software is the compelling reason to devise new ways, for example, software reuse. Software reuse is receiving much attention, since there is enough software written and available that can be reused at a fraction of the cost of developing new software from scratch. The objectives of object-oriented software reuse are to produce software better, faster, and more cheaply by reusing existing well-tested assets. These assets are domain architecture, requirement analysis documents, prototype models, design, algorithm, coding components, test scenarios, standards, and any other related documents. If software is reused, it will result in the following advantages: Shorter software development time Software with higher quality and fewer errors 87 SUMMARY Consequently lower cost In this work, the general concepts of software reuse are introduced. Then, these concepts are applied in the implementation of data warehouse ETL systems, called Data Warehouse Loader. This software is implemented in Java. It fulfills most of the requirements of data warehouse ETL systems. Java is chosen as programming language in order to be portable and platform-independent. Java interface concept and building block principle are used for the purpose of flexibility for data transformation scheme and adaptability to any data source & target data warehouse. Such that, the advantages of the object-oriented programming language are explored. In case of changing data transformation scheme or connected different database systems, the reprogramming efforts are reduced, which means the high software reusability is achieved. However, it still has some drawbacks. Compared to the former sd&m Data Warehouse Loader with C, this software achieves high reusability with the expense of slowing down the run-time speed-up and large main memory consumption. Therefore, it would be of problems when processing large volume of data. One reason for that is this Data Warehouse Loader is implemented in Java and Java naturally requires more memory resource and is slower, because Java program is first compiled to low-level byte code and then interpreted by Java Visual Machine on the particular machine. Another reason for that is the workflow of this Data Warehouse Loader is based on object-oriented approach. The most significant difference between an object and a non-object-oriented approach is that each database record is organized as an object instead of a stream of characters. Therefore, all the operations are concerned with an object or a set of objects, which are concatenated in a linked list object. Operations with objects need more main memory consumption compared to simple characters. Therefore, further work on this data warehouse ETL system would be some modification of the data process workflow in order to reduce the main memory consumption, especially in case of processing large amount of data. In a word, the concept of software reuse should be an integral principle of software engineering process. Software reuse techniques should be applied to the whole software development process. Such that, high quality software with fewer errors can be produced with lower cost and shorter schedules. 88 SUMMARY 6.2. Lesson learned Generally, what lessons should be learned during the software reuse practice in the last few years? Reuse is not a silver bullet. It should be apparent that reuse is only part of the solution to the software crisis. It cannot solve all the problems. Product-line architectures should be emphasized. Probably the most important lessen learned during the past few years is that product-line architecture is where the action should be. The architecture creates the foundation for systematic reuse to occur. It becomes very difficult to figure out how to be reusable without architecture to serve as a decision framework. While technical issues exist, software reuse remains a management problem. That means most of the barriers facing software reuse adopters are cultural, managerial, psychological, and political. For reuse concepts to work, new processes and paradigms must be introduced. Changes in management infrastructure, such as organization, policies, processes, are needed to support the introduction of reuse. Structure and tools need to be provided to get products out the door quickly and expertly. Case tools need to be reoriented to model the solution space in addition to the problem domain. Most CASE tool define what should be built without considering available reusable assets. As a result, reuse opportunities are not fully considered as systems are synthesized. In a word, software must be viewed more as an asset than as expense by management in the future. Viewing software as an asset would encourage management to capitalize its software research and productize its software development efforts. 89 APPENDIX Appendix Reference Books: [1] Jag Sodhi, and Prince Sodhi, “Software Reuse Domain Analysis and Design Processed”, McGraw-Hill, 1998 [2] Donald J. Reifer, “Practical Software Reuse”, Wiley Computer Publishing, John Wiley & Sons, Inc. 1997 [3] W.H. Inmon, “Data Architecture: the Information Paradigm”, QED Technical Publishing Group, Second Edition, 1991 [4] Vivek R. Gupta, "An Introduction to Data Warehousing", Distributed by System Services Corporation [5] W.H.Inmon, Richard D. Hackathorn, "Using the Data Warehouse", A Wiley-QED Publication, John Wiley & Sons, Inc. 1994 [6] Kelly J. Thomas, "Dimensional Data Modeling", Distributed by Sybase [7] Wayne C. Lim, “Effects of Reuse on Quality, Productivity, and Economics”, Software, IEEE Computer Society, September 1994, pp. 23-30 [8] Larry Berstein, “Politics of Reuse”, Processing AT&T Symposium on Software Reuse, May 1995 [9] Brian W. Beach, Martin L. Griss, and Kevin D. Wentzel, “Bus-Based Kits for Reusable software”, Proceedings of 2nd Irvine Software Symposium, University of California at Irvine, March 1992, pp. 19-28 [10] DoD Software Reuse Initiative, “Software Reuse Benchmarking Study”, available from Defense Information Systems Agency, February 1996 [11] Art Pyster, “A Product-line Approach”, Proceedings 8th Annual Software Technology Conference, April 1996 [12] Grady Booch, “Object-Oriented Design with Applications”, the Benjamin/Cummings Publishing Co., 1991 [13] Peter Coad and Edward Yourdon, “Object-Oriented Analysis (Second Edition)”, Yourdon Press, 1991 90 APPENDIX [14] Kyo C. Kang, Sholom G. Cohen, James A. Hess, William E. Novak, and A. Spencer Peterson, “Feature-Oriented Domain Analysis (FODA) Feasibility study”, Software Engineering Institute, Report CMU/SEI-90-TR-21, November 1990 [15] Jim Moore, “Reuse Library Interoperability Group (RLG)”, “Presentation of Status to DoD Reuse Executive Steering Committee”, available from Defense Information Systems Agency, September 1995 [16] DoD Software Reuse Initiative, “Virtual Reuse Library Assessment and Library Infrastructure Report”, available from Defense Information Systems Agency, September 1994 [17] Elena Wright, “Library Business Fee for Service”, available from Defense Information Systems Agency, February 1993 [18] Central Archive fro Reusable Defense Software (CARDS), Franchise Plan. STARAVC-B010/001/00, February 1994 [19] Bob Hennan, NASA Electronic Library System (NELS) Future Plans, available fro NASA Johnson Space Center, August 1994 [20] Don Batory, “Validating Component Compositions in Software System Generators”, TR-95-01, University of Texas at Austin, 1995 [21] Software Productivity Consortium, “Reuse Adoption Guidebook”, SPC-92051-CMC, November 1993 [22] Mark Paul, “Survey of Major Changes proposed for CMM v2”, CMM Correspondence Group, Software Engineering Institute, April 1996 91