A Novel and Optimized Parallel Approach for ETL Using Pipelining Deepshikha Aggarwal#, V. B. Aggarwal* # Department of IT, Jagan Institute of Management Studies, Delhi deepshikha.aggarwal@jimsindia.org vbaggarwal@jimsindia.org Abstract - Companies have huge amounts of precious data lying around throughout their servers and networks that needs to be moved from one location to another such as from one business center to another or to a data warehouse for analysis. The problem is that the data lies in different sorts of heterogeneous systems, and therefore in different formats. To accumulate data at one place and to make it suitable for strategic decisions we need a data warehouse system. This consists of extract, transform and load (ETL) software, which extracts data from various resources, transform it into new data formats according to required information needs, and then load it into desired data structure(s) such as a data warehouse. Such softwares take enormous time for the purpose which makes the process very slow. To deal with the problem of time taken by ETL process, in this paper we are presenting a new technique of ETL using pipelining. We have divided the ETL process into various segments which work simultaneously using pipelining and reduce the time for ETL considerably. Keywords— Data Warehouse, Data Mining, Data Mart, pipelining, nested pipelining, parallel computing. well-designed. An ETL system is necessary for the accomplishment of a data warehousing project. ETL is the process that involves Extraction, Transformation and Loading. II. EXTRACTION The first part of an ETL process is to extract the data from various source systems. Most data warehousing projects consolidate data from different sources. These systems may use a different data organization and format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. To improve the quality of data during the extraction process, two operations are performed: Data profiling. Data cleaning. A. Data Profiling Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data [11]. The purpose of data profiling is to identify the relationships between columns, check for integrity constraints and also check for ambiguities, incomplete or incorrect values in data . Data profiling therefore enables to eliminate errors due to these reasons. I. INTRODUCTION In today’s world businesses demand softwares to process large data volumes at the lightning speeds. Everybody wants to have information and reports on a single click. In this fast changing competitive environment data volumes are increasing exponentially and hence there is an increasing After this, the process of data cleaning is executed in which we perform the following tasks: demand on the data warehouses to deliver available information instantly. Additionally, data warehouses should Filling up the missing values. generate consistent and accurate results.[1] In data warehouse, Smooth noisy data. data is extracted from operational databases, ERP applications, Identify or remove outliers. CRM tools, excel spread sheets, flat files etc. It then transformed to match the data warehouse schema and Resolve inconsistency. business requirements and loaded into the data warehouse Resolve redundancy caused by integration. database in desired structures. Data in the Data warehouse is added periodically – monthly, daily, hourly depending on the purpose of the data warehouse and the type of business it B. Data Cleaning serves.[2] Scrubbing or Data Cleansing is removing or correcting errors and inconsistencies present in operational data Because ETL is a continuous and frequent part of a data values (e.g., inconsistently spelled names, impossible warehouse process, ETL processes must be automated and a birth dates, out-of-date zip codes, missing data, etc.). It may use pattern recognition and other artificial intelligence techniques for improving the data quality [7]. Formal programs in total quality management (TQM) can also be implemented for data cleaning. These focus on defect prevention, rather than defect correction.The quality of data improves by: Fixing errors such as misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data and inconsistencies It also involves decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging and locating missing data. The process of extraction also includes parsing the extracted data to check whether it meets the desired structure or pattern. If the pattern is not up to the mark then data may be rejected. Designing and creating the extraction process is often one of the most timeconsuming tasks in the ETL process and indeed, in the entire data warehousing process. The source systems might be very complex and poorly documented, and thus determining which data needs to be extracted can be difficult. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date. Moreover, the source system typically cannot be modified, nor can its performance or availability be adjusted, to accommodate the needs of the data warehouse extraction process. Extraction is also known as Capturing which means extracting the relevant data from the source files used to fill the EDW. The relevant data is typically a subset of all the data that is contained in the operational systems. The two types of capture are: Static – A method of capturing a snapshot of the required source data at a point in time, for initial EDW loading. Incremental – It is used for ongoing updates of an existing EDW and therefore it only captures the changes that have occurred in the source data since the last capture. Translating coded values (e.g., if the source system stores M for male and F for female, but the warehouse stores 1 for male and 2 for female) Encoding free-form values (e.g., mapping "Male",”M " and "Mr" onto 1) Deriving a new calculated value (e.g., Age=CurrentDate-DOB) Joining together data from multiple sources (e.g., lookup, merge, etc.) Summarizing multiple rows of data (e.g., total sales for each Month) Generating surrogate key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) The transformation phase is divided into following tasks: Smoothing (removing noisy data) Aggregation (summarization, data cube construction) Generalization ( hierarchy climbing) Attribute construction (new attributes constructed from the given ones) IV. LOADING This phase loads the transformed data into the data warehouse. The execution of this phase varies depending on the need of the organization Eg. If we are maintaining the records of sales on yearly bases then data in the data warehouse will be reloaded after a year. So the frequency of reloading or updating the data depends on type of requirements and type of decisions that we are going to make. This phase directly interact with the database, the constraints and the triggers as well, which helps in improving the overall quality of data and the ETL performance. V. ETL TOOLS III. TRANSFORMATION The transform phase applies a set of instructions or functions to the extracted data so as to convert different data formats into single format which can be loaded. Following are the transformations types which may be required:[8] Selecting only certain columns to load (Which are needed in decision making process and selecting null columns not to load) ETL process can be implemented on various software platforms. There are various tools which are developed to make this process more reliable and advanced. Some of the tools available for ETL[9]: Informatica (Informatica Corporation) Oracle Warehouse Builder(Oracle Corporation) Transformation manager(ETL Solutions) DB2 Universal Enterprise Edition(IBM) Cognos Decision Stream(IBM Cognos) SQL server integration system(Microsoft) Ab Initio(Ab Initio Software Corporation) Data Stage(IBM)[5] VI. CHALLENGES FOR ETL The ETL process can involve a number of challenges and operational problems. Since a data warehouse is assembled by integrating data from a number of heterogeneous sources and formats, the ETL process has to bring all the data together in a standard, homogeneous environment. Following are some of the challenges: Dealing with diverse and disparities in source systems. Need to deal with source systems with multiple platforms and different operating systems. sections we have discussed the ETL process and the challenges faced by this process. Every new design/model has been developed with the aim of reducing time taken to extract, transform and load data into the data warehouse. Solution: A new design introduced with the concept of nested pipelining To perform Extraction, transformation and loading for a Data warehouse , we are using Pipelining. Pipelining is a technique where the complete process is divided into various segments and all these segments can work simultaneously. In our model we have divided the ETL process into 9 segments as shown in Table 1. Many source systems are older legacy applications running obsolete database technologies. Historical data change values are not preserved in source operations. This process takes enormous time. TABLE I SEGMENTATION OF ETL PROCESS Segment1 Segment2 VII. PARALLEL PROCESSING IN ETL Parallel computing is the simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads. Parallel processing makes programs run faster because there are more engines (CPU s or Cores) running it. In order to improve the performance of ETL software parallel processing may be implemented. This has enabled the evolution of a number of methods to improve the overall performance of ETL processes when dealing with large volumes of data. Segment3 Segment4 Segment5 Segment6 Segment7 Segment8 Segment9 “Fmv” which will fill missing values “Smd” which will Smooth noisy data. “Ro” which will Remove outliners. “Rc” which will Remove Inconsistency. “Sm” Smoothing. “Agg”Aggregation. “AC”Attribute Construction “Gen” Generalization Load which will load the data There are 3 main types of parallelisms as implemented in ETL applications: Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream.. Component: The simultaneous running of multiple processes on different data streams in the same job. All three types of parallelism are usually combined in a single job.[2][3] VIII. A NOVEL TOOL FOR ETL USING PIPELINING Problem with ETL design The major problem with the existing ETL models is the time taken by the process. Several models have been introduced to reduce this time span. In previous Data is divided into different modules. During the first clock cycle Module1 enters into segment1 and during 2nd clock cycle first module moves to second segment and module2 enters into segment1. During 3rd clock cycle module1 enters into segment3 , module2 enters into segment2 and module3 enters into segment1. So in this way all segments work together on a number of data modules simultaneously as shown in Table 2. k = no. of segments. tp = Time to execute a task in case of Pipeline As the no of tasks increases, n becomes much larger than k-1, and k+n-1 approaches the value of n. Under this condition, the speedup becomes S= tn/tp If we assume the time taken to process a task is same in the pipeline and non pipeline circuits, we will have tn = ktp. It includes, this assumption, the speed reduces to TABLE 2 DATA MODULES IN SEGMENTATION S = ktp/tp We can demonstrate it with the help of an example, let the time taken to process a sub operation in each segment be equal to tp= 20ns. Assume that the, pipeline has k= 9 segments and executes n= 100 tasks in sequence. The pipeline system will take (k+n-1)tp = (3+99)*20 = 2040 ns to complete the task. Assuming that tn= ktp = 9*20 = 180 ns, a non pipeline system requires nktp= 100*180 = 18000 ns to complete 100 tasks. The speed ratio is equal to 18000/2040 = 8.82 .So we have speed up the process approximately 9 times. VIII. CONCLUSION Extraction Transformation and loading process may involves a number of tasks and challenges but in this paper we have shown that if (assuming) ETL process has only 9 tasks (as shown in the previous section) then we can speed up the whole process to a great extent using pipelining.This model can be expanded to have many more number of segments for the ETL process. In future, such type of models for ETL can revolutionize the information management in data warehouses and decision support systems by saving time and money for business organizations. If we perform ETL process in this way then we will be able to save a lot of clock cycles and will be able to speed up the process. The following calculation shows the speeding up of the ETL process. REFERENCES Speeding up ETL process [1] S=N*tn (k+N-1)tp [2] Where, [3] N = no. Of tasks tn= Time to execute a task in case of Non Pipeline [4] Li, B., and Shasha, D., "Free Parallel Data Mining", ACM SIGMOD Record, Vol.27, No.2, pp.541-543, New York, USA (1998). Anand, S. S., Bell, D. A. and Hughes, J.G., "EDM: A general framework for data mining based on evidence theory", Data and Knowledge Engineering, Vol.18, No.3, pp.189-223 (1996). Agrawal, R., Imielinsk, T. and Swami, A., “Database Mining: A Performance Perspective”, IEEE Transaction Knowledge and Data Engineering, vol. 5, no. 6, pp. 914-925 (1993). Chen, S.Y. and Liu, X., "Data mining from 1994 to 2004: an application-oriented review", International Journal of Business Intelligence and Data Mining, Vol.1, No.1, pp.4-11, 2005. [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] Singh, D. R. Y. and Chauhan, A. S., "Neural Networks In Data Mining", Journal of Theoretical and Applied Information Technology, Vol.5, No.6, pp.36-42 (2009). C.M. Roberts, "Radio frequency identification (RFID)", Computers & Security, Vol.25, pp. 18 – 26 (2006). Aboalsamh, H. A., "A novel Boolean algebraic framework for association and pattern mining", WSEAS Transactions on Computers, Vol.7, No.8, pp.1352-1361 (2008). Sathiyamoorthi, V. and Bhaskaran, V.M., "Data Mining for Intelligent Enterprise Resource Planning System", International Journal of Recent Trends in Engineering, Vol. 2, No. 3,pp.1-5 (2009). Floriana Esposito,” A Comparative analysis of methods for pruning Decision trees,”IEEE Transaction on pattern analysis and pattern matching vol 19, 1997. http://en.wikipedia.org/wiki/Etl#Transform http://en.wikipedia.org/wiki/Data_profiling J.A. Blakeley, N. Coburn, and P.-A. Larson. Up-dating derived relations: etecting irrelevant and autonomously computable updates. ACM Transactions on Database Systems, 14(3):369{400, September 1989 P. Vassiliadis, C. Quix, Y. Vassiliou, M. Jarke. Data Warehouse Process Management.cInformation Systems, vol. 26, no.3, pp. 205-236, June 2001 T. Stöhr, R. Müller, E. Rahm. An Integrative and Uniform Model for Metadata Management in Data Warehousing Environments. In Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW’99), pp. 12.1- 12.16, Heidelberg, Germany, 1999. M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis (eds.). Fundamentals of Data Warehouses. 2nd Edition, Springer-Verlag, Germany,.