A Framework for Revising The Emerging Data Warehouses

International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 A Framework for Revising The Emerging Data Warehouses N.T.Radha#1, Vempada Sarada *2 1 1,2 2 Associate professor , Final MTech Student, Dept. of CSE, Pydah College of Engineering and Technology, Boyapalem,Visakhapatnam, AP, India Abstract: In streaming data-warehousing there are so many problems during the manipulations in the resources and so many of them updating and this is because of the scalability usage of the resources means number of times the resource utilized. This method leads to the refreshing problem of the resource systems. In this paper we introduced a framework that includes the priority and efficiency different types of job updates with consistency in intervals of time and our proposed system presents the scheduling of the jobs based on the priority and effect their efficiency and performance. mechanism is to temporally partition each view according to a record timestamp. Then, only those partitions which are affected by new data need to be updated. We explained the propagation of new data from a raw source and then through two levels of derived views. Root and derived tables are all temporally partitioned by some criteria. We have also determined the flow of data from a source partition to a dependent partition for a discussion of how to infer these data flows. The two updated partitions in the base table trigger updates only to two partitions in the view and there are two partitions in the second materialized view. I. INTRODUCTION A Data stream management system (DSMS) is a computer program to manage continuous data streams. As same as to a database management system which is designed for static data in conventional databases and gives the flexibility to query processing so that the information need can be expressed using queries. However, in contrast to a DBMS and DSMS executes a continuous query that is not only performed once it permanently installed and the query is continuously executed until it is explicitly uninstalled. Since most DSMS is a continuous query produces new results as long as new data arrive at the resource. This is very initial concept is similar to Complex event processing so that both technologies are partially coalescing. One of the biggest challenges for a DSMS is to handle potentially infinite data streams using a fixed amount of memory and no random access to the data. There are so many approaches to limit the amount of data in one pass and which can be divided into two types. First approach is there are compression techniques that try to summarize the data and for the other hand there are window techniques that try to portion the data into (finite) parts. The data flow through a stream warehouse begins with raw data and the applications are summaries which are loaded into a first level of materialized views often called base tables. To update the base tables then propagate through the remaining materialized views called as child tables. Views can store terabytes of data spanning years and an efficient mechanism for updating them is a critical component of a stream warehouse. The most very common ISSN: 2231-5381 Derived Table Derived Table Extraction-Transformation-Loading (ETL) process when the warehouse is quiescent. This clean separation between querying and updating is a fundamental assumption of conventional data warehousing applications it is very clearly simplifies several aspects of the implementation. Immediate view maintenance may appear to be a reasonable solution for a streaming warehouse (deferred maintenance increases query response times and especially if high volumes of data arrive between queries and periodic maintenance delays updates that arrive in the middle of the update period). Whenever new data arrive and we immediately update the corresponding table and the table T has been updated and we trigger the updates of all the materialized views sourced from T and followed by all the views defined over those views. There is a disadvantage with this approach is that new data may arrive on multiple streams and there is no mechanism for limiting the number of tables that can be updated and running too many parallel updates can degrade performance due to memory and CPUcache thrashing disk-arm thrashing. This motivates the need for a scheduler that limits the number of concurrent updates and determines which job to schedule next. http://www.ijettjournal.org Page 3741 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 A common characteristic of many real-time systems is that their requirements specification includes timing information in the form of deadlines. The complete time taken for an event is mapped against the "value" this event has to the system. It is loosely defined to mean the contribution this event has to the system’s objectives. With the computational event represented in this value is zero before the start-time and returns to zero once the deadline is passed and the mapping of time to value between starttime and deadline is application dependent (and is shown as a constant in the figure). A system consists of one or more nodes and every node being an autonomous computing system. This is termed the internal environment. The environment of a node can contain other nodes (connected via an arbitrary network) and interfaces and actuators. The nodes in a system interact to achieve a common objective. For a realtime application and this objective is typically the control and monitoring of the system’s external environment. Processes or tasks (the terms are used interchangeably in this review) form the logical units of computation in a processor and every method or process has a single thread of control. In the development of application programs it is usual to map system timing requirements onto process deadlines. The meeting deadlines therefore become one of process scheduling and having two distinct forms of process structure being immediately isolated: (a) Periodic Processes (b) A periodic Processes Periodic processes, as their name implies, execute on a regular basis and the characters are (i) their period; (ii) their required execution time (per period). The execution time may be given in terms of an average measurement and (or) a worst case execution time. Consider an example that the periodic process may need to execute every second using, on average, 100ms of CPU time; this may rise to 300ms in extreme instances. The periodic process activation is usually triggered by an action external to the system. Often these processes deal with critical events in the system’s environment and hence their deadlines are particularly important. In general a periodic processes are viewed as being activated randomly and consider an example a. It is not possible to do worst case analysis (there is a finite possibility of any number of a periodic events). As a result the periodic processes cannot have hard deadlines. The worst case calculations to allow to made there is often defined a minimum period between any two a periodic events (from the same source). If this is the case the process involved is said to be sporadic. The maximum response time of a process is termed the deadline and between the invocation of a process and its deadline and the happening process requires a certain amount of computation time to complete its execution. Computation times may be static and within a ISSN: 2231-5381 given maximum and minimum. The computation time must not be greater than the time between the invocation and deadline of a process. Hence, the following relationship should hold for all processes: C≤D where C - computation time D - deadline For each periodic processes, its period must be at least equal to its end line. A process is complete before successive invocation and that is termed a runnability constraint. Hence, for periodic processes, the following relations are: C≤D≤T where T - period of the process Non-periodic processes can be invoked at any time. A process can block itself, by requesting a resource that is unavailable. When a process performs a "wait" on a semaphore guarding a critical region currently occupied. The use when a process performs a "delay" or other such operation. Tasks whose progress is not dependent upon the progress of other processes are termed independent. This definition discounts the competition for processor time between processes as a dependency. The inter related processes can interact in many ways including communication and precedence relationships. II.RELATED WORK A.Scheduling Model: In the warehouse model and every table i receives updates from an external source at times ri1 < ri,2 < · · · < riki with 0 < ri,1;define ri,0 = 0. The unknown times in advance to the algorithm and at time ri,j. The time updated ri,j contains data for the time period (ri,j −1, ri,j] 2 and the data do not become available to be processed till time ri,j. The release time or arrival time of the update to be ri,j and the start time of the update to be ri,j −1. That is, the update contains data for the time period (start time, release time]. We define the length Lij of the jth update to table . Let t be the number of tables and p ≤ t be the number of available identical processors. At ant time any idle processor may choose to update any table has at least one pending update and is not currently being updated by another processor. Let us consider that we are at time τ and table i is picked and current up to time ri,j is very cheaper to execute an update of length L than l updates of length L/l for all and the pending updates for table i with release times in the interval [ri,j, τ ] are processed non-preemptively. For a table i if any update process is done that is released after the start of the processing of a batch is not included in that batch that waits its turn to be processed in the next batch. http://www.ijettjournal.org Page 3742 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 S ri0 The wait interval of a batch is the time between the arrival of its first update and the time and it starts being processed. Then the additional updates may arrive during the wait interval and processing time is the length of its processing or execution interval—the time during which it is actually being processed and the wait time is the duration between the release of its first update and the time when processing of that batch begins. Let us consider that the processing time of a batch is at most proportional to the quantity of new data, with a constant of proportionality which depends on the table (since an hour’s worth of updates to table i might be far more voluminous than an hour’s worth for table i′) and for each table i define a real αi ≤ 1 such that processing a batch of length L takes .That processing an update containing an hour’s worth of data should take significantly less than one hour implies that αi should be at most 1.) In order to keep up with the inputs and let us consider that there is an α ≤ p/t such that each αi ≤ α. Any t tables, from time 0 to sometime T and to receive data for tT time units. In order that the processors not and it is necessary that p processors be able to process tT units of data in T time intervals and ever since T time units of data on table i can be processed in Tαi time and we need Σt Σ i=1(Tαi) ≤ pT and there are p processors or αi ≤ p. If one wants a bound in terms of the max αi and need to impose tα ≤ p, where α = max αi. Hence α ≤ p/t is required. We do not believe C to be a real threshold for obtaining good algorithms, and we think it is merely a weakness of our analysis. (Note: Even when α > p/t, one conceivably could give an on-line algorithm which is constant competitive against the best off-line algorithm. The adversary would fall behind badly. B.Staleness and Stretch Recall from Section 1 that at any time τ , the staleness Si(τ ) of table i is defined to be τ −r, where the table is current up to time r. Warehouse continuously receives new data and must keep its tables “fresh” at all times and Consider that table i is initialized at time ri,0 = 0 and suppose that a processor becomes available for table and which contains data for interval (ri,0, ri,1], is available to be processed at time s. The mian update takes time at most is represented as αi(ri,1 – ri,0) and finishes at time f ≤ s + ISSN: 2231-5381 αi(ri,1 – ri,0). From initial time until the processing of the first update finishes at time f, the staleness increases apparently with slope 1and starting at 0 and ending at time f at value f. The processing of that update finishes the time f and the staleness drops to f – ri,1, since at time f the table is current to time ri;1. Immediately after time f the staleness increases linearly, with slope 1, once again. Note that total staleness is simply the area under the staleness curve. Now suppose no processor is available again till time s and One containing data for time (ri,1, ri,2] and one containing data for time (ri,2, ri,3]. This means that the processor processes both updates and finishing at time is represented as f′ ≤ s′+αi(ri,3−ri,1). By time f′, the staleness and since the table is then current up to time ri,3. It is important to note that the staleness function would not change if instead of these two updates. Traditionally, the flow time of a job is defined as the difference between its completion time and release time, and its stretch is the flow time divided on the basis of length. The main updates start accumulating data before released and which affects the staleness of the corresponding table. We thus define the flow time of the update released at time ri,j to be f – ri,j−1, where the processing of the batch containing this update finishes at time f and that is its completion time minus its start time and it’s not its completion time minus its release time (the completion time of a batch being the time when the processing of the batch finishes). After we define the stretch to be the maximum and all updates of the flow time of the update divided by the length. It shows that how much additional staleness is accrued while an update is waiting and being processed. For a moment the stretch of the first update is represented as f−ri,0/ri,1−ri,0 , the stretch of the second update is represented as f′−ri,1/ ri,2−ri,1, and the third update of stretch is f′−ri,2/ ri,3−ri,2. C) Comparison with Scheduling Results Previous scheduling results focus on individual job penalties such as job deadlines, rather than table penalties, and employ no notion of batching and crucial to our result. Perhaps the most similar problem studied in the literature is that of minimizing the sum of squares of flow times that flow time measures the total time a job spends in the system which is including wait time and the time complexity for this objective function and no constant competitive algorithm exists. The proof of nonexistence of a competitive algorithm for this problem and relies on the fact that N jobs have to pay N penalties. Particularly consider a sequence of N consecutive and it is unit-time jobs arriving starting at time 0 and ending at time N. Even if all N jobs could be (started and) completed instantaneously at time N and penalty would be N2 (for the first job) plus (N −1)2 and the total penalty of Ω(N3). In our proposed model the staleness of a table depends on the time of the last update and increases linearly over time and the next batch of updates has been processed. Regardless of http://www.ijettjournal.org Page 3743 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 whether one long update or N unit-length updates (which will be batched together) are processed at time and that the total staleness is O(N2). In our proposed model prevents an adversary from injecting a long stream of identical short jobs which will hugely cost the algorithm. . III. PROPOSED MODEL This section presents our scheduling framework. The key idea is to partition the update jobs by their expected processing time to partition the available resources and represents a fraction of the computing resources required by our complex jobs including CPU and disk I/Os. It is placed in the queue corresponding to its assigned partition (track), where scheduling decisions are made by a local scheduler running a basic algorithm. Let Us consider that each job is executed on exactly one track and that tracks become a mechanism for limiting concurrency and for separating long jobs from short jobs (with the number of tracks being the limit on the number of concurrent jobs). We assume that the same type of basic scheduling algorithm is used for each track. At this point, one may ask why we do not precisely measure resource utilization and adjust the level of parallelism. The appropriate answer is that it is difficult to determine performance bottlenecks in a complex server the performance process may deteriorate even if resources are far from fully utilized. The cleanly correlating resource use with performance leads us to schedule in terms of abstract tracks instead of carefully calibrated CPU and disk usage. Let us discuss basic algorithms and the job partitioning strategies and techniques for dealing with view hierarchies and transient overload. The basic scheduling algorithms prioritize jobs to be executed on individual tracks and will serve as initiating blocks. The Earliest-Deadline-First (EDF) algorithm orders released jobs by proximity to their deadlines. This method is also known to be an optimal hard real-time scheduling algorithm for a single track if the jobs are preemptible [7]. Since our jobs are prioritized and EDF directly does not result in the best performance. A)EDF-Partitioned Strategy The EDF-partitioned algorithm assigns jobs to tracks in a way that ensures that each track has a feasible non pre-emptive EDF schedule. It is very flexible schedule means that if the local scheduler were to use the EDF algorithm to decide which job to schedule next, all jobs would meet their deadlines. We consider that the deadline of an update job is its release time plus its period that is for each table we want to load every batch of new data before the next batch arrives. We need to introduce some additional terminology: ui = E i(Pi)=Pi: For utilization time let us consider that each job is completed before the next one arrives; and therefore the amount of new data to load is proportional to the length of the period.  Ur =∑i u i The utilization of track r (summed over all jobs assigned to r). ISSN: 2231-5381 Emaxr = max(Ei(Pi)Ji in track r) that is the processing time of the longest job in track r.  Pminr= min(Pi|Ji in track r) That is the smallest period of all the jobs assigned to track r. A recent result gives sufficient conditions for EDF scheduling of a set of non pre-emptive jobs on multiple processors [3]. We use the simplified single-track conditions to find the EDF schedulability condition for track r: Ur ≤ 1 - Emaxr=Pminr Finding an optimal allocation of jobs to tracks is NP-hard and we use a modification of the standard greedy heuristic: sort the jobs in order of increasing periods and after that allocate them to tracks and creating a new track whenever the schedulability condition would be violated. Let Job Ji is allocated to track r, then r is said to be its home track and If there are more processing tracks available than required to be allocated to the update jobs and the remaining tracks are referred to as free tracks. Note that the EDF-partitioned strategy is compatible with any local algorithm for scheduling individual tracks and the feasibility guarantee applies only if we were to use EDF as the local algorithm. A track is available if no job is executing in it (or has been allocated for execution); else the track is unavailable. The algorithm is then the following. 1. Sort the released jobs by the local algorithm. 2. For each job Ji in sorted order a. If Ji’s home track is available and schedule that Ji on its home track. b. Else, if there is an available free track and schedule Ji on the free track. c. Else, scan through the tracks r such that Ji can be promoted to track ri and the track r is free and there is no released job remaining in the sorted list for home track r, A. Schedule Ji on track r. 3. Else, delay the execution of Ji. After ordering the jobs, we are grouping the similar jobs together. We are reducing the streaming time for updating the job and Grouping the similar jobs we can reduce the processing time of processing the job.k-mediods algorithm is used for grouping the jobs. The algorithm is shown below: The k-mediods algorithm is a clustering algorithm related to the k-means algorithm and the mediods shift algorithm. Both the k-means and k-mediods algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize the distance between points labelled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm and chooses data points as centres (mediods or exemplars) and works with an arbitrary matrix of distances between data points and l2. k-medoid is a efficient technique of clustering that clusters the data set of n objects into k clusters known a priori process and it is more robust to noise and outliers as compared to k-means because it minimizes a sum of pair- http://www.ijettjournal.org Page 3744 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 wise dissimilarities instead of a sum of squared Euclidean distances. A mediod can be defined as the object of a cluster and whose the average dissimilarity to all the objects in the cluster is minimal That is a most centrally located point in the cluster. The most common realisation of k-mediod clustering is the Partitioning Around Mediods (PAM) algorithm and is as follows: 1. Initialize: randomly select k of the n data points as the medoids 2. Associate each data point to the closest medoid. 3. For each medoid m 1. For each non-medoid data point o 1. Swap m and o and compute the total cost of the configuration 4. Select the configuration with the lowest cost. 5. Repeat steps 2 to 4 until there is no change in the medoid. We apply this clustering algorithm on the execution times of the jobs. Then we cluster the similar jobs of execution time based on the utilization times and the jobs in a track in sorted order then we will execute the jobs in the priority wise. We selected this clustering algorithm because we will find only the similarity between the jobs. we have to put the similar jobs in track, so that we can execute the jobs on the tables we have to manipulate can be done in less time. And also the derived tables of the root table can be updated in less time. It will achieve the processing efficiency and the best cost of performance and also it will save the memory of the CPU. After that Materialized view hierarchies can make the proper prioritization of jobs difficult and consider an example if a high-priority view is sourced from a low priority view and it cannot be updated until the source view is—which might take a long time since the source view has low priority. So the source views need to inherit the priority of their view . Let Ipi be the inherited priority of table Ti. We explore three ways of inheriting priority: Sum: Ipi is the sum of the priorities of its dependent views (including itself).  Max: Ip is the priorities of maximum dependent views (including itself). Max-plus: Ipi is K times the maximum of the priorities of its dependent views (including itself), for some K >1:0. A large value of K increases the priority of base tables relative to derived tables and those derived tables which have a long chain of ancestors. IV. CONCLUSION In this our framework involves the resources available for the short jobs. In this our framework includes we introduced clustering algorithm for easy grouping of the jobs. For each jobs we analyse the execution time and the utilization jobs and partitioning works efficiently for ISSN: 2231-5381 ordering the jobs. The main theme of this framework is maintaining the freshness of the tables and their derived tables and it works very efficiently on complex environment. The complexity of calculations also less and tested manually. REFERENCES [1] B. Adelberg, H. Garcia-Molina, and B. Kao, “Applying Update Streams in a Soft Real-Time Database System,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 245-256, 1995. [2] B. Babcock, S. Babu, M. Datar, and R. Motwani, “Chain: Operator Scheduling for Memory Minimization in Data Stream Systems,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 253-264, 2003. [3] S. Babu, U. Srivastava, and J. Widom, “Exploiting Kconstraints to Reduce Memory Overhead in Continuous Queries over Data Streams,” ACM Trans. Database Systems, vol. 29, no. 3, pp. 545- 580, 2004. [4] S. Baruah, “The Non-preemptive Scheduling of Periodic Tasks upon Multiprocessors,” Real Time Systems, vol. 32, nos. 1/2, pp. 9- 20, 2006. [5] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel, “Proportionate Progress: A Notion of Fairness in Resource Allocation,” Algorithmica, vol. 15, pp. 600-625, 1996. [6] M.H. Bateni, L. Golab, M.T. Hajiaghayi, and H. Karloff, “Scheduling to Minimize Staleness and Stretch in Real-time Data Warehouses,” Proc. 21st Ann. Symp. Parallelism in Algorithms and Architectures (SPAA), pp. 29-38, 2009. [7] A. Burns, “Scheduling Hard Real-Time Systems: A Review,” Software Eng. J., vol. 6, no. 3, pp. 116-128, 1991. [8] D. Carney, U. Cetintemel, A. Rasin, S. Zdonik, M. Cherniack, and M. Stonebraker, “Operator Scheduling in a Data Stream Manager,” Proc. 29th Int’l Conf. Very Large Data Bases (VLDB), pp. 838- 849, 2003. [9] J. Cho and H. Garcia-Molina, “Synchronizing a Database to Improve Freshness,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 117-128, 2000. [10] L. Colby, A. Kawaguchi, D. Lieuwen, I. Mumick, and K. Ross, “Supporting Multiple View Maintenance Policies,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 405-416, 1997. [11] M. Dertouzos and A. Mok, “Multiprocessor On-Line Scheduling of Hard- Real-Time Tasks,” IEEE Trans. Software. Eng., vol. 15, no. 12, pp. 1497-1506, Dec. 1989. [12] U. Devi and J. Anderson, “Tardiness Bounds under Global EDF Scheduling,” Real-Time Systems, vol. 38, no. 2, pp. 133-189, 2008. [13] N. Folkert, A. Gupta, A. Witkowski, S. Subramanian, S. Bellamkonda, S. Shankar, T. Bozkaya, and L. Sheng, “Optimizing Refresh of a Set of Materialized Views,” Proc. 31st Int’l Conf. Very Large Data Bases (VLDB), pp. 10431054, 2005. http://www.ijettjournal.org Page 3745 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 [14] M. Garey and D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [15] L. Golab, T. Johnson, J.S. Seidel, and V. Shkapenyuk, “Stream Warehousing with Datadepot,” Proc. 35th ACM SIGMOD Int’l Conf. Management of Data, pp. 847-854, 2009. [16] L. Golab, T. Johnson, and V. Shkapenyuk, “Scheduling Updates in a Real-Time Stream Warehouse,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 12071210, 2009. [17] H. Guo, P.A. Larson, R. Ramakrishnan, and J. Goldstein, “Relaxed Currency and Consistency: How to Say ‘Good Enough’ in SQL,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 815-826, 2004. BIOGRAPHIES vempada sarada completed Btech in Information Tehnology from Avanthi Institute of engineering and technology, visakhapatnam. currently pursuing Mtech in computer Science from pydah college of engineering and technology,visakhapatnam,Andra Pradesh her research areas are data mining and network security. N. Tulasi Radha is Associate professor and HOD in the Department of CSE and IT, Pydah college of Engineering and Technology , visakhapatnam, AP. Her B.Tech from GITAM college of engineering and M.Tech from JNTU, Kakinada. Her interested areas are network security, Human Computer Interaction. ISSN: 2231-5381 http://www.ijettjournal.org Page 3746

A Framework for Revising The Emerging Data Warehouses

Related documents

Products

Support

A Framework for Revising The Emerging Data Warehouses

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib