A Framework for Revising The Emerging Data Warehouses

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
A Framework for Revising The Emerging Data
Warehouses
N.T.Radha#1, Vempada Sarada *2
1
1,2
2
Associate professor , Final MTech Student,
Dept. of CSE, Pydah College of Engineering and Technology, Boyapalem,Visakhapatnam, AP, India
Abstract: In streaming data-warehousing there are so
many problems during the manipulations in the
resources and so many of them updating and this is
because of the scalability usage of the resources means
number of times the resource utilized. This method
leads to the refreshing problem of the resource systems.
In this paper we introduced a framework that includes
the priority and efficiency different types of job updates
with consistency in intervals of time and our proposed
system presents the scheduling of the jobs based on the
priority and effect their efficiency and performance.
mechanism is to temporally partition each view according
to a record timestamp. Then, only those partitions which
are affected by new data need to be updated. We explained
the propagation of new data from a raw source and then
through two levels of derived views. Root and derived
tables are all temporally partitioned by some criteria.
We have also determined the flow of data from a
source partition to a dependent partition for a discussion of
how to infer these data flows. The two updated partitions in
the base table trigger updates only to two partitions in the
view and there are two partitions in the second materialized
view.
I. INTRODUCTION
A Data stream management system (DSMS) is a
computer program to manage continuous data streams. As
same as to a database management system which is
designed for static data in conventional databases and gives
the flexibility to query processing so that the information
need can be expressed using queries. However, in contrast
to a DBMS and DSMS executes a continuous query that is
not only performed once it permanently installed and the
query is continuously executed until it is explicitly
uninstalled. Since most DSMS is a continuous query
produces new results as long as new data arrive at the
resource. This is very initial concept is similar to Complex
event processing so that both technologies are partially
coalescing.
One of the biggest challenges for a DSMS is to
handle potentially infinite data streams using a fixed
amount of memory and no random access to the data. There
are so many approaches to limit the amount of data in one
pass and which can be divided into two types. First
approach is there are compression techniques that try to
summarize the data and for the other hand there are
window techniques that try to portion the data into (finite)
parts.
The data flow through a stream warehouse begins
with raw data and the applications are summaries which are
loaded into a first level of materialized views often called
base tables. To update the base tables then propagate
through the remaining materialized views called as child
tables. Views can store terabytes of data spanning years
and an efficient mechanism for updating them is a critical
component of a stream warehouse. The most very common
ISSN: 2231-5381
Derived Table
Derived Table
Extraction-Transformation-Loading
(ETL)
process when the warehouse is quiescent. This clean
separation between querying and updating is a fundamental
assumption of conventional data warehousing applications
it is very clearly simplifies several aspects of the
implementation.
Immediate view maintenance may appear to be a
reasonable solution for a streaming warehouse (deferred
maintenance increases query response times and especially
if high volumes of data arrive between queries and periodic
maintenance delays updates that arrive in the middle of the
update period). Whenever new data arrive and we
immediately update the corresponding table and the table
T has been updated and we trigger the updates of all the
materialized views sourced from T and followed by all the
views defined over those views. There is a disadvantage
with this approach is that new data may arrive on multiple
streams and there is no mechanism for limiting the number
of tables that can be updated and running too many parallel
updates can degrade performance due to memory and CPUcache thrashing disk-arm thrashing. This motivates the
need for a scheduler that limits the number of concurrent
updates and determines which job to schedule next.
http://www.ijettjournal.org
Page 3741
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
A common characteristic of many real-time
systems is that their requirements specification includes
timing information in the form of deadlines. The complete
time taken for an event is mapped against the "value" this
event has to the system. It is loosely defined to mean the
contribution this event has to the system’s objectives. With
the computational event represented in this value is zero
before the start-time and returns to zero once the deadline
is passed and the mapping of time to value between starttime and deadline is application dependent (and is shown
as a constant in the figure).
A system consists of one or more nodes and every
node being an autonomous computing system. This is
termed the internal environment. The environment of a
node can contain other nodes (connected via an arbitrary
network) and interfaces and actuators. The nodes in a
system interact to achieve a common objective. For a realtime application and this objective is typically the control
and monitoring of the system’s external environment.
Processes or tasks (the terms are used
interchangeably in this review) form the logical units of
computation in a processor and every method or process
has a single thread of control.
In the development of application programs it is
usual to map system timing requirements onto process
deadlines. The meeting deadlines therefore become one of
process scheduling and having two distinct forms of
process structure being immediately isolated:
(a) Periodic Processes
(b) A periodic Processes
Periodic processes, as their name implies, execute on a
regular basis and the characters are
(i) their period;
(ii) their required execution time (per period).
The execution time may be given in terms of an average
measurement and (or) a worst case execution time.
Consider an example that the periodic process may need to
execute every second using, on average, 100ms of CPU
time; this may rise to 300ms in extreme instances. The
periodic process activation is usually triggered by an action
external to the system. Often these processes deal with
critical events in the system’s environment and hence their
deadlines are particularly important.
In general a periodic processes are viewed as
being activated randomly and consider an example a. It is
not possible to do worst case analysis (there is a finite
possibility of any number of a periodic events). As a result
the periodic processes cannot have hard deadlines. The
worst case calculations to allow to made there is often
defined a minimum period between any two a periodic
events (from the same source). If this is the case the
process involved is said to be sporadic.
The maximum response time of a process is
termed the deadline and between the invocation of a
process and its deadline and the happening process requires
a certain amount of computation time to complete its
execution. Computation times may be static and within a
ISSN: 2231-5381
given maximum and minimum. The computation time must
not be greater than the time between the invocation and
deadline of a process. Hence, the following relationship
should hold for all processes:
C≤D
where
C - computation time
D - deadline
For each periodic processes, its period must be at
least equal to its end line. A process is complete before
successive invocation and that is termed a runnability
constraint. Hence, for periodic processes, the following
relations are:
C≤D≤T
where
T - period of the process Non-periodic processes can be
invoked at any time.
A process can block itself, by requesting a
resource that is unavailable. When a process performs a
"wait" on a semaphore guarding a critical region currently
occupied. The use when a process performs a "delay" or
other such operation. Tasks whose progress is not
dependent upon the progress of other processes are termed
independent. This definition discounts the competition for
processor time between processes as a dependency. The
inter related processes can interact in many ways including
communication and precedence relationships.
II.RELATED WORK
A.Scheduling Model:
In the warehouse model and every table i receives
updates from an external source at times ri1 < ri,2 < · · · <
riki with 0 < ri,1;define ri,0 = 0. The unknown times in
advance to the algorithm and at time ri,j. The time updated
ri,j contains data for the time period (ri,j −1, ri,j] 2 and the
data do not become available to be processed till time ri,j.
The release time or arrival time of the update to be ri,j and
the start time of the update to be ri,j −1. That is, the update
contains data for the time period (start time, release time].
We define the length Lij of the jth update to table .
Let t be the number of tables and p ≤ t be the
number of available identical processors. At ant time any
idle processor may choose to update any table has at least
one pending update and is not currently being updated by
another processor. Let us consider that we are at time τ and
table i is picked and current up to time ri,j is very cheaper
to execute an update of length L than l updates of length L/l
for all and the pending updates for table i with release
times in the interval [ri,j, τ ] are processed non-preemptively. For a table i if any update process is done that is
released after the start of the processing of a batch is not
included in that batch that waits its turn to be processed in
the next batch.
http://www.ijettjournal.org
Page 3742
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
S
ri0
The wait interval of a batch is the time between
the arrival of its first update and the time and it starts being
processed. Then the additional updates may arrive during
the wait interval and processing time is the length of its
processing or execution interval—the time during which it
is actually being processed and the wait time is the duration
between the release of its first update and the time when
processing of that batch begins. Let us consider that the
processing time of a batch is at most proportional to the
quantity of new data, with a constant of proportionality
which depends on the table (since an hour’s worth of
updates to table i might be far more voluminous than an
hour’s worth for table i′) and for each table i define a real
αi ≤ 1 such that processing a batch of length L takes .That
processing an update containing an hour’s worth of data
should take significantly less than one hour implies that αi
should be at most 1.) In order to keep up with the inputs
and let us consider that there is an α ≤ p/t such that each αi
≤ α. Any t tables, from time 0 to sometime T and to receive
data for tT time units. In order that the processors not and it
is necessary that p processors be able to process tT units of
data in T time intervals and ever since T time units of data
on table i can be processed in Tαi time and we need Σt Σ
i=1(Tαi) ≤ pT and there are p processors or αi ≤ p. If one
wants a bound in terms of the max αi and need to impose tα
≤ p,
where α = max αi. Hence α ≤ p/t is required. We do not
believe C to be a real threshold for obtaining good
algorithms, and we think it is merely a weakness of our
analysis.
(Note: Even when α > p/t, one conceivably could
give an on-line algorithm which is constant competitive
against the best off-line algorithm. The adversary would
fall behind badly.
B.Staleness and Stretch
Recall from Section 1 that at any time τ , the
staleness Si(τ ) of table i is defined to be τ −r, where the
table is current up to time r. Warehouse continuously
receives new data and must keep its tables “fresh” at all
times and Consider that table i is initialized at time ri,0 = 0
and suppose that a processor becomes available for table
and which contains data for interval (ri,0, ri,1], is available to
be processed at time s. The mian update takes time at most
is represented as αi(ri,1 – ri,0) and finishes at time f ≤ s +
ISSN: 2231-5381
αi(ri,1 – ri,0). From initial time until the processing of the
first update finishes at time f, the staleness increases
apparently with slope 1and starting at 0 and ending at time
f at value f. The processing of that update finishes the time f
and the staleness drops to f – ri,1, since at time f the table is
current to time ri;1.
Immediately after time f the staleness increases
linearly, with slope 1, once again. Note that total staleness
is simply the area under the staleness curve. Now suppose
no processor is available again till time s and One
containing data for time (ri,1, ri,2] and one containing data
for time (ri,2, ri,3]. This means that the processor processes
both updates and finishing at time is represented as f′ ≤
s′+αi(ri,3−ri,1). By time f′, the staleness and since the table is
then current up to time ri,3. It is important to note that the
staleness function would not change if instead of these two
updates.
Traditionally, the flow time of a job is defined as
the difference between its completion time and release
time, and its stretch is the flow time divided on the basis of
length. The main updates start accumulating data before
released and which affects the staleness of the
corresponding table. We thus define the flow time of the
update released at time ri,j to be f – ri,j−1, where the
processing of the batch containing this update finishes at
time f and that is its completion time minus its start time
and it’s not its completion time minus its release time (the
completion time of a batch being the time when the
processing of the batch finishes). After we define the
stretch to be the maximum and all updates of the flow time
of the update divided by the length. It shows that how
much additional staleness is accrued while an update is
waiting and being processed. For a moment the stretch of
the first update is represented as f−ri,0/ri,1−ri,0 , the stretch
of the second update is represented as f′−ri,1/
ri,2−ri,1, and the third update of stretch is f′−ri,2/ ri,3−ri,2.
C) Comparison with Scheduling Results
Previous scheduling results focus on individual
job penalties such as job deadlines, rather than table
penalties, and employ no notion of batching and crucial to
our result. Perhaps the most similar problem studied in the
literature is that of minimizing the sum of squares of flow
times that flow time measures the total time a job spends in
the system which is including wait time and the time
complexity for this objective function and no constant
competitive algorithm exists. The proof of nonexistence of
a competitive algorithm for this problem and relies on the
fact that N jobs have to pay N penalties. Particularly
consider a sequence of N consecutive and it is unit-time
jobs arriving starting at time 0 and ending at time N.
Even if all N jobs could be (started and) completed
instantaneously at time N and penalty would be N2 (for the
first job) plus (N −1)2 and the total penalty of Ω(N3). In our
proposed model the staleness of a table depends on the time
of the last update and increases linearly over time and the
next batch of updates has been processed. Regardless of
http://www.ijettjournal.org
Page 3743
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
whether one long update or N unit-length updates (which
will be batched together) are processed at time and that
the total staleness is O(N2). In our proposed model prevents
an adversary from injecting a long stream of identical short
jobs which will hugely cost the algorithm.
.
III. PROPOSED MODEL
This section presents our scheduling framework.
The key idea is to partition the update jobs by their
expected processing time to partition the available
resources and represents a fraction of the computing
resources required by our complex jobs including CPU and
disk I/Os. It is placed in the queue corresponding to its
assigned partition (track), where scheduling decisions are
made by a local scheduler running a basic algorithm. Let
Us consider that each job is executed on exactly one track
and that tracks become a mechanism for limiting
concurrency and for separating long jobs from short jobs
(with the number of tracks being the limit on the number of
concurrent jobs). We assume that the same type of basic
scheduling algorithm is used for each track.
At this point, one may ask why we do not
precisely measure resource utilization and adjust the level
of parallelism. The appropriate answer is that it is difficult
to determine performance bottlenecks in a complex server
the performance process may deteriorate even if resources
are far from fully utilized. The cleanly correlating resource
use with performance leads us to schedule in terms of
abstract tracks instead of carefully calibrated CPU and disk
usage. Let us discuss basic algorithms and the
job
partitioning strategies and techniques for dealing with view
hierarchies and transient overload.
The basic scheduling algorithms prioritize jobs to
be executed on individual tracks and will serve as initiating
blocks. The Earliest-Deadline-First (EDF) algorithm orders
released jobs by proximity to their deadlines. This method
is also known to be an optimal hard real-time scheduling
algorithm for a single track if the jobs are preemptible [7].
Since our jobs are prioritized and EDF directly does not
result in the best performance.
A)EDF-Partitioned Strategy
The EDF-partitioned algorithm assigns jobs to
tracks in a way that ensures that each track has a feasible
non pre-emptive EDF schedule. It is very flexible schedule
means that if the local scheduler were to use the EDF
algorithm to decide which job to schedule next, all jobs
would meet their deadlines. We consider that the deadline
of an update job is its release time plus its period that is for
each table we want to load every batch of new data before
the next batch arrives. We need to introduce some
additional terminology:
ui = E i(Pi)=Pi: For utilization time let us consider that
each job is completed before the next one arrives; and
therefore the amount of new data to load is proportional to
the length of the period.
 Ur =∑i u i The utilization of track r (summed over all
jobs assigned to r).
ISSN: 2231-5381
Emaxr = max(Ei(Pi)Ji in track r) that is the processing time
of the longest job in track r.
 Pminr= min(Pi|Ji in track r) That is the smallest period of
all the jobs assigned to track r.
A recent result gives sufficient conditions for EDF
scheduling of a set of non pre-emptive jobs on multiple
processors [3]. We use the simplified single-track
conditions to find the EDF schedulability condition for
track r: Ur ≤ 1 - Emaxr=Pminr
Finding an optimal allocation of jobs to tracks is
NP-hard and we use a modification of the standard greedy
heuristic: sort the jobs in order of increasing periods and
after that allocate them to tracks and creating a new track
whenever the schedulability condition would be violated.
Let Job Ji is allocated to track r, then r is said to be its
home track and If there are more processing tracks
available than required to be allocated to the update jobs
and the remaining tracks are referred to as free tracks.
Note that the EDF-partitioned strategy is compatible with
any local algorithm for scheduling individual tracks and the
feasibility guarantee applies only if we were to use EDF as
the local algorithm.
A track is available if no job is executing in it (or
has been allocated for execution); else the track is
unavailable. The algorithm is then the following.
1. Sort the released jobs by the local algorithm.
2. For each job Ji in sorted order
a. If Ji’s home track is available and schedule that
Ji on its home track.
b. Else, if there is an available free track and
schedule Ji on the free track.
c. Else, scan through the tracks r such that Ji can
be promoted to track ri and the track r is free and
there is no released job remaining in the sorted list
for home track r,
A. Schedule Ji on track r.
3. Else, delay the execution of Ji.
After ordering the jobs, we are grouping the
similar jobs together. We are reducing the streaming time
for updating the job and Grouping the similar jobs we can
reduce the processing time of processing the job.k-mediods
algorithm is used for grouping the jobs. The algorithm is
shown below:
The k-mediods algorithm is a clustering algorithm related
to the k-means algorithm and the mediods shift algorithm.
Both the k-means and k-mediods algorithms are partitional
(breaking the dataset up into groups) and both attempt to
minimize the distance between points labelled to be in a
cluster and a point designated as the center of that cluster.
In contrast to the k-means algorithm and chooses data
points as centres (mediods or exemplars) and works with
an arbitrary matrix of distances between data points and l2.
k-medoid is a efficient technique of clustering that clusters
the data set of n objects into k clusters known a priori
process and it is more robust to noise and outliers as
compared to k-means because it minimizes a sum of pair-
http://www.ijettjournal.org
Page 3744
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
wise dissimilarities instead of a sum of squared Euclidean
distances.
A mediod can be defined as the object of a cluster and
whose the average dissimilarity to all the objects in the
cluster is minimal That is a most centrally located point in
the cluster. The most common realisation of k-mediod
clustering
is
the Partitioning
Around
Mediods
(PAM) algorithm and is as follows:
1. Initialize: randomly select k of the n data points as
the medoids
2. Associate each data point to the closest medoid.
3. For each medoid m
1. For each non-medoid data point o
1. Swap m and o and compute the
total cost of the configuration
4. Select the configuration with the lowest cost.
5. Repeat steps 2 to 4 until there is no change in the
medoid.
We apply this clustering algorithm on the execution times
of the jobs. Then we cluster the similar jobs of execution
time based on the utilization times and the jobs in a track
in sorted order then we will execute the jobs in the priority
wise. We selected this clustering algorithm because we will
find only the similarity between the jobs. we have to put
the similar jobs in track, so that we can execute the jobs on
the tables we have to manipulate can be done in less time.
And also the derived tables of the root table can be updated
in less time. It will achieve the processing efficiency and
the best cost of performance and also it will save the
memory of the CPU.
After that Materialized view hierarchies can make the
proper prioritization of jobs difficult and consider an
example if a high-priority view is sourced from a low
priority view and it cannot be updated until the source view
is—which might take a long time since the source view has
low priority. So the source views need to inherit the
priority of their view . Let Ipi be the inherited priority of
table Ti. We explore three ways of inheriting priority:
Sum: Ipi is the sum of the priorities of its dependent
views (including itself).
 Max: Ip is the priorities of maximum dependent views
(including itself).
Max-plus: Ipi is K times the maximum of the priorities
of its dependent views (including itself), for some K >1:0.
A
large value of K increases the priority of base tables
relative to derived tables and those derived tables which
have a long chain of ancestors.
IV. CONCLUSION
In this our framework involves the resources
available for the short jobs. In this our framework includes
we introduced clustering algorithm for easy grouping of the
jobs. For each jobs we analyse the execution time and the
utilization jobs and partitioning works efficiently for
ISSN: 2231-5381
ordering the jobs. The main theme of this framework is
maintaining the freshness of the tables and their derived
tables and it works very efficiently on complex
environment. The complexity of calculations also less and
tested manually.
REFERENCES
[1] B. Adelberg, H. Garcia-Molina, and B. Kao, “Applying
Update Streams in a Soft Real-Time Database System,”
Proc. ACM SIGMOD Int’l Conf. Management of Data, pp.
245-256, 1995.
[2] B. Babcock, S. Babu, M. Datar, and R. Motwani,
“Chain: Operator Scheduling for Memory Minimization in
Data Stream Systems,” Proc. ACM SIGMOD Int’l Conf.
Management of Data, pp. 253-264, 2003.
[3] S. Babu, U. Srivastava, and J. Widom, “Exploiting Kconstraints to Reduce Memory Overhead in Continuous
Queries over Data Streams,” ACM Trans. Database
Systems, vol. 29, no. 3, pp. 545- 580, 2004.
[4] S. Baruah, “The Non-preemptive Scheduling of
Periodic Tasks upon Multiprocessors,” Real Time Systems,
vol. 32, nos. 1/2, pp. 9- 20, 2006.
[5] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel,
“Proportionate Progress: A Notion of Fairness in Resource
Allocation,” Algorithmica, vol. 15, pp. 600-625, 1996.
[6] M.H. Bateni, L. Golab, M.T. Hajiaghayi, and H.
Karloff, “Scheduling to Minimize Staleness and Stretch in
Real-time Data Warehouses,” Proc. 21st Ann. Symp.
Parallelism in Algorithms and Architectures (SPAA), pp.
29-38, 2009.
[7] A. Burns, “Scheduling Hard Real-Time Systems: A
Review,” Software Eng. J., vol. 6, no. 3, pp. 116-128,
1991.
[8] D. Carney, U. Cetintemel, A. Rasin, S. Zdonik, M.
Cherniack, and M. Stonebraker, “Operator Scheduling in a
Data Stream Manager,” Proc. 29th Int’l Conf. Very Large
Data Bases (VLDB), pp. 838- 849, 2003.
[9] J. Cho and H. Garcia-Molina, “Synchronizing a
Database to Improve Freshness,” Proc. ACM SIGMOD
Int’l Conf. Management of Data, pp. 117-128, 2000.
[10] L. Colby, A. Kawaguchi, D. Lieuwen, I. Mumick, and
K. Ross, “Supporting Multiple View Maintenance
Policies,” Proc. ACM SIGMOD Int’l Conf. Management of
Data, pp. 405-416, 1997.
[11] M. Dertouzos and A. Mok, “Multiprocessor On-Line
Scheduling of Hard- Real-Time Tasks,” IEEE Trans.
Software. Eng., vol. 15, no. 12, pp. 1497-1506, Dec. 1989.
[12] U. Devi and J. Anderson, “Tardiness Bounds under
Global EDF Scheduling,” Real-Time Systems, vol. 38, no.
2, pp. 133-189, 2008.
[13] N. Folkert, A. Gupta, A. Witkowski, S. Subramanian,
S. Bellamkonda, S. Shankar, T. Bozkaya, and L. Sheng,
“Optimizing Refresh of a Set of Materialized Views,” Proc.
31st Int’l Conf. Very Large Data Bases (VLDB), pp. 10431054, 2005.
http://www.ijettjournal.org
Page 3745
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
[14] M. Garey and D. Johnson, Computers and
Intractability: A Guide to the Theory of NP-Completeness.
W.H. Freeman, 1979.
[15] L. Golab, T. Johnson, J.S. Seidel, and V. Shkapenyuk,
“Stream Warehousing with Datadepot,” Proc. 35th ACM
SIGMOD Int’l Conf. Management of Data, pp. 847-854,
2009.
[16] L. Golab, T. Johnson, and V. Shkapenyuk,
“Scheduling Updates in a Real-Time Stream Warehouse,”
Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 12071210, 2009.
[17] H. Guo, P.A. Larson, R. Ramakrishnan, and J.
Goldstein, “Relaxed Currency and Consistency: How to
Say ‘Good Enough’ in SQL,” Proc. ACM SIGMOD Int’l
Conf. Management of Data, pp. 815-826, 2004.
BIOGRAPHIES
vempada sarada completed Btech in
Information Tehnology from Avanthi
Institute of engineering and technology,
visakhapatnam. currently pursuing Mtech
in computer Science from pydah college
of
engineering
and
technology,visakhapatnam,Andra
Pradesh her research areas are data mining and network security.
N. Tulasi Radha is Associate professor and
HOD in the Department of CSE and IT,
Pydah college of Engineering and Technology
, visakhapatnam, AP. Her B.Tech from
GITAM college of engineering and M.Tech
from JNTU, Kakinada. Her interested areas
are network security, Human Computer
Interaction.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 3746
Download