Applied Mathematics & Information Sciences

advertisement
Efficient Data Placement Algorithm for Cloud-based Workflow1
Peng Zhang1,2,3, Guiling Wang3, Yanbo Han3, Jing Wang3
1
Institute of Computing Technology, Chinese Academy of Sciences, 100049, China,
2Graduate School of the Chinese Academy of Sciences, 100190, China,
3North China University of Technology, 100041, China
Abstract
While cloud-based workflow shows potentials of inherent scalability and expenditure reduction, such issues as
data transfer and efficiency have popped up as major concerns. The workflow engine must intelligently select
locations in which the data will reside to avoid redundant data transfer when one task needs several distributed
data in cloud. In this paper, an efficient data placement algorithm for cloud-based workflow is proposed. The
data placement algorithm uses affinity graph to group datasets to minimize data transfers while keeping a
polynomial time complexity. By integrating the algorithm into cloud-based workflow engine, we can expect
efficiency improvement in data transfer. Experiments and analysis supported our efforts.
Keywords: data placement, affinity graph, cloud computing, workflow, data transfer
1. INTRODUCTION
Recently, cloud computing has been gaining
popularity as a technology that promises to provide the
next generation of Information Technology (IT) platforms
based on the concept of utility computing [1-3]. Some key
features of cloud computing such as high performance,
massive storage and relatively low cost of infrastructure
construction attract many attentions.
By taking advantage of cloud computing, the
Workflow Management System (WfMS) shows potentials
of inherent scalability and expenditure reduction, and
enables to gain a wider utilization for cross-organizational
e-business application. However, they will also face some
new challenges, where data transfer is one of them.
 One the one hand, since in the cloud computing
system, datasets may are distributed and the
communication is also based on the Internet. When
one task needs to process data from different
locations, data transfer on the Internet is inevitable
[4].
 On the other hand, the infrastructure of the cloud
computing system is hidden from their users, who
do not know the exact physical locations where
their data are placed. This kind of model is very
convenient for users, but random data placement
could lead to unnecessary data transfers [5].
So we need an efficient data placement to respond to
the challenge. Concerning the data placement problem,
Kosar and colleagues [6] propose a data placement
scheduler for the distributed computing systems. It
guarantees the reliable and efficient data transfer with
different protocols. Cope et al. [7] propose a data
placement strategy for urgent computing environments to
guarantee the data’s robustness. At the infrastructure level,
NUCA [8] is a data placement and replication strategy for
distributed caches that can reduce the data’s access
latency. These works mainly focus on how to transfer the
application data, and they cannot minimize the total data
transfers. As cloud computing has become more and more
popular, new data management systems have also
appeared, such as the Google File System [9] and Hadoop
[10]. But they are designed mainly for Web search
applications, which are different from workflow
applications.
The closest data placement research to ours is the
workflow system which has proposed some data
placement strategies based on the k-means cluster
algorithm [5], but the algorithm requires objective
function and shows exponential cost growth.
This paper treats the cloud as an umbrella term
referring to Internet-based interconnect computing and
storage nodes, which partition the Internet-based
cyberspace for controlled sharing and collaboration, and
reports a practice of the data placement algorithm for
cloud-based workflow that have the following specific
features:
 Efficient data placement. The data placement
works in tight combination with task scheduling to
place data in a graphical partition manner, upon
which the data transfers could be minimized.
Moreover, the data placement algorithm is
polynomial and involves no arbitrary empirical
objective functions to evaluate candidate partitions.
 We evaluate the performance of our proposed
algorithm and demonstrate that our data placement
algorithm achieves significant performance
improvements.
1
The work was partially supported by the National Science Foundation of China under Grant No. 61033006, 60970131.
Male, PhD candidate, His research interests include service composition, workflow management. Email:zhangpeng@software.ict.ac.cn
1
2. MOTIVATING EXAMPLE
We provide an example scenario for explain a typical
problem can be addressed by the reported work.
Emergency Materials Management (EMM) is an entire
lifecycle management that includes key tasks like demand
analysis, raise, storage, transport and consumption. EMM
is also a typical cross-organizational e-business
application that represents the kind of systems that can be
used as the motivational example for the reported work.
Since an EMM is usually quite complicated system, we
have decided to choose describe only a simplified episode
of an EMM to illustrate the problem that we have address.
As shown in Figure 1, the episode that involves
investigation office, materials management office and
transport office, who are required to act on a following
scenario.
controlflow dataflow
Investigation Office
Analyze
demand
Materials Management Office
In stock
Receive
Order
Out of stock
Plan
sourceing
Allocation
Make
Load
Source
Transport Office
Delivery
Distribution
Sign in
…...
data transfer
Shipping
Order
Delivery
Order
Form
Node A
Node B
Node C
…...
Node D
Figure.1 The random data placement
To support the abovementioned relief management
workflow scenarios, several kinds of data need are
required; an analysis task produces an “Order form”
reporting the emergency materials report. The materials
management office receives the “Order Form” and
outputs the “Shipping Order”. The local transport office
delivers the emergency materials as soon as the “Order
Form” and “Shipping Order” are received.
All the data required to support this workflow are
placed on different nodes in the cloud. Figure 1 shows
that the “Order Form” in the WfMS is placed on node A
and the “Shipping Order” is placed on node C, then there
is at least one data transfer in each workflow instance,
regardless of the delivery task is scheduled for node A or
C. If there is a need to a large number of concurrent
instances for supporting the workflow, we can anticipate a
very high cost of data transfer even if the nodes are
connected with Ethernet. Let us suppose that there are 10
transport offices, 1Mbits order form and 100Mbits/sec
Ethernet bandwidth, each transport office enables 1000
delivery instance, the total data transfer time is more or
less 1×102 seconds, which can be a significant amount of
time in disastrous relief scenarios.
It is noteworthy that it is usually impossible to place
all data to the same node as a result of storage limit. In
addition, placing all data to the same node will also lead
to a node becoming a super node that has a high workload.
2. DATA PLACEMENT MODEL
For e-business applications, their processes need to
be executed for a large number of times sequentially
within a very short time period or concurrently with a
large number of instances [11], so that when one task
needs data located at different nodes, these concurrent
instances makes the data transfer become more criticality.
To effectively place these data, the workflow engine must
intelligently select nodes in which these data will reside.
In this paper, a data scheduling engine has been proposed,
as shown in Figure 3. The data scheduling engine mainly
does two things: analyze the dependencies between tasks
and data; place the data based on the analysis results.
After that, the data scheduling engine notifies the task
scheduling engine to schedule the tasks. As have been
known, in workflows, both tasks and data could be
numerous and make up a complicated many-to-many
relationship. One task might need many data and one data
might be used by many tasks. So the data placement
should be based on these dependencies between tasks and
data. In order to facilitate understanding, some related
definitions are given as follows.
DEFINITION 1 Workflow Specification: A workflow
specification can be represented by a directed acyclic
graph WSpc =(T, E),where T is a finite set of task ti (1≤i
≤n,n=|T|) and E is a finite set of directed edges eij(1≤i≤
n, 1≤j≤n). Task T= (ActivityTask, ControlTask) are basic
building blocks of a workflow. ActivityTask defines the
abstract functional requirements, ControlTask defines the
control logics. ActivityTask = (Name, In, Out), where the
three elements represent the task’s name, input parameters
and output parameters. ControlTask={Start, AndSplit,
AndJoin, OrSplit, OrJoin, End}.
DEFINITION 2 Data Set: The data set D is a finite
set of data, and every data has a size, so every data di∈D
has attributes denoted as <i, si>, where i denotes the
identifier, si denotes the size of di. The data are used to
represent the parameters of ActivityTask, so each ti ∈
ActivityTask(1≤i≤|ActivityTask|) has attributes denoted
as<j, Dj>, where j is the identifier, Dj⊆D.
DEFINITION 3 Data Affinity: The data affinity is
acckij , where acckij is the number
defined as affij 

k ServiceTask
of task k referencing both data i and j. The summation
occurs over the ActivityTask in WSpc. This definition of
data affinity measures the strength of an imaginary bond
2
between the two data, predicated on the fact that data are
used together by activity tasks.
DEFINITION 4 Data Affinity Matrix: Based on this
definition of data affinity, the data affinity matrix DA is
defined as follows: It is an n  n matrix for the n-data
problem whose (i,j) element equals affij.
DEFINITION 5 Data Block Set: The data block set B
is a finite set of data block, and each data block has a set
of data, and is denoted as<m, Dm>, where m is the
identifier, Dm⊆D.
DEFINITION 6 Node Set: The node set R is a finite
set of nodes, and each node rn has a size and a set of data
blocks, and is denoted as <n, sn, Bn>, where n is the
identifier, sn denotes the storage limit of rn, Bn⊆B.
From workflow deployment to workflow execution,
this paper mainly discusses two data placement cases.
The first case is initial data placement. When
workflow specifications are deployed in cloud, the data
used by activity tasks need to be placed. The second case
is runtime data placement. When workflow specifications
are executed, the data generated by activity tasks need to
be placed. Let us introduce the two cases in details.
2.1 Initial Placement
Vertical partitioning is the process that divides a global
object which may be a single relation or more like a
universal relation into groups of their attributes, called
vertical fragments. It is used during the design of a
distributed database to enhance the performance of
transactions. Vertical partitioning has a variety of
applications wherever the match between data and
transactions can affect performance [12]. In this paper, the
match between data and tasks can affect directly the data
placement, which can affect indirectly the data transfer
performance, so we can apply the vertical partitioning to the
data placement, where the tasks are seem as transactions and
data are seem as attributes in a relation. We shall use the
following notation and terminology in the description of our
initial data placement.
 Primitive cycle denotes any cycle in the affinity
graph.
 Affinity cycle denotes a primitive cycle that
contains a cycle node.
 Cycle completing edge denotes a “to be selected”
edge that would complete a cycle.
 Cycle node is that node of the cycle completing
edge, which was selected earlier.
 Former edge denotes an edge that was selected
between the last cut and the cycle node.
 Cycle edge is any of the edges forming a cycle.
 Extension of a cycle refers to a cycle being
extended by pivoting at the cycle node.
The above definitions are used in the proposed data
placement to process the affinity graph and to generate
possible cycles from the graph. The intuitive explanations
will be found in the paper [12].
Let’s take the materials management and transport
process for example. Firstly, the workflow specification
3
as shown in Figure 2(a) is transformed into a data affinity
matrix as shown in the left of Figure 2(b).A diagonal
element DA(i,i) equals to the sum of the usage for the data
di. This is reasonable since it shows the “strength” of that
data in terms of its use by all activity tasks. The procedure
for generating the data blocks by the affinity graph is
described below. Each partition of the graph generates a
data block.
Transport Office
d2 Materials Management Office
d1
AndSplit
AndJoin
S2
S1
S4
S7
S6
S5
S8
S3
d3
d5
d4
(a)
d1
d2
d3
d4
d5
d1
2
1
2
0
0
d2
1
3
1
2
1
d3
2
1
2
0
0
d4
0
2
0
2
1
d5 0
1
0
1
3
d1
1
2
1
d2
1
1
2
d3
d5
d4
(b)
d1
d1
1
2
2
1
d2
1
1
1
d5
1
2
d3
d3
d5
d4
1
d2
d4
2
(c)
S3
d4
d2
S4
S1
B2->R2
d3
d1
S2
S5
B1->R1
d5
B3->R3
(d)
Figure 2.The data placement process
Construct the affinity graph from the workflow
specification being considered. Note that the data
affinity matrix is itself an adequate data structure to
represent this graph. No additional physical storage of
data would be necessary. Figure 2(b) shows an
affinity graph.
2 Start from any node. In Figure 2(b), we start from the
node d1.
3 Select an edge which satisfies the following
conditions:
 It should be linearly connected to the tree already
constructed.
 It should have the largest value among the possible
choices of edges at each end of the tree.
In Figure 2(b), we firstly select the edge (d1, d3), and
secondly select the edge (d3, d2). Next, we select the edge
(d2, d4) and the edge (d4, d5). Finally, we select the edge
(d5, d2).This iteration will end when all nodes are used for
tree construction.
4 When the next selected edge forms a primitive cycle:
1
 If a cycle node does not exist, check for the
“possibility of a cycle”, and if the possibility exists,
mark the cycle as an affinity cycle. The possibility
of a cycle results from the condition that no former
edge exists, or p(former edge) <= p(al1 the cycle
edges). Consider this cycle as a candidate partition.
Go to step 3.
In Figure 2(b), when we select the edge (d5, d2), there
is no cycle node and no former edge exists, so the
primitive cycle d2-d4-d5 constitutes a candidate partition,
where the cycle node is d2.
 If a cycle node exists already, discard this edge and
go to step 3.
5 When the next selected edge does not form a cycle
and a candidate partition exists:
 If no former edge exists, check for the possibility
of extension of the cycle by this new edge. The
possibility of extension results from the condition
of p(edge being considered or cycle completing
edge)>= p(any one of the cycle edges). If there is
no possibility, cut this edge and consider the cycle
as a partition. Go to step 3.
In Figure 2(b), because there is no possibility of
extension of the cycle d2-d4-d5, we cut the edge (d3, d2),
and consider d2-d4-d5 as a partition.
 If a former edge exists, change the cycle node and
check for the possibility of extension of the cycle
by the former edge. If there is no possibility, cut
the former edge and consider the cycle as a
partition. Go to step 3.
In our present approach, however, we consider the
data affinity matrix as a complete graph called the affinity
graph in which an edge value represents the affinity
between the two data. Then, forming a linearly connected
spanning tree, the procedure generates all meaningful data
blocks in iteration by considering a cycle as a data block.
A linear and connected tree has only two ends. In the right
of Figure 2(b) shows the affinity graph corresponding to
the data affinity matrix after excluding zero-valued edges.
Note that the data affinity matrix serves as a data structure
for the affinity graph. The major advantages of the
proposed method are that:
 There is no need for iterative binary partitioning.
The major weakness of iterative binary partitioning
is that at each step two new problems are generated
increasing
the
complexity;
furthermore,
termination of the algorithm is dependent on the
discriminating power of the objective function.
 The method requires no objective function. The
empirical objective functions were selected after
some trial and error experimentation to find out
that they possess a good discriminating power.
Although reasonable, they constitute an arbitrary
choice. This arbitrariness has been eliminated in
the proposed methodology.
Now we consider the computational complexity. Step
1 does not affect the computational complexity because
the attribute affinity matrix can be used as a symmetric
matrix. The repeat loop in the detailed description is
executed n-l times, where n denotes the number of data.
At each iteration, selection of the next edge takes a time
O(n). Also whether a cycle exists or not can be
implemented in time of O(n). Thus, the algorithm takes a
time O(n2). The partition results are shown as in Figure
2(c), where each broken line partition a data block and
each data block includes a group of data. For example, the
data d1 and the data d3 constitute a data block <1, {d1,
d3}>, and the data d2 and the data d4 and the data d5
constitute a data block <2, {d2, d4, d5}>. Next, these data
blocks are placed in different nodes. The procedure is
introduced as follows:
6 Get the ActivityTask from WSpc;
7 Calculate the estimated data transfer time if the data
block Bk is scheduled to the node Rj according to
formula 1. The bandwidth set is denoted as BW, BWij
denotes the bandwidth between node Ri and node Rj.
Here the sum of the storage limit of all nodes is
greater than the sum of the size of all data.
min Rj 
 (dk / BWij) (1)
dkRj , dk Ri , dkBk
8
Sort the data transfer time and then select the node Rj
with the least data transfer time;
9 If the storage limit of Rj is greater than the size of
data block Bi, then place the Bi to the node Rj; else try
other nodes and go to Step 9;
10 If the size of data block Bi is greater than any storage
limit of nodes, then the data block will be partitioned
into two data blocks with the maximum sum of
affinity weight until the size is satisfied;
As shown in Figure 2(d), the data block B1 is placed
in the node R1, and the data block B2 is partitioned two
data blocks as a result of the total size of d2, d4 and d5 is
greater than any storage limit of nodes. Since the partition
{d2, d4} and the partition {d5} have the maximum sum of
affinity weight, the d2 and d4 constitute one data block and
is placed in the node R2, and the d5 constitute another data
block and is placed in the node R3. There are only two
data transfers after the S1, S2, S3, S4, and S5 are executed.
Suppose that the number of data and nodes is n
respectively, and the time complexity from step 6 to step
10 is O(n2)+O(log(n)n).
2.2 Runtime Placement
In workflow execution, there are two situations to
change the data placement. One situation is that new
workflows are deployed to the cloud computing
environment, together with the new data and activity
tasks will be added to the system; the other situation is
that the ready activity task has been executed, together
with the output data, which might be used by several later
activity tasks or later workflows. For the former situation,
we only repeat the initial placement. For the latter
situation, our idea includes three steps:
11 Calculate the data affinity weights between the newly
generated data and existing nodes: The data affinity
4
weight between data di and node Rj is denoted as dRij,
which is the sum of the data affinity weight of di with
all the data in Rj, Based on the data affinity weights
and the storage limit of the nodes, we select one to
place the new generated data.
12 Place the newly generated data: The newly generated
data is placed on the node which has the greatest data
affinity weight and enough storage; else try other
nodes like step 9 and step 10.
13 Task scheduling engine will periodically monitor the
states of all the workflow’s tasks and schedule the
ready task to the node has the most required data.
Workflow Engine
Task
Scheduling
Engine
Data
Scheduling
Engine
Shipping
Order
Delivery
Order
Form
Node A
Node B
Node C
15
16
17
dataSet=engine.fetchNewData();
for each dm dataSet
for each node Ri
calculate dRmj/
18
19
Rn=max(dRmj)
result+=scheduleData(dm, Rn)/
20
21
Return result;
22 END
To our knowledge, little effort has been allocated to
investigate the data placement for workflow using the
graphical algorithm. This graphical algorithm can be used
effectively for data partitioning because it overcomes the
shortcomings of binary partitioning and it does not need
any complementary algorithms such as the BEA
procedure that has been described in the work reported in
paper[4]. Furthermore, the algorithm involves no arbitrary
empirical objective functions to evaluate candidate
partitions.
4 EXPERIMENTAL RESULTS
Figure 3. The data placement architecture
Let us review our example. After the runtime data
placement, the newly generated data “Shipping Order”
and newly generated data “Order Form” will be placed on
the same node as shown in Figure 3.
3. DATA PLACEMENT ALGORITHM
In order to generalize the above described steps, the
algorithm is given. According to our analysis for abovementioned steps, the time complexity of algorithm is no
more than O(n2) +O(n2)+O(log(n)n), and is a near optimal
solution while keeping a polynomial time complexity.
Function DP4WF(Msg, WSpc, R, DA)
1
BEGIN
if (Msg ==”1”)//initial stage
2
Set dataSet //generate data set
3
4
5
for i=0 to set.size()
6
ActivityTask t= taskSet.get(i);
7
dataSet.add(t.D)
8
DA=Affinity(dataSet, taskSet)
9
Graph g= AffinityGraph(DA)/
10
Set set= GraphPartition(g)
11
for j=0 to set.size()
12
B block =set.get(j);
13
result+=scheduleBlock(block, R)
14
5
Set taskSet=WSpc.T.ActivityTask
We have adopted a pub/sub message-oriented
middleware ActiveMQ2 to enhance our prior tools named
VINCA Personal Workflow [13], and develop a cloudbased workflow platform that has been used for the
experimental evaluation of our approach in this Section.
To evaluate their performance, we run some workflow
instances through 3 simulation strategies:
Random: In this simulation, we randomly place the
existing data during the initial stage and store the
generated data in the local node (i.e., where they were
generated) at the runtime. This simulation represents the
traditional data placement strategies in old distributed
computing systems. At that time, data were usually stored
in the local node naturally or in the nodes that had the
available storage. The temporal intermediate data, i.e.
generated data, were also naturally stored where they
were generated waiting for the tasks to retrieve them.
K-Means: This simulation shows the overall
performance of k-means algorithms. The strategy need
object function to cluster generated data and place them
in the appropriate nodes.
DP4WF: This simulation shows the overall
performance of our algorithms. Our algorithms are
specifically designed for cloud workflows. The strategy is
based on data dependency and can automatically partition
the existing data. Comparisons with other strategies will
be made with different aspects to show the performance
of our algorithms.
The traditional way to evaluate the performance of a
workflow system is to record and compare the execution
time. However, in our work, we will count the total data
transfer instead. The execution time could be influenced
if (Msg ==”2”)//runtime stage
2
http://activemq.apache.org/
by other factors, such as bandwidth, scheduling strategy
and I/O speed. Our data placement strategy aims to
reduce the data transfers between nodes on the Internet.
So we directly take the number of datasets that are
actually transferred during the workflow’s execution as
the measurement to evaluate the performance of the
algorithms. In a cloud computing environment with a
limited bandwidth based on the Internet, if the total data
transfers have been reduced, the execution time will be
reduced accordingly. Furthermore, the cost of data
transfers will also decrease.
250
Random
K-Means
DP4WF
Data Transfers
200
150
100
50
0
30
50
Data Sets
80
120
(a) The data transfer comparison
120
Random
K-Means
DP4WF
110
100
400
Random
K-Means
DP4WF
350
300
90
Data Transfers
Data Transfers
test workflows with different numbers of datasets. In
Figure 4(b), we fixed the test workflows’ datasets count to
50, and ran them on different numbers of nodes. From the
results, we could conclude that the K-Means and DP4WF
algorithms can effectively reduce the total data transfers
of the workflow’s execution.
However, in the simulation described above, we did
not limit the amount of storage that the nodes had
available during runtime. In a cloud computing
environment, nodes normally have a limited storage,
especially in some storage constrained systems. When
one node is overloaded, we need to reallocate the data to
other nodes. The reallocation will not only cause extra
data transfers, but will also delay the execution of the
workflow. To count the reallocated datasets, we ran the
same test workflows as in Figure 5 with a storage limit in
every node.
From Figure 5, we can see that as the number of
nodes and datasets increases, the performance of the
random strategy decreases. This is because the datasets
and tasks are gathering on one node. This triggers the
adjustment process more frequently, which costs extra
data transfers.
80
70
60
250
200
150
100
50
50
5
10
Nodes
15
20
0
30
(b) The data transfer comparison
80
120
(a) The data transfer comparison
250
Random
K-Means
DP4WF
230
210
Data Transfers
Figure 4. The data transfers without storage limit
To make the evaluation as objective as possible, we
generate test workflows randomly to run on our platform.
This would make the evaluation results independent of
any specific applications. As we need to run Random and
the K-Means and DP4WF algorithms separately, we set
the number of existing datasets and generated datasets to
be the same for every test workflow. That means we have
the same number of existing datasets and tasks for every
test workflow, and we assume that each task will only
generate one dataset. We can control the complexity of
the test workflow by changing the number of datasets.
Every dataset will be used by a random number of tasks,
and the tasks that use the generated datasets must be
executed after the task that generates their input. We can
control the complexity of the relationships between the
datasets and tasks by changing the range of this random
number. Another factor that would have an impact on the
algorithms is the storage limit of nodes. We can randomly
set the storage limit for the nodes. We will run new
simulations to show their impact on the performance.
Here, we have only included graphs of the simulation
results. In Figure 4(a), we ran the test workflows with
different complexities on 15 nodes. We used 4 types of
50 Data Sets
190
170
150
130
110
90
70
50
5
10
Nodes
15
20
(b) The data transfer comparison
Figure 5. The data transfers with storage limit
During the execution of every test workflow instance,
we recorded the number of datasets that transferred to
each node, as well as the tasks that scheduled to that node.
The objective was to see how the tasks and datasets were
distributed, which could indicate the workload balance
among nodes. We also calculated the standard deviation
of the nodes’ usage.
Figure 6 shows the average standard deviation of
running 1000 test workflows on 15 nodes each having 80
existing datasets and 80 tasks. From Figure 6, we can see
6
relatively high deviations in the nodes’ usage in the two
simulations without the runtime algorithm. This means
that the tasks and the datasets are allocated to one node
more frequently. This leads to a node becoming a super
node that has a high workload. By contrast, in the other
two simulations that use the runtime algorithm to preallocate the generated data to other nodes, the deviation of
the node usage is low. This demonstrates that the runtime
algorithm can make a more balanced distribution of the
workload among nodes.
availability and scalability. However, to our knowledge,
only a few works give some preliminary research results. In
this paper, we propose an efficient data placement
algorithm for cloud-based workflow, which works in tight
combination with task scheduling to place data in a
graphical partition manner, upon which the data transfers
could be minimized. Moreover, the data placement
algorithm is polynomial and involves no objective
function.
REFERENCES
Standard Deviation
16
14
Data Transfer
12
Task Scheduling
10
[1]
[2]
8
6
4
[3]
2
0
Random
K-MEANS
DP 4WF
Figure 6. The standard deviation of workload
The result in Figures 7 indicates that varying the size
of datasets has great effect on the running cost. When the
size of datasets is less than 40, the running cost of the Kmeans is less than ours. However, with the growth of the
size of datasets, the running cost of the K-means shows
exponential increase tendency, but ours only shows linear
increase tendency.
[4]
[5]
[6]
[7]
300
K-Means
250
[8]
Cost time /s
DP4WF
200
150
100
[9]
50
0
30
50 Data Sets 80
120
Figure 7. The running cost comparison
We thus conclude that our data placement algorithm
achieves significant performance improvements while
keeping a polynomial time complexity.
[10]
[11]
[12]
5. CONCLUSIONS
[13]
With cloud-based workflow, users can reduce their
IT expenditure and enjoy full-fledged business
automation based upon the cloud fabric with high
Weiss, A (2007). Computing in the Cloud. ACM Networker,
11:18-25.
Han, Y, Sun, JY, Wang, GL, Li, HF. A Cloud-Based BPM
Architecture with User-End Distribution of Non-ComputeIntensive Activities and Sensitive Data. Journal of Computer
Science and Technology. 25(6): 1157-1167. 2010
Armbrust M, Fox A, Griffith R et al. Above the clouds: A
Berkeley view of cloud computing. University of California,
Berkeley, 2009, http://www.eecs.berkeley.edu/Pubs/TechRpt
s/2009/EECS-2009-28.html.
Deelman E., Chervenak A., Data management challenges of
data-intensive scientific workflows, in: IEEE International
Symposium on Cluster Computing and the Grid, 2008, 687-692
Yuan, D, Yang, Y, Liu, X, Chen, JJ (2010). A data placement
strategy in scientific cloud workflows. Future Generation
Computer Systems, 26(8):1200-1214.
Kosar, T, Livny, M (2005). A framework for reliable and efficient
data placement in distributed computing systems. Journal of
Parallel and Distributed Computing, 65(10):1146-1157.
Cope, JM, Trebon, N, Tufo, HM, Beckman, P (2009). Robust
data placement in urgent computing environments. IEEE
International Symposium on Parallel and Distributed Processing
(IPDPS’09), Rome, Italy, IEEE Computer Society.
Hardavellas, N, Ferdman, M, Falsafi, B, Ailamaki, A (2009).
Reactive NUCA: Near-optimal block placement and replication
in distributed caches. 36th Annual International Symposium on
Computer Architecture (ISCA'09), Austin, Texas, USA, IEEE
Computer Society.
Ghemawat, S, Gobioff, H, Leung, ST (2003). The Google File
System. SIGOPS Oper. Syst. Rev., 37:29-43.
Borthakur, D (2007). The hadoop distributed file system:
Architecture and design. Hadoop Project Website, 11, 21.
Liu, K, Jin, H, Chen, J, Liu, X, Yuan, D, Yang, Y (2010). A
Compromised-Time-Cost Scheduling Algorithm in SwinDeW-C
for Instance-Intensive Cost-Constrained Workflows on Cloud
Computing Platform. International Journal of High Performance
Computing Applications, 24(4):445.
Navathe, SB, Ra, M (1989). Vertical partitioning for database
design: a graphical algorithm. SIGMOD Record. 18(2): 440-450
Wang, J, Zhang, LY, Han, YB (2006). Client-Centric Adaptive
Scheduling of Service-Oriented Applications. Journal of
Computer Science and Technology, 21(4): 537-546.
云工作流下的有效的数据布局算法
摘要:当云工作流带来可扩展性和节约成本的好处时,数据传输及其效率逐渐成为主要关注的问题。当一个任务需要
同时处理分布在不同位置的数据时,工作流引擎必须智能地选取数据的存放位置来避免不必要的数据传输,为此本文
提出了云工作流下的有效的数据布局算法,该算法使用了关联图划分数据集来减少数据传输次数,同时保持了多项式
的时间复杂度。实验表明,该算法不仅提高了云工作流的数据传输效率,而且减少了算法的运行成本。
关键字:数据布局;关联图;云计算;工作流;数据传输
7
Download