A Local-Optimization based Strategy for
Cost-Effective Datasets Storage of
Scientific Applications in the Cloud
Many slides from authors’
presentation on CLOUD 2011
Presenter: Guagndong Liu
Mar 13th, 2012
Outline
•
•
•
•
Introduction
A Motivating Example
Problem Analysis
Important Concepts
and Cost Model of
Dec 8th , 2011
Datasets Storage in the Cloud
• A Local-Optimization based Strategy for
Cost-Effective Datasets Storage of
Scientific Applications in the Cloud
• Evaluation and Simulation
Introduction
• Scientific applications
– Computation and data intensive
• Generated data sets: terabytes or even
petabytes in size
• Huge computation:
e.g. scientific workflow
Dec 8th , 2011
– Intermediate data: important!
• Reuse or reanalyze
• For sharing between institutions
• Regeneration vs storing
Introduction
• Cloud computing
– A new way for deploying scientific applications
– Pay-as-you-go model
• Storing strategy
8th , 2011
– Which Dec
generated
dataset should be stored?
– Tradeoff between cost and user preference
– Cost-effective strategy
A Motivating Example
• Parkes radio telescope and pulsar survey
• Pulsar searching workflow
De-disperse
…...
…...
Record
Raw
Data
Beam
Beam
Extract
Beam
…...
Dec…...8th , 2011
Trial Measure 1
Compress
Beam
Trial Measure 2
…...
Trial Measure 1200
Acceleate
Get
Candidates
FFT
Seek
FFA
Seek
Pulse
Seek
Candidates
Candidates
…
…...
Extract
Beam
Get
Candidates
Elimanate
candidates
Fold to
XML
Make
decision
A Motivating Example
Raw beam
data
Extracted &
compressed
beam
Size: 20 GB
Generation time: 27 mins
Dedispersion
files
Accelerated
Dedispersion
files
Seek
results files
Candidate
list
90 GB
790 mins
90 GB
300 mins
16 MB
80 mins
1 KB
1 mins
Dec 8th , 2011
XML files
25 KB
245 mins
• Current storage strategy
– Delete all the intermediate data, due to storage
limitation
• Some intermediate data should be stored
• Some need not
Problem Analysis
• Which datasets should be stored?
– Data challenge: double every year over the next
decade and further -- [Szalay et al. Nature, 2006]
– Different strategies correspond to different costs
– Scientific workflowsth are very complex and there
Dec 8 , 2011
are dependencies among datasets
– Furthermore, one scientist can not decide the
storage status of a dataset anymore
– Data accessing delay
– Datasets should be stored based on the trade-off
of computation cost and storage cost
• A cost-effective datasets storage strategy is
needed
Important Concepts
• Data Dependency Graph (DDG)
– A classification of the application data
• Original data and generated data
– Data provenance
8th , 2011
• A kind of Dec
meta-data
that records how data
are generated
– DDG
d3
d1
d4
d2
d7
d5
d6
d8
Important Concepts
• Attributes of a Dataset in DDG
– A dataset di in DDG has the attributes: <xi, yi,
fi, vi, provSeti, CostRi>
• xi ($) denotes the generation cost of dataset
di from itsDec
direct
8th , predecessors.
2011
• yi ($/t) denotes the cost of storing dataset
di in the system per time unit.
• fi (Boolean) is a flag, which denotes the
status whether dataset di is stored or
deleted in the system.
• vi (Hz) denotes the usage frequency, which
indicates how often di is used.
Important Concepts
• Attributes of a Dataset in DDG
– provSeti denotes the set of stored
provenances that are needed when regenerating
dataset di.
genCost (d i )  xi th
 {dk d jprovSeti d j dk di } xk
Dec 8 , 2011
– CostRi ($/t) is di’s cost rate, which means the
average cost per time unit of di in the system.
f i  stored
 yi ,
CostR i  
 genCost (d i )  vi , f i  deleted
• Cost = C + S
– C: total cost of computation resources
– S: total cost of storage resources
Cost Model of Datasets Storage in the
Cloud
• Total cost rate of a DDG: TCRS  diDDG CostRi S
– S is the storage strategy of the DDG
d1
(x1 , y1 ,v1)
d2
d3
(x2 , y2 ,v2) (x3 , th
y3 ,v3)
Dec 8 , 2011
S1 :
f1 =1
f2 =0
f3 =0
 TCRS1  y1  x2 v2  ( x2  x3 )v3
S2 :
f1 =0
f2 =0
f3 =1

..
.
TCR S 2  x1v1  ( x1  x 2 )v 2  y3
• For a DDG with n datasets, there are 2n
different storage strategies
CTT-SP Algorithm
• To find the minimum cost storage strategy
for a DDG
• Philosophy of the algorithm:
– Construct a Cost Transitive Tournament (CTT)
based on the DDG.
Dec 8th , 2011
• In the CTT, the paths (from the start to the end
dataset) have one-to-one mapping to the storage
strategies of the DDG
• The length of each path equals to the total cost
rate of the corresponding storage strategy.
– The Shortest Path (SP) represents the minimum
cost storage strategy
CTT-SP Algorithm
• Example
d1
d2
d3
(x1 , y1 ,v1)
(x2 , y2 ,v2)
(x3 , y3 ,v3)
DDG
Dec 8x thv +(x, 2011
+x )v +(x +x +x )v
1 1
1
2
2
1
3
3
x1v1+(x1+x2)v2+y3
CTT
x3 v3
x1v1+y2
ds
2
y1
d1
y2
d2
y3
d3
0
de
x2v2+y3
x2v2+(x2+x3)v3
The weights of cost edges:
  d i , d j  y j  {dk dk DDGdi dk d j } genCost (d k )  vk 
A Local-Optimization based Datasets
Storage Strategy
• Requirements of Storage Strategy
– Efficiency and Scalability
• The strategy is used at runtime in the cloud
and the DDG may be large
• The strategy
takes computation
Dec itself
8th , 2011
resources
– Reflect users’ preference and data accessing
delay
• Users may want to store some datasets
• Users may have certain tolerance of data
accessing delay
A Local-Optimization based Datasets
Storage Strategy
• Introduce two new attributes of the datasets in DDG
to represent users’ accessing delay tolerance, which
are <Ti , λi >
• Ti is a duration of time that denotes users’ tolerance
of dataset di’s accessing delay
 genCost (d k )

d k  DDG Dec
(d i 
d k ,
dj)
 Tk 
8th
2011
 CostCPU

• λi is the parameter to denote users’ cost related
tolerance of dataset di’s accessing delay, which is a
value between 0 and 1
A Local-Optimization based Datasets
Storage Strategy
Dec 8th , 2011
A Local-Optimization based Datasets
Storage Strategy
• Efficiency and Scalability
– A general DDG is very complex. The computation
complexity of CTT-SP algorithm is O(n9), which is
not efficient and scalable to be used on large DDGs
• Partition the large DDG into small linear segments
Dec 8th , 2011
Linear DDG2
...
Linear DDG1
...
Linear DDG4
...
...
Partitioning
point dataset
Linear DDG3
Partitioning
point dataset
• Utilize CTT-SP algorithm on linear DDG segments
in order to guarantee a localized optimum
Evaluation
• Use random generated DDG for simulation
– Size: randomly distributed from 100GB to 1TB.
– Generation time : randomly distributed from 1 hour to 10
hours
– Usage frequency: randomly distributed 1 day to 10 days
th , 2011
(time between Dec
every8usage).
– Users’ delay tolerance (Ti) , randomly distributed from 10
hours to one day
– Cost parameter (λi) : randomly distributed from 0.7 to 1
to every datasets in the DDG
• Adopt Amazon cloud services’ price model
(EC2+S3):
– $0.15 per Gigabyte per month for the storage resources.
– $0.1 per CPU hour for the computation resources.
Evaluation
• Compare different storage strategies with
proposed strategy
– Usage based strategy
– Generation cost based strategy
Dec strategy
8th , 2011
– Cost rate based
Evaluation
Change of daily cost (4% users stored datasets)
Store all datasets
Cost Rate (USD/Day)
2500
Store none
2000
1500
Dec
8th
Usage based
strategy
, 2011
1000
Generation cost
based strategy
500
Cost rate based
strategy
0
0
100
200
300
400
500
600
700
800
Number of Datasets in DDG
900 1000 1100
Local-optimisation
based strategy
Evaluation
CPU Time of the strategies
Cost rate based
strategy
200
180
160
Local-optimisation
based strategy
(n_i=10)
Time(s)
140
120
Local-optimisation
based strategy
Dec 8th , 2011
100
80
60
Local-optimisation
based strategy
(m=5)
40
20
CTT-SP algorithm
0
50
100
150
200
300
500
Number of datasets in DDG
1000
Thanks
Dec 8th , 2011
©2007 The Board of Regents of the University of Nebraska. All rights reserved.