A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation on CLOUD 2011 Presenter: Guagndong Liu Mar 13th, 2012 Outline • • • • Introduction A Motivating Example Problem Analysis Important Concepts and Cost Model of Dec 8th , 2011 Datasets Storage in the Cloud • A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud • Evaluation and Simulation Introduction • Scientific applications – Computation and data intensive • Generated data sets: terabytes or even petabytes in size • Huge computation: e.g. scientific workflow Dec 8th , 2011 – Intermediate data: important! • Reuse or reanalyze • For sharing between institutions • Regeneration vs storing Introduction • Cloud computing – A new way for deploying scientific applications – Pay-as-you-go model • Storing strategy 8th , 2011 – Which Dec generated dataset should be stored? – Tradeoff between cost and user preference – Cost-effective strategy A Motivating Example • Parkes radio telescope and pulsar survey • Pulsar searching workflow De-disperse …... …... Record Raw Data Beam Beam Extract Beam …... Dec…...8th , 2011 Trial Measure 1 Compress Beam Trial Measure 2 …... Trial Measure 1200 Acceleate Get Candidates FFT Seek FFA Seek Pulse Seek Candidates Candidates … …... Extract Beam Get Candidates Elimanate candidates Fold to XML Make decision A Motivating Example Raw beam data Extracted & compressed beam Size: 20 GB Generation time: 27 mins Dedispersion files Accelerated Dedispersion files Seek results files Candidate list 90 GB 790 mins 90 GB 300 mins 16 MB 80 mins 1 KB 1 mins Dec 8th , 2011 XML files 25 KB 245 mins • Current storage strategy – Delete all the intermediate data, due to storage limitation • Some intermediate data should be stored • Some need not Problem Analysis • Which datasets should be stored? – Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006] – Different strategies correspond to different costs – Scientific workflowsth are very complex and there Dec 8 , 2011 are dependencies among datasets – Furthermore, one scientist can not decide the storage status of a dataset anymore – Data accessing delay – Datasets should be stored based on the trade-off of computation cost and storage cost • A cost-effective datasets storage strategy is needed Important Concepts • Data Dependency Graph (DDG) – A classification of the application data • Original data and generated data – Data provenance 8th , 2011 • A kind of Dec meta-data that records how data are generated – DDG d3 d1 d4 d2 d7 d5 d6 d8 Important Concepts • Attributes of a Dataset in DDG – A dataset di in DDG has the attributes: <xi, yi, fi, vi, provSeti, CostRi> • xi ($) denotes the generation cost of dataset di from itsDec direct 8th , predecessors. 2011 • yi ($/t) denotes the cost of storing dataset di in the system per time unit. • fi (Boolean) is a flag, which denotes the status whether dataset di is stored or deleted in the system. • vi (Hz) denotes the usage frequency, which indicates how often di is used. Important Concepts • Attributes of a Dataset in DDG – provSeti denotes the set of stored provenances that are needed when regenerating dataset di. genCost (d i ) xi th {dk d jprovSeti d j dk di } xk Dec 8 , 2011 – CostRi ($/t) is di’s cost rate, which means the average cost per time unit of di in the system. f i stored yi , CostR i genCost (d i ) vi , f i deleted • Cost = C + S – C: total cost of computation resources – S: total cost of storage resources Cost Model of Datasets Storage in the Cloud • Total cost rate of a DDG: TCRS diDDG CostRi S – S is the storage strategy of the DDG d1 (x1 , y1 ,v1) d2 d3 (x2 , y2 ,v2) (x3 , th y3 ,v3) Dec 8 , 2011 S1 : f1 =1 f2 =0 f3 =0 TCRS1 y1 x2 v2 ( x2 x3 )v3 S2 : f1 =0 f2 =0 f3 =1 .. . TCR S 2 x1v1 ( x1 x 2 )v 2 y3 • For a DDG with n datasets, there are 2n different storage strategies CTT-SP Algorithm • To find the minimum cost storage strategy for a DDG • Philosophy of the algorithm: – Construct a Cost Transitive Tournament (CTT) based on the DDG. Dec 8th , 2011 • In the CTT, the paths (from the start to the end dataset) have one-to-one mapping to the storage strategies of the DDG • The length of each path equals to the total cost rate of the corresponding storage strategy. – The Shortest Path (SP) represents the minimum cost storage strategy CTT-SP Algorithm • Example d1 d2 d3 (x1 , y1 ,v1) (x2 , y2 ,v2) (x3 , y3 ,v3) DDG Dec 8x thv +(x, 2011 +x )v +(x +x +x )v 1 1 1 2 2 1 3 3 x1v1+(x1+x2)v2+y3 CTT x3 v3 x1v1+y2 ds 2 y1 d1 y2 d2 y3 d3 0 de x2v2+y3 x2v2+(x2+x3)v3 The weights of cost edges: d i , d j y j {dk dk DDGdi dk d j } genCost (d k ) vk A Local-Optimization based Datasets Storage Strategy • Requirements of Storage Strategy – Efficiency and Scalability • The strategy is used at runtime in the cloud and the DDG may be large • The strategy takes computation Dec itself 8th , 2011 resources – Reflect users’ preference and data accessing delay • Users may want to store some datasets • Users may have certain tolerance of data accessing delay A Local-Optimization based Datasets Storage Strategy • Introduce two new attributes of the datasets in DDG to represent users’ accessing delay tolerance, which are <Ti , λi > • Ti is a duration of time that denotes users’ tolerance of dataset di’s accessing delay genCost (d k ) d k DDG Dec (d i d k , dj) Tk 8th 2011 CostCPU • λi is the parameter to denote users’ cost related tolerance of dataset di’s accessing delay, which is a value between 0 and 1 A Local-Optimization based Datasets Storage Strategy Dec 8th , 2011 A Local-Optimization based Datasets Storage Strategy • Efficiency and Scalability – A general DDG is very complex. The computation complexity of CTT-SP algorithm is O(n9), which is not efficient and scalable to be used on large DDGs • Partition the large DDG into small linear segments Dec 8th , 2011 Linear DDG2 ... Linear DDG1 ... Linear DDG4 ... ... Partitioning point dataset Linear DDG3 Partitioning point dataset • Utilize CTT-SP algorithm on linear DDG segments in order to guarantee a localized optimum Evaluation • Use random generated DDG for simulation – Size: randomly distributed from 100GB to 1TB. – Generation time : randomly distributed from 1 hour to 10 hours – Usage frequency: randomly distributed 1 day to 10 days th , 2011 (time between Dec every8usage). – Users’ delay tolerance (Ti) , randomly distributed from 10 hours to one day – Cost parameter (λi) : randomly distributed from 0.7 to 1 to every datasets in the DDG • Adopt Amazon cloud services’ price model (EC2+S3): – $0.15 per Gigabyte per month for the storage resources. – $0.1 per CPU hour for the computation resources. Evaluation • Compare different storage strategies with proposed strategy – Usage based strategy – Generation cost based strategy Dec strategy 8th , 2011 – Cost rate based Evaluation Change of daily cost (4% users stored datasets) Store all datasets Cost Rate (USD/Day) 2500 Store none 2000 1500 Dec 8th Usage based strategy , 2011 1000 Generation cost based strategy 500 Cost rate based strategy 0 0 100 200 300 400 500 600 700 800 Number of Datasets in DDG 900 1000 1100 Local-optimisation based strategy Evaluation CPU Time of the strategies Cost rate based strategy 200 180 160 Local-optimisation based strategy (n_i=10) Time(s) 140 120 Local-optimisation based strategy Dec 8th , 2011 100 80 60 Local-optimisation based strategy (m=5) 40 20 CTT-SP algorithm 0 50 100 150 200 300 500 Number of datasets in DDG 1000 Thanks Dec 8th , 2011 ©2007 The Board of Regents of the University of Nebraska. All rights reserved.