File-Access Patterns of Data-Intensive Workflow Applications and their Implications to Distributed Filesystems Takeshi Shibata SungJun Choi Kenjiro Taura University of Tokyo Department of Information and Communication Engineering Graduate School of Information Science and Technology University of Tokyo Department of Information and Communication Engineering Graduate School of Information Science and Technology University of Tokyo Department of Information and Communication Engineering Graduate School of Information Science and Technology shibata@logos.ic.i.utokyo.ac.jp demyan@logos.ic.i.utokyo.ac.jp tau@logos.ic.i.utokyo.ac.jp ABSTRACT This paper studies five real-world data intensive workflow applications in the fields of natural language processing, astronomy image analysis, and web data analysis. Data intensive workflows are increasingly becoming important applications for cluster and Grid environments. They open new challenges to various components of workflow execution environments including job dispatchers, schedulers, file systems, and file staging tools. The keys to achieving high performance are efficient data sharing among executing hosts and locality-aware scheduling that reduces the amount of data transfer. While much work has been done on scheduling workflows, many of them use synthetic or random workload. As such, their impacts on real workloads are largely unknown. Understanding characteristics of real-world workflow applications is a required step to promote research in this area. To this end, we analyse real-world workflow applications focusing on their file access patterns and summarize their implications to schedulers and file system/staging designs. Keywords workflow applications, distributed filesystems 1. INTRODUCTION Workflow facilitates integration of individually developed executables, making parallel processing readily accessible to domain experts. Thus it has become an important discipline in various fields including natural science, engineering, and information processing. Many systems for executing workflows have been developed [1, 2, 3]. More recently, programming paradigms and systems specifically designed for large data processing such as MapReduce [4], Hadoop1 , and Dryad [5]2 made it popular to utilize parallel processing without an involved effort of parallel programming. An obvious common goal of these systems is efficient execution of workflows. To this end, there have been efforts on various components of workflow engines including scheduling algorithms [6, 7, 8, 9, 10, 11, 12], data transfers [13, 14], and fast dispatchers [15]. There are efforts focusing on scheduling with data transfer costs taken into account [16, 17, 18, 19] A good survey on scheduling algorithm is in [20]. Despite their importance, practical evaluation of workflow systems have been rare and remain difficult, mainly due to lack of commonly accessible benchmarks. Even with a real application, translating the result on a particular platform into a generally acceptable observation on workflow systems is difficult because performance of a workflow depends on so many components of the environment such as nodes, networks, file systems, and so on. This is particularly so because workflows typically consist of many sequential executables each of which may have unknown sensitivities to their environments. Most existing studies on scheduling algorithms therefore have been based on simulation with synthetic workloads such as randomly generated task graphs. Bharathi et al. [21] is one of the few studies that systematically characterizes several real workflow applications. The present paper shares the same spirit as theirs, but pays a special attention to IO (file access) behaviors of applications. A single workflow generally consists of a set of tasks each of which may communicate with (i.e. depends on or is depended upon) another task in the workflow. Since a task is typically a sequential (single node) application, a data transfer among tasks is generally handled by the workflow system. Data may be implicitly transferred via a shared file system or explicitly moved by a staging subsystem. Either case, they generally go through a secondary storage to ensure a certain degree of fault tolerance—that a workflow 1 2 http://hadoop.apache.org/ http://research.microsoft.com/en-us/projects/Dryad/ can restart from a point where a job failed. As such, understanding IO behaviours of real workflow applications is very important for implementing efficient workflow systems. It is particularly so for data-intensive workflows, which are becoming major targets of workflow systems. To this end, this paper shows the result of our study on five real workflow applications in the fields of natural language processing, web data analysis, and astronomy image analysis. With this study, we like to provide data points by which we can investigate important design questions for implementing workflow and related subsystems such as file systems. Relevant questions include: • How effective is it to place jobs according to file affinity, and what is the tradeoff between the complexity and benefit of scheduling algorithms? • What are advantages and disadvantages of using shared file systems (such nas NFS[22], Lustre3 , GPFS [23]4 , and Ceph[24]) over using explicit staging? • How much we can gain from staging systems that transfer files directly between compute hosts rather than indirectly through a central home node? • Likewise, how much we can gain from user-level file systems such as Gfarm [25] or GMount [26] that can use any compute host as a storage server? 2. RELATED WORK 2.1 Workflow Studies Bharathi et al. [21] is the closest to the present work. It studies five real workflow applications. It analyses dependency graphs of each workflow and identifies common communication and dependency patterns such as pipeline, data distribution, aggregation, and redistribution. It also shows statistics on job granularities. In addition to dependency patterns, we also study file access behaviors and discuss their implications to file system or staging designs. 2.2 Executable Workflows Montage [27] is one of the few workflow applications that can be readily downloaded and executed by others. Both Bharathi et al. [21] and our work include this application as a target of their studies. We will provide four more executable applications with annotations that can be shared by researchers in this community. 2.3 Workflow Collections For publicly accessible collection of workflows, Bharathi et al. also provides WorkflowGenerator web site5 , which generates DAGs whose nodes are tasks annotated with runtime and files used. Files are annotated with their sizes, so dependencies and communication volumes can be reconstructed from the DAGs. We are preparing a web site that provides similar data for applications studied in this paper. In addition, we will provide more comprehensive traces of file access behaviors, as well as their executables. myExperiment6 is a repository of workflow descriptions, currently providing more than 750 workflows. Each workflow is given as a description for a particular system (mostly for Taverna [3]), and not annotated with profile information such as runtimes and file sizes. The number of workflows we are going to provide is much smaller, but we are trying to provide them in a form executable by others. 2.4 Workflow Scheduling Techniques and Distributed File Systems There have been efforts on various components of workflow engines including scheduling algorithms [6, 7, 8, 9, 10, 11, 12]. However, most of those methods require the complete form of DAGs for workflows and completion time, input/output file size for each jobs, etc. It is difficult to apply scheduling techniques such as HEFT without those information. Especially, The complete form of DAGs requires workflow descriptions to have explicit declarations of input/output files for each job. Because the workflow descriptions we examined in this paper has no explicit form of input/output files, some systems which help dispatched jobs find required files and register output files is required. Distributed file systems are able to be used for that purpose. 3. WORKFLOW EXECUTION AND FILE ACCESS LOGGING 3.1 Workflow Engine Workflows described in this paper are described in Makefiles and a parallel and distributed make engine called GXP Make7 was used to execute them. Make is a convenient tool for describing jobs with dependencies and can naturally be used for describing executing workflows. Except for Montage, all workflows are written in Makefile by their original authors. For Montage, one of the tools for it generates an abstract workflow in an XML format called DAX, used Pegasus. Output DAX files are converted to Makefiles. Every job description by the tag hjobi is converted to a ”rule” of makefile; commands of a rule are extracted from hjobi tags, whereas dependency (prerequisite) and target filenames from husesi tags with arguments ’link = ”input”’ and ’link = ”output”’, respectively. GXP Make runs on top of an existing Make implementation GNU Make, which supports parallel execution for a single multiprocessor node. GXP Make extends this function to parallel execution in distributed environments by intercepting its child process invocations. Processes forked by the underlying GNU make enter a queue managed by GXP make; It then dispatches jobs to available workers that have been acquired by any underlying resource manager such as TORQUE, SGE, SSH, etc. Input and output files of tasks are stored in a single network file system, such as NFS, Lustre, GFarm, etc. 3 The choice of this particular workflow system is largely orthogonal to the results presented in this paper. It is not difficult to automatically generate descriptions for other work- 4 6 http://www.sun.com/software/products/lustre/ http://www-03.ibm.com/systems/clusters/software/gpfs/ 5 http://vtcpc.isi.edu/pegasus/index.php/WorkflowGenerator 7 http://www.myexperiment.org/ http://www.logos.ic.i.u-tokyo.ac.jp/gxp/ 2. The distance of dispatched command lines from normal commands are measured by edit distance. Each dispatched command line is assigned to a category which is represented by the nearest normal command. flow system such as DAGMan, given file access logging methods presented in the next section. 3.2 File Access Logging In data intensive applications, it is important to examine how much data are transferred over the network, when and where these transfers happen, and where bottlenecks are. When data are transferred mainly as files through a distributed file system, the detail of file access is required to extract such information. In order to record file access operations, we developed a logging tool which is specialized for GXP Make using a tool called Filesystem in Userspace (FUSE). Table 1 shows the recorded data for each access. For instance, if the following block is in a makefile : DIR := s o m e d i r $ (DIR)/%. i n p u t : $ (DIR)/%. output cmd $< > $@ then the corresponding normal command is cmd s o m e d i r /%. i n p u t > s o m e d i r /%. output . Table 1: job id operation data id date time size logged data par a file access operation hostname:processId read, write, open, create, getattr, mkdir, etc. filename execution time of each operation in nsec. from the epoc blocking time during each operation size for each read and write A file system using our logging tool provides a mount point under which the target directory is mirrored. When a process accesses a file or a directory under the mount point, one of our registered callback routines is called by FUSE, in which the access is logged along with the client process id. Although another FUSE-based tool ParaTrac [28] offers similar results to ours, our tool is implemented independently for different purpose. While ParaTrac is a general purposes profiling tool, this profiling tool is specialized for GXP Make workflow system and has several different functions from ParaTrac. For example, while our tool can not produce fine-grade analysis such as statistics for time when ’read’ operations occur in processes, it is able to divide jobs into categories for given makefiles and show their statistics, extract user cpu time of each job and show critical path of the workflow, produce workflow DAGs in HTML files with information of workflow execution, and so on. 4. We need to translate the process id to the identity of a job (note that a job may consist of multiple processes). GXP Make gives a unique number to each job through an environment variable (GXP_WORK_IDX). When our logging tool receives a callback, it records GXP_WORK_IDX of the client process if it is the first callback from that process. WORKFLOW APPLICATIONS In this section, details and profiling result for the five workflow applications 8 are given. They are: • Medline to Medie Indexing • Case Frames Construction When the file system is unmounted after logging all file access of a workflow, access logs by all hosts are collected to form a database. Dependencies among dispatched jobs and accessed files are inferred from those files. Dependencies are inferred as follows. (1) If a job X accessed a file A with any file operation, X is related to A. (2) If X created A, then A depends on X (X → A), otherwise X depends on A (A → X). After the database is created, the workflow DAG is generated as a clickable HTML file with information for the workflow execution. Statistics for file-access patterns of each job and each category for jobs are also extracted from the database. A category for jobs is a set of jobs which are derived from the same line in the makefile, or the same command without input file. All dispatched command lines, or jobs, are divided into categories by the following steps. 1. Strings which is called normal commands are extracted from the given makefile by shallow pursing. Each normal command is basically corresponding to single line of the makefile. • Similar Pages Extraction • Supernovae Explosion Detection • Montage The first two are applications in natural language processing; Similar Pages Extraction is a data mining application applied to a large collection of web pages; The last two are image analysis applications in astronomy. The experimental environment is a single cluster composed of 15 nodes (120 CPU cores) with a single NFS server. They are connected by Gigabit Ethernet with bonding (2Gbps/node). We use GXP make as the workflow engine. Details of file-access patterns are logged and analysed by means of the previous section. Following subsections describe each application with more details. 8 https://www.intrigger.jp/wiki/index.php/Applications 4.1 MEDLINE to MEDIE MEDLINE to MEDIE Indexing (hereafter M2M) workflow creates indexing database for a running search engine called MEDIE.9 It indexes all registered abstracts in the MEDLINE, the world largest biomedical text database. It indexes and retrieves sentences with their semantic subjects, verbs, and objects, enabling queries such as “what does p53 activate?” To this end, its indexing involves deep natural language processing including parsing, requiring a large amount of CPU times and intermediate data storage. M2M takes a compressed XML text files as its input, each containing a number of biomedical papers in MEDLINE. Most tasks in M2M input and output compressed XML files. Figure 1 shows a sub-workflow of M2M for a single input file. A circle is a single task, and a square represents a file created or read by tasks. An arrow from a task to a file indicates that the task created the file, and one from a file to a task that the task opened the file for reading. The most time consuming part is parsing done by a parser called Enju. This is parallelized within each input file by splitting the intermediate file right before the parsing into small chunks. This can be seen as a number of horizontally aligned boxes around the middle of the graph. F258 J43 F259 J20 J42 F152 F257 800000 meta data read write stime utime 700000 600000 500000 ]c se[ 400000 e im T 300000 200000 100000 0 1 2 3 4 Ranking 5 6 7 (a) Time detail of each job utime stime write read metadata F260 J45 F147 J44 J17 F146 J29 F153 J56 F64 J21 J28 F158 J23 F156 F157 J22 J16 J50 F171 F275 (b) Time detail for whole workflow F155 F90 F277 J57 F272 F273 J51 F88 J10 J32 J11 F177 F91 F87 F201 F162 F98 F100 F103 F104 F105 F71 F106 F107 F108 F109 F72 F110 F76 F77 F78 F1 J68 J63 J66 J61 J64 J0 J1 F79 F80 F84 F85 F86 F2 F4 J52 F3 F73 J62 F68 F74 J8 F81 F111 F112 F75 F8 F10 F6 J65 J67 J3 J4 J2 F82 F83 F9 F11 F7 J9 F261 F70 J46 F262 Figure 2: MEDLINE to MEDIE :Time detail of each job and of the whole workflow J12 F113 J39 J24 F149 F161 J18 J25 F164 F54 J35 J58 F60 J33 F166 J34 F58 F140 J26 J5 F282 4. Medie-ner-disamb.: named entity recognition for biomedical terms, such as protein names, and removing ambiguity from them. J13 F56 F138 J40 F148 J14 F203 F165 J53 F278 F143 J69 J70 F150 J19 F151 F289 J31 F175 J71 F284 F174 J59 J30 F279 F281 J54 5. Medie-ner-filter: preprocess for Medie-ner-disamb. F168 J55 F280 J36 J60 J37 F263 F169 F268 F195 J47 J27 F145 F264 J15 J49 J48 F267 F266 J72 F144 6. MakeDB : creation index database for MEDIE. F290 J38 J73 F65 F291 J6 F66 7. Som-merge : merging intermediate XML files. J7 F67 J41 F230 F224 F255 F254 F221 F252 F229 F248 F239 F223 F242 F240 F228 F220 F218 F256 F231 F227 F217 F233 F235 F251 F246 F222 F236 F232 F234 F237 F243 F241 F247 F244 F212 F253 F216 F245 F226 F225 F219 F250 F238 F249 Figure 1: MEDLINE to MEDIE Each input file results in 68 independent jobs, including natural language processing, database creation, managing and conversion of files described in XML, etc. Resource consuming tasks are listed below in the decreasing order for execution time. 1. Enju : HPSG parser, which constructs of predicate argument structures. 2. Akane : Protein-protein interaction inference from sentences. 3. Genia-tagger : part-of-speech tagging and shallow parsing for biomedical text. 9 http://www-tsujii.is.s.u-tokyo.ac.jp/medie/ Figure 2(a) shows a timing detail of the fifteen most time consiming tasks for 99 input files (491 MB). In the experiment, Enju is executed in 3584 times and each of the other tasks is executed 99 times (once for each input file). While the most time-consuming task, Enju, is clearly cpu-intensive, the second one, Akane, is very IO intensive. Moreover, it spends a significant amount of time on meta-data accesses. Overall, 46% of the whole time are consumed by file IO and metadata access (Figure 2(b)). Details of metadata accesses are shown in Table 2. 4.2 Similar Pages Extraction Similar Pages is a workflow that takes a large number of texts collected from the web as input and analyses them. It outputs all pairs of ’similar’ sentences, which have Hamming distances smaller than a specified threshold. The key algorithm, called Sachica, is able to enumerate all such pairs Table 2: detail of file metadata access for M2M getattr readdir open count 1,749,482 99 3,418,394 aggregate time[sec.] 84,266 4 128,671 average time[msec.] 48 40 38 time ratio 0.39 0 0.59 create 21,642 1,199 55 0.01 count aggregate time[sec.] average time[msec.] time ratio mkdir 496 33 67 0 unlink 4,871 1,069 219 0.01 30000 meta data read write stime utime 25000 20000 ]c es [ 15000 e m i T 10000 5000 0 1 2 3 Ranking rapidly [29]. The dependency DAG of Similar Pages Extraction workflow is shown in Figure 3. It has three kind of tasks: (a) Time detail of each job • Mkbins : gather all input text files and split them into a given number (N ) of files. utime stime write read • Sachica1 : find pairs of similar subsequences in a single file. metadata • Sachica2 : find pairs of similar subsequences in two files, one from each of them. 4.txt 2.txt 5.txt 3.txt 8.txt 9.txt 0.txt 6.txt 7.txt 1.txt (b) Time detail for whole workflow mkbins bin.1 sachica1 xresult.1.bz2 log.1 sachica2 xxresult.0.1.bz2 log.0.1 bin.0 sachica2 log.1.2 xxresult.1.2.bz2 bin.2 sachica1 xresult.0.bz2 bin.table.0 bin.table.1 sachica2 log.0 log.0.2 Figure 4: Similar Pages :Time detail of each job and of the whole workflow bin.table.2 sachica1 xxresult.0.2.bz2 log.2 xresult.2.bz2 Figure 3: Similar Pages Extraction workflow Characteristic of the dependency DAG for Similar Pages is that Sachica1 and Sachica2 have N and N (N − 1)/2 jobs respectively, where N is the number of files Mkbins outputs. Each split file is read N times from Sachica1 and Sachica2. Figure 4 shows an experimental result for about 1 million (total 5 GB) input files with N = 30 (Mkbin:1, Schica1:30, Schica2:435). Mkbin is very file-access intensive, whose cost are mainly IO operation ’read’ and metadata operations ’get attribute’ and ’open’. Overall, file-access cost occupies the half of its execution time and ’read’ is the main reason of the file-access costs. 4.3 Case Frames Construction Case Frames Construction is a workflow to create data structures in natural language processing called case frames [30, 31]. Case frames, which are frequent patterns of predictargument structures, are generated by analysing large mount of text collected from the web. The workflow is mainly constructed from the following jobs in the increasing order with consuming time(Figure 7(a)): 1. parse-comp : a parser for Japanese texts with KNP [30] including named entity recognition. 2. make-caseframe : a job that creates case frames by extracting reliable predicate-argument examples from the parsing result. 3. cf-distribute-rest : a job that adds all case frames to clusters made by cf-clustering. 4. cf-clustering : clustering highly frequent case frames and merging them with similarity. 5. calc-of-similarity : computing similarities for each pair of case frames. The DAG for the workflow with 4 input files is shown in Figure 5. As the figure shows, all files output by the upstream jobs are gathered into one large file, and it is split into files with other size, and finally they are combined into a result file. So parallelism of the workflow has two peaks. Figure 6 shows the number of concurrent jobs with respect to the elapsed time (in seconds) of an execution. In the experiment, 59 % of the whole time is spent for file access, while the ratio of time for each job has large variance (Figure 7). 4.4 Finding Supernovae Finding Supernovae is a workflow to find coordinates where supernovae appears to be from astronomical data. The workflow uses the same software and data with Data Analysis x000002.org J16 0.06/0.06 J1 1.77/1.77 J0 2.01/2.01 J2 1.39/1.39 J3 1.64/1.64 all.list x000001.jmn x000000.jmn x000003.jmn x000002.jmn J6 63.37/65.14 J7 89.46/91.47 J4 70.35/71.74 J5 65.86/67.5 x000001.org x000001.knp x000001.log x000000.org x000001.knp0 x000001.log0 x000000.knp J8 13.23/78.37 x000000.knp0 x000003.org x000000.log J11 15.78/107.25 x000001.data x000000.log0 x000003.knp x000003.log0 x000003.log J10 13.71/85.45 x000000.data x000003.knp0 x000002.knp x000002.log0 1000000 x000002.log 900000 x000002.knp0 J9 13.67/81.17 x000003.data x000002.data meta data read write stime utime 800000 J12 0.08/107.33 all.data 700000 J13 0.19/107.52 all.data.orig ]c 600000 es [ 500000 e m i T 400000 J14 0.2/107.72 all.data.sid J15 0.25/107.97 all-000000.data.sid all-000002.data.sid all-000001.data.sid J19 0.31/108.28 J18 0.31/108.28 J17 0.32/108.29 all-000000.data.basic all-000002.data.basic J20 0.06/108.35 all-000000.data.basic J21 0.07/108.42 J25 0.12/108.47 J26 0.07/108.41 J29 0.16/108.5 all-000000-cross.log all-000000-cross-wa.result all-000000-cross.formatted all-000002-cross-wa.formatted all-000000-cross-wa-soto.result all-000001-cross.log all-000001-cross.formatted 100000 J36 0.26/109.15 all-000002-cross-wa.log all-000002-cross-wa-soto.result all-000001-cross-wa.log all-000001-cross-wa.result all-000001-cross-wa.formatted J39 1.02/110.17 all-000002-cross-wa-soto.log all-000001-cross-wa-soto.result all-000001-cross-wa-soto.log all-000002-cross-wa-soto-others_tmp.log all-000001-cross-wa-soto-others_tmp.formatted J45 0.12/110.02 J54 0.14/110.55 all-000002-cross-wa-soto-delwa.formatted all-000001-cross-wa-soto-delwa.formatted J44 0.13/110.18 J46 0.23/110.25 J57 0.23/110.78 all-000000-cross-wa-soto-delwa.formatted all-000002-cross-wa-soto-delwa-sub_tmp.formatted all-000001-cross-wa-soto-delwa-sub_tmp.formatted J47 0.23/110.41 J49 0.07/110.32 J61 0.06/110.84 all-000000-cross-wa-soto-delwa-sub_tmp.formatted all-000002-cross-wa-soto-delwa-sub.formatted all-000001-cross-wa-soto-delwa-sub.formatted J48 0.13/110.54 J51 0.07/110.39 J63 0.06/110.9 all-000000-cross-wa-soto-delwa-sub.formatted all-000002-cross-wa-soto-delwa-sub-refined.formatted all-000001-cross-wa-soto-delwa-sub-refined.formatted J53 0.06/110.6 J52 0.23/110.62 J64 0.24/111.14 all-000002-cross-wa-soto-delwa-sub-refined-or.formatted all-000001-cross-wa-soto-delwa-sub-refined-or.formatted J55 0.23/110.83 J56 0.13/110.75 J67 0.14/111.28 all-000000-cross-wa-soto-delwa-sub-refined-or.formatted all-000002-cross-wa-soto-delwa-sub-refined-or-sorted.formatted all-000001-cross-wa-soto-delwa-sub-refined-or-sorted.formatted J58 0.12/110.95 J59 0.24/110.99 all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs.formatted J60 0.25/111.2 J62 0.24/111.23 J73 0.24/111.76 all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs.formatted all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm.formatted all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm.formatted J65 0.23/111.43 J66 0.14/111.37 J75 0.13/111.89 all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm.formatted all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.formatted J68 0.14/111.57 J72 0.07/111.44 all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.formatted J71 0.33/111.9 all-000000.cfsim.dat J74 0.06/111.63 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Ranking (a) Time detail of jobs utime J70 0.24/111.52 all-000000-cross-wa-soto-delwa-sub-refined-or-sorted.formatted all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.log 1 all-000001-cross-wa-soto-others_tmp.log all-000000-cross-wa-soto-delwa-sub-refined.formatted all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs.log 0 all-000001-cross-wa-soto.formatted J50 0.24/110.41 all-000002-cross-wa-soto-others_tmp.formatted all-000000-cross-wa-soto-others_tmp.formatted 200000 all-000001.data.basic.rest J43 0.24/109.9 all-000000-cross-wa-soto.formatted J42 0.24/110.05 all-000000-cross-wa-soto-others_tmp.log all-000002-cross-wa.result all-000002-cross-wa-soto.formatted J40 0.25/109.81 all-000000-cross-wa-soto.log all-000002-cross.log J41 0.25/109.66 all-000000-cross-wa.formatted J27 0.16/108.51 all-000001.1st.formatted J33 0.26/108.89 all-000002-cross.formatted 300000 J30 0.28/108.63 all-000001.1st.log J38 0.26/109.41 all-000002.data.basic.rest J37 0.25/109.56 all-000000-cross-wa.log J24 0.06/108.41 all-000002.1st.log J35 0.54/109.15 all-000000.1st.formatted J34 0.59/109.31 all-000000.data.basic.rest all-000001.data.basic J32 0.27/108.61 all-000002.1st.formatted J31 0.25/108.72 all-000000.1st.log J22 0.07/108.35 all-000002.data.basic all-000000suru.data.basic J28 0.15/108.62 all-000001.data.basic J23 0.06/108.34 all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs.log all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.log J69 0.24/111.61 J77 0.06/111.95 all-000002.cfsim.dat all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs.formatted all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.formatted all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs.log stime write all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.log read J76 0.22/112.11 all-000001.cfsim.dat metadata all-000002.formatted all-000000.formatted all-000001.formatted J78 0.06/112.01 cf.formatted J79 0.22/112.23 cf.knpdict Figure 5: Case Frames Construction (b) Time detail for whole workflow Figure 7: Details for Case Frames Construction r0002_10_t1.fits r0002_10_t0.fits J12 3.32/3.32 r0002_10_t1.fits.det J14 11.01/14.65 r0002_10_t1.fits.mat J21 0.24/14.89 r0002_10_t1.fits.coef J22 5.96/20.85 r0002_10_t1_sft.fits r0002_10_t1.fits.coef_r r0002_10_t1.fits.dso r0002_10_t1_obj.lst r0000_00_t1.fits J2 3.64/3.64 r0002_10_t0.fits.det r0002_10_t0_obj.lst r0002_10_t1_flg.fits J6 0.23/3.87 msk.lst r0002_10_t0_flg.fits r0000_00_t1.fits.det J15 853.62/857.26 r0002_10_t0_r0002_10_t1_conv.fits r0000_00_t1.fits.dso r0002_10_t1.reg msk.lst J19 0.12/14.43 r0002_10_t1_sub_msk.fits r0000_00_t1.fits.coef J8 1.58/862.22 r0002_10_t1.result r0000_00_t1.fits.coef_r J20 7.03/21.46 r0000_00_t0_obj.lst r0000_00_t0.fits.det J17 0.17/3.58 r0000_00_t1.fits.mat J7 3.38/860.64 J1 3.41/3.41 r0000_00_t1_obj.lst J13 10.9/14.31 r0002_10_t1_sub.fits r0000_00_t0.fits J11 3.17/3.17 r0002_10_t0.fits.dso r0000_00_t1_flg.fits r0000_00_t0.fits.dso r0000_00_t0_flg.fits J16 756.12/759.53 r0000_00_t0_r0000_00_t1_conv.fits r0000_00_t1.reg r0000_00_t1_sub.fits J18 3.41/762.94 r0000_00_t1_sub_msk.fits J3 2.82/765.76 r0000_00_t1_sft.fits r0000_00_t1.result J9 0.08/862.3 J4 0.09/765.85 r0002_10.result.tmp r0000_00.result.tmp J10 0.07/862.37 J5 0.07/765.92 r0002_10.result r0000_00.result J0 0.07/862.44 total_result Figure 6: Parallelism of an execution for Case Frames Construction Challenge on IEEE Cluster/Grid 2008 10 . Each pair of astronomical images which are for the same position but different time is compared with each other and the positions of candidates for supernovae are listed up. Finally the candidate positions are gathered into one file. Figure 8 shows the DAG of the Finding Supernovae for 2 pairs of input images. In this workflow, the most heavy job (imbus3vp3) is relatively compute-intensive while other jobs are not. 4.5 Montage We execute Montage DAX and, take log and generate the dependency graph(Figure 11). Figure 8: Finding Supernovae 3156 2MASS(J)images as input, where each size of them is about 2MB . As Figure 11(b) shows, almost of the aggregate time is consumed for file access. Concentrating all files to single file server becomes the bottleneck and degrades performance of the workflow. Percentage of cpu usage time to aggregate time becomes larger as the input size (Figure 11(c)), which means that concentrating data access degrades throughput of the file server. 5. Each input file is a astronomic image file, and the workflow gathers them into one picture. In the experiment we use 10 http://www.cluster2008.org/challenge/ COMPARISON FOR DESIGNS OF WORKFLOW SOLVERS In this section, we investigate various cost for designs of workflow solvers such as staging, centorized file system, schedul- 600000 700000 meta data read write stime utime 600000 500000 ]c es 400000 [ e m i T 300000 meta data read write stime utime 500000 400000 c]e [s 300000 e im T 200000 200000 100000 100000 0 0 3 p v 3 s u b im t if h s im 3 t c e t e d t r is o L t k c s ra a t x M k e s m 3 k s a m 2 h c t a m 2 t fi ly o p it d iffF oun mD ckgr a mB k w a P dd tbl nk Fit l(2) mA jectP Shri mImg mgtb ncat o m o I r C m m mP Ranking Ranking (a) Time detail of jobs (a) Time detail of each job utime utime stime stime write write read read metadata metadata (b) Time detail for whole workflow (b) Time detail for whole workflow 1.00 Figure 9: Details for Finding Supernovae p2mass-atl as-98120 4n-j0420 044_area .fits 2mass-atla s-981204 n-j04200 44.fits 2mass-atla s-971024 n-j00902 32.fits mProjectPP 2.28/2.28 mProjectPP 2.28/2.28 p2mass-atl as-98120 4n-j0420 044.fits mDiffFit 0.52/2.8 diff.001.0 04_area. fits diff.001.0 04.fits mDiffFit 0.38/2.66 fit.000001 .000004. txt fit.000001 .000003. txt diff.001.0 03.fits diff.001.0 03_area. fits diff.003.0 04_area. fits mProjectPP 2.28/2.28 mProjectPP 2.3/2.3 p2mass-atl as-97102 4n-j0090 220.fits mDiffFit 0.43/2.71 fit.000003 .000004. txt 2mass-atla s-971024 n-j00800 44.fits region.hdr p2mass-atl as-97102 4n-j0090 232_area .fits p2mass-atl as-97102 4n-j0090 232.fits 2mass-atla s-971024 n-j00902 20.fits p2mass-atl as-97102 4n-j0090 220_area .fits p2mass-atl as-97102 4n-j0080 044.fits mDiffFit 0.47/2.77 diff.003.0 04.fits fit.000002 .000004. txt diff.002.0 04_area. fits p2mass-atl as-97102 4n-j0080 044_area .fits mDiffFit 0.37/2.67 statfile_n ew.tbl diff.002.0 04.fits fit.000002 .000003. txt diff.002.0 03.fits diff.002.0 03_area. fits no it ar ep o hc ae ro f oit aR 0.90 0.80 0.70 0.50 0.40 0.30 0.20 mConcatFit 0.12/2.92 0.10 fits.tbl 0.00 pimages.tb l metadata read write stime utime 0.60 28 mBgModel . 0.22/3.14 308 3474 # of input files correction s.tbl mBackground 0.49/3.63 mBackground 0.48/3.62 c2mass-atl as-98120 4n-j0420 044.fits c2mass-atl as-97102 4n-j0090 232.fits c2mass-atl as-98120 4n-j0420 044_area .fits mBackground 0.48/3.62 cimages.tb l c2mass-atl as-97102 4n-j0090 220.fits c2mass-atl as-97102 4n-j0090 220_area .fits newcimages .tbl mAdd -e ne 0.39/4.21 mosaic_200 90330_19 4109_309 67.fits (c) ratio of time for each oparations w.r.t. the number of input files c2mass-atl as-97102 4n-j0080 044_area .fits mImgtbl . 0.19/3.82 c2mass-atl as-97102 4n-j0090 232_area .fits mBackground 0.48/3.62 c2mass-atl as-97102 4n-j0080 044.fits Figure 11: Montage :Time detail of each job and of the whole workflow mosaic_200 90330_19 4109_309 67_area. fits mShrink mo 0.2/4.41 shrunken_2 0090330_ 194109_3 0967.fit s mJPEG -ct 0.3/4.71 shrunken_2 0090330_ 194109_3 0967.jpg Figure 10: montage workflow ing, based on logged data for workflows in the previous section. 5.1 Centralization Costs A staging is a process to transfer required files to nodes in advance. Many staging methods send all updated and created files to central servers when jobs finish. Since intermediate files are not sent directly from the node who output them to other nodes in those methods, called centralized staging, bandwidth and latency to access for central servers frequently becomes a bottleneck especially for data intensive workflows. Many centralized file systems, such as NFS and LUSTRE, also have the same problem, whereas distributed file systems, such as Gmount and Gfarm, can be more effective since they locate files on local disks of nodes and send them if necessary. While input and output files of a workflow are usually required to be located on file system, intermediate files, which are both read and written by jobs of the workflow, are not. We assume that files which are only read during the workflow execution are the input files and that files which are only written are the output files. Table 3 shows input, intermediate and output aggregate file size and the intermediate rate ( intermediate/(input + intermediate + output) ) for each workflow in the previous section. It is possible to avoid 40-88% of file transfer to central servers on centralized file systems and staging systems that can be become a bottleneck for the whole workflow performance if intermediate files does not send to the central servers. Table 4: local read for locality-aware scheduling read size [GB] input local remote local read rate m2m 0.51 69.56 119.35 0.36 similarpages 2.49 0.17 74.61 0.002 caseframes 0.73 46.35 0.99 0.96 supernovae 233.73 37.50 67.54 0.11 montage 9.13 56.93 235.86 0.18 access latency to the metadata server on the file system with keeping consistency. Thus metadata operation cost should be able to reduce much less if staging is used in stead of a single file system, since metadata access is only in local nodes and keeping consistency with other nodes is not required with stating methods. Table 3: ratio for the aggregate size of intermediate files input interm. output [GB] interm. rate m2m 0.51 97.43 13.25 0.88 similarpages 2.49 2.49 0.97 0.42 caseframes 0.73 46.74 58.51 0.44 supernovae 77.91 103.48 78.26 0.40 7.26 72.84 21.44 0.72 montage Table 5: the ratio of metadata access costs ratio for whole ratio for file-access m2m 0.12 0.21 similarpages 0.13 0.18 caseframes 0.05 0.09 supernovae 0.04 0.10 montage 0.32 0.32 5.2 Locality-aware Scheduling When a scheduler for workflows decides the compute nodes where each job are dispatched, it is important to consider the cost to access files located on remote nodes. Jobs are preferred to be lunched at a compute node where as many input files are located on the local disk or memory. In many scheduling methods like HEFT [7], the cost to access files on remote nodes are considered when jobs are assigned to compute nodes, as well as predicted computation time for jobs, length of critical path, etc. Let us define that a job Y depends on a job X if some output file of X is input file of Y . If a job X has some jobs that depend on only X, it is very likely that, at least, one of jobs which depends on X is assigned to the same node with X with any criterion for assignments of jobs. In order to measure how much size of files are able to read from local in locality-aware scheduling methods, for each job X, a job Y which depends on X are randomly selected among jobs depends on X, and X and Y are assigned the same node. Talbe 4 shows the result of the assignment for each workflow. While Case Frames Construction has high local read rate local/(local+input+remote), Similar Pages Extraction does not. The difference of them mainly arise from the difference of the average number of branches in the DAGs of the workflows. If a workflow has the small averaged number of branches for a job in the DAG, considering locality awareness became more important. As Figure 3 and 5 shows, most jobs of Case Frames Construction have one branch while those of Similar Pages Extraction have 30 branches. 6. CONCLUSION In Section 5, advantages of modifying file systems, staging methods and scheduling methods are estimated by analysing dependency graphs generated from logs. Since those workflows have no explicit declaration of input and output files for each job, dependencies between files and jobs are created by analysing file access logs. Three key insights for performance are drawn from the experiments. First, the total load on the central file server decreases by 40 − 88% if intermediate files are sent not going through the central file servers, but directly from the producer to the consumer. Secondary, if locality-aware scheduling methods are applied, their effectiveneses highly depend on the shapes of DAGs (the reductions are from 0.002−0.96). In particular, it largely depends on the number of children per job. Lastly, 9 − 32% of whole file access cost can be eliminated when a staging is used instead of a file system, because a large part of file access costs comes from the metadata operation latency between clients and servers. Our future direction includes using profiled information to statistically infer various information about jobs such as names of input files, names of output files, size of output files, and the execution time of the job from the command line. If many of them are successfully estimated, it is possible to build a framework in which jobs can be described without explicit staging directives, files transparently shared across nodes, yet executed very efficiently. 5.3 Costs for Metadata Operations Acknowledgment Metadata operations often be non-negligible costs on a single file system. Table 5 shows the experimental result in the previous section shows. The reason for the cost of metadata operations hardly depends on bandwidth limitation but on This work was supported by Grant-in-Aid for Specially Promoted Research (MEXT, Japan) and the MEXT Grant-inAid for Scienti?c Research on Priority Areas project “ New IT Infrastructure for the Information-explosion Era ” 7. REFERENCES [1] D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in practice: the condor experience.” Concurrency - Practice and Experience, vol. 17, no. 2-4, pp. 323–356, 2005. [2] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz., “Pegasus: a framework for mapping complex scientific workflows onto distributed systems,” Scientific Programming Journal, vol. 13, no. 3, pp. 219–237, 2005. [3] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li, “Taverna: a tool for the composition and enactment of bioinformatics workflows,” Bioinformatics, vol. 20, no. 17, pp. 3045–3054, 2004. [4] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2004, p. 137 窶 50. [5] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential building blocks,” in Proceedings of the 2007 Eurosys Conference, 2007. [6] A. Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor-Crummey, B. Liu, and L. Johnsson, “Scheduling strategies for mapping application workflows onto the grid,” in High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium on, July 2005, pp. 125–134. [7] H. Topcuoglu, S. Hariri, and M.-Y. Wu, “Performance-effective and low-complexity task scheduling for heterogeneous computing,” Parallel and Distributed Systems, IEEE Transactions on, vol. 13, no. 3, pp. 260–274, Mar 2002. [8] Z. Yu and W. Shi, “An adaptive rescheduling strategy for grid workflow applications,” in Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, March 2007, pp. 1–8. [9] E. Ilavarasan and P. Thambidurai, “Low complexity performance effective task scheduling algorithm for heterogeneous computing environments,” Journal of Computer Science, vol. 3, no. 2, pp. 94–103, 2007. [10] R. Sakellariou and H. Zhao, “A hybrid heuristic for dag scheduling on heterogeneous systems,” in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, April 2004, pp. 111–. [11] W. Guo, W. Sun, W. Hu, and Y. Jin, “Resource allocation strategies for data-intensive workflow-based applications in optical grids,” in Communication systems, 2006. ICCS 2006. 10th IEEE Singapore International Conference on, Oct. 2006, pp. 1–5. [12] J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, and K. Kennedy, “Task scheduling strategies for workflow-based applications in grids,” Cluster Computing and the Grid, IEEE International Symposium on, vol. 2, pp. 759–767, 2005. [13] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng, “Datastager: scalable data [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] staging services for petascale applications,” in HPDC ’09: Proceedings of the 18th ACM international symposium on High performance distributed computing. New York, NY, USA: ACM, 2009, pp. 39–48. T. Kosar and M. Livny, “Stork: making data placement a first class citizen in the grid,” in Distributed Computing Systems, 2004. Proceedings. 24th International Conference on, 2004, pp. 342–349. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, “Falkon: a fast and light-weight task execution framework,” in SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1–12. G. Khanna, N. Vydyanathan, U. Catalyurek, T. Kurc, S. Krishnamoorthy, P. Sadayappan, and J. Saltz, “Task scheduling and file replication for data-intensive jobs with batch-shared i/o,” in High Performance Distributed Computing, 2006 15th IEEE International Symposium on, 0-0 2006, pp. 241–252. I. Raicu, I. T. Foster, Y. Zhao, P. Little, C. M. Moretti, A. Chaudhary, and D. Thain, “The quest for scalable support of data-intensive workloads in distributed systems,” in HPDC ’09: Proceedings of the 18th ACM international symposium on High performance distributed computing. New York, NY, USA: ACM, 2009, pp. 207–216. K. Ranganathan and I. Foster, “Decoupling computation and data scheduling in distributed data-intensive applications,” in High Performance Distributed Computing, 2002. HPDC-11 2002. Proceedings. 11th IEEE International Symposium on, 2002, pp. 352–358. T. Kosar, “A new paradigm in data intensive computing: Stork and the data-aware schedulers,” in Challenges of Large Applications in Distributed Environments, 2006 IEEE, 0-0 2006, pp. 5–12. F. Dong and S. G. Akl, “Scheduling algorithms for grid computing: State of the art and open problems,” School of Computing, Queen’s University, Tech. Rep., January 2006. S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H. Su, and K. Vahi, “Characterization of scientific workflows,” in Workflows in Support of Large-Scale Science, 2008. WORKS 2008. Third Workshop on, Nov. 2008, pp. 1–10. B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, “Nfs version 3 - design and implementation,” in In Proceedings of the Summer USENIX Conference, 1994, pp. 137–152. F. Schmuck and R. Haskin, “Gpfs: A shared-disk file system for large computing clusters,” in In Proceedings of the 2002 Conference on File and Storage Technologies (FAST, 2002, pp. 231–244. S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, “Ceph: A scalable, high-performance distributed file system,” in Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006, p. 307 窶 20. O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, “Grid datafarm architecture for petascale data intensive computing,” in Cluster Computing and [26] [27] [28] [29] [30] [31] the Grid, 2002. 2nd IEEE/ACM International Symposium on, May 2002, pp. 102–102. N. Dun, K. Taura, and A. Yonezawa, “Gmount: An ad hoc and locality-aware distributed file system by using ssh and fuse,” in 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009, pp. 188–195. “Montage: An astronomical image mosaic engine,” http://montage.ipac.caltech.edu/. N. Dun, K. Taura, and A. Yonezawa, “Paratrac: A fine-grained profiler for data-intensive workflows,” in to appear; 19th IEEE Symposium on High Performance Distributed Computing, 2010. T. Uno, “A new approach for speeding up enumeration algorithms,” in Proceeding of ISAAC98(Lecture Note in Computer Science 1533, Springer-Verlag, Algorithms and Computation), 1998, pp. 287–296. N. K. Daisuke Kawahara and S. Kurohashi, “Japanese case structure analysis by unsupervised construction of a case frame dictionary,” in International Conference on Computational Linguistics (COLING2000), 2000, pp. 432–438. R. Sasano, D. Kawahara, and S. Kurohashi, “The effect of corpus size on case frame acquisition for discourse analysis,” in the North American Chapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT2009), 2009, pp. 521–529.