File-Access Patterns of Data-Intensive Workflow Applications and their Implications to Distributed Filesystems

advertisement
File-Access Patterns of Data-Intensive Workflow
Applications and their Implications to Distributed
Filesystems
Takeshi Shibata
SungJun Choi
Kenjiro Taura
University of Tokyo
Department of Information and
Communication Engineering
Graduate School of
Information Science and
Technology
University of Tokyo
Department of Information and
Communication Engineering
Graduate School of
Information Science and
Technology
University of Tokyo
Department of Information and
Communication Engineering
Graduate School of
Information Science and
Technology
shibata@logos.ic.i.utokyo.ac.jp
demyan@logos.ic.i.utokyo.ac.jp
tau@logos.ic.i.utokyo.ac.jp
ABSTRACT
This paper studies five real-world data intensive workflow
applications in the fields of natural language processing, astronomy image analysis, and web data analysis. Data intensive workflows are increasingly becoming important applications for cluster and Grid environments. They open
new challenges to various components of workflow execution
environments including job dispatchers, schedulers, file systems, and file staging tools. The keys to achieving high performance are efficient data sharing among executing hosts
and locality-aware scheduling that reduces the amount of
data transfer. While much work has been done on scheduling workflows, many of them use synthetic or random workload. As such, their impacts on real workloads are largely
unknown. Understanding characteristics of real-world workflow applications is a required step to promote research in
this area. To this end, we analyse real-world workflow applications focusing on their file access patterns and summarize
their implications to schedulers and file system/staging designs.
Keywords
workflow applications, distributed filesystems
1. INTRODUCTION
Workflow facilitates integration of individually developed
executables, making parallel processing readily accessible to
domain experts. Thus it has become an important discipline in various fields including natural science, engineering,
and information processing. Many systems for executing
workflows have been developed [1, 2, 3]. More recently,
programming paradigms and systems specifically designed
for large data processing such as MapReduce [4], Hadoop1 ,
and Dryad [5]2 made it popular to utilize parallel processing without an involved effort of parallel programming. An
obvious common goal of these systems is efficient execution
of workflows. To this end, there have been efforts on various
components of workflow engines including scheduling algorithms [6, 7, 8, 9, 10, 11, 12], data transfers [13, 14], and
fast dispatchers [15]. There are efforts focusing on scheduling with data transfer costs taken into account [16, 17, 18,
19] A good survey on scheduling algorithm is in [20].
Despite their importance, practical evaluation of workflow
systems have been rare and remain difficult, mainly due to
lack of commonly accessible benchmarks. Even with a real
application, translating the result on a particular platform
into a generally acceptable observation on workflow systems
is difficult because performance of a workflow depends on
so many components of the environment such as nodes, networks, file systems, and so on. This is particularly so because workflows typically consist of many sequential executables each of which may have unknown sensitivities to
their environments. Most existing studies on scheduling algorithms therefore have been based on simulation with synthetic workloads such as randomly generated task graphs.
Bharathi et al. [21] is one of the few studies that systematically characterizes several real workflow applications. The
present paper shares the same spirit as theirs, but pays a
special attention to IO (file access) behaviors of applications.
A single workflow generally consists of a set of tasks each
of which may communicate with (i.e. depends on or is depended upon) another task in the workflow. Since a task
is typically a sequential (single node) application, a data
transfer among tasks is generally handled by the workflow
system. Data may be implicitly transferred via a shared
file system or explicitly moved by a staging subsystem. Either case, they generally go through a secondary storage to
ensure a certain degree of fault tolerance—that a workflow
1
2
http://hadoop.apache.org/
http://research.microsoft.com/en-us/projects/Dryad/
can restart from a point where a job failed. As such, understanding IO behaviours of real workflow applications is
very important for implementing efficient workflow systems.
It is particularly so for data-intensive workflows, which are
becoming major targets of workflow systems.
To this end, this paper shows the result of our study on
five real workflow applications in the fields of natural language processing, web data analysis, and astronomy image
analysis. With this study, we like to provide data points
by which we can investigate important design questions for
implementing workflow and related subsystems such as file
systems. Relevant questions include:
• How effective is it to place jobs according to file affinity,
and what is the tradeoff between the complexity and
benefit of scheduling algorithms?
• What are advantages and disadvantages of using shared
file systems (such nas NFS[22], Lustre3 , GPFS [23]4 ,
and Ceph[24]) over using explicit staging?
• How much we can gain from staging systems that transfer files directly between compute hosts rather than
indirectly through a central home node?
• Likewise, how much we can gain from user-level file
systems such as Gfarm [25] or GMount [26] that can
use any compute host as a storage server?
2. RELATED WORK
2.1 Workflow Studies
Bharathi et al. [21] is the closest to the present work. It
studies five real workflow applications. It analyses dependency graphs of each workflow and identifies common communication and dependency patterns such as pipeline, data
distribution, aggregation, and redistribution. It also shows
statistics on job granularities. In addition to dependency
patterns, we also study file access behaviors and discuss their
implications to file system or staging designs.
2.2 Executable Workflows
Montage [27] is one of the few workflow applications that
can be readily downloaded and executed by others. Both
Bharathi et al. [21] and our work include this application
as a target of their studies. We will provide four more executable applications with annotations that can be shared by
researchers in this community.
2.3 Workflow Collections
For publicly accessible collection of workflows, Bharathi et
al. also provides WorkflowGenerator web site5 , which generates DAGs whose nodes are tasks annotated with runtime
and files used. Files are annotated with their sizes, so dependencies and communication volumes can be reconstructed
from the DAGs. We are preparing a web site that provides
similar data for applications studied in this paper. In addition, we will provide more comprehensive traces of file access
behaviors, as well as their executables.
myExperiment6 is a repository of workflow descriptions, currently providing more than 750 workflows. Each workflow is
given as a description for a particular system (mostly for
Taverna [3]), and not annotated with profile information
such as runtimes and file sizes. The number of workflows
we are going to provide is much smaller, but we are trying
to provide them in a form executable by others.
2.4
Workflow Scheduling Techniques and Distributed File Systems
There have been efforts on various components of workflow
engines including scheduling algorithms [6, 7, 8, 9, 10, 11,
12]. However, most of those methods require the complete
form of DAGs for workflows and completion time, input/output file size for each jobs, etc. It is difficult to apply scheduling techniques such as HEFT without those information.
Especially, The complete form of DAGs requires workflow
descriptions to have explicit declarations of input/output
files for each job. Because the workflow descriptions we examined in this paper has no explicit form of input/output
files, some systems which help dispatched jobs find required
files and register output files is required. Distributed file
systems are able to be used for that purpose.
3.
WORKFLOW EXECUTION AND FILE ACCESS LOGGING
3.1 Workflow Engine
Workflows described in this paper are described in Makefiles and a parallel and distributed make engine called GXP
Make7 was used to execute them. Make is a convenient tool
for describing jobs with dependencies and can naturally be
used for describing executing workflows.
Except for Montage, all workflows are written in Makefile
by their original authors. For Montage, one of the tools for
it generates an abstract workflow in an XML format called
DAX, used Pegasus. Output DAX files are converted to
Makefiles. Every job description by the tag hjobi is converted to a ”rule” of makefile; commands of a rule are extracted from hjobi tags, whereas dependency (prerequisite)
and target filenames from husesi tags with arguments ’link =
”input”’ and ’link = ”output”’, respectively.
GXP Make runs on top of an existing Make implementation GNU Make, which supports parallel execution for a
single multiprocessor node. GXP Make extends this function to parallel execution in distributed environments by intercepting its child process invocations. Processes forked by
the underlying GNU make enter a queue managed by GXP
make; It then dispatches jobs to available workers that have
been acquired by any underlying resource manager such as
TORQUE, SGE, SSH, etc. Input and output files of tasks
are stored in a single network file system, such as NFS, Lustre, GFarm, etc.
3
The choice of this particular workflow system is largely orthogonal to the results presented in this paper. It is not difficult to automatically generate descriptions for other work-
4
6
http://www.sun.com/software/products/lustre/
http://www-03.ibm.com/systems/clusters/software/gpfs/
5
http://vtcpc.isi.edu/pegasus/index.php/WorkflowGenerator
7
http://www.myexperiment.org/
http://www.logos.ic.i.u-tokyo.ac.jp/gxp/
2. The distance of dispatched command lines from normal commands are measured by edit distance. Each
dispatched command line is assigned to a category
which is represented by the nearest normal command.
flow system such as DAGMan, given file access logging methods presented in the next section.
3.2 File Access Logging
In data intensive applications, it is important to examine
how much data are transferred over the network, when and
where these transfers happen, and where bottlenecks are.
When data are transferred mainly as files through a distributed file system, the detail of file access is required to
extract such information. In order to record file access operations, we developed a logging tool which is specialized
for GXP Make using a tool called Filesystem in Userspace
(FUSE). Table 1 shows the recorded data for each access.
For instance, if the following block is in a makefile :
DIR := s o m e d i r
$ (DIR)/%. i n p u t : $ (DIR)/%. output
cmd $< > $@
then the corresponding normal command is
cmd s o m e d i r /%. i n p u t > s o m e d i r /%. output .
Table 1:
job id
operation
data id
date
time
size
logged data par a file access operation
hostname:processId
read, write, open, create, getattr, mkdir, etc.
filename
execution time of each operation in nsec.
from the epoc
blocking time during each operation
size for each read and write
A file system using our logging tool provides a mount point
under which the target directory is mirrored. When a process accesses a file or a directory under the mount point,
one of our registered callback routines is called by FUSE, in
which the access is logged along with the client process id.
Although another FUSE-based tool ParaTrac [28] offers similar results to ours, our tool is implemented independently
for different purpose. While ParaTrac is a general purposes
profiling tool, this profiling tool is specialized for GXP Make
workflow system and has several different functions from
ParaTrac. For example, while our tool can not produce
fine-grade analysis such as statistics for time when ’read’
operations occur in processes, it is able to divide jobs into
categories for given makefiles and show their statistics, extract user cpu time of each job and show critical path of
the workflow, produce workflow DAGs in HTML files with
information of workflow execution, and so on.
4.
We need to translate the process id to the identity of a job
(note that a job may consist of multiple processes). GXP
Make gives a unique number to each job through an environment variable (GXP_WORK_IDX). When our logging tool
receives a callback, it records GXP_WORK_IDX of the client
process if it is the first callback from that process.
WORKFLOW APPLICATIONS
In this section, details and profiling result for the five workflow applications 8 are given. They are:
• Medline to Medie Indexing
• Case Frames Construction
When the file system is unmounted after logging all file access of a workflow, access logs by all hosts are collected to
form a database. Dependencies among dispatched jobs and
accessed files are inferred from those files. Dependencies are
inferred as follows. (1) If a job X accessed a file A with
any file operation, X is related to A. (2) If X created A,
then A depends on X (X → A), otherwise X depends on A
(A → X).
After the database is created, the workflow DAG is generated as a clickable HTML file with information for the
workflow execution. Statistics for file-access patterns of each
job and each category for jobs are also extracted from the
database. A category for jobs is a set of jobs which are derived from the same line in the makefile, or the same command without input file.
All dispatched command lines, or jobs, are divided into categories by the following steps.
1. Strings which is called normal commands are extracted
from the given makefile by shallow pursing. Each normal command is basically corresponding to single line
of the makefile.
• Similar Pages Extraction
• Supernovae Explosion Detection
• Montage
The first two are applications in natural language processing; Similar Pages Extraction is a data mining application
applied to a large collection of web pages; The last two are
image analysis applications in astronomy.
The experimental environment is a single cluster composed
of 15 nodes (120 CPU cores) with a single NFS server. They
are connected by Gigabit Ethernet with bonding (2Gbps/node). We use GXP make as the workflow engine.
Details of file-access patterns are logged and analysed by
means of the previous section.
Following subsections describe each application with more
details.
8
https://www.intrigger.jp/wiki/index.php/Applications
4.1 MEDLINE to MEDIE
MEDLINE to MEDIE Indexing (hereafter M2M) workflow
creates indexing database for a running search engine called
MEDIE.9 It indexes all registered abstracts in the MEDLINE, the world largest biomedical text database. It indexes
and retrieves sentences with their semantic subjects, verbs,
and objects, enabling queries such as “what does p53 activate?” To this end, its indexing involves deep natural language processing including parsing, requiring a large amount
of CPU times and intermediate data storage. M2M takes a
compressed XML text files as its input, each containing a
number of biomedical papers in MEDLINE. Most tasks in
M2M input and output compressed XML files.
Figure 1 shows a sub-workflow of M2M for a single input
file. A circle is a single task, and a square represents a file
created or read by tasks. An arrow from a task to a file
indicates that the task created the file, and one from a file
to a task that the task opened the file for reading. The most
time consuming part is parsing done by a parser called Enju.
This is parallelized within each input file by splitting the
intermediate file right before the parsing into small chunks.
This can be seen as a number of horizontally aligned boxes
around the middle of the graph.
F258
J43
F259
J20
J42
F152
F257
800000
meta data
read
write
stime
utime
700000
600000
500000
]c
se[ 400000
e
im
T
300000
200000
100000
0
1
2
3
4
Ranking
5
6
7
(a) Time detail of each job
utime
stime
write
read
metadata
F260
J45
F147
J44
J17
F146
J29
F153
J56
F64
J21
J28
F158
J23
F156
F157
J22
J16
J50
F171
F275
(b) Time detail for whole workflow
F155
F90
F277
J57
F272
F273
J51
F88
J10
J32
J11
F177
F91
F87
F201
F162
F98
F100
F103
F104
F105
F71
F106
F107
F108
F109
F72
F110
F76
F77
F78
F1
J68
J63
J66
J61
J64
J0
J1
F79
F80
F84
F85
F86
F2
F4
J52
F3
F73
J62
F68
F74
J8
F81
F111
F112
F75
F8
F10
F6
J65
J67
J3
J4
J2
F82
F83
F9
F11
F7
J9
F261
F70
J46
F262
Figure 2: MEDLINE to MEDIE :Time detail of each
job and of the whole workflow
J12
F113
J39
J24
F149
F161
J18
J25
F164
F54
J35
J58
F60
J33
F166
J34
F58
F140
J26
J5
F282
4. Medie-ner-disamb.: named entity recognition for biomedical terms, such as protein names, and removing ambiguity from them.
J13
F56
F138
J40
F148
J14
F203
F165
J53
F278
F143
J69
J70
F150
J19
F151
F289
J31
F175
J71
F284
F174
J59
J30
F279
F281
J54
5. Medie-ner-filter: preprocess for Medie-ner-disamb.
F168
J55
F280
J36
J60
J37
F263
F169
F268
F195
J47
J27
F145
F264
J15
J49
J48
F267
F266
J72
F144
6. MakeDB : creation index database for MEDIE.
F290
J38
J73
F65
F291
J6
F66
7. Som-merge : merging intermediate XML files.
J7
F67
J41
F230
F224
F255
F254
F221
F252
F229
F248
F239
F223
F242
F240
F228
F220
F218
F256
F231
F227
F217
F233
F235
F251
F246
F222
F236
F232
F234
F237
F243
F241
F247
F244
F212
F253
F216
F245
F226
F225
F219
F250
F238
F249
Figure 1: MEDLINE to MEDIE
Each input file results in 68 independent jobs, including natural language processing, database creation, managing and
conversion of files described in XML, etc. Resource consuming tasks are listed below in the decreasing order for
execution time.
1. Enju : HPSG parser, which constructs of predicate
argument structures.
2. Akane : Protein-protein interaction inference from sentences.
3. Genia-tagger : part-of-speech tagging and shallow parsing for biomedical text.
9
http://www-tsujii.is.s.u-tokyo.ac.jp/medie/
Figure 2(a) shows a timing detail of the fifteen most time
consiming tasks for 99 input files (491 MB). In the experiment, Enju is executed in 3584 times and each of the other
tasks is executed 99 times (once for each input file). While
the most time-consuming task, Enju, is clearly cpu-intensive,
the second one, Akane, is very IO intensive. Moreover, it
spends a significant amount of time on meta-data accesses.
Overall, 46% of the whole time are consumed by file IO and
metadata access (Figure 2(b)). Details of metadata accesses
are shown in Table 2.
4.2 Similar Pages Extraction
Similar Pages is a workflow that takes a large number of
texts collected from the web as input and analyses them. It
outputs all pairs of ’similar’ sentences, which have Hamming
distances smaller than a specified threshold. The key algorithm, called Sachica, is able to enumerate all such pairs
Table 2: detail of file metadata access for M2M
getattr
readdir
open
count
1,749,482
99
3,418,394
aggregate time[sec.]
84,266
4
128,671
average time[msec.]
48
40
38
time ratio
0.39
0
0.59
create
21,642
1,199
55
0.01
count
aggregate time[sec.]
average time[msec.]
time ratio
mkdir
496
33
67
0
unlink
4,871
1,069
219
0.01
30000
meta data
read
write
stime
utime
25000
20000
]c
es
[ 15000
e
m
i
T
10000
5000
0
1
2
3
Ranking
rapidly [29]. The dependency DAG of Similar Pages Extraction workflow is shown in Figure 3. It has three kind of
tasks:
(a) Time detail of each job
• Mkbins : gather all input text files and split them into
a given number (N ) of files.
utime
stime
write
read
• Sachica1 : find pairs of similar subsequences in a single
file.
metadata
• Sachica2 : find pairs of similar subsequences in two
files, one from each of them.
4.txt
2.txt
5.txt
3.txt
8.txt
9.txt
0.txt
6.txt
7.txt
1.txt
(b) Time detail for whole workflow
mkbins
bin.1
sachica1
xresult.1.bz2
log.1
sachica2
xxresult.0.1.bz2
log.0.1
bin.0
sachica2
log.1.2
xxresult.1.2.bz2
bin.2
sachica1
xresult.0.bz2
bin.table.0
bin.table.1
sachica2
log.0
log.0.2
Figure 4: Similar Pages :Time detail of each job and
of the whole workflow
bin.table.2
sachica1
xxresult.0.2.bz2
log.2
xresult.2.bz2
Figure 3: Similar Pages Extraction workflow
Characteristic of the dependency DAG for Similar Pages is
that Sachica1 and Sachica2 have N and N (N − 1)/2 jobs
respectively, where N is the number of files Mkbins outputs.
Each split file is read N times from Sachica1 and Sachica2.
Figure 4 shows an experimental result for about 1 million
(total 5 GB) input files with N = 30 (Mkbin:1, Schica1:30,
Schica2:435). Mkbin is very file-access intensive, whose cost
are mainly IO operation ’read’ and metadata operations ’get
attribute’ and ’open’. Overall, file-access cost occupies the
half of its execution time and ’read’ is the main reason of
the file-access costs.
4.3 Case Frames Construction
Case Frames Construction is a workflow to create data structures in natural language processing called case frames [30,
31]. Case frames, which are frequent patterns of predictargument structures, are generated by analysing large mount
of text collected from the web. The workflow is mainly constructed from the following jobs in the increasing order with
consuming time(Figure 7(a)):
1. parse-comp : a parser for Japanese texts with KNP [30]
including named entity recognition.
2. make-caseframe : a job that creates case frames by
extracting reliable predicate-argument examples from
the parsing result.
3. cf-distribute-rest : a job that adds all case frames to
clusters made by cf-clustering.
4. cf-clustering : clustering highly frequent case frames
and merging them with similarity.
5. calc-of-similarity : computing similarities for each pair
of case frames.
The DAG for the workflow with 4 input files is shown in
Figure 5. As the figure shows, all files output by the upstream jobs are gathered into one large file, and it is split
into files with other size, and finally they are combined into
a result file. So parallelism of the workflow has two peaks.
Figure 6 shows the number of concurrent jobs with respect
to the elapsed time (in seconds) of an execution.
In the experiment, 59 % of the whole time is spent for file
access, while the ratio of time for each job has large variance
(Figure 7).
4.4 Finding Supernovae
Finding Supernovae is a workflow to find coordinates where
supernovae appears to be from astronomical data. The workflow uses the same software and data with Data Analysis
x000002.org
J16
0.06/0.06
J1
1.77/1.77
J0
2.01/2.01
J2
1.39/1.39
J3
1.64/1.64
all.list
x000001.jmn
x000000.jmn
x000003.jmn
x000002.jmn
J6
63.37/65.14
J7
89.46/91.47
J4
70.35/71.74
J5
65.86/67.5
x000001.org
x000001.knp
x000001.log
x000000.org
x000001.knp0
x000001.log0
x000000.knp
J8
13.23/78.37
x000000.knp0
x000003.org
x000000.log
J11
15.78/107.25
x000001.data
x000000.log0
x000003.knp
x000003.log0
x000003.log
J10
13.71/85.45
x000000.data
x000003.knp0
x000002.knp
x000002.log0
1000000
x000002.log
900000
x000002.knp0
J9
13.67/81.17
x000003.data
x000002.data
meta data
read
write
stime
utime
800000
J12
0.08/107.33
all.data
700000
J13
0.19/107.52
all.data.orig
]c 600000
es
[ 500000
e
m
i
T 400000
J14
0.2/107.72
all.data.sid
J15
0.25/107.97
all-000000.data.sid
all-000002.data.sid
all-000001.data.sid
J19
0.31/108.28
J18
0.31/108.28
J17
0.32/108.29
all-000000.data.basic
all-000002.data.basic
J20
0.06/108.35
all-000000.data.basic
J21
0.07/108.42
J25
0.12/108.47
J26
0.07/108.41
J29
0.16/108.5
all-000000-cross.log
all-000000-cross-wa.result
all-000000-cross.formatted
all-000002-cross-wa.formatted
all-000000-cross-wa-soto.result
all-000001-cross.log
all-000001-cross.formatted
100000
J36
0.26/109.15
all-000002-cross-wa.log
all-000002-cross-wa-soto.result
all-000001-cross-wa.log
all-000001-cross-wa.result
all-000001-cross-wa.formatted
J39
1.02/110.17
all-000002-cross-wa-soto.log
all-000001-cross-wa-soto.result
all-000001-cross-wa-soto.log
all-000002-cross-wa-soto-others_tmp.log
all-000001-cross-wa-soto-others_tmp.formatted
J45
0.12/110.02
J54
0.14/110.55
all-000002-cross-wa-soto-delwa.formatted
all-000001-cross-wa-soto-delwa.formatted
J44
0.13/110.18
J46
0.23/110.25
J57
0.23/110.78
all-000000-cross-wa-soto-delwa.formatted
all-000002-cross-wa-soto-delwa-sub_tmp.formatted
all-000001-cross-wa-soto-delwa-sub_tmp.formatted
J47
0.23/110.41
J49
0.07/110.32
J61
0.06/110.84
all-000000-cross-wa-soto-delwa-sub_tmp.formatted
all-000002-cross-wa-soto-delwa-sub.formatted
all-000001-cross-wa-soto-delwa-sub.formatted
J48
0.13/110.54
J51
0.07/110.39
J63
0.06/110.9
all-000000-cross-wa-soto-delwa-sub.formatted
all-000002-cross-wa-soto-delwa-sub-refined.formatted
all-000001-cross-wa-soto-delwa-sub-refined.formatted
J53
0.06/110.6
J52
0.23/110.62
J64
0.24/111.14
all-000002-cross-wa-soto-delwa-sub-refined-or.formatted
all-000001-cross-wa-soto-delwa-sub-refined-or.formatted
J55
0.23/110.83
J56
0.13/110.75
J67
0.14/111.28
all-000000-cross-wa-soto-delwa-sub-refined-or.formatted
all-000002-cross-wa-soto-delwa-sub-refined-or-sorted.formatted
all-000001-cross-wa-soto-delwa-sub-refined-or-sorted.formatted
J58
0.12/110.95
J59
0.24/110.99
all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs.formatted
J60
0.25/111.2
J62
0.24/111.23
J73
0.24/111.76
all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs.formatted
all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm.formatted
all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm.formatted
J65
0.23/111.43
J66
0.14/111.37
J75
0.13/111.89
all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm.formatted
all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.formatted
J68
0.14/111.57
J72
0.07/111.44
all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.formatted
J71
0.33/111.9
all-000000.cfsim.dat
J74
0.06/111.63
3
5
7
9
11 13 15 17 19 21 23 25 27 29
Ranking
(a) Time detail of jobs
utime
J70
0.24/111.52
all-000000-cross-wa-soto-delwa-sub-refined-or-sorted.formatted
all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.log
1
all-000001-cross-wa-soto-others_tmp.log
all-000000-cross-wa-soto-delwa-sub-refined.formatted
all-000000-cross-wa-soto-delwa-sub-refined-or-sorted-cs.log
0
all-000001-cross-wa-soto.formatted
J50
0.24/110.41
all-000002-cross-wa-soto-others_tmp.formatted
all-000000-cross-wa-soto-others_tmp.formatted
200000
all-000001.data.basic.rest
J43
0.24/109.9
all-000000-cross-wa-soto.formatted
J42
0.24/110.05
all-000000-cross-wa-soto-others_tmp.log
all-000002-cross-wa.result
all-000002-cross-wa-soto.formatted
J40
0.25/109.81
all-000000-cross-wa-soto.log
all-000002-cross.log
J41
0.25/109.66
all-000000-cross-wa.formatted
J27
0.16/108.51
all-000001.1st.formatted
J33
0.26/108.89
all-000002-cross.formatted
300000
J30
0.28/108.63
all-000001.1st.log
J38
0.26/109.41
all-000002.data.basic.rest
J37
0.25/109.56
all-000000-cross-wa.log
J24
0.06/108.41
all-000002.1st.log
J35
0.54/109.15
all-000000.1st.formatted
J34
0.59/109.31
all-000000.data.basic.rest
all-000001.data.basic
J32
0.27/108.61
all-000002.1st.formatted
J31
0.25/108.72
all-000000.1st.log
J22
0.07/108.35
all-000002.data.basic
all-000000suru.data.basic
J28
0.15/108.62
all-000001.data.basic
J23
0.06/108.34
all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs.log
all-000002-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.log
J69
0.24/111.61
J77
0.06/111.95
all-000002.cfsim.dat
all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs.formatted
all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.formatted
all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs.log
stime
write
all-000001-cross-wa-soto-delwa-sub-refined-or-sorted-cs-sm-postprocessed.log
read
J76
0.22/112.11
all-000001.cfsim.dat
metadata
all-000002.formatted
all-000000.formatted
all-000001.formatted
J78
0.06/112.01
cf.formatted
J79
0.22/112.23
cf.knpdict
Figure 5: Case Frames Construction
(b) Time detail for whole workflow
Figure 7: Details for Case Frames Construction
r0002_10_t1.fits
r0002_10_t0.fits
J12
3.32/3.32
r0002_10_t1.fits.det
J14
11.01/14.65
r0002_10_t1.fits.mat
J21
0.24/14.89
r0002_10_t1.fits.coef
J22
5.96/20.85
r0002_10_t1_sft.fits
r0002_10_t1.fits.coef_r
r0002_10_t1.fits.dso
r0002_10_t1_obj.lst
r0000_00_t1.fits
J2
3.64/3.64
r0002_10_t0.fits.det
r0002_10_t0_obj.lst
r0002_10_t1_flg.fits
J6
0.23/3.87
msk.lst
r0002_10_t0_flg.fits
r0000_00_t1.fits.det
J15
853.62/857.26
r0002_10_t0_r0002_10_t1_conv.fits
r0000_00_t1.fits.dso
r0002_10_t1.reg
msk.lst
J19
0.12/14.43
r0002_10_t1_sub_msk.fits
r0000_00_t1.fits.coef
J8
1.58/862.22
r0002_10_t1.result
r0000_00_t1.fits.coef_r
J20
7.03/21.46
r0000_00_t0_obj.lst
r0000_00_t0.fits.det
J17
0.17/3.58
r0000_00_t1.fits.mat
J7
3.38/860.64
J1
3.41/3.41
r0000_00_t1_obj.lst
J13
10.9/14.31
r0002_10_t1_sub.fits
r0000_00_t0.fits
J11
3.17/3.17
r0002_10_t0.fits.dso
r0000_00_t1_flg.fits
r0000_00_t0.fits.dso
r0000_00_t0_flg.fits
J16
756.12/759.53
r0000_00_t0_r0000_00_t1_conv.fits
r0000_00_t1.reg
r0000_00_t1_sub.fits
J18
3.41/762.94
r0000_00_t1_sub_msk.fits
J3
2.82/765.76
r0000_00_t1_sft.fits
r0000_00_t1.result
J9
0.08/862.3
J4
0.09/765.85
r0002_10.result.tmp
r0000_00.result.tmp
J10
0.07/862.37
J5
0.07/765.92
r0002_10.result
r0000_00.result
J0
0.07/862.44
total_result
Figure 6: Parallelism of an execution for Case
Frames Construction
Challenge on IEEE Cluster/Grid 2008 10 . Each pair of astronomical images which are for the same position but different time is compared with each other and the positions of
candidates for supernovae are listed up. Finally the candidate positions are gathered into one file. Figure 8 shows the
DAG of the Finding Supernovae for 2 pairs of input images.
In this workflow, the most heavy job (imbus3vp3) is relatively compute-intensive while other jobs are not.
4.5 Montage
We execute Montage DAX and, take log and generate the
dependency graph(Figure 11).
Figure 8: Finding Supernovae
3156 2MASS(J)images as input, where each size of them is
about 2MB .
As Figure 11(b) shows, almost of the aggregate time is consumed for file access. Concentrating all files to single file
server becomes the bottleneck and degrades performance of
the workflow.
Percentage of cpu usage time to aggregate time becomes
larger as the input size (Figure 11(c)), which means that
concentrating data access degrades throughput of the file
server.
5.
Each input file is a astronomic image file, and the workflow
gathers them into one picture. In the experiment we use
10
http://www.cluster2008.org/challenge/
COMPARISON FOR DESIGNS OF WORKFLOW SOLVERS
In this section, we investigate various cost for designs of
workflow solvers such as staging, centorized file system, schedul-
600000
700000
meta data
read
write
stime
utime
600000
500000
]c
es 400000
[
e
m
i
T 300000
meta data
read
write
stime
utime
500000
400000
c]e
[s 300000
e
im
T
200000
200000
100000
100000
0
0
3
p
v
3
s
u
b
im
t
if
h
s
im
3
t
c
e
t
e
d
t
r
is
o
L
t
k
c
s
ra
a
t
x
M
k
e
s
m
3
k
s
a
m
2
h
c
t
a
m
2
t
fi
ly
o
p
it
d
iffF
oun
mD ckgr
a
mB
k
w
a
P
dd
tbl
nk
Fit
l(2)
mA jectP Shri mImg mgtb ncat
o
m
o
I
r
C
m
m
mP
Ranking
Ranking
(a) Time detail of jobs
(a) Time detail of each job
utime
utime
stime
stime
write
write
read
read
metadata
metadata
(b) Time detail for whole workflow
(b) Time detail for whole workflow
1.00
Figure 9: Details for Finding Supernovae
p2mass-atl
as-98120
4n-j0420
044_area
.fits
2mass-atla
s-981204
n-j04200
44.fits
2mass-atla
s-971024
n-j00902
32.fits
mProjectPP
2.28/2.28
mProjectPP
2.28/2.28
p2mass-atl
as-98120
4n-j0420
044.fits
mDiffFit 0.52/2.8
diff.001.0
04_area.
fits
diff.001.0
04.fits
mDiffFit 0.38/2.66
fit.000001
.000004.
txt
fit.000001
.000003.
txt
diff.001.0
03.fits
diff.001.0
03_area.
fits
diff.003.0
04_area.
fits
mProjectPP
2.28/2.28
mProjectPP
2.3/2.3
p2mass-atl
as-97102
4n-j0090
220.fits
mDiffFit 0.43/2.71
fit.000003
.000004.
txt
2mass-atla
s-971024
n-j00800
44.fits
region.hdr
p2mass-atl
as-97102
4n-j0090
232_area
.fits
p2mass-atl
as-97102
4n-j0090
232.fits
2mass-atla
s-971024
n-j00902
20.fits
p2mass-atl
as-97102
4n-j0090
220_area
.fits
p2mass-atl
as-97102
4n-j0080
044.fits
mDiffFit 0.47/2.77
diff.003.0
04.fits
fit.000002
.000004.
txt
diff.002.0
04_area.
fits
p2mass-atl
as-97102
4n-j0080
044_area
.fits
mDiffFit 0.37/2.67
statfile_n
ew.tbl
diff.002.0
04.fits
fit.000002
.000003.
txt
diff.002.0
03.fits
diff.002.0
03_area.
fits
no
it
ar
ep
o
hc
ae
ro
f
oit
aR
0.90
0.80
0.70
0.50
0.40
0.30
0.20
mConcatFit
0.12/2.92
0.10
fits.tbl
0.00
pimages.tb
l
metadata
read
write
stime
utime
0.60
28
mBgModel .
0.22/3.14
308
3474
# of input files
correction
s.tbl
mBackground
0.49/3.63
mBackground
0.48/3.62
c2mass-atl
as-98120
4n-j0420
044.fits
c2mass-atl
as-97102
4n-j0090
232.fits
c2mass-atl
as-98120
4n-j0420
044_area
.fits
mBackground
0.48/3.62
cimages.tb
l
c2mass-atl
as-97102
4n-j0090
220.fits
c2mass-atl
as-97102
4n-j0090
220_area
.fits
newcimages
.tbl
mAdd -e ne
0.39/4.21
mosaic_200
90330_19
4109_309
67.fits
(c) ratio of time for each oparations w.r.t. the number of input files
c2mass-atl
as-97102
4n-j0080
044_area
.fits
mImgtbl .
0.19/3.82
c2mass-atl
as-97102
4n-j0090
232_area
.fits
mBackground
0.48/3.62
c2mass-atl
as-97102
4n-j0080
044.fits
Figure 11: Montage :Time detail of each job and of
the whole workflow
mosaic_200
90330_19
4109_309
67_area.
fits
mShrink mo
0.2/4.41
shrunken_2
0090330_
194109_3
0967.fit
s
mJPEG -ct
0.3/4.71
shrunken_2
0090330_
194109_3
0967.jpg
Figure 10: montage workflow
ing, based on logged data for workflows in the previous section.
5.1 Centralization Costs
A staging is a process to transfer required files to nodes
in advance. Many staging methods send all updated and
created files to central servers when jobs finish. Since intermediate files are not sent directly from the node who output
them to other nodes in those methods, called centralized
staging, bandwidth and latency to access for central servers
frequently becomes a bottleneck especially for data intensive
workflows.
Many centralized file systems, such as NFS and LUSTRE,
also have the same problem, whereas distributed file systems, such as Gmount and Gfarm, can be more effective
since they locate files on local disks of nodes and send them
if necessary.
While input and output files of a workflow are usually required to be located on file system, intermediate files, which
are both read and written by jobs of the workflow, are not.
We assume that files which are only read during the workflow execution are the input files and that files which are
only written are the output files. Table 3 shows input, intermediate and output aggregate file size and the intermediate
rate ( intermediate/(input + intermediate + output) )
for each workflow in the previous section. It is possible to
avoid 40-88% of file transfer to central servers on centralized
file systems and staging systems that can be become a bottleneck for the whole workflow performance if intermediate
files does not send to the central servers.
Table 4: local read for locality-aware scheduling
read size [GB]
input
local remote local read rate
m2m
0.51
69.56 119.35
0.36
similarpages
2.49
0.17
74.61
0.002
caseframes
0.73
46.35
0.99
0.96
supernovae
233.73 37.50
67.54
0.11
montage
9.13
56.93 235.86
0.18
access latency to the metadata server on the file system with
keeping consistency. Thus metadata operation cost should
be able to reduce much less if staging is used in stead of a
single file system, since metadata access is only in local nodes
and keeping consistency with other nodes is not required
with stating methods.
Table 3: ratio for the aggregate size of intermediate
files
input interm. output [GB] interm. rate
m2m
0.51
97.43
13.25
0.88
similarpages
2.49
2.49
0.97
0.42
caseframes
0.73
46.74
58.51
0.44
supernovae
77.91 103.48
78.26
0.40
7.26
72.84
21.44
0.72
montage
Table 5: the ratio of metadata access costs
ratio for whole ratio for file-access
m2m
0.12
0.21
similarpages
0.13
0.18
caseframes
0.05
0.09
supernovae
0.04
0.10
montage
0.32
0.32
5.2 Locality-aware Scheduling
When a scheduler for workflows decides the compute nodes
where each job are dispatched, it is important to consider
the cost to access files located on remote nodes. Jobs are
preferred to be lunched at a compute node where as many
input files are located on the local disk or memory.
In many scheduling methods like HEFT [7], the cost to access files on remote nodes are considered when jobs are assigned to compute nodes, as well as predicted computation
time for jobs, length of critical path, etc. Let us define
that a job Y depends on a job X if some output file of
X is input file of Y . If a job X has some jobs that depend on only X, it is very likely that, at least, one of jobs
which depends on X is assigned to the same node with X
with any criterion for assignments of jobs. In order to measure how much size of files are able to read from local in
locality-aware scheduling methods, for each job X, a job
Y which depends on X are randomly selected among jobs
depends on X, and X and Y are assigned the same node.
Talbe 4 shows the result of the assignment for each workflow. While Case Frames Construction has high local read
rate local/(local+input+remote), Similar Pages Extraction does not. The difference of them mainly arise from the
difference of the average number of branches in the DAGs of
the workflows. If a workflow has the small averaged number
of branches for a job in the DAG, considering locality awareness became more important. As Figure 3 and 5 shows,
most jobs of Case Frames Construction have one branch
while those of Similar Pages Extraction have 30 branches.
6.
CONCLUSION
In Section 5, advantages of modifying file systems, staging
methods and scheduling methods are estimated by analysing
dependency graphs generated from logs. Since those workflows have no explicit declaration of input and output files
for each job, dependencies between files and jobs are created
by analysing file access logs. Three key insights for performance are drawn from the experiments. First, the total load
on the central file server decreases by 40 − 88% if intermediate files are sent not going through the central file servers,
but directly from the producer to the consumer. Secondary,
if locality-aware scheduling methods are applied, their effectiveneses highly depend on the shapes of DAGs (the reductions are from 0.002−0.96). In particular, it largely depends
on the number of children per job. Lastly, 9 − 32% of whole
file access cost can be eliminated when a staging is used instead of a file system, because a large part of file access costs
comes from the metadata operation latency between clients
and servers.
Our future direction includes using profiled information to
statistically infer various information about jobs such as
names of input files, names of output files, size of output
files, and the execution time of the job from the command
line. If many of them are successfully estimated, it is possible to build a framework in which jobs can be described
without explicit staging directives, files transparently shared
across nodes, yet executed very efficiently.
5.3 Costs for Metadata Operations
Acknowledgment
Metadata operations often be non-negligible costs on a single
file system. Table 5 shows the experimental result in the
previous section shows. The reason for the cost of metadata
operations hardly depends on bandwidth limitation but on
This work was supported by Grant-in-Aid for Specially Promoted Research (MEXT, Japan) and the MEXT Grant-inAid for Scienti?c Research on Priority Areas project “ New
IT Infrastructure for the Information-explosion Era ”
7. REFERENCES
[1] D. Thain, T. Tannenbaum, and M. Livny,
“Distributed computing in practice: the condor
experience.” Concurrency - Practice and Experience,
vol. 17, no. 2-4, pp. 323–356, 2005.
[2] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil,
C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman,
J. Good, A. Laity, J. C. Jacob, and D. S. Katz.,
“Pegasus: a framework for mapping complex scientific
workflows onto distributed systems,” Scientific
Programming Journal, vol. 13, no. 3, pp. 219–237,
2005.
[3] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger,
M. Greenwood, T. Carver, K. Glover, M. R. Pocock,
A. Wipat, and P. Li, “Taverna: a tool for the
composition and enactment of bioinformatics
workflows,” Bioinformatics, vol. 20, no. 17, pp.
3045–3054, 2004.
[4] J. Dean and S. Ghemawat, “Mapreduce: Simplified
data processing on large clusters,” in Proceedings of the
6th USENIX Symposium on Operating Systems Design
and Implementation (OSDI), 2004, p. 137 窶 50.
[5] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly,
“Dryad: Distributed data-parallel programs from
sequential building blocks,” in Proceedings of the 2007
Eurosys Conference, 2007.
[6] A. Mandal, K. Kennedy, C. Koelbel, G. Marin,
J. Mellor-Crummey, B. Liu, and L. Johnsson,
“Scheduling strategies for mapping application
workflows onto the grid,” in High Performance
Distributed Computing, 2005. HPDC-14. Proceedings.
14th IEEE International Symposium on, July 2005,
pp. 125–134.
[7] H. Topcuoglu, S. Hariri, and M.-Y. Wu,
“Performance-effective and low-complexity task
scheduling for heterogeneous computing,” Parallel and
Distributed Systems, IEEE Transactions on, vol. 13,
no. 3, pp. 260–274, Mar 2002.
[8] Z. Yu and W. Shi, “An adaptive rescheduling strategy
for grid workflow applications,” in Parallel and
Distributed Processing Symposium, 2007. IPDPS
2007. IEEE International, March 2007, pp. 1–8.
[9] E. Ilavarasan and P. Thambidurai, “Low complexity
performance effective task scheduling algorithm for
heterogeneous computing environments,” Journal of
Computer Science, vol. 3, no. 2, pp. 94–103, 2007.
[10] R. Sakellariou and H. Zhao, “A hybrid heuristic for
dag scheduling on heterogeneous systems,” in Parallel
and Distributed Processing Symposium, 2004.
Proceedings. 18th International, April 2004, pp. 111–.
[11] W. Guo, W. Sun, W. Hu, and Y. Jin, “Resource
allocation strategies for data-intensive workflow-based
applications in optical grids,” in Communication
systems, 2006. ICCS 2006. 10th IEEE Singapore
International Conference on, Oct. 2006, pp. 1–5.
[12] J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi,
A. Mandal, and K. Kennedy, “Task scheduling
strategies for workflow-based applications in grids,”
Cluster Computing and the Grid, IEEE International
Symposium on, vol. 2, pp. 759–767, 2005.
[13] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky,
K. Schwan, and F. Zheng, “Datastager: scalable data
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
staging services for petascale applications,” in HPDC
’09: Proceedings of the 18th ACM international
symposium on High performance distributed
computing. New York, NY, USA: ACM, 2009, pp.
39–48.
T. Kosar and M. Livny, “Stork: making data
placement a first class citizen in the grid,” in
Distributed Computing Systems, 2004. Proceedings.
24th International Conference on, 2004, pp. 342–349.
I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and
M. Wilde, “Falkon: a fast and light-weight task
execution framework,” in SC ’07: Proceedings of the
2007 ACM/IEEE conference on Supercomputing.
New York, NY, USA: ACM, 2007, pp. 1–12.
G. Khanna, N. Vydyanathan, U. Catalyurek, T. Kurc,
S. Krishnamoorthy, P. Sadayappan, and J. Saltz,
“Task scheduling and file replication for data-intensive
jobs with batch-shared i/o,” in High Performance
Distributed Computing, 2006 15th IEEE International
Symposium on, 0-0 2006, pp. 241–252.
I. Raicu, I. T. Foster, Y. Zhao, P. Little, C. M.
Moretti, A. Chaudhary, and D. Thain, “The quest for
scalable support of data-intensive workloads in
distributed systems,” in HPDC ’09: Proceedings of the
18th ACM international symposium on High
performance distributed computing. New York, NY,
USA: ACM, 2009, pp. 207–216.
K. Ranganathan and I. Foster, “Decoupling
computation and data scheduling in distributed
data-intensive applications,” in High Performance
Distributed Computing, 2002. HPDC-11 2002.
Proceedings. 11th IEEE International Symposium on,
2002, pp. 352–358.
T. Kosar, “A new paradigm in data intensive
computing: Stork and the data-aware schedulers,” in
Challenges of Large Applications in Distributed
Environments, 2006 IEEE, 0-0 2006, pp. 5–12.
F. Dong and S. G. Akl, “Scheduling algorithms for
grid computing: State of the art and open problems,”
School of Computing, Queen’s University, Tech. Rep.,
January 2006.
S. Bharathi, A. Chervenak, E. Deelman, G. Mehta,
M.-H. Su, and K. Vahi, “Characterization of scientific
workflows,” in Workflows in Support of Large-Scale
Science, 2008. WORKS 2008. Third Workshop on,
Nov. 2008, pp. 1–10.
B. Pawlowski, C. Juszczak, P. Staubach, C. Smith,
D. Lebel, and D. Hitz, “Nfs version 3 - design and
implementation,” in In Proceedings of the Summer
USENIX Conference, 1994, pp. 137–152.
F. Schmuck and R. Haskin, “Gpfs: A shared-disk file
system for large computing clusters,” in In Proceedings
of the 2002 Conference on File and Storage
Technologies (FAST, 2002, pp. 231–244.
S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn, “Ceph: A scalable, high-performance
distributed file system,” in Proceedings of the 7th
USENIX Symposium on Operating Systems Design
and Implementation (OSDI), 2006, p. 307 窶 20.
O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and
S. Sekiguchi, “Grid datafarm architecture for petascale
data intensive computing,” in Cluster Computing and
[26]
[27]
[28]
[29]
[30]
[31]
the Grid, 2002. 2nd IEEE/ACM International
Symposium on, May 2002, pp. 102–102.
N. Dun, K. Taura, and A. Yonezawa, “Gmount: An ad
hoc and locality-aware distributed file system by using
ssh and fuse,” in 9th IEEE/ACM International
Symposium on Cluster Computing and the Grid, 2009,
pp. 188–195.
“Montage: An astronomical image mosaic engine,”
http://montage.ipac.caltech.edu/.
N. Dun, K. Taura, and A. Yonezawa, “Paratrac: A
fine-grained profiler for data-intensive workflows,” in
to appear; 19th IEEE Symposium on High
Performance Distributed Computing, 2010.
T. Uno, “A new approach for speeding up enumeration
algorithms,” in Proceeding of ISAAC98(Lecture Note
in Computer Science 1533, Springer-Verlag,
Algorithms and Computation), 1998, pp. 287–296.
N. K. Daisuke Kawahara and S. Kurohashi, “Japanese
case structure analysis by unsupervised construction
of a case frame dictionary,” in International
Conference on Computational Linguistics
(COLING2000), 2000, pp. 432–438.
R. Sasano, D. Kawahara, and S. Kurohashi, “The
effect of corpus size on case frame acquisition for
discourse analysis,” in the North American Chapter of
the Association for Computational Linguistics Human Language Technologies (NAACL-HLT2009),
2009, pp. 521–529.
Download