Techniques for Distributed Query Planning and Execution

advertisement
Project No :
FP7-318338
Project Acronym:
Optique
Project Title:
Scalable End-user Access to Big Data
Instrument:
Integrated Project
Scheme:
Information & Communication Technologies
Deliverable D7.1
Techniques for Distributed Query Planning and Execution:
One-Time Queries
Due date of deliverable:
(T0+12)
Actual submission date:
October 31, 2013
Start date of the project: 1st November 2012
Duration: 48 months
Lead contractor for this deliverable:
UoA
Dissemination level:
PU – Public
Final version
Executive Summary:
Techniques for Distributed Query Planning and Execution: One-Time
Queries
This document summarises deliverable D7.1 of project FP7-318338 (Optique), an Integrated Project supported by the 7th Framework Programme of the EC. Full information on this project, including the contents
of this deliverable, is available online at http://www.optique-project.eu/.
We first introduce the concept of elasticity and the main contributions of this work. We continue by
presenting the ADP system on which we build in order to create the distributed query execution module
of Optique. After that, we present the main research results related to the elastic execution of dataflow
graphs in a cloud environment. We explore the trade-offs between query completion time and monetary
cost of resource usage and we provide an efficient methodology to find good solutions. Next, we present the
current status of integration with the Optique platform and the communication with other components of
the system. Finally, we conclude by presenting the work that will be carried out during the second year of
the project.
List of Authors
Herald Kllapi (UoA)
Dimitris Bilidas (UoA)
Yannis Ioannidis (UoA)
Manolis Koubarakis (UoA)
2
Contents
1 Introduction
2 The
2.1
2.2
2.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
7
7
8
10
10
11
11
3 Elastic Dataflow Processing in the Cloud
3.1 Related Work . . . . . . . . . . . . . . . .
3.2 Preliminaries . . . . . . . . . . . . . . . .
3.2.1 Dataflow . . . . . . . . . . . . . . .
3.2.2 Schedule Estimation . . . . . . . .
3.2.3 Skyline Distance Set . . . . . . . .
3.3 Dataflow Language . . . . . . . . . . . . .
3.4 Operator Ranking . . . . . . . . . . . . .
3.5 Scheduling Algorithm . . . . . . . . . . .
3.5.1 Dynamic Programming . . . . . .
3.5.2 Simulated Annealing . . . . . . . .
3.5.3 Parallel Wave . . . . . . . . . . . .
3.5.4 Exhaustive . . . . . . . . . . . . .
3.6 Complexity Analysis . . . . . . . . . . . .
3.7 Experimental Evaluation . . . . . . . . . .
3.7.1 Model and Algorithms . . . . . . .
3.7.2 Systems . . . . . . . . . . . . . . .
3.7.3 Conclusions of Experiments . . . .
3.8 Conclusions and Future Work . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
13
14
14
14
16
17
17
19
19
20
20
20
22
22
22
27
32
33
2.4
2.5
2.6
ADP System
Related Work . . . . . . .
Overview . . . . . . . . .
Language Abstractions . .
2.3.1 Dataflow Language
2.3.2 Dataflow Language
Extensions . . . . . . . . .
SQL Processing Engine . .
Conclusions . . . . . . . .
5
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Semantics
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Integration of ADP With the Optique Platform
34
4.1 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 JDBC Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Queries Provided by the Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Conclusions
38
Bibliography
38
3
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Glossary
42
A Test Queries from the Use Cases
43
A.1 NPD Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.2 Siemens Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4
Chapter 1
Introduction
The workpackage WP7 of Optique deals with the distributed query execution module and is divided in the
following tasks:
• Task 7.1: Query planning and execution techniques for one-time OBDA queries
• Task 7.2: Query planning and execution techniques for continuous and temporal OBDA queries
• Task 7.3: Optimization techniques
• Task 7.4: Implementation, testing and evaluation
This deliverable describes work that has been done concerning the first task, which corresponds to the
effort put in WP7 during the first year of the project. In the context of an Ontology-based Data Access
system, the efficient execution of queries produced by the query transformation components heavily relies
on the proper and effective use of the resources of a cloud environment. Our main attention has been drawn
to this concept, the elastic execution of queries by dynamic resource management, in order to achieve the
objectives of this task.
In Chapter 2, we present ADP, a system for distributed query processing in cloud environments. ADP
has been developed at the University of Athens during several European funded research projects. ADP is a
system designed for the effcient execution of data-intensive flows on the cloud. These dataflows are expressed
as relational queries with user defined functions (UDFs). We present the architecture of the system with its
main components and the query languages that it supports.
In Chapter 3, we present the research contributions regarding the elastic execution of dataflows. Finding
trade-offs between completion time and monetary cost is essential in a cloud environment. Cloud enabled
data processing platforms should offer the ability to select the best trade-off. The obvious questions are: 1.
Does it exist? 2. Can it be obtained at an overhead that makes it worth it? Our first contribution in this
chapter is to demonstrate that very significant elasticity exists in a number of common tasks, even when the
abstraction for the cloud-computation is modeled at a very high level, such as MapReduce. Moreover, we
show that elasticity can be discovered in practice using highly scalable and efficient algorithms and that there
appear to be certain simple “rules of thumb” for when elasticity is present. It is natural to expect that more
refined models of the cloud-computation would allow further optimizations and extraction of elasticity. At
the same time, it is reasonable to be concerned whether the resulting complexity of the refined model would
allow for these optimizations/extraction to be performed. Our second contribution is to demonstrate that
there exists a very fertile middle-ground in terms of abstraction which enables the extraction of much more
elasticity than what is possible under MapReduce while remaining computationally tractable. The content
of this chapter has been submitted for publication and is currently under review.
In Chapter 4, we describe the current status of integration of ADP into the Optique platform. We present
several examples of using the JDBC driver and executing the queries of the Optique use cases.
Finally, in Chapter 5, we discuss the work plan of WP7 for the second year of the project and conclude.
5
Chapter 2
The ADP System
ADP is a system designed for the efficient execution of complex dataflows on the Cloud [50]. Requests take
the form of queries in SQL with user defined functions (UDFs). The SQL query is transformed into two
intermediate levels before execution. The query of the first level is again SQL but has additional notations
about the distribution of the tables. We enhance SQL by adding the table partition as a first class citizen of
the language. A table partition is defined as a set of tuples having a particular property that we can exploit
in query optimization. An example is the value of a hash function applied on one column to be the same for
all the tuples in the same partition. This property is used for distributed hash joins.
ADP acts as a middleware placed between the infrastructure and the other components of the system,
simplifying their “view” of the underlying infrastructure. One can benefit from ADP by detaching part of the
application logic and writing it using the ADP abstractions. This allows it to scale transparently as needed.
The inception of ADP began in 2004 in the European project Diligent1 , and in particular, in the distributed query engine deployed for the project. The query engine of Diligent was subsequently used and
extended in the European project Health-e-Child2 . Finally, after Health-e-Child ended, ADP became an
internal project of the MaDgIK3 group.
2.1
Related Work
The most popular platforms for data processing flows on the cloud are based on MapReduce [10], presented
by Google. On top of MapReduce, Google has build systems like FlumeJava [8], Sawzall [41], and Tenzing [9].
FlumeJava is a library used to write data pipelines that are transformed into MapReduce jobs. Sawzall is a
scripting language that can express simple data processing over huge datasets. Tenzing [9] is an analytical
query engine that uses pool of pre-allocated machines to minimize latency.
The main open source implementation of MapReduce is Hadoop [5] by Yahoo!. Hive [48] is a data
warehouse solution from Facebook. HiveQL (the query language of Hive) is a subset of SQL and the
optimization techniques are limited only to simple transformation rules. The optimization goal is to minimize
the number of MapReduce jobs, while at the same time maximize the parallelism, and as a consequence,
minimize the execution time of the query. HadoopDB [2] is recent hybrid system that combines MapReduce
with databases. It uses multiple single node databases and rely on Hadoop to schedule the jobs to each
database. The optimization goal is to create as much parallelism as possible by assigning sub-queries to
the singe node databases. The U.S. startup Hadapt [20] is currently commercializing HadoopDB. Finally,
Amazon offers Hadoop Elastic MapReduce as a service to its customers [3].
Several high-level query languages and applications have been developed on top of Hadoop, such as
PigLatin[36] and Mahout [4], a platform for large-scale machine learning. The dataflow graphs used in
1
http://diligent.ercim.eu/
http://www.health-e-child.org
3
http://www.madgik.di.uoa.gr/
2
6
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
MapReduce are relatively restricted and this reduces opportunities for optimization. All of the above systems
have as optimization goal to execute the queries as fast as possible.
The Condor/DAGMan/Stork [33] set is the state-of-the-art technology of High Performance Computing.
Nevertheless, Condor was designed to harvest CPU cycles on idle machines. However, running data intensive
workflows with DAGMan is very inefficient [43]. Many systems use DAGMan as middleware, like Pegasus
[11] and GridDB [34]. Proposals for extensions of Condor to deal with data intensive scientific workflows
do exist [43], but to the best of our knowledge, they have not been materialized yet. In [14] is presented
a case study of executing the Montage dataflow on the cloud examining the trade-offs of different dataflow
execution modes and provisioning plans for cloud resources.
Dryad [24] is a commercial middleware by Microsoft that has a more general architecture than MapReduce
since it can parallelize any dataflow. Its schedule optimization, however, relies heavily on hints requiring
knowledge of node proximity, which are generally not available in a cloud environment. It also deals with
job migration by instantiating another copy of a job and not by moving the job to another machine. This
might be acceptable when optimizing solely time but not when the financial cost of allocating additional
containers matters. DryadLINQ [53] is built on top of Dryad and use LINQ [49], a set of .NET constructs for
manipulating data. LINQ queries are transformed into Dryad graphs and executed in a distributed fashion.
Stubby [21, 32] is a cost based optimizer for Map-Reduce dataflows. Our work have some similarities
with the modeling but we target a broader range of dataflow graphs.
Mariposa [47] was one of the first distributed database systems that takes into consideration the monetary
cost of answering the queries. The user provides a budget function and the system optimizes the cost of
accessing the individual databases using auctioning.
Dremel [35] is a system for real time analysis of very large datasets. The system is designed for a subclass
of SQL queries that return relatively small results. The optimization techniques in this work target a more
broad class of SQL queries.
2.2
Overview
Figure 2.1 shows the current architecture of ADP. The queries are optimized and transformed into execution
plans that are executed in ART, the ADP Run Time. The resources needed to execute the queries (machines,
network, etc.) are reserved or allocated by ARM, the ADP Resource Mediator. Those resources are wrapped
into containers. Containers are used to abstract from the details of a physical machine in a cluster or a
virtual machine in a cloud. The information about the operators and the state of the system is stored in
the Registry. ADP uses state of the art technology and well proven solutions inspired by years of research
in parallel and distributed databases (e.g., parallelism, partitioning, and various optimizations).
The core query evaluation engine of ADP is built on top of SQLite 4 . The system allows rapid development
of specialized data analysis tasks that are directly integrated into the system. Queries are expressed in SQL
extended with UDFs. Although UDFs are supported by database management systems (DBMS) for a long
time, their use is limited due to their complexity and limitations imposed by the DBMSs. One of the goals
of the system is to eliminate the effort of creating and using UDFs by making them a first class citizens in
the query language itself. SQLite natively supports UDFs implemented in C. The UDFs are categorized into
row, aggregate, and virtual table functions.
2.3
Language Abstractions
The query language of ADP is based on SQL enhanced with two extensions. The first one is the automatic
creation of virtual tables. This permits direct usage within the query, without explicitly creating them before.
The second extension is an inverted syntax which uses UDFs as statements. We present some examples to
illustrate. The following query reads the data from ‘in.tsv’ file.
4
This part of ADP is available as open source library https://code.google.com/p/madis
7
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Figure 2.1: The system architecture
create virtual table input_file using file(’in.tsv’);
select * from input_file;
The query can be written as:
select * from file(’in.tsv’);
and the sytem will automatically translate it into the above syntax. The second extension is an inverted
syntax which uses UDFs as statements. Using this syntax, the query can be further simplified as:
file ’in.tsv’;
The inverted syntax provides a natural way to composite virtual table functions. In the following example,
the query that uses the file operator, is provided as a parameter to countrows:
select * from countrows("select * from file(’in.tsv’)");
The above syntax is very error prone because it uses nested quote levels. By using inversion, the query is
written as:
countrows file ’in.tsv’;
The ordering is from left to right, i.e., xyz is translated to x(y(z)). Notice that this syntax is very close to
the natural language sentence “count the rows of file ‘in.tsv’ ”.
2.3.1
Dataflow Language
This language is internal to the system, but it can also be used to bypass the optimizer and enforce a specific
dataflow. The best way to describe the syntax and semantics of the language is by presenting some examples.
We use a subset of the TPC-H schema described below:
1. orders(o_orderkey, o_orderstatus, ...)
2. lineitem(l_orderkey, l_partkey, l_quantity, ...)
3. part(p_partkey, p_name, ...)
Assume that tables are horizontally partitioned as follows.
1. orders to 2 parts on hash(o_orderkey)
2. lineitem to 3 parts on hash(l_orderkey)
3. part to 2 parts on hash(p_partkey)
8
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
distributed create table lineitem_large as
select * from lineitem where l_quantity > 20;
with hash(∗) being a known hash function with good properties (e.g. MD5). The queries have two semantically different parts: distributed and local. The distributed part defines how the input table partitions
are combined and how the output will be partitioned. The local part defines the SQL query that will be
evaluated against all the combinations of the input (possibly, in parallel). In the example, the system will
run the SQL query on every partition of table lineitem. Consequently, table lineitem_large will be created
and partitioned into the same number of partitions as lineitem. The output table can be partitioned based
on specific columns. All the records with the same values on the specified columns will be on the same
partition. For example, consider the following query.
distributed create table lineitem_large on l_orderkey as
select * from lineitem where l_quantity > 20;
The output of each query, is partitioned on column l_orderkey. Notice that all records with the same
l_orderkey value must be unioned in order to produce the partitions of table lineitem_large, creating the
lattice after the queries are executed. The user can specify the number of partitions (e.g., 10) by writing the
query as follows:
distributed create table lineitem_large to 10 on l_orderkey as
select * from lineitem where l_quantity > 20;
For the time being, if the degree of parallelism is not specified, it is set to a predefined value. An interesting
optimization problem is to find the optimal parallelism to use. We are currently working on this problem.
We want to stress the fact that even when this feature will be available, the current functionality will be
given as an option because it is extremely useful in practice.
If more than one input tables are used, the table partitions are combined and the query is executed on
each combination. The combination is either direct or a cartesian product, with the latter being the default
behavior. An example is the following query.
distributed create table lineitem_part as
select * from lineitem, part where l_partkey = p_partkey;
The system evaluates the query by combining all the partitions of lineitem with all the partitions of part.
As a result, table lineitem_part will have 6 partitions (3 x 2). If tables lineitem and part have the same
number of partitions, the combination can be a direct product. This is shown in the following query.
distributed create table lineitem_part as direct
select * from lineitem, part where l_partkey = p_partkey;
Notice that, in order the query to be a correct join, the tables lineitem and part must be partitioned on
columns l_partkey and p_partkey respectively. The local part of the query can be as complex as needed
using the full expressivity of SQL enhanced with the UDFs. Queries can be combined in order to express
complex data flows. For example, a distributed hash join can be expressed as follows:
distributed create temporary table lineitem_p on l_partkey as select * from lineitem;
distributed create temporary table part_p on p_partkey as select * from part;
distributed create table lineitem_part as direct
select * from lineitem_p, part_p where l_partkey = p_partkey;
Tables lineitem_p and part_p are temporary and are destroyed after execution. Notice that in this example,
the system must choose the same parallelism for tables lineitem_p and part_p in order to combined as a
direct product. A MapReduce flow can be expressed as follows:
distributed create temporary table map on key
select keyFunc(c1, c2, ...) as key, valueFunc(c1, c2, ...)
distributed create table reduce as
select reduceFunc(value) from map group by key;
9
as value from input;
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
with, key(∗), being a row function that returns the key of the row, and value(∗) being a row function that
produces the value. In the second query, the reduce(∗) is a aggregate function that is applied on each group.
The system also support indexes. The index is not a global index. Instead, is created on each partition of
the table on the specified columns. An example is shown below:
distributed create index l_index on lineitem(l_partkey);
Another useful feature of the language is the ability to specify the partitions against which the query will be
evaluated. For example, if only the first partition of table lineitem has all the records with l_quantity more
than 20, we can write the following query:
distributed on [0] create table lineitem_large as
select * from lineitem where l_quantity > 20;
2.3.2
Dataflow Language Semantics
A dTable consists of its schema, the list of its partitions, and the partition columns, i.e., dTable(schema,
parts, pcols) with |parts| >= 1 and |pcols| >= 0. A partition of a dTable is a relational table with
the same schema. The partitions of the tables are horizontal. A partitioning is a transformation of the
dTable into a new dTable with the specified number of partitions on the specified columns, i.e., output =
p(input, parts, columns), with parts ≥ 1 and |columns| ≥ 1. A dQuery is a transformation of its input
dTables to a single output dTable, i.e., output = query(SQL + U DF s, inputs, combine) with |inputs| ≥ 1
and combine = {cartesianproduct, directproduct}. The following holds for cartesian product:
• |output.parts| =
Q|inputs|−1
i=0
|inputs[i].parts|
• output.parts[i] = query(input[j].parts[k]), ∀i, j, k
The following holds for direct product:
• |inputs[i].parts| = S, ∀i, S > 0,
• |output.parts| = S
• output.parts[i] = query(input[j].parts[i]), ∀i, j
A p(dQuery) is a dQuery with partitioning, i.e., output = p(dQuery(def, inputs, combine), parts, columns).
A dQueryScript is set of p(dQuery) connected with tables. By definition, the output of the last p(dQuery)
is the result of the script.
2.4
Extensions
The system uses extensively the UDF extensions APIs of SQLite. SQLite supports the following UDF
categories: Row functions take as input one or more columns from a row and produce one value. An example
is the U P P ER() function. Aggregate functions can be used to capture arbitrary aggregation functionality
beyond the one predefined in SQL (i.e., SU M (), AV G(), etc.). Virtual table functions (also known as table
functions in Postgresql and Oracle) are used to create virtual tables that can be used in a similar way with
tables. The API offers both serial access via a cursor, and random access via an index. The SQLite engine is
not aware of the table size allowing the input and output to be arbitrarily large. All UDFs are implemented
in Python. Both Python and SQLite are not strictly typed. This enables the implementation of UDFs that
have dynamic schemas based on their input data.
10
Optique Deliverable D7.1
2.5
Techniques for Distributed Query Planning and Execution: One-Time Queries
SQL Processing Engine
The first layer is APSW, a wrapper of SQLite that makes possible the control of the DB engine from Python.
APSW also makes possible to implement UDFs in Python enabling SQLite to use them in the same way as its
native UDFs. Both Python and SQLite are executed in the same process, greatly reducing the communication
cost between them. The Connection and Function Manager (CFM) is the external interface. It receives SQL
queries, transforms them into SQL92, and passes them to SQLite for execution. This component also supports
query execution tracing and monitoring. Finally, it automatically finds and loads all the available UDFs.
The query is parsed, optimized, and transformed into the intermediate dataflow language described
earlier. The Parser and Optimizer of the system, use information stored in the Catalog. The catalog
contains all the information about the tables. The dataflow language is optimized and transformed into a
dataflow graph. Each node of the graph is an SQL query and each link is either an input or an output table.
The graph produced contains two types of operators: SQLExec and unionReplicator. Operator SQLExec
takes as a parameter an SQL query and executes it reading the partitions from its input and producing the
output partitions. Operator unionReplicator, performs a union all to the partitions of the input (all the
partitions must be from the same dTable) and replicates the result to all of its outputs.
The dataflow graph is given to ADP for execution. The system schedules the dataflow graph and monitors
its progress. When the execution completes successfully, the table produced is added to the Catalog of the
system. ADP finds the best schedule for the dataflow graph and executes it on the containers reserved on
the cloud.
A particular database is located in a particular catalog on the file system of each machine. Each partition
is in a different file. A table is attached to the database without needing to importing the data, and as a
result, eliminating the startup and shutdown time of the database engine. The limit on the number of
attached databases in Sqlite is 62. For operators that have more than 62 input partitions, we create a tree
of unions. The total number of iterations is bounded by ceil(log62|inputs|) - 1 .
2.6
Conclusions
In this chapter we described the architecture and the components of ADP. We also presented example
queries that specify the use of UDFs in the language and we gave a detailed characterization of the dataflow
language. In the next chapter we will present results regarding the optimization of dataflow processing, which
also contains the task of optimizing execution of queries expressed in the specified language in a system like
ADP.
11
Chapter 3
Elastic Dataflow Processing in the Cloud
Query processing has been studied for a long time by the database community in the context of parallel,
distributed, and federated databases [15, 28, 37]. Recently, cloud computing has attracted much attention in
the research community and software industry and fundamental database problems are being revisited [17].
Thanks to virtualization, cloud computing has evolved over the years from a paradigm of basic IT infrastructures used for a specific purpose (clusters), to grid computing, and recently, to several paradigms of
resource provisioning services: depending on the particular needs, infrastructures (IaaS — Infrastructure as
a Service), platforms (PaaS — Platform as a Service), and software (SaaS — Software as a Service) can be
provided as services [18]. One of the important advantages of these incarnations of cloud computing is the
cost model of resources. Clusters represent a fixed capital investment made up-front and a relatively small
operational cost paid over time. In contrast, clouds are characterized by elasticity, and offer their users the
ability to lease resources only for as long as needed, based on a per quantum pricing scheme, e.g., one hour
on Amazon EC2.1 Together with the lack of any up-front cost, this represents a major benefit of clouds
over earlier approaches. The elasticity, i.e., the ability to use computational resources that are available on
demand, challenges the way we implement algorithms, systems, and applications. Execution of dataflows
can be elastic, providing several choices of price-to-performance ratio and making the optimization problem
two dimensional [27].
Modern applications combine the notions of querying & search, information filtering & retrieval, data
transformation & analysis, and other data manipulations. Such rich tasks are typically expressed in a high
level language (SQL enhanced with UDFs), optimized [29], and transformed into an execution plan. The
latter is represented as a data processing graph that has arbitrary data operators as nodes and producerconsumer interactions as edges. In this chapter we focus on the common denominator of all distributed data
processing systems: scheduling, i.e., where each node of the dataflow graph will be executed. Scheduling is
a well-known NP-complete problem [19, 42]. Traditionally, the main criterion to optimize is the completion
time of the dataflow, and many algorithms have been proposed for that problem [30, 52]. Scheduling dataflows
on the cloud is a challenging task since it has a very rich space of alternative schedules taking into account
the monetary cost of using the resources.
A major research problem is the development of new distributed computing paradigms that fit closely the
elastic computation model of cloud computing. The most successful of these computational models today is
MapReduce. Our first contribution is to demonstrate that very significant elasticity exists in tasks modeled
with the MapReduce abstraction. Moreover, we show that elasticity can be discovered in practice using
highly scalable and efficient algorithms.
It is natural to expect that more refined models would allow further optimizations and extraction of
elasticity. At the same time, it is also reasonable to be concerned as to whether the resulting complexity of
the refined model would allow these optimizations to be performed. Our second contribution is to demonstrate
that there exists a very fertile middle-ground in terms of abstraction which enables the extraction of much
more elasticity than what is possible under MapReduce while remaining computationally tractable.
1
http://aws.amazon.com/ec2
12
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
In this work, we propose a two step approach to explore the space of alternative schedules on the Cloud
with respect to both completion time and monetary cost. Initially, we compute a global ranking of the
operators of the dataflow based on their influence. Given the ranking, a dynamic programming algorithm
finds the initial skyline of solutions. This skyline is further refined by a 2D simulated annealing algorithm.
To illustrate the intuition behind our approach, we use the example of dataflow graph in Fig. 3.1.
B
C
d
A
e
Figure 3.1: A simple dataflow. Heavy operators are bigger. Thicker arrow means large volume of data.
Given that we can measure the influence of each operator in the graph, we can assign the most influential
operators first. In our example the best solution is: (i) put in different containers A, B, & C, (ii) put together
e and C, and (iii) put together d and A. The challenge is how to measure the influence of each operator in
the dataflow. The number, nature, and temporal & monetary costs of the schedules on the skyline depend
on many parameters, such as the dataflow characteristics (execution time of operators, amount of data
generated, etc.), the cloud pricing scheme (quantum length and price), the network bandwidth, and more.
We incorporated these scheduling algorithms to ADP.
To the best of our knowledge, our system is the first attempt to address the problem of dataflow processing
on the cloud with respect to both completion time and monetary cost. Our techniques can successfully find
trade-offs between time and money, and offer to the user the choice of choosing the best trade-off or choose
automatically the appropriate one based on the user preferences and constraints.
In this chapter, we make the following contributions:
• We show that elasticity can be embedded into processing systems that use the MapReduce abstraction.
• We propose a model of jobs and resources that can capture the special characteristics of the Cloud.
• We propose a two step solution to the scheduling problem on the cloud. We compute the ranking of
the operators in the dataflow. Given that, a dynamic programming algorithm finds the initial skyline
of solutions. The skyline is then refined by a 2D simulated annealing algorithm.
• We show that our approach is able to successfully find trade-offs between completion time and monetary
cost, and thus, exploit the elasticity of clouds.
• We show that using our model, we achieve manageable complexity and a significant gain producing
schedules that dominate plans produced by the MapReduce abstraction on every dimension.
The remainder of this chapter is organized as follows. In Section 3.1 we present the related work. In
Section 3.2 we introduce the notations we use and the problem definition. In Section 3.3 we describe the
dataflow language abstractions we use. In Section 3.4 we present the ranking algorithms that we used. In
Section 3.5 we present the algorithms that we propose and in Section 3.6 we compute their complexity. In
Section 3.7 we present our experimental effort and in Section 3.8 we conclude.
3.1
Related Work
Dataflow processing presents a major challenge and a lot of attention has been given in the recent years. In
[45] is presented a methodology to design ETL dataflows based on multiple criteria. This is complementary
to our work. However, to the best of our knowledge, the optimization is not automatic for the time being.
New approaches [46], focuses on optimizing execution time of data flows executed over multiple execution
engines (DBMS, MapReduce, etc.).
13
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
There are also several efforts that move in the same direction as our work but try to solve simpler
versions of the problem. Examples include a particle swarm optimization of general dataflows having a
single-dimensional weighted average parameter of several metrics as the optimization criterion [38], a heuristic
optimization of independent tasks (no dependencies) having the number of machines that should be allocated
to maximize speedup as the optimization criterion given a predefined budget [44], and focusing on energy
efficiency [51].
In summary, we capitalize on the elasticity of clouds and produce multiple schedules, enabling the user
to select the desired trade-off. To the best of our knowledge, no dataflow processing system deals with the
concept of elasticity or two-dimensional time/space optimization, which constitute our key novelties.
3.2
3.2.1
Preliminaries
Dataflow
We use the same modeling as in [27] and for completeness, we summarize here. A dataflow is represented as a directed acyclic graph graph(ops, f lows). Nodes (ops) correspond to arbitrary concrete operators and edges (f lows) correspond to data transferred between them. An operator in ops is modeled as
op(time, cpu, memory, behavior) where time is the execution time of the operator & cpu is its average CPU
utilization measured as a percentage of the host CPU power when executed in isolation (without the presence
of other operators), memory is the maximum memory required for the effective execution of the operator, and
behavior is a flag that is equal to either pipeline (PL) or store-and-forward (SnF). If behavior is equal to SnF,
all inputs to the operator must be available before execution; if it is equal to PL, execution can start as soon
as some input is available. Two typical examples from databases are sort and select operators: sort is SnF
and select is PL. These metrics are either computed or collected by the system [31]. We model an operator as
having a uniform resource consumption during its execution (cpu, memory, and behavior do not change). A
flow between two operators, producer and consumer, is modeled as f low(producer, consumer, data), where
data is the size of the data transferred.
The container is the abstraction of the host, virtual or physical, encapsulating the resources provided
by the underlying infrastructure. Containers are responsible for supervising operators and providing the
necessary context for executing them. A container is described by its CPU, its available memory, and its
network bandwidth: cont(cpu, memory, network).
A schedule SG of a dataflow graph G is an assignment of its operators into containers schedule(assigns).
An individual operator assignment is modeled as: assign(op, cont, start, end) where start and end are the
start and end time of the operator correspondingly, executed in the presence of other operators.
Time t(SG ) and money m(SG ) costs are the completion time and the monetary cost of a schedule SG
of a dataflow graph G. Cloud providers lease computing resources that are typically charged based on a per
time quantum pricing scheme. For this reason we measure t(SG ) and m(SG ) in quanta. Qt and cost Qm are
the quantum time and price of leasing a container for Qt respectively.
The cloud is a provider of virtual hosts (containers). We model only the compute service of the cloud
and not the storage service. Assuming that the storage service is not used to store temporary results, a
particular dataflow G will read and write the same amount of data for any schedule SG of operators into
containers. So the cost of reading the input and writing the output is the same.
3.2.2
Schedule Estimation
The algorithm estimates the time that every operator starts and its execution time given an assignment of
operators to containers. We use a geometric approach. The time-shared resources are cpu, network, and
disk. Those resources form boxes in a multi-dimensional space. Operators are boxes with resources and
limited time. Containers are also modeled as boxes with infinite time. Intuitively, the problem becomes how
to fit the boxes of the operators into the boxes of containers.
14
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Operators are modeled with three boxes I, P , and O with dimensions (time, cpu, inRate, outRate) as
follows: I are the resources needed to read the input from the network or from the disk, P are resources
needed to process the data, and O are resources needed to write the output to the network or to the disk.
For SnF operators I and O are only disk resources. For data transfer operators, P is always zero. The data
read from the network is always written to the disk. Notice that even for PL operators, this modeling is
acceptable because at some level, PL operators are a sequence of SnF operators that read a small fraction
of the input, process it, and produce a small fraction of the output.
Definitions
Formally, let A be an operator that belongs to dataflow G. The operator is defined as A(time, cpu, −, −)
with assign(A, X, −, −). Without loss of generality, assume that A is a LP operator. For SnF operators we
use the disk instead of the network for I/O. The total data that A reads from the network is:
D∗→A =
X
{D : f low(B, A, D), assign(B, Y, −, −), X 6= Y }
B∈G
Similarly, the total data that A writes to the network is:
X
DA→∗ =
{D : f low(A, B, D), assign(B, Y, −, −), X 6= Y }
B∈G
The three boxes of the operator are defined as follows:
I(
D∗→A
, DTCP U , X.network, 0)
X.network
P (A.time, A.cpu, 0, 0)
DA→∗
, DTCP U , 0, X.network)
X.network
being a system parameter. In isolation, the box AB of operator A will have the following
O(
with DTCP U
properties:
AB .time = I.time + P.time + O.time
AB .cpu =
I.cpu · I.time + P.cpu · P.time + O.cpu · O.time
AB .time
AB .inRate =
AAB .outRate =
I.inRate · I.time
AB .time
O.outRate · O.time
AB .time
In the context of others, operators are scaled in the time dimension due to time-shared resources. However,
assuming uniform behavior for all operators, the following measures do not change at any scale.
instructions = AB .time · AB .cpu
indata = AB .time · AB .inRate
outdata = AB .time · AB .outRate
Those measures are used to calculate the cpu, inRate and outRate at any time scale. A group S is
a subset of connected operators that can be executed concurrently. Groups can have either connected PL
operators or only one SnF operator. The execution time of the operators belonging to the same group is
the same. As a consequence, that all operators inside the same group are scaled to reach the most time
−
−
−
consuming operator. Formally, a group P is defined as P (→
o ,→
s ) with →
o being the vector with the operators
→
−
that are members of the group and s being the vector with the time scale of operators inside the group.
15
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Money
d1
d2
d3
Time
Figure 3.2: Two skylines A (circle) and B (square). Their distance is ( d1+d2
2 , d3).
Estimation Algorithm
The estimation algorithm estimates when and for how long every operator will run given any schedule. Thus,
the completion time and monetary cost is estimated for the whole schedule. The graph is examined from
producers to consumers starting from the operators that have no inputs. The set of operators that can be
→
−
→
−
executed in parallel is divided into groups. Given a vector of groups P , we find the vector of S with the
time scales. An operator oi that belong to group Pj is scaled by Sj · P.si . The next time event t is defined
by the group that will terminate first. The remaining time and in/out data for each operator is reduced by
the percentage of the time that the operator is executed till t. The terminated operators are removed from
the ready set and the operators from the queued set that their memory constraint are satisfied, are inserted
in the ready set. The algorithm terminated when all the operators are examined.
The completion time t(SG ) is defined as the time of the termination of the last operator of the schedule.
To compute the money m(SG ), we slice time in each container into windows of length Qt starting from the
first operator of the schedule. The monetary cost is then a simple count of the time-windows that have at
least one operator running, multiplied by Qm .
|C| |W |
X
X
m(SG ) = Qm ∗ (
(ci , wj ))
i=1 j=1
with C = {ci } being the set of containers, W = {wj } being the set of time-windows, and
(ci , wj ) =
3.2.3
1, if at least one operator is active in wj in ci
0, otherwise
Skyline Distance Set
To compare solutions from different algorithms we define the distance between two skylines. Let A and B
be two skylines produced by different algorithms. The units of both time and money are in quanta. Assume
that A is dominated by B, i.e., skyline(A ∪ B) = B. Let |A| be the number of points in skyline A. The
distance of A from B is defined as:
P|A|
dist(Ai , B)
Dp (A, B) = i=1
|A|
with dist(Ai , B) being the distance of schedule Ai ∈ A from skyline B. Intuitively, this distance shows the
average distance in quanta between the schedules of the two skylines. Several works has studied the problem
of finding the distance between a point and a skyline [26, 22]. In general, the problem is to minimize the
total cost of making Ai part of B, i.e., make it a point in the skyline. In this work we define this distance as
being the minimum L2 distance needed for Ai to be part of B. We can compute it as shown in Figure 3.2.
This distance is smooth at the point where Ai becomes part of the skyline.
16
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
We define the distance between two arbitrary skylines as follows. Let A and B be two skylines. We
compute C = skyline(A ∪ B). The distance D(A, B) is defined as the pair (Dp (A, C), Dp (B, C)). Figure 3.2
shows an example. This distance has several good properties. Let (Dp (A, C), Dp (B, C)) = (a, b). The
following hold:
(i) a, b ≥ 0
(ii) if a = 0 then A dominates B
(iii) if a, b > 0, both A and B have schedules with tradeoffs
(iv) is smooth where a schedule becomes part of the skyline
Figure 3.2 shows an example where both a and b are positive. This distance can be generalized to n skylines
(S1 , ..., Sn ) as follows:
D(S1 , ..., Sn ) = (Dp (Si , C), ..., Dp (Sn , C))
with C = skyline(∪(Si )). In our experiments, we compute the generalized skyline distance from the results
of all algorithms to compare the results.
3.3
Dataflow Language
In our system, requests take the form of queries in SQL with user defined functions (UDFs). The SQL query
is transformed into the intermediate level that we described in Section 2.3.1. This intermediate level SQL
script is then transformed to a dataflow graph using the modeling that described in Section 3.2. Figure 3.3
shows the dataflow graph produced from query 8 of the TPC-H benchmark.
61
18
63
19
64
74
75
76
77
13
78
14
79
15
80
81
102
8
73
48
103
9
16
49
10
32
50
104
11
33
51
105
12
34
52
106
20
35
45
99
36
46
100
29
30
31
47
101
90
91
92
93
94
95
96
97
108
110
Figure 3.3: The dataflow graph produced from TPC-H query 8.
3.4
Operator Ranking
In this section we present the methodology we use to rank the operators of a dataflow graph. A score is
computed for each operator according to the influence they have on the schedule. Figure 3.4 shows the
results of ranking the Montage dataflow with 50 SnF operators. Influential operators have darker colors.
70
60
50
40
30
20
10
0
0
5
10
15
20
25
30
35
40
45
50
Figure 3.4: The Montage dataflow ranking information. Darker colors means higher scores.
A simple way to rank the operators is by computing a relative score based on their properties and the
input & output data F (op, in, out). We call this ranking Structure Ranking because it takes into account
only the immediate neighborhood for each operator in the graph. The scoring function we use is defined as
follows:
score(oi ) = a · oi .proc_time + (1 − a)(oi .io_time)
17
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
with oi the operator i, proc_time = oi .time and
P|in|
P|out|
i=1 (in[i].data) +
i=1 (out[i].data)
io_time =
net_speed
Parameter a expresses the importance of the execution time and I/O time. In our experiments we used
a = 0.5. The ranking of operators is defined as
SR(G) = (score(o1 ), score(o2 ), ..., score(on ))
Assuming there are no long-range correlations between operators in the graph, this is a good ranking
function. A long-range correlation between two operators exists when the minimum length of the path
between them is more than 2, and the assignment of one operator in the schedule affects the assignment of
the other. If the graph has this property, Structure Ranking will not find it.
To overcome the problem, we compute the score for each operator by measuring their influence on the
schedule directly. We can measure that by finding the partial derivative of the operators on the space of
schedules produced by our model. Given a particular schedule SG of a dataflow graph G, we compute:
θSE
θSE
(SG ), ...,
(SG ))
θo1
θon
with oi being the operator i in the dataflow and SE being the schedule estimation algorithm. Essentially,
the derivative shows how sensitive is the dataflow to the different assignments of the operators. We call
this ranking Derivative Ranking. The derivative of SE is hard to compute analytically, so instead we use
an iterative process to approximate it. Assuming the graph has n operators, the space of schedules is a
n-dimensional hypercube. Each axis of that cube is of length C (the number of containers). To measure the
derivative, we assign each operator to all possible positions in its axis, without changing the positions of the
others, and measure the difference in time and money of all the produced schedules.
Algorithm 1 shows the ranking process. The algorithm creates a schedule by randomly assigning the
operators into containers. Then, each operator is assigned to every possible container without changing
the positions of the others. At each step, it measures the difference of the cost function provided. This is
repeated until the ordering of the operators does not change. The cost function we used is defined as follows:
∇SE(SG ) = (
F (S(G)) = a ∗ S(G).time + (1 − a) ∗ S(G).money
In our experiments we set a = 0.5.
Intuitively, the Structure Ranking and the Derivative Ranking should produce similar results if there are
no long-range correlations between the operators in the dataflow graph. Indeed, this is the case for graphs
with SnF operators. Below we present the results of ranking the Ligo [13] dataflow with 50 operators. We
show the structure (S) and derivative (D) ranking. Each operator is a different character. Each list is ranked
from the highest score to the lowest. We measure the difference of the two rankings using the Kendall tau
rank distance [16].
S: W noXmM T piOkRf V N jP QghlU SqY rBAEHKCGJF LID1c2a3b45Zd6e
D: W M XT ORN P nomkV pf jQiU ShglrqY JDBF GHELKCIA1a42b3Zce6d5
Kendall tau Distance : 0.088
We observe that the two rankings are not significantly different and the most influential operators are
the same. However, for PL operators, the long-range correlations do exist. Below we show the results of
Ligo with PL operators.
S: W noXmM T piOkRf V N jP QghlU SqY rBAEHKCGJF LID1c2a3b45Zd6e
D: Y W N T QXJDRIV CBOLU bGZ4F S3HM EAacP gj1f ndKkhi5m6le2oprq
Kendall tau Distance : 0.475
We observe that the ranking is significantly different. Pipeline operators run in parallel and two connected
operators run the same amount of time, regardless if one of them is faster than the other. The structure
ranking, would give the fast operator a smaller score than the slow one, which is not correct, because they
are both influential. An illustrative example is operator D. Using the structure ranking, its rank is 37 and
using derivative is 8.
18
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Algorithm 1 Ranking Algorithm
Input:
G: The dataflow graph
N: The number of containers
F: F (S(G)) → <: The cost function
M: The maximum number of iterations
Output:
scores: The scores of the operators
1: scores[|G.ops|] ← 0
2: for m ∈ [1, M ] do
3:
S ← RandomScheduler(G, N )
4:
all[|G.ops|][N ] ← 0
5:
for o ∈ G do
6:
for c ∈ [1, N ] do
7:
assign(o, c, _, _)
8:
all[o][m] ← F (S)
9:
end for
10:
end for
11:
curr[|G.ops|] ← 0
12:
for o ∈ G do
13:
curr[o] ← maxi (all[o][i]) − mini (all[o][i])
14:
end for
15:
if ranking of ops is the same in scores and curr then
16:
break
17:
end if
18:
for o ∈ G do
19:
scores[o] ← ((m − 1) · scores[o] + curr[o])/m
20:
end for
21: end for
22: return scores
3.5
Scheduling Algorithm
The scheduling algorithm that we propose has two phases. The first phase computes the skyline of schedules
based on a dynamic programming algorithm using the ranking of the operators. This skyline is further
refined using a 2D simulated annealing. In the following sections we present the two algorithms.
3.5.1
Dynamic Programming
The dynamic programming algorithm that we use is shown in Algorithm 2. The operators are considered
from producer to consumer. Each operator with no inputs, is a candidate for assignment. An SnF operator
is a candidate, as soon as all of its inputs are available. A PL operator is a candidate, as soon as all of
its inputs come from PL or from completed SnF operators. The algorithm chooses the operator with the
maximum rank from the list of available operators. That operator is added to all the schedules in the skyline
at every possible position. At every step, only the schedules in the skyline are kept.
The skyline may contain too many points [40]. For the experiments, we keep all the points in the
skyline. In practice this is infeasible. Several approaches can be followed. One approach would be to keep
k representative schedules (k is a system parameter) from the skyline: the fastest, the cheapest, and k − 2
equally distributed in between.
19
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Algorithm 2 Dynamic Programming
Input: G: A dataflow graph.
C: The maximum number of parallel containers to use.
Output: skyline: The solutions.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
skyline ← ready ←{operators in G that have no dependencies}
f irstOperator ← maxRank(ready)
f irstSchedule ← {assign(f irstOperator, 1, −, −)}
skyline ←{f irstSchedule}
while ready 6= do
next ← maxRank(ready)
S←
for all schedules s in space do
for all containers c (c ≤ C) do
S ← S ∪ {s + assign(next, c, −, −)}
end for
end for
space ← skyline of S
ready ← ready − {next}
ready ← ready ∪ {operators in G that dependency constraints no longer exist}
end while
18: return skyline
3.5.2
Simulated Annealing
The 2D simulated annealing is shown in Algorithm 3. The initial skyline is computed by Algorithm 2.
At each step, all the schedules of the skyline are considered. A neighbor of each schedule is computed by
assigning an operator to another container. If the newly produced schedule dominates the old one, only that
schedule is kept. Both schedules are kept if they do not dominate each other. If the old one dominates the
new, the new one is kept with probability that depends on the euclidean distance between them and the
temperature. The lower the temperature, the smaller the probability of keeping a dominated schedule.
As RandN eighbor we use two functions: (i) purely random and (ii) random based on ranking. The later,
chooses the operators with probability proportional to their scores.
3.5.3
Parallel Wave
The Parallel Wave is a generalization of the scheduling algorithm for MapReduce graphs. In the beginning,
the algorithm finds the depth of each operator in the graph. The operators with no inputs, have depth zero.
The depth of every other operator is the maximum depth of the operators connected to its inputs plus one.
In a MapReduce pipeline, all the operators in the map phase will be at the same depth. The same holds for
operators in the reduce phase. We used this algorithm to add elasticity into MapReduce dataflow pipelines,
similar to the ones produced by Hive [48] or Flume [8].
Let W be the number of different depths in the graph and pi be the maximum parallelism of depth i. pi
is defined as min(C, |Wops |), with C being the maximum number of containers and |Wi | being the number
of operators at depth i. This algorithm considers all the combinations of different pi . For the scheduling of
operators in the same level, we use a simple load balancing mechanism.
3.5.4
Exhaustive
The exhaustive algorithm enumerates all the different schedules and keeps the ones in the skyline. We use
this algorithm for two purposes: (i) to compare the results of the proposed algorithms with the optimal and
(ii) as a initialization step for simulated annealing to find the optimal assignment for the most influential
operators. Let N be the number of operators and C be the number of containers. The space of solutions is
20
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Algorithm 3 Simulated Annealing
Input: G: A dataflow graph.
K: Maximum number of iterations.
C: Maximum number of containers
Output: skyline: The solutions.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
skyline ← DynamicP rogramming(G, C)
S ← skyline
k←0
while k < K do
for all schedules s in S do
n ← RandN eighbor(s)
if n.time < s.time and n.money < s.money then
// Dominate
s ← next
else
if n.time < s.time or n.money < s.money then
// Tradeoff
S ← S ∪ {next}
else
L2Distance(s,next)
−
T (k)
if e
> rand[0,1] then
s ← next
end if
end if
end if
end for
skyline ← skyline of (skyline ∪ S)
k ←k+1
end while
24: return skyline
C N . For N = 2 and C = 2 the assignments are 22 . However, the solution space has a lot of symmetries.
In the example, only two solutions are different: i) assign both operators to the same container or ii) assign
them to different containers. We break the symmetries2 generating schedules as shown in Figure 3.5. The
figure shows all different solutions when scheduling three operators (A, B, and C). We can compute the
A
A
AB
B
C
ABC
A
AB
C
AC
B
A
B
BC
A
B
C
Figure 3.5: All the solutions for three operators A, B, and C.
number of different schedules as follows. The number of leafs at the sub-tree starting with k containers and
n remaining operators to assign is:
ss (n, k) = k · ss (n − 1, k) + ss (n − 1, k + 1)
With ss (0, k) = 1 ∀k. The root of the tree is ss (N, 0). A simple dynamic programming technique is used to
solve the equation. A generalization is to use at most C containers.
ssg (n, k, C) = k · ssg (n − 1, k, C) + ssg (n − 1, k + 1, C)
2
Finding symmetries in the graph is a much harder problem.
21
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
ssg (0, k, C) =
1, ∀k ≤ C
0, otherwise
Number of Schedules
1e+14
# Schedules
1e+12
1e+10
1e+08
1e+06
10000
100
1
0
2
4
6
8 10 12 14 16 18 20
# Operators
Figure 3.6: Total and unique number of schedules.
Figure 3.6 shows the number of schedules before and after breaking the symmetries. For 20 operators,
the unique number of schedules are approximately 10 orders of magnitudes less. We leave for future work
the breaking the symmetries on the graph.
3.6
Complexity Analysis
We assume as given a dataflow graph G with n operators, l links, and c containers. We also assume that the
number of schedules in the skyline is no more than s.
Schedule Estimation: The worst case scenario is all operators to be ready in the beginning and only one
operator to terminate each time. At each phase, all operators and all links are considered. The complexity
is
n
X
n(n + 1)
SEC = O( (i + l)) = O(nl +
)
2
i=1
Given that in most graphs l > n, SEC = O(nl).
Ranking: Ranking takes at most r iterations to complete. At each iteration, nc invocations of SE are
performed. Thus, the complexity is RC = O(rn2 lc).
Dynamic Programming: This algorithm takes n steps. At each step at most sc invocations of SE are
performed, one for each schedule and container. Therefore, the complexity is DY NC = O(sn2 lc).
Simulated Annealing: Simulated annealing takes at most k steps to run and in every step at most s
invocations of SE are performed for the new neighbor. The complexity is SAC = O(ksnl).
The algorithm, calls the ranking algorithm, then the dynamic programming, and at the end, the simulated
annealing. Therefore, the complexity of the algorithm is: O(n2 lc(r + s + ks/nc)).
3.7
Experimental Evaluation
This section presents the results of our experimental effort divided into two groups. The first group contains
experiments with our model and algorithms. The second group contains experiment with our system and
the comparison with Hive [48].
3.7.1
Model and Algorithms
The parameters of the experiment are shown in Table 3.1. We begin by presenting the experimental setup.
22
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Table 3.1: Algorithm Experiment Properties
Property
Dataflow
Output Size
Operator type
Ranking
Search
Data transfer
Values
Montage, Ligo, Cybershake, TPC-H, MapReduce
10x − 10000x
SnF, PL
Derivative, Structure
NL, Dyn, SA, Exh, PW
DTCP U = 0.1, DTM EM = 0.05
Table 3.2: Operator Properties
Property
time
cpu
memory
data
Values
0.2, 0.4, 0.6, 0.8, 1.0
0.4, 0.45, 0.5, 0.55, 0.6
0.05, 0.1, 0.15, 0.2, 0.25
0.2, 0.4, 0.6, 0.8, 1.0
Experimental Setup
Dataflow Graphs: We examine five families of dataflow graphs: Montage [25] (Fig. 3.7A), Ligo [13] (Figure 3.7B), Cybershake [12] (Figure 3.7C), MapReduce (Figure 3.8), and the first 10 queries of TPC-H [6]
(Figure 3.3 shows query 8). The first three are abstractions of dataflows that are used in scientific applications: Montage is used by NASA to generate mosaics of the sky, Ligo is used by the Laser Interferometer
Gravitational-wave Observatory to analyze galactic binary systems, and Cybershake is used by the Southern
California Earthquake Center to characterize earthquakes. The MapReduce and TPC-H graphs are generated
by our dataflow language as presented in Section 3.3.
Figure 3.7: The scientific graphs Montage(A), Ligo(B), and Cybershake(C).
Operator Types: We have indicated the values of operator properties as percentages of the corresponding parameters of container resources (Table 3.2). For example, an operator having memory needs equal to
0.4 uses 40% of a container’s memory. Furthermore, execution times are given as percentages of the cloud’s
time quantum and so are data sizes (inputs/outputs of operators), taking into account the network speed.
For example, an execution time of 0.5 indicates that the operator requires half of a time quantum to complete
its execution (say, 30 minutes according to the predominant Amazon cost model). Likewise, an output of
size 0.2 requires one fifth of a time quantum to be transferred through the network if needed. This way, the
output data size is in inverse correlation with network speed. Money is measured as described in Section 3.2.
We have used synthetic workloads based on Montage, Ligo, and Cybershake dataflows as defined in [7].
23
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
1
2
4
5
7
8
10
11
13
0
14
3
16
6
17
9
51
19
12
20
15
22
18
23
21
25
24
26
27
28
30
52
53
54
45
46
47
48
49
56
57
59
60
62
63
65
66
55
58
61
64
68
69
70
71
67
29
33
31
36
32
39
34
35
37
38
40
41
43
44
42
50
72
Figure 3.8: A MapReduce pipeline consisting of two jobs. Each job has three phases, map, combine, and
reduce.
We scaled up the specified run times and operator output sizes by a factor of 50 and 1000 respectively; run
time to output size ratio was increased by 20. We also set operator memory to 10% of the container capacity.
We set the data transfer CPU utilization t be DTCP U = 0.1 and memory needs DTM EM = 0.05. We
experimented with changing the output by a factor of 10, 100, 1000, and 10000. The properties of operators
in MapReduce and TPC-H are chosen with uniform probability from the corresponding sets of values shown
in Table 3.2. We have used dataflows with both SnF and PL operators.
Optimization Algorithms: We used Nested Loops Optimizer (NL) as presented in [27], Exhaustive
(Exh), Parallel Wave (PW), Dynamic Programming (DP), and Simulated Annealing (SA). For the NL
algorithm, we used the greedy algorithm in the inner loop and 10 different number of containers in the range
[1 − N ] with N being the number of operators in the graph. We used both structure and derivative methods
for ranking. For the initialization of the SA algorithm, we used DP, Exh using the 10 most influential
operators, and random. We used two RandN eighbor functions (Algorithm 3 Line 6): random, and random
based on ranking (each operator is selected with probability proportional to its score).
Measurements: For each experiment we generated 10 different graphs with different seeds. For Montage, Ligo, and Cybershake, we used the generator in [7]. TPC-H and MapReduce graphs are generated from
the language described in Section 3.3. We run all the algorithms for each different graph and computed the
generalized skyline distance from the results. As defined in Section 3.2, the generalized skyline distance is
the distance of the skyline produced by each algorithm from the combined skyline. In the results we show
the average value from produced from the 10 different seeds. The time and money are measured in quanta.
Model Validation
We begin by presenting the experiments we performed to validate our model. We used the TPC-H queries.
The properties of each operator, as presented in Section 3.2, can be either computed or collected by the
system during execution [31]. In this work are assumed to be given. To acquire the properties, we run each
operator in isolation using only one container and collected the statistics. The cloud price quantum is set to
10 seconds.
Figure 3.9 shows the real and estimated execution time and money for all queries of TPC-H using 8
containers and 8GB of data in total. We show the fastest plan in the skyline. We observe that our model is
able to successfully predict both the time and money of the dataflows. Furthermore, query 9 clearly shows
that our model overestimates the actual running time and money, thus it can be used as an upper bound of
the real.
Compare with Optimal
In this set of experiments, we compare the skylines produced by our algorithms with the optimal. We generate
the optimal result using the Exh algorithm. Since computing optimally the skyline is very expensive for large
dataflows, we used TPC-H graphs with 10 operators. To have a better understanding of the quality of the
results, we also computed the worst solutions. Those solutions are in the Max-Max skyline. Here we show the
results of queries 3, 6, and 9 (the results are similar for the other queries). The characteristics of the operators
are chosen with uniform probability from the values in Table 3.2. The results are shown in Figure 3.10. All
operators are SnF and the output data replication is 100. We observe that the DP algorithm produces
solutions that are not far from optimal. SA significantly improves the results produced by DP, especially for
query 6.
24
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Estimated and Real Time
140
Time (seconds)
120
Estimated
Real
100
80
60
40
20
0
0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1
Money (quanta)
Estimated and Real Money
100
90
80
70
60
50
40
30
20
10
0
Estimated
Real
0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1
Figure 3.9: Real and estimated execution time and money for queries of the TPC-H benchmark.
We are aware of the fact that these results cannot be generalized to larger graphs. We are currently
working on techniques to reduce the search space by breaking symmetries on the graph itself and compute
exact solutions for very large graphs. One way to pursue this is by analyzing the SQL scripts presented in
Section 3.3. The dataflow graphs are generated by a relatively small number of SQL queries and they are
very symmetric due to the nature of SQL.
Elastic MapReduce
We created several MapReduce dataflows with different numbers of jobs. The properties of each operator
are chosen from the values in Table 3.2 with uniform probability. Figure 3.11 shows the results of dynamic
programming and parallel wave using the dataflow of Figure 3.8 with SnF and PL operators. First, we
observe that we can have significant elasticity on MapReduce dataflows. Second, we see that using the
algorithm we propose, we gain significant improvement. We are able to find plans that are as fast as the
ones of MapReduce but a lot cheaper. Finally, we observe that the gain for PL is much bigger than the case
for SnF. This was expected since the MapReduce graphs are SnF.
We also experimented with varying the size of data. Figure 3.12 shows the results. We observe that when
the amount of data increases, we are to find better schedules for both SnF and PL dataflows. Furthermore,
we observe that for SnF dataflows, the MapReduce scheduler performs better. We expected this since is
designed for this type of dataflows.
Effect of Ranking
In this set of experiments we measured the effect of ranking the operators. We used both structure and
derivative ranking with Dyn and SA algorithms. The dataflows have SnF and PL operators. The output
data replication factor is set to 100. As a baseline for Dyn, we compare with the FIFO algorithm, i.e., the
operators are assigned with the same ordering they become available.
25
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
TPC-H Distance from Optimal
Skyline Distance
5
Q3
Q6
Q9
4
3
2
1
0
Ma
Dy
Dy
n+S
n
x
A
Figure 3.10: Comparison of different algorithms with the optimal using TPC-H queries with 10 operators.
SnF Operators
PL Operators
220
PW
Dyn
Money (Quanta)
Money (Quanta)
280
240
200
160
PW
Dyn
200
180
160
140
45
50
55
60
65
70
Time (Quanta)
75
80
10
20
30
40
50
Time (Quanta)
60
70
Figure 3.11: Dynamic algorithm compared with Parallel Wave on MapReduce graphs.
Dynamic Programming: Figure 3.13 shows the results for Dyn using the Ligo dataflow with 100 PL and
SnF operators. We observe that ranking in general, improves dramatically the solutions compared to FIFO
for both PL and SnF dataflows. As expected, for SnF operators the results are similar because the rankings
do not differ much. For PL operators however, we see that the ranking based on derivative is much more
beneficial than the ranking based on structure.
Simulated Annealing: For simulated annealing, we experimented with the the initialization algorithm
and the neighbor selection. Figure 3.14 shows the results for the Cybershake dataflow with 100 SnF and PL
operators.
In the left plot of Figure 3.14, we observe that the initialization algorithm is essential. The best choice is
the Dyn algorithm. It does not obtain to much worse results than Exh for SnF operators and is the best for
PL operators. Using Exh on the 10 most influential operators is the best for SnF but not beneficial at all
for PL operators. This is because the influential operators for PL dataflows are many more and much more
sensitive than the SnF operators. We are currently working on ways to make SA multi phase: each phase
considers only a subset of operators. The operators can be partitioned based on their ranking. This could
be done using histogram based partitioning [23] like equi-width or max-diff.
The middle plot of Figure 3.14 shows the results of SA using different algorithms to choose a neighbor.
Interestingly, the ranking based neighbor selection hurts SA. It prevents it from navigating the search space
by actually restricting it to select only the most influential operators to move. The purely random neighbor
selection is the best choice.
In the right part of Figure 3.14 we show the effect of using different ranking algorithms. We observe that
the derivative ranking is significantly better than structure for PL operators and do not differ to much for
SnF operators.
Compare Algorithms
In this set of experiments, we compare all algorithms using the scientific dataflows Montage, Ligo, and
Cybershake. All dataflows have 50 operators and the data replication is 100. We compare the algorithm
proposed in this work with the ones in [27] using the generalized skyline distance.
26
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
PL Operators
Skyline Distance (log scale)
Skyline Distance (log scale)
SnF Operators
10
Dyn
PW
1
0.1
0.01
10
100
1000
Data Replication (log scale)
10
Dyn
PW
1
0.1
0.01
10000
10
100
1000
Data Replication (log scale)
10000
Figure 3.12: Dynamic algorithm compared with Parallel Wave on MapReduce graphs.
Dynamic Programming for Ligo 100
Distance
3
Derivative
Structure
FIFO
2
1
0
PL
F
Sn
Figure 3.13: Dynamic programming with different ranking using Ligo dataflow with 100 PL and SnF operators.
Figure 3.15 shows the results. We observe that the combination of Dyn algorithm to find the initial
skyline and refinement of that skyline by SA with random neighbor is the best choice. In some cases, the
solutions produced are one order of magnitude better than the NL algorithm. As observed earlier, using
ranking with SA is not beneficial. Furthermore, the Exh algorithm do not give very good results compared
to Dyn+SA. We remind that the Exh uses only the 10 most influential operators.
We also measured the number of schedules in the skyline produced by each algorithm. In general, Dyn
finds more schedules in skyline compared to NL and SA improves that even more. That, combined with the
fact that Dyn+SA produces the best skyline of schedules, makes it an obvious choice.
We also experimented with varying the size of the output data generated by the operators for Montage,
Ligo, and Cybershake dataflows with 50 operators. Figure 3.16 shows the results. We observe that Dyn+SA
proposed in this work produces plans that in some cases are one order of magnitude better than NL.
Finally, Figure 3.17 shows the running time of the algorithms. We observe that Dyn algorithm has the
fastest running time. Exh and Exh+SA have very long running time compared to the others because they
examine a very large number of schedules. The call of the Schedule Estimation sub-routine is expensive.
3.7.2
Systems
The parameters of the experiment are shown in Table 3.3.
Experimental Setup
Execution Environment: In our experiments, all containers have the same resources (cpu, memory, disk,
and network). The resources were kindly provided by Okeanos 3 , the cloud by GRNet 4 . We used 32 virtual
machines (containers), each with 1 CPU, 8 GB of memory, and 60 GB of disk. We measured the average
network speed to 15 MB/sec. We used Hadoop 1.1.2 and Hive 0.11. The time and money are measured in
quanta.
3
4
okeanos.grnet.gr
www.grnet.gr
27
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Init algorithm for Cybershake 100
6
80
60
40
Dyn Ranking for Cybershake 100
12
Derivative
Structure
Random
7
Distance
Distance
100
Neighbors for Cybershake 100
8
Dyn
Exh
Random
10
5
4
3
Derivative
Structure
8
Distance
120
6
4
2
20
2
1
0
0
0
PL
F
Sn
PL
F
Sn
PL
F
Sn
Figure 3.14: Varying data for Ligo dataflow with 100 PL and SnF operators.
# points in skyline for Montage, Ligo, and Cybershake (50 Ops)
30
Montage
Ligo
Cybershake
10
# points in skyline
Skyline Distance (log scale)
Distance for Montage, Ligo, and Cybershake (50 Ops)
100
1
0.1
0.01
Montage
Ligo
Cybershake
25
20
15
10
5
0
k)
an
)
k
an
(R
(R
SA
h+
Ex
SA
SA
k)
an
)
k
an
(R
(R
SA
+
yn
+
yn
yn
L
SA
h+
Ex
D
D
D
N
h+
Ex
SA
SA
SA
+
yn
+
yn
yn
L
h+
Ex
D
D
D
N
Figure 3.15: Compare distance and number of schedules in the skyline of different algorithms from the
combined skyline on various scientific dataflows with 50 operators.
Dataset: We generated a total of 512 GB data (or approx. 2.2 billion tuples) using the generator
provided for TPC-H. The benchmark has eight tables:
region(1), partsupp(32, ps_partkey), orders(32, o_orderkey),
lineitem(32, l_orderkey), customer(32, c_custkey),
part(32, p_partkey), nation(1), and supplier(1).
In parenthesis, we show the number of partitions we created for each table and the key based on which
the partitioning was performed. In Hive, we used the CLUSTERED BY when we created the tables. The
replication factor of Hadoop is set to 1 in order to have a clear comparison with our system. After loading
the data to Hadoop we used the balancer to evenly distribute the data to the cluster. In ADP, the tables
are horizontally partitioned and distributed to the cluster. The partitions of the same table are placed on
different virtual machines.
Table 3.3: System Experiment Properties
Property
Dataflow
Operator type
Ranking
Search
Num of VMs
TPC-H Data (GB)
Quantum size
Values
TPC-H, MapReduce
SnF
Derivative, Structure
Dyn, PW
8, 16, 24, 32
64, 128, 256, 512
10 seconds
28
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Ligo (50 ops)
NL
Dyn
Dyn+SA
4
Cybershake (50 ops)
12
NL
Dyn
Dyn+SA
Skyline Distance
5
Skyline Distance
Skyline Distance
Montage (50 ops)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
3
2
1
10
8
4
2
0
10
100
1000
Data Replication (log scale)
10000
NL
Dyn
Dyn+SA
6
0
10
100
1000
Data Replication (log scale)
10000
10
100
1000
Data Replication (log scale)
10000
Figure 3.16: Varying data for Montage, Ligo, and Cybershake dataflows with 50 operators.
Run Time (Seconds - log scale)
Alg Run Time (50 Ops)
100
Montage
Ligo
Cybershake
10
1
k)
an
)
k
an
(R
(R
SA
h+
Ex
SA
SA
SA
+
yn
+
yn
yn
L
h+
Ex
D
D
D
N
Figure 3.17: Running time of the algorithms for Montage, Ligo, and Cybershake dataflows with 50 PL
operators.
Elastic Execution
In our first set of experiments, we examine the elasticity of TPC-H graphs. Figure 3.18 shows the results.
We observe that most of the queries are elastic, i.e., there is tradeoff between time and money.
Three different examples are queries 2, 3, and 4. Query 3 is inelastic, meaning that the more machines we
add, we increase both time and money. This is a typical behavior for distributed system when the data size is
small. Query 2 does not use the two large tables lineitem and orders. Query 3 is a typical example of elastic
dataflow. Money can be traded for time. Finally, query 4 runs faster and cheaper with more machines.
Scalability Execution
In this set of experiments we show the scalability of our system. Figure 3.19 shows the execution time of
TPC-H queries varying the size of the data. We observe that our system scales well and can handle up to
half of TB of data. Query 9 is a difficult query since it produces a large amount of intermediate results and
we run out of disk for the 256 GB and 0.5TB datasets.
Figure 3.20 shows the execution time of TPC-H queries varying the number of virtual machines. We
used the 64GB dataset in order to be able to run with 8 virtual machines. We observe that our system can
effectively exploit the available resource and run faster queries as the number of machines increases.
Compare with Hive
In our final set of experiments, we compared the schedules generated by our algorithms with Hive. Since
the goal of Hive is to produce plans that run as fast as possible, we choose to execute the fastest plan in the
skyline. Figure 3.21 shows the results. We observe that our system is able to run queries faster.
29
Techniques for Distributed Query Planning and Execution: One-Time Queries
24
24
16
24
16
24
32
60
50
40
30
32
300
270
240
210
180
32
400
350
300
250
200
Time Q8
Money Q8
8
16
24
200
160
120
80
40
32
1800
1600
1400
1200
Time Q9
Money Q9
8
16
30
25
20
15
10
24
32
400
350
300
250
200
Time Q10
Money Q10
8
16
24
Money
600
500
400
300
Time Q7
Money Q7
8
Money
32
Time Q6
Money Q6
8
Money
170
160
150
140
130
Time Q5
Money Q5
16
Money
32
Money
16
8
Time
210
180
150
120
Money
Time
Time
Time
24
50
40
30
20
Time
32
Time Q4
Money Q4
8
Time
24
16
30
20
10
0
30
25
20
15
10
250
200
150
100
50
Time Q3
Money Q3
8
Time
16
25
20
15
10
5
20
15
10
5
32
Time Q2
Money Q2
8
Time
24
Money
16
Money
Time
8
7.5
7
6.5
6
8
6
4
2
0
280
270
260
250
Time Q1
Money Q1
Money
40
30
20
10
0
Money
Time
Optique Deliverable D7.1
32
#VMs
Figure 3.18: TPC-H queries on ADP varying number of machines.
Figure 3.22 shows the same queries in the time-money 2D space. In order to find the monetary cost of
queries in Hive we used the total MapReduce CPU time spent provided by Hive. This measure does not
include the time spend for network IO, so is a lower bound of the actual cost. We observe that we are able
not only to run faster but also cheaper than MapReduce.
Analyzing in depth the performance factors that make our system run faster is out of the scope of this
deliverable. Here we present a summary.
MapReduce graphs: We support complex dataflows that fit better the model of SQL enhanced with UDFs.
MapReduce dataflows are restricted to 3 phases (map, combine, reduce). An example is the group-by
operator. The best way to execute it is with a tree of partial group-by operators. Is relatively easy to express
a tree dataflow using our system. In MapReduce however, we need several jobs to do that when the height
30
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
TPC-H Queries on ADP
7000
64 GB
128 GB
256 GB
0.5 TB
Time (Seconds)
6000
5000
4000
3000
2000
1000
0
10
Q
9
Q
8
Q
7
Q
6
Q
5
Q
4
Q
3
Q
2
Q
1
Q
Figure 3.19: Execution time of TPC-H on ADP varying data size with 32 VMs.
Time (Seconds)
TPC-H Queries on ADP on 64 GB varying #VMs
1800
1600
1400
1200
1000
800
600
400
200
8
16
24
32
10
Q
9
Q
8
Q
7
Q
6
Q
5
Q
4
Q
3
Q
2
Q
1
Q
Figure 3.20: Execution time of TPC-H on ADP with 64GB of data varying number of VMs.
of the tree is more than 2, introducing substantial accidental complexity. We also have to be careful in the
intermediate jobs (all except the last job) how we assign keys to pairs. Assigning as key the group-by column
will not work for trees with height > 2.
MapReduce initialization cost: A small factor influencing the performance of Hive is the initialization cost
of each MapReduce job. This becomes a problem especially when the number of jobs is large. Tenzing [9]
solved this problem by having a pool of Map and Reduce jobs already initialized and used on demand. The
initialization in ADP is minimal and is performed only once per job.
Synchronization of each MapReduce phase: Another factor that affect the performance is the synchronization
of each MR job. Delays are common in clusters due to network or disk latencies. Delays cause machines to
waste valuable time waiting for a small number of Map or Reduce jobs to finish. ADP creates a single job
per query, so the different phases of the execution are blended together and any possible delays do not affect
the whole query.
Partitioning exploitation: We exploit the partitioning of the table. If the data is partitioned on the same
with the join key, we perform hash join without re-partitioning (the equivalent for Hive is join in the reduce
phase). This reduces the data transferred thought the network.
Hand-Optimized Scripts: In this work, we focus on scheduling the dataflow graphs, i.e., the final stage of
optimization in ADP. Finding the best sequence of joins to execute is considered as a given. The query
scripts are given as input to the system. For future work, this will serve as a baseline to evaluate different
rewriting techniques applied on the first level of optimization of ADP.
MapReduce
We converted the TPC-H queries into MapReduce dataflows using Hive explain capability. With our language
abstractions we can easily express MapReduce dataflows. The following example shows how we express
31
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
TPC-H(64) Queries on Hive and ADP
4000
Hive
ADP
Time (Seconds)
3500
3000
2500
2000
1500
1000
500
10
Q
9
Q
8
Q
7
Q
6
Q
5
Q
4
Q
3
Q
2
Q
1
Q
Figure 3.21: Execution time of TPC-H queries on Hive and ADP with 64GB data and 32 VMs.
TPC-H(64) Queries on Hive and ADP
10000
Money (Quanta)
Q9 ADP
1000
Q9 Hive
100
Hive
ADP
10
1
10
100
Time (Quanta)
1000
Figure 3.22: 2D space of TPC-H on Hive and ADP with 64GB data and 32 VMs.
distributed hash joins using the abstraction of MapReduce (join on the reduce side).
distributed create temp table kv_A partition on cA as
select cA, ...
from input_A;
distributed create temp table kv_B partition on cB as
select cB, ...
from input_B;
distributed create table output as
select reduce(value)
from kv_A, kv_B
where cA = cB;
Figure 3.23 shows the results of the experiment. We observe that the difference of Hive with our system
is not significant with MapReduce dataflows.
3.7.3
Conclusions of Experiments
We emphasize that this work focuses on the elasticity of dataflow graphs. The experiments with Hive
presented in this section serve as a proof that our system is comparable with state-of-the-art database
management systems. The results presented here are encouraging and we leave for future work the comparison
using larger datasets and more complex queries running on a cluster with more machines. The conclusions
of the experimental study are summarized as follows.
• A very significant elasticity exists in common MapReduce tasks.
• Using our model and the proposed algorithms, we are able to extract more elasticity than MapReduce.
32
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
TPC-H Hive Queries on ADP and Hive
4000
Hive
ADP
Time (Seconds)
3500
3000
2500
2000
1500
1000
500
10
Q
9
Q
8
Q
7
Q
6
Q
5
Q
4
Q
3
Q
2
Q
1
Q
Figure 3.23: Execution time of TPC-H expressed on MapReduce on Hive and ADP with 64GB data and 32
VMs.
• The proposed solution of dynamic programming refined by simulated annealing is a promising approach
to solve the problem of elastic dataflow processing of the Cloud.
• A good ranking of the operators has great impact on the quality of the results produced by the our
algorithms.
• Our algorithms are able to efficiently explore the 2D space and find tradeoffs between time and money.
3.8
Conclusions and Future Work
In this chapter we presented an efficient algorithm for the scheduling of dataflow graphs on the cloud based on
both completion time and monetary cost. The algorithm ranks the operators of the dataflow based on their
influence. We implemented the algorithm in our platform for data processing flows on the cloud. Through
several experiments we showed that our methods can efficiently explore the 2D space of time and money and
are able to find good trade-offs. Our future work moves in several directions.
Theoretical bounds: We want to provide theoretical bounds on the error of the skylines produced by our
algorithm. Works such as [39, 40] prove that the approximation of the skyline is a tractable problem.
Elasticity Prediction: The ranking of the operators could be used to predict the elasticity of a dataflow.
Preliminary results show that this is feasible. Dataflows with low elasticity tend to have some operators that
dominate others. On the other hand, dataflows with high elasticity, do not have heavy backbone operators.
This observation leads to an enhancement of the algorithm. If the dataflow is not elastic, run a much
simpler algorithm that minimizes completion time. Given that the dataflow is not elastic, the money is also
minimized.
Belief Propagation: We also plan to use belief propagation techniques to rank the operators along with
new scheduling algorithms that exploit the rich information that belief propagation algorithms produce, such
as the probability distribution of the solutions.
In the next chapter we present the integration of ADP into the Optique platform and present the progress
that has been made in the communication with other components.
33
Chapter 4
Integration of ADP With the Optique
Platform
In this chapter we present the integration of the distributed query processing and optimization module in the
Optique platform during the first year of the project. We describe the communication with other modules
and the current state of the process.
4.1
Progress
First, we collaborated closely with other work packages (especially WP2 and WP6) to clearly define the role
of ADP in the general Optique architecture. The results of this work have been described with more details
in deliverable D2.1 and also in [1]. Figure 4.1 is borrowed from the aforementioned deliverable and shows
the architecture of the distributed query processing component including the communication with the query
transformation and the communication with external data sources.
Then, we focused on the integration of ADP with the other components of Optique. More specifically, we
have been implementing the JDBC interface that other Optique Platform components can use. With respect
to the integration with the platform, we have created some instances of ADP running on virtual machines
on the Optique shared infrastructure. We have imported data from relational databases for both use cases
of the project and we have run queries issued by the Query Transformation module.
4.2
JDBC Interface
We have developed a JDBC interface to integrate with the other components that use ADP. We chose JDBC
because it is well defined and the integration is very simple (e.g., Quest uses JDBC already). The design is
shown in Figure 4.2. The front-end issues queries in the form of HTTP post. The results are given to the
client as a JSON stream. The ADP gateway is an HTTP server that listens to requests. An example of a
result to a request is shown below.
{schema: [[id, integer],[name, text]], time:5.2, error:[]}
[0, Germany]
[1, Greece]
[2, Italy]
...
The first line contains the metadata of the result. It is a JSON dictionary that includes the schema, the
execution time, and the possible errors of the execution. More fields may be added in the future. After the
metadata, each line contains one record. Each record is a JSON array with the values of the columns. This
34
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Integrated via Information Workbench
Presentation
Layer
Optique's configuration interface
Query Answering Component
Query transformation
Shared
database
Query Rewriting
Answ Manager
1-time Q
SPARQL
1-time Q
SPARQL
Stream Q
Configuration
of modules
Stream Q
LDAP
authentification
ADP Gateway: JDBC, Stream API
Distributed Query
Execution based on ADP
Master
Data
Connector
Optimisation
Engine
Optimisation
Engine
Execution
Engine
Execution
Engine
Worker
Worker
Stream
Connector
P2P Net
Worker
Worker
Fast Local Net
Application,
Internal Data
Layer
Stream connector
JDBC, Teiid
Externat
Data
Layer
...
RDBs, triple stores,
temporal DBs, etc.
...
Components
Component
Cloud API
Cloud (virtual
resource pool)
data streams
Colouring Convention
Group of components
Front end:
mainly Web-based
Optique solution
Externat
Cloud
Types of Users
Expert users
Figure 4.1: General architecture of the ADP component within the Optique System
Figure 4.2: The JDBC front-end and the ADP gateway are connected using HTTP protocols transferring
JSON encoded streams.
allows the data to be sent as streams. Listing 4.1 presents some sample code for using the ADP using the
JDBC driver.
35
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Listing 4.1: Using the ADP JDBC driver
C l a s s . forName ( " madgik . adp . j d b c . AdpDriver " ) ;
S t r i n g s i e m e n s D a t a b a s e = " h t t p : / / 1 0 . 2 5 4 . 1 1 . 1 8 : 9 0 9 0 / home/adp/ d a t a b a s e s / s i e m e n s " ;
S t r i n g siemensQuery = " s e l e c t ∗ from . . . ; " ;
Connection conn = DriverManager . g e t C o n n e c t i o n (
" j d b c : adp : " + siemensDatabase ,
" adp " ,
" siemens " ) ;
Statement stmt = conn . c r e a t e S t a t e m e n t ( ) ;
R e s u l t S e t r s = stmt . executeQuery ( siemensQuery ) ;
while ( r s . next ( ) ) {
...
}
rs . close () ;
stmt . c l o s e ( ) ;
The database is specified by the address of the ADP gateway and the path of the database. In our
example, the address is 10.254.11.18:9090 and the database path is /home/adp/databases/siemens. The rest
of the code is JDBC standard.
4.3
Queries Provided by the Use Cases
We have successfully imported data from the NPD1 and Siemens databases that are set up in Optique’s
development infrastructure and we have successfully executed example queries provided by the partners. All
the SQL queries can be found in Appendix A.
Table 4.1 shows the execution times for the queries of NPD dataset of the Statoil use case. We observe
that ADP has a constant overhead. This is due to optimization, scheduling, monoring, reporting, etc. that
are essential for distributed processing. The data is not large enough to show the benefits from using a
distributed processing engine. However, we observe that even in this setting, for the first query of NPD, we
are able to reduce significantly the running time. We anticipate large benefits with larger amount of data.
Table 4.1: NPD Queries
Query No
01
02
03
04
05
06
07
08
09
10
12
13
14
15
16
17
18
MySQL Running Time
104.637404
1.420361
0.091786
0.026232
0.600685
0.002712
0.001985
0.010112
0.042385
0.040352
0.40598
0.156954
0.088088
4.309496
0.008964
0.00743
0.002539
ADP Running Time
8.313
8.291
8.340
8.248
8.331
8.462
8.421
8.474
8.428
8.377
8.328
8.491
8.478
12.236
8.482
8.436
8.430
We measured the overhead using very simple queries that do very simple processing and found that it is
around 8 seconds. We are working on methods to reduce the pre-processing time altough, for large datasets,
this overhead is insignificant. Figure 4.3 shows the Statoil queries after removing the overhead.
Table 4.2 show the execution times for the Siemens use case. Figure 4.4 shows the same data in a plot.
We observe that ADP is slower than PostgreSQL. This is because we use only one machine and we are not
able to explore the parallelism of the queries and ameliorate the overhead of the distributed engine. When we
integrate more tightly with the platform, and especially with the eCloudManager, we will be able to execute
the queries in a distributed fashion and accelerate the execution.
1
one of the data sources used in the Statoil use case
36
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Time (seconds, log scale) Statoil Queries 100 10 1 MySQL 0.1 ADP (Normalized) 0.01 0.001 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Query Figure 4.3: The Queries of the Statoil use case on MySQL and ADP
Table 4.2: Siemens Queries
Query No
01
02
03
04
05
06
07
08
PostgreSQL Running Time
4.473
1.599
0.405
0.446
0.468
0.486
0.447
0.325
ADP Running Time
28.905
9.230
8.001
7.606
8.082
7.593
7.354
68.300
Siemens Queries Time (seconds, log scale) 100 10 PostgreSQL 1 ADP 0.1 1 2 3 4 5 6 7 8 Query Figure 4.4: The Queries of the Siemens use case on PostreSQL and ADP
37
Chapter 5
Conclusions
In this deliverable we presented the work that has been carried out in the context of work package WP7
during the first year of the Optique project. We started by describing the ADP system for distributed
query processing. We gave emphasis to the language abstractions of the system. We then proceeded to
the main research results regarding the optimization of dataflow execution on the cloud. We also reported
on the current status of the implementation and the communication with other components of the Optique
platform.
In the second year of the project we will turn our attention to efficient execution of continuous and temporal OBDA queries, by taking into consideration the work done for tasks 5.2 and 5.3 of WP5. Furthermore,
we will have tighter integration with the query transformation module by considering possible feedback that
will be useful for the optimization process of query rewriting, like execution statistics. Furthermore, we
will integrate tightly with the platform, and especially with the eCloudManager, in order to run ADP in
distributed mode. Also, during the first year of the project, it became evident that the execution of queries
from different data sources will be crucial for the successful tackling of the project’s use cases, especially for
the Statoil use case. Work related to federation and query execution from distributed data sources should
be pushed earlier and during the second year we will take into account problems related to this aspect. We
plan to work closely with the query transformation module to achieve efficient execution of federated queries
by taking into account factors like the constraints and QoS of different components.
38
Bibliography
[1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz.
HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads.
PVLDB, 2(1):922–933, 2009.
[2] Amazon. "amazon elastic map reduce, http://aws.amazon.com/elasticmapreduce/", 2011.
[3] Apache. "Mahout : Scalable machine-learning and data-mining library, http://mahout.apache.org/",
2010.
[4] Apache. Apache Hadoop, http://hadoop.apache.org/, 2011.
[5] Apache. TPC-H Benchmark, http://www.tpc.org/tpch/, 2012.
[6] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, Mei-Hui Su, and K. Vahi. Characterization of
scientific workflows. pages 1 –10, nov. 2008.
[7] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw,
and Nathan Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not., 45(6):363–
375, 2010.
[8] Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina,
Younghee Kwon, and Michael Wong. Tenzing a sql implementation on the mapreduce framework.
PVLDB, 4(12):1318–1327, 2011.
[9] Jeffrey Dean and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". In
6th Symposium on Operating System Design and Implementation, pages 137–150, 2004.
[10] E. Deelman and et. al. "Pegasus: Mapping Large Scale Workflows to Distributed Resources in Workflows
in e-Science". Springer, 2006.
[11] Ewa Deelman et al. Managing large-scale workflow execution from resource provisioning to provenance
tracking: The cybershake example. In IEEE e-Science, page 14, 2006.
[12] Ewa Deelman, Carl Kesselman, and more. "GriPhyN and LIGO, Building a Virtual Data Grid for
Gravitational Wave Scientists". In IEEE HPDC, pages 225–, 2002.
[13] Ewa Deelman, Gurmeet Singh, Miron Livny, G. Bruce Berriman, and John Good. The cost of doing
science on the cloud: the montage example. In IEEE/ACM SC, page 50, 2008.
[14] David J. DeWitt and Jim Gray. Parallel database systems: The future of high performance database
systems. Communications of the ACM, 35(6):85–98, 1992.
[15] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing top k lists. In SODA, pages 28–36, 2003.
[16] Daniela Florescu and Donald Kossmann. Rethinking cost and performance of database systems. SIGMOD Record, 38(1):43–48, 2009.
39
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
[17] Luis Miguel Vaquero Gonzalez, Luis Rodero Merino, Juan Caceres, and Maik Lindner. "A break in the
clouds: towards a cloud definition". Computer Communication Review, 39(1):50–55, 2009.
[18] Ronald L. Graham. "Bounds on Multiprocessing Timing Anomalies". SIAM Journal of Applied Mathematics, 17(2):416–429, 1969.
[19] Hadapt. "hadapt analytical platform, http://www.hadapt.com/", 2011.
[20] Herodotos Herodotou and Shivnath Babu. Profiling, what-if analysis, and cost-based optimization of
mapreduce programs. PVLDB, 4(11):1111–1122, 2011.
[21] Jin Huang, Bin Jiang, Jian Pei, Jian Chen, and Yong Tang. Skyline distance: a measure of multidimensional competence. Knowl. Inf. Syst., 34(2):373–396, 2013.
[22] Yannis E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19–30, 2003.
[23] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. "Dryad: distributed dataparallel programs from sequential building blocks". In EuroSys, pages 59–72, 2007.
[24] Joseph C. Jacob et al. "Montage: a grid portal and software toolkit for science, grade astronomical
image mosaicking". Int. J. Comput. Sci. Eng., 4(2):73–87, 2009.
[25] Youngdae Kim, Gae won You, and Seung won Hwang. Escaping a dominance region at minimum cost.
In DEXA, pages 800–807, 2008.
[26] Herald Kllapi, Dimitris Bilidas, Ian Horrocks, Yannis Ioannidis, Ernesto Jiménez-Ruiz, Evgeny Kharlamov, Manolis Koubarakis, and Dmitiy Zheleznyakov. Distributed query processing on the cloud: the
optique point of view (short paper). In 10th OWL: Experiences and Directions Workshop (OWLED),
2013.
[27] Herald Kllapi, Eva Sitaridi, Manolis M. Tsangaris, and Yannis E. Ioannidis. Schedule optimization for
data processing flows on the cloud. In Proc. of SIGMOD, pages 289–300, 2011.
[28] Donald Kossmann. The state of the art in distributed query processing. ACM Computing Surveys,
32(4):422–469, 2000.
[29] Donald Kossmann. "The state of the art in distributed query processing". ACM Comput. Surv.,
32(4):422–469, 2000.
[30] Yu-Kwong Kwok and Ishfaq Ahmad. "Benchmarking and Comparison of the Task Graph Scheduling
Algorithms". J. Parallel Distrib. Comput., 59(3):381–422, 1999.
[31] Jiexing Li, Arnd Christian König, Vivek R. Narasayya, and Surajit Chaudhuri. Robust estimation of
resource consumption for sql queries using statistical techniques. PVLDB, 5(11):1555–1566, 2012.
[32] Harold Lim, Herodotos Herodotou, and Shivnath Babu. Stubby: A transformation-based optimizer for
mapreduce workflows. PVLDB, 5(11):1196–1207, 2012.
[33] Michael J. Litzkow, Miron Livny, and Matt W. Mutka. "Condor - A Hunter of Idle Workstations". In
ICDCS, pages 104–111, 1988.
[34] David T. Liu and Michael J. Franklin. "The Design of GridDB: A Data-Centric Overlay for the Scientific
Grid". In VLDB, pages 600–611, 2004.
[35] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and
Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330–339, 2010.
40
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
[36] Christopher Olston et al. "Pig latin: a not-so-foreign language for data processing". In SIGMOD
Conference, pages 1099–1110, 2008.
[37] M. Tamer Özsu and Patrick Valduriez. Principles of Distributed Database Systems. Prentice-Hall, 2
edition, 1999.
[38] Suraj Pandey, Linlin Wu, Siddeswara Mayura Guru, and Rajkumar Buyya. A particle swarm
optimization-based heuristic for scheduling workflow applications in cloud computing environments.
In IEEE AINA, pages 400–407, 2010.
[39] Christos H. Papadimitriou and Mihalis Yannakakis. On the approximability of trade-offs and optimal
access of web sources. In FOCS, pages 86–92, 2000.
[40] Christos H. Papadimitriou and Mihalis Yannakakis. "Multiobjective Query Optimization". In PODS,
2001.
[41] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis
with Sawzall. Scientific Programming, 13(4):277–298, 2005.
[42] Garey M. R., Johnson D. S., and Sethi Ravi. "The Complexity of Flowshop and Jobshop Scheduling".
Mathematics of operations research, 1(2):117–129, 1976.
[43] Srinath Shankar and David J. DeWitt. "Data driven workflow planning in cluster management systems".
In HPDC, pages 127–136, 2007.
[44] João Nuno Silva, Luís Veiga, and Paulo Ferreira. Heuristic for resources allocation on utility computing
infrastructures. In Bruno Schulze and Geoffrey Fox, editors, MGC, page 9. ACM, 2008.
[45] Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. Qox-driven etl design: reducing the cost of etl consulting engagements. In SIGMOD Conference, pages 953–960, 2009.
[46] Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. Optimizing analytic data
flows for multiple execution engines. In SIGMOD Conference, pages 829–840, 2012.
[47] Michael Stonebraker, Paul M. Aoki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, and
Andrew Yu. Mariposa: A wide-area distributed database system. VLDB J., 5(1):48–63, 1996.
[48] Ashish Thusoo et al. "Hive - a petabyte scale data warehouse using Hadoop". In ICDE, pages 996–1005,
2010.
[49] Mads Torgersen. Querying in C#: how language integrated query (LINQ) works. In OOPSLA Companion, pages 852–853, 2007.
[50] Manolis M. Tsangaris and more. Dataflow processing and optimization on grid and cloud infrastructures.
IEEE Data Eng. Bull., 32(1):67–74, 2009.
[51] Xiaoli Wang, Yuping Wang, and Hai Zhu. Energy-efficient task scheduling model based on mapreduce
for cloud computing using genetic algorithm. JCP, 7(12):2962–2970, 2012.
[52] Fatos Xhafa and Ajith Abraham, editors. Metaheuristics for Scheduling in Distributed Computing
Environments. Studies in Computational Intelligence. Springer, 2008.
[53] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and
Jon Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a
high-level language. In OSDI, pages 1–14, 2008.
41
Glossary
ADP
API
ART
CFM
DBMS
DP
ETL
FIFO
HTTP
JDBC
JSON
IaaS
NL
NPD
OBDA
PaaS
PL
PW
SaaS
QoS
RDBMS
RMI
SA
SnF
SQL
UDF
VM
W3C
WP
Athena Distributed Processing
Application Programming Interface
ADP Run Time
Connection and Function Manager
Data Base Management System
Dynamic Programming
Extract Transform Load
First In First Out
Hypertext Transfer Protocol
Java Database Connectivity
JavaScript Object Notation
Infrastructure as a Service
Nested Loops Optimizer
Norwegian Petroleum Directorate
Ontology-based Data Access
Platform as a Service
Pipeline
Parallel Wave
Software as a Service
Quality of Service
Relational Data Base Management System
Remote Method Invocation
Simulated Annealing
Store and Forward
Structured Query Language
User Defined Function
Virtual Machine
World Wide Web Consortium
Work Package
42
Appendix A
Test Queries from the Use Cases
This appendix contains the the test queries that have been provided by the use case partners and have been
executed on the ADP instances on the fluidOps infrastructure.
A.1
NPD Queries
Query No 1
SELECT DISTINCT a . prlName , a . cmpLongName , a . p r l L i c e n s e e I n t e r e s t , a . p r l L i c e n s e e D a t e V a l i d T o
FROM l i c e n c e _ l i c e n s e e _ h s t a
WHERE a . p r l L i c e n s e e D a t e V a l i d T o IN
(SELECT MAX( b . p r l L i c e n s e e D a t e V a l i d T o )
FROM l i c e n c e _ l i c e n s e e _ h s t b
WHERE a . prlName = b . prlName
GROUP BY b . prlName )
ORDER BY a . prlName
Query No 2
SELECT a . prlName , a . cmpLongName , a . prlOperDateValidFrom
FROM l i c e n c e _ o p e r _ h s t a
WHERE a . prlOperDateValidFrom IN
(SELECT MAX( b . prlOperDateValidFrom )
FROM l i c e n c e _ o p e r _ h s t b
WHERE a . prlName = b . prlName )
ORDER BY a . prlName
Query No 3
SELECT prlName , prlDateGranted , p r l D a t e V a l i d T o
FROM l i c e n c e
ORDER BY prlName
Query No 4
SELECT w l b P r o d u c t i o n L i c e n c e , MIN( wlbEntryDate )
FROM w e l l b o r e _ e x p l o r a t i o n _ a l l
GROUP BY w l b P r o d u c t i o n L i c e n c e
ORDER BY w l b P r o d u c t i o n L i c e n c e
Query No 5
SELECT prlName , cmpLongName , p r l L i c e n s e e D a t e V a l i d F r o m as fromDate , p r l L i c e n s e e D a t e V a l i d T o as
toDate
FROM l i c e n c e _ l i c e n s e e _ h s t
ORDER BY prlName , fromDate
43
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
Query No 6
SELECT fldName , SUBSTRING_INDEX( wlbName ,
FROM f i e l d
ORDER BY fldName
’− ’ , 1 )
Query No 7
SELECT fldName , fldRemainingOE , f l d R e m a i n i n g O i l , fldRemainingGas , f l d R e m a i n i n g C o n d e n s a t e
FROM f i e l d _ r e s e r v e s
ORDER BY fldRemainingOE DESC
Query No 8
SELECT fclBelongsToName as f i e l d , fclName as f a c i l i t y , f c l K i n d as type
FROM f a c i l i t y _ f i x e d
WHERE f c l B e l o n g s T o K i n d = ’FIELD ’
ORDER BY f i e l d
Query No 9
SELECT w l b P r o d u c t i o n L i c e n c e , COUNT(DISTINCT wlbWell )
FROM
(
SELECT w l b P r o d u c t i o n L i c e n c e , wlbWell
FROM w e l l b o r e _ d e v e l o p m e n t _ a l l
UNION
SELECT w l b P r o d u c t i o n L i c e n c e , wlbWell
FROM w e l l b o r e _ e x p l o r a t i o n _ a l l
UNION
SELECT w l b P r o d u c t i o n L i c e n c e , wlbWell
FROM w e l l b o r e _ s h a l l o w _ a l l
) AS t
GROUP BY w l b P r o d u c t i o n L i c e n c e
Query No 10
SELECT DISTINCT ∗
FROM
( (SELECT wlbName , ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) AS lenghtM
FROM w e l l b o r e _ c o r e
WHERE wlbCoreIntervalUom = ’ [ f t
]’
AND ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) > 30
)
UNION
(SELECT wlbName , wlbTotalCoreLength AS lenghtM
FROM w e l l b o r e _ c o r e
WHERE wlbCoreIntervalUom = ’ [m
]’
AND wlbTotalCoreLength > 30
)
) as t
Query No 11
SELECT DISTINCT s t r a t . lsuName , c o r e s . wlbName AS wlbName
FROM w e l l b o r e _ c o r e AS c o r e s , s t r a t _ l i t h o _ w e l l b o r e AS s t r a t
WHERE c o r e s . wlbNpdidWellbore = s t r a t . wlbNpdidWellbore
AND ( ( c o r e s . wlbCoreIntervalUom = ’ [ f t
]’
AND
(GREATEST( c o r e s . w l b C o r e I n t e r v a l T o p ∗ 0 . 3 0 4 8 , s t r a t . lsuTopDepth )
< LEAST( c o r e s . w l b C o r e I n t e r v a l B o t t o m ∗ 0 . 3 0 4 8 , s t r a t . lsuBottomDepth ) )
)
OR
( c o r e s . wlbCoreIntervalUom = ’ [m
]’
44
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
AND
(GREATEST( c o r e s . w lb C o re I n t er v a lT o p , s t r a t . lsuTopDepth )
< LEAST( c o r e s . wlbCoreIntervalBottom , s t r a t . lsuBottomDepth )
)
)
)
ORDER BY s t r a t . lsuName , wlbName
Query No 12
SELECT DISTINCT c o r e s . wlbName , c o r e s . lenghtM , w e l l b o r e . w l b D r i l l i n g O p e r a t o r , w e l l b o r e .
wlbCompletionYear
FROM
( (SELECT wlbName , wlbNpdidWellbore , ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) AS lenghtM
FROM w e l l b o r e _ c o r e
WHERE wlbCoreIntervalUom = ’ [ f t
]’
)
UNION
(SELECT wlbName , wlbNpdidWellbore , wlbTotalCoreLength AS lenghtM
FROM w e l l b o r e _ c o r e
WHERE wlbCoreIntervalUom = ’ [m
]’
)
) as c o r e s ,
( (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear
FROM w e l l b o r e _ d e v e l o p m e n t _ a l l
)
UNION
(SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear
FROM w e l l b o r e _ e x p l o r a t i o n _ a l l
)
UNION
(SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear
FROM w e l l b o r e _ s h a l l o w _ a l l
)
) as w e l l b o r e
WHERE w e l l b o r e . wlbNpdidWellbore = c o r e s . wlbNpdidWellbore
AND w e l l b o r e . w l b D r i l l i n g O p e r a t o r LIKE ’%STATOIL% ’
AND wlbCompletionYear >= 2008
AND lenghtM > 50
ORDER BY c o r e s . wlbName
Query No 13
SELECT DISTINCT c o r e s . wlbName , c o r e s . lenghtM , w e l l b o r e . w l b D r i l l i n g O p e r a t o r , w e l l b o r e .
wlbCompletionYear
FROM
( (SELECT wlbName , wlbNpdidWellbore , ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) AS lenghtM
FROM w e l l b o r e _ c o r e
WHERE wlbCoreIntervalUom = ’ [ f t
]’
)
UNION
(SELECT wlbName , wlbNpdidWellbore , wlbTotalCoreLength AS lenghtM
FROM w e l l b o r e _ c o r e
WHERE wlbCoreIntervalUom = ’ [m
]’
)
) as c o r e s ,
( (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear
FROM w e l l b o r e _ d e v e l o p m e n t _ a l l
)
UNION
(SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear
FROM w e l l b o r e _ e x p l o r a t i o n _ a l l
)
45
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
UNION
(SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear
FROM w e l l b o r e _ s h a l l o w _ a l l
)
) as w e l l b o r e
WHERE w e l l b o r e . wlbNpdidWellbore = c o r e s . wlbNpdidWellbore
AND w e l l b o r e . w l b D r i l l i n g O p e r a t o r LIKE ’%STATOIL% ’
AND wlbCompletionYear < 2008
AND lenghtM < 10
ORDER BY c o r e s . wlbName
Query No 14
SELECT f i e l d . fldName
prfPrdOeNetMillSm3 ,
prfPrdOilNetMillSm3 ,
prfPrdGasNetBillSm3 ,
prfPrdNGLNetMillSm3 ,
prfPrdCondensateNetMillSm3
FROM f i e l d _ p r o d u c t i o n _ m o n t h l y a ,
field
WHERE p r f N p d i d I n f o r m a t i o n C a r r i e r = f l d N p d i d F i e l d
AND prfPrdOeNetMillSm3 IN
(SELECT MAX( prfPrdOeNetMillSm3 )
FROM f i e l d _ p r o d u c t i o n _ m o n t h l y b
WHERE a . p r f N p d i d I n f o r m a t i o n C a r r i e r = b . p r f N p d i d I n f o r m a t i o n C a r r i e r
GROUP BY b . p r f I n f o r m a t i o n C a r r i e r )
ORDER BY prfPrdOeNetMillSm3 DESC
Query No 15
SELECT f i e l d . fldName ,
AVG( prfPrdOeNetMillSm3 ) ,
AVG( p r f P r d O i l N e t M i l l S m 3 ) ,
AVG( prfPrdGasNetBillSm3 ) ,
AVG( prfPrdNGLNetMillSm3 ) ,
AVG( prfPrdCondensateNetMillSm3 )
FROM f i e l d _ p r o d u c t i o n _ y e a r l y ,
field
WHERE p r f N p d i d I n f o r m a t i o n C a r r i e r = f l d N p d i d F i e l d
AND p r f Y e a r < 2013 −− e x c l u d e c u r r e n t , and i n c o m p l e t e , y e a r
GROUP BY p r f I n f o r m a t i o n C a r r i e r
ORDER BY AVG( prfPrdOeNetMillSm3 ) DESC
Query No 16
SELECT f i e l d . fldName ,
SUM( prfPrdOeNetMillSm3 ) ,
SUM( p r f P r d O i l N e t M i l l S m 3 ) ,
SUM( prfPrdGasNetBillSm3 ) ,
SUM( prfPrdNGLNetMillSm3 ) ,
SUM( prfPrdCondensateNetMillSm3 )
FROM f i e l d _ p r o d u c t i o n _ y e a r l y ,
field
WHERE p r f N p d i d I n f o r m a t i o n C a r r i e r = f l d N p d i d F i e l d
−− AND p r f Y e a r < 2013 −− e x c l u d e c u r r e n t , and i n c o m p l e t e , y e a r
GROUP BY p r f I n f o r m a t i o n C a r r i e r
ORDER BY SUM( prfPrdOeNetMillSm3 ) DESC
Query No 17
SELECT p r f I n f o r m a t i o n C a r r i e r ,
SUM( p r f P r d O i l N e t M i l l S m 3 ) AS sumOil ,
SUM( prfPrdGasNetBillSm3 ) AS sumGas
46
Optique Deliverable D7.1
Techniques for Distributed Query Planning and Execution: One-Time Queries
FROM f i e l d _ p r o d u c t i o n _ m o n t h l y AS prod ,
field
WHERE prod . p r f N p d i d I n f o r m a t i o n C a r r i e r = f i e l d . f l d N p d i d F i e l d
AND prod . p r f Y e a r = 2010
AND prod . prfMonth >= 1
AND prod . prfMonth <= 6
AND f i e l d . cmpLongName = ’ S t a t o i l Petroleum AS ’
GROUP BY p r f I n f o r m a t i o n C a r r i e r
ORDER BY p r f I n f o r m a t i o n C a r r i e r
A.2
Siemens Queries
Query No 1
SELECT value FROM measurement WHERE s e n s o r =57 and "Timestamp" > ’ 2009−01−01 0 0 : 0 0 : 0 0 ’ and "
Timestamp" < ’ 2009−12−31 2 3 : 0 0 : 0 0 ’
Query No 2
SELECT " e v e n t t e x t " , COUNT( " e v e n t t e x t " ) AS c FROM message WHERE "Timestamp" >= ’ 2009−01−01
0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2009−01−03 2 3 : 0 0 : 0 0 ’ GROUP BY " e v e n t t e x t " ORDER BY c DESC
Query No 3
SELECT " e v e n t t e x t " , COUNT( " e v e n t t e x t " ) AS c FROM message WHERE "Timestamp" >= ’ 2009−01−01
0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2009−01−03 2 3 : 0 0 : 0 0 ’ GROUP BY " e v e n t t e x t " ORDER BY c DESC
LIMIT 10
Query No 4
SELECT " as se mb l y " , COUNT( " message " ) AS e v e n t F r e q u e n c y FROM message WHERE " c a t e g o r y "=6 AND "
Timestamp" >= ’ 2005−01−01 0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2005−12−31 2 3 : 0 0 : 0 0 ’ GROUP BY "
a ss em bl y " ORDER BY e v e n t F r e q u e n c y DESC LIMIT 10
Query No 5
SELECT " c a t e g o r y " , COUNT( " c a t e g o r y " ) AS c a t e g o r y F r e q u e n c y FROM message WHERE "Timestamp" > ’
2005−03−25 0 0 : 0 0 : 0 0 ’ AND "Timestamp" < ’ 2006−04−04 0 0 : 0 0 : 0 0 ’ GROUP BY " c a t e g o r y " ORDER
BY c a t e g o r y F r e q u e n c y DESC LIMIT 5
Query No 6
SELECT "Timestamp" FROM message WHERE " e v e n t t e x t "= ’ C o n t r o l l e r f a u l t ’ AND "Timestamp" > ’
2007−11−15 0 0 : 0 0 : 0 0 ’ AND "Timestamp" < ’ 2007−12−31 2 3 : 5 9 : 5 9 ’ ORDER BY "Timestamp" ASC
LIMIT 1
Query No 7
SELECT e v e n t t e x t , COUNT( e v e n t t e x t ) AS f r e q FROM message WHERE " assembly "=6 AND "Timestamp"
>= ’ 2005−01−01 0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2005−01−01 0 1 : 4 4 : 4 7 ’ GROUP BY e v e n t t e x t
ORDER BY f r e q DESC LIMIT 10
Query No 8
SELECT COUNT( s e n s o r ) FROM measurement , message WHERE measurement . s e n s o r =35 AND measurement . "
Timestamp"=message . "Timestamp" AND measurement . "Timestamp" >= ’ 2005−01−01 0 0 : 0 0 : 0 0 ’ AND
measurement . "Timestamp" <= ’ 2005−01−10 0 0 : 0 0 : 0 0 ’ AND measurement . value >745
47
Download