Project No : FP7-318338 Project Acronym: Optique Project Title: Scalable End-user Access to Big Data Instrument: Integrated Project Scheme: Information & Communication Technologies Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Due date of deliverable: (T0+12) Actual submission date: October 31, 2013 Start date of the project: 1st November 2012 Duration: 48 months Lead contractor for this deliverable: UoA Dissemination level: PU – Public Final version Executive Summary: Techniques for Distributed Query Planning and Execution: One-Time Queries This document summarises deliverable D7.1 of project FP7-318338 (Optique), an Integrated Project supported by the 7th Framework Programme of the EC. Full information on this project, including the contents of this deliverable, is available online at http://www.optique-project.eu/. We first introduce the concept of elasticity and the main contributions of this work. We continue by presenting the ADP system on which we build in order to create the distributed query execution module of Optique. After that, we present the main research results related to the elastic execution of dataflow graphs in a cloud environment. We explore the trade-offs between query completion time and monetary cost of resource usage and we provide an efficient methodology to find good solutions. Next, we present the current status of integration with the Optique platform and the communication with other components of the system. Finally, we conclude by presenting the work that will be carried out during the second year of the project. List of Authors Herald Kllapi (UoA) Dimitris Bilidas (UoA) Yannis Ioannidis (UoA) Manolis Koubarakis (UoA) 2 Contents 1 Introduction 2 The 2.1 2.2 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 7 7 8 10 10 11 11 3 Elastic Dataflow Processing in the Cloud 3.1 Related Work . . . . . . . . . . . . . . . . 3.2 Preliminaries . . . . . . . . . . . . . . . . 3.2.1 Dataflow . . . . . . . . . . . . . . . 3.2.2 Schedule Estimation . . . . . . . . 3.2.3 Skyline Distance Set . . . . . . . . 3.3 Dataflow Language . . . . . . . . . . . . . 3.4 Operator Ranking . . . . . . . . . . . . . 3.5 Scheduling Algorithm . . . . . . . . . . . 3.5.1 Dynamic Programming . . . . . . 3.5.2 Simulated Annealing . . . . . . . . 3.5.3 Parallel Wave . . . . . . . . . . . . 3.5.4 Exhaustive . . . . . . . . . . . . . 3.6 Complexity Analysis . . . . . . . . . . . . 3.7 Experimental Evaluation . . . . . . . . . . 3.7.1 Model and Algorithms . . . . . . . 3.7.2 Systems . . . . . . . . . . . . . . . 3.7.3 Conclusions of Experiments . . . . 3.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 14 14 16 17 17 19 19 20 20 20 22 22 22 27 32 33 2.4 2.5 2.6 ADP System Related Work . . . . . . . Overview . . . . . . . . . Language Abstractions . . 2.3.1 Dataflow Language 2.3.2 Dataflow Language Extensions . . . . . . . . . SQL Processing Engine . . Conclusions . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Integration of ADP With the Optique Platform 34 4.1 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 JDBC Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Queries Provided by the Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Conclusions 38 Bibliography 38 3 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Glossary 42 A Test Queries from the Use Cases 43 A.1 NPD Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 A.2 Siemens Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Chapter 1 Introduction The workpackage WP7 of Optique deals with the distributed query execution module and is divided in the following tasks: • Task 7.1: Query planning and execution techniques for one-time OBDA queries • Task 7.2: Query planning and execution techniques for continuous and temporal OBDA queries • Task 7.3: Optimization techniques • Task 7.4: Implementation, testing and evaluation This deliverable describes work that has been done concerning the first task, which corresponds to the effort put in WP7 during the first year of the project. In the context of an Ontology-based Data Access system, the efficient execution of queries produced by the query transformation components heavily relies on the proper and effective use of the resources of a cloud environment. Our main attention has been drawn to this concept, the elastic execution of queries by dynamic resource management, in order to achieve the objectives of this task. In Chapter 2, we present ADP, a system for distributed query processing in cloud environments. ADP has been developed at the University of Athens during several European funded research projects. ADP is a system designed for the effcient execution of data-intensive flows on the cloud. These dataflows are expressed as relational queries with user defined functions (UDFs). We present the architecture of the system with its main components and the query languages that it supports. In Chapter 3, we present the research contributions regarding the elastic execution of dataflows. Finding trade-offs between completion time and monetary cost is essential in a cloud environment. Cloud enabled data processing platforms should offer the ability to select the best trade-off. The obvious questions are: 1. Does it exist? 2. Can it be obtained at an overhead that makes it worth it? Our first contribution in this chapter is to demonstrate that very significant elasticity exists in a number of common tasks, even when the abstraction for the cloud-computation is modeled at a very high level, such as MapReduce. Moreover, we show that elasticity can be discovered in practice using highly scalable and efficient algorithms and that there appear to be certain simple “rules of thumb” for when elasticity is present. It is natural to expect that more refined models of the cloud-computation would allow further optimizations and extraction of elasticity. At the same time, it is reasonable to be concerned whether the resulting complexity of the refined model would allow for these optimizations/extraction to be performed. Our second contribution is to demonstrate that there exists a very fertile middle-ground in terms of abstraction which enables the extraction of much more elasticity than what is possible under MapReduce while remaining computationally tractable. The content of this chapter has been submitted for publication and is currently under review. In Chapter 4, we describe the current status of integration of ADP into the Optique platform. We present several examples of using the JDBC driver and executing the queries of the Optique use cases. Finally, in Chapter 5, we discuss the work plan of WP7 for the second year of the project and conclude. 5 Chapter 2 The ADP System ADP is a system designed for the efficient execution of complex dataflows on the Cloud [50]. Requests take the form of queries in SQL with user defined functions (UDFs). The SQL query is transformed into two intermediate levels before execution. The query of the first level is again SQL but has additional notations about the distribution of the tables. We enhance SQL by adding the table partition as a first class citizen of the language. A table partition is defined as a set of tuples having a particular property that we can exploit in query optimization. An example is the value of a hash function applied on one column to be the same for all the tuples in the same partition. This property is used for distributed hash joins. ADP acts as a middleware placed between the infrastructure and the other components of the system, simplifying their “view” of the underlying infrastructure. One can benefit from ADP by detaching part of the application logic and writing it using the ADP abstractions. This allows it to scale transparently as needed. The inception of ADP began in 2004 in the European project Diligent1 , and in particular, in the distributed query engine deployed for the project. The query engine of Diligent was subsequently used and extended in the European project Health-e-Child2 . Finally, after Health-e-Child ended, ADP became an internal project of the MaDgIK3 group. 2.1 Related Work The most popular platforms for data processing flows on the cloud are based on MapReduce [10], presented by Google. On top of MapReduce, Google has build systems like FlumeJava [8], Sawzall [41], and Tenzing [9]. FlumeJava is a library used to write data pipelines that are transformed into MapReduce jobs. Sawzall is a scripting language that can express simple data processing over huge datasets. Tenzing [9] is an analytical query engine that uses pool of pre-allocated machines to minimize latency. The main open source implementation of MapReduce is Hadoop [5] by Yahoo!. Hive [48] is a data warehouse solution from Facebook. HiveQL (the query language of Hive) is a subset of SQL and the optimization techniques are limited only to simple transformation rules. The optimization goal is to minimize the number of MapReduce jobs, while at the same time maximize the parallelism, and as a consequence, minimize the execution time of the query. HadoopDB [2] is recent hybrid system that combines MapReduce with databases. It uses multiple single node databases and rely on Hadoop to schedule the jobs to each database. The optimization goal is to create as much parallelism as possible by assigning sub-queries to the singe node databases. The U.S. startup Hadapt [20] is currently commercializing HadoopDB. Finally, Amazon offers Hadoop Elastic MapReduce as a service to its customers [3]. Several high-level query languages and applications have been developed on top of Hadoop, such as PigLatin[36] and Mahout [4], a platform for large-scale machine learning. The dataflow graphs used in 1 http://diligent.ercim.eu/ http://www.health-e-child.org 3 http://www.madgik.di.uoa.gr/ 2 6 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries MapReduce are relatively restricted and this reduces opportunities for optimization. All of the above systems have as optimization goal to execute the queries as fast as possible. The Condor/DAGMan/Stork [33] set is the state-of-the-art technology of High Performance Computing. Nevertheless, Condor was designed to harvest CPU cycles on idle machines. However, running data intensive workflows with DAGMan is very inefficient [43]. Many systems use DAGMan as middleware, like Pegasus [11] and GridDB [34]. Proposals for extensions of Condor to deal with data intensive scientific workflows do exist [43], but to the best of our knowledge, they have not been materialized yet. In [14] is presented a case study of executing the Montage dataflow on the cloud examining the trade-offs of different dataflow execution modes and provisioning plans for cloud resources. Dryad [24] is a commercial middleware by Microsoft that has a more general architecture than MapReduce since it can parallelize any dataflow. Its schedule optimization, however, relies heavily on hints requiring knowledge of node proximity, which are generally not available in a cloud environment. It also deals with job migration by instantiating another copy of a job and not by moving the job to another machine. This might be acceptable when optimizing solely time but not when the financial cost of allocating additional containers matters. DryadLINQ [53] is built on top of Dryad and use LINQ [49], a set of .NET constructs for manipulating data. LINQ queries are transformed into Dryad graphs and executed in a distributed fashion. Stubby [21, 32] is a cost based optimizer for Map-Reduce dataflows. Our work have some similarities with the modeling but we target a broader range of dataflow graphs. Mariposa [47] was one of the first distributed database systems that takes into consideration the monetary cost of answering the queries. The user provides a budget function and the system optimizes the cost of accessing the individual databases using auctioning. Dremel [35] is a system for real time analysis of very large datasets. The system is designed for a subclass of SQL queries that return relatively small results. The optimization techniques in this work target a more broad class of SQL queries. 2.2 Overview Figure 2.1 shows the current architecture of ADP. The queries are optimized and transformed into execution plans that are executed in ART, the ADP Run Time. The resources needed to execute the queries (machines, network, etc.) are reserved or allocated by ARM, the ADP Resource Mediator. Those resources are wrapped into containers. Containers are used to abstract from the details of a physical machine in a cluster or a virtual machine in a cloud. The information about the operators and the state of the system is stored in the Registry. ADP uses state of the art technology and well proven solutions inspired by years of research in parallel and distributed databases (e.g., parallelism, partitioning, and various optimizations). The core query evaluation engine of ADP is built on top of SQLite 4 . The system allows rapid development of specialized data analysis tasks that are directly integrated into the system. Queries are expressed in SQL extended with UDFs. Although UDFs are supported by database management systems (DBMS) for a long time, their use is limited due to their complexity and limitations imposed by the DBMSs. One of the goals of the system is to eliminate the effort of creating and using UDFs by making them a first class citizens in the query language itself. SQLite natively supports UDFs implemented in C. The UDFs are categorized into row, aggregate, and virtual table functions. 2.3 Language Abstractions The query language of ADP is based on SQL enhanced with two extensions. The first one is the automatic creation of virtual tables. This permits direct usage within the query, without explicitly creating them before. The second extension is an inverted syntax which uses UDFs as statements. We present some examples to illustrate. The following query reads the data from ‘in.tsv’ file. 4 This part of ADP is available as open source library https://code.google.com/p/madis 7 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Figure 2.1: The system architecture create virtual table input_file using file(’in.tsv’); select * from input_file; The query can be written as: select * from file(’in.tsv’); and the sytem will automatically translate it into the above syntax. The second extension is an inverted syntax which uses UDFs as statements. Using this syntax, the query can be further simplified as: file ’in.tsv’; The inverted syntax provides a natural way to composite virtual table functions. In the following example, the query that uses the file operator, is provided as a parameter to countrows: select * from countrows("select * from file(’in.tsv’)"); The above syntax is very error prone because it uses nested quote levels. By using inversion, the query is written as: countrows file ’in.tsv’; The ordering is from left to right, i.e., xyz is translated to x(y(z)). Notice that this syntax is very close to the natural language sentence “count the rows of file ‘in.tsv’ ”. 2.3.1 Dataflow Language This language is internal to the system, but it can also be used to bypass the optimizer and enforce a specific dataflow. The best way to describe the syntax and semantics of the language is by presenting some examples. We use a subset of the TPC-H schema described below: 1. orders(o_orderkey, o_orderstatus, ...) 2. lineitem(l_orderkey, l_partkey, l_quantity, ...) 3. part(p_partkey, p_name, ...) Assume that tables are horizontally partitioned as follows. 1. orders to 2 parts on hash(o_orderkey) 2. lineitem to 3 parts on hash(l_orderkey) 3. part to 2 parts on hash(p_partkey) 8 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries distributed create table lineitem_large as select * from lineitem where l_quantity > 20; with hash(∗) being a known hash function with good properties (e.g. MD5). The queries have two semantically different parts: distributed and local. The distributed part defines how the input table partitions are combined and how the output will be partitioned. The local part defines the SQL query that will be evaluated against all the combinations of the input (possibly, in parallel). In the example, the system will run the SQL query on every partition of table lineitem. Consequently, table lineitem_large will be created and partitioned into the same number of partitions as lineitem. The output table can be partitioned based on specific columns. All the records with the same values on the specified columns will be on the same partition. For example, consider the following query. distributed create table lineitem_large on l_orderkey as select * from lineitem where l_quantity > 20; The output of each query, is partitioned on column l_orderkey. Notice that all records with the same l_orderkey value must be unioned in order to produce the partitions of table lineitem_large, creating the lattice after the queries are executed. The user can specify the number of partitions (e.g., 10) by writing the query as follows: distributed create table lineitem_large to 10 on l_orderkey as select * from lineitem where l_quantity > 20; For the time being, if the degree of parallelism is not specified, it is set to a predefined value. An interesting optimization problem is to find the optimal parallelism to use. We are currently working on this problem. We want to stress the fact that even when this feature will be available, the current functionality will be given as an option because it is extremely useful in practice. If more than one input tables are used, the table partitions are combined and the query is executed on each combination. The combination is either direct or a cartesian product, with the latter being the default behavior. An example is the following query. distributed create table lineitem_part as select * from lineitem, part where l_partkey = p_partkey; The system evaluates the query by combining all the partitions of lineitem with all the partitions of part. As a result, table lineitem_part will have 6 partitions (3 x 2). If tables lineitem and part have the same number of partitions, the combination can be a direct product. This is shown in the following query. distributed create table lineitem_part as direct select * from lineitem, part where l_partkey = p_partkey; Notice that, in order the query to be a correct join, the tables lineitem and part must be partitioned on columns l_partkey and p_partkey respectively. The local part of the query can be as complex as needed using the full expressivity of SQL enhanced with the UDFs. Queries can be combined in order to express complex data flows. For example, a distributed hash join can be expressed as follows: distributed create temporary table lineitem_p on l_partkey as select * from lineitem; distributed create temporary table part_p on p_partkey as select * from part; distributed create table lineitem_part as direct select * from lineitem_p, part_p where l_partkey = p_partkey; Tables lineitem_p and part_p are temporary and are destroyed after execution. Notice that in this example, the system must choose the same parallelism for tables lineitem_p and part_p in order to combined as a direct product. A MapReduce flow can be expressed as follows: distributed create temporary table map on key select keyFunc(c1, c2, ...) as key, valueFunc(c1, c2, ...) distributed create table reduce as select reduceFunc(value) from map group by key; 9 as value from input; Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries with, key(∗), being a row function that returns the key of the row, and value(∗) being a row function that produces the value. In the second query, the reduce(∗) is a aggregate function that is applied on each group. The system also support indexes. The index is not a global index. Instead, is created on each partition of the table on the specified columns. An example is shown below: distributed create index l_index on lineitem(l_partkey); Another useful feature of the language is the ability to specify the partitions against which the query will be evaluated. For example, if only the first partition of table lineitem has all the records with l_quantity more than 20, we can write the following query: distributed on [0] create table lineitem_large as select * from lineitem where l_quantity > 20; 2.3.2 Dataflow Language Semantics A dTable consists of its schema, the list of its partitions, and the partition columns, i.e., dTable(schema, parts, pcols) with |parts| >= 1 and |pcols| >= 0. A partition of a dTable is a relational table with the same schema. The partitions of the tables are horizontal. A partitioning is a transformation of the dTable into a new dTable with the specified number of partitions on the specified columns, i.e., output = p(input, parts, columns), with parts ≥ 1 and |columns| ≥ 1. A dQuery is a transformation of its input dTables to a single output dTable, i.e., output = query(SQL + U DF s, inputs, combine) with |inputs| ≥ 1 and combine = {cartesianproduct, directproduct}. The following holds for cartesian product: • |output.parts| = Q|inputs|−1 i=0 |inputs[i].parts| • output.parts[i] = query(input[j].parts[k]), ∀i, j, k The following holds for direct product: • |inputs[i].parts| = S, ∀i, S > 0, • |output.parts| = S • output.parts[i] = query(input[j].parts[i]), ∀i, j A p(dQuery) is a dQuery with partitioning, i.e., output = p(dQuery(def, inputs, combine), parts, columns). A dQueryScript is set of p(dQuery) connected with tables. By definition, the output of the last p(dQuery) is the result of the script. 2.4 Extensions The system uses extensively the UDF extensions APIs of SQLite. SQLite supports the following UDF categories: Row functions take as input one or more columns from a row and produce one value. An example is the U P P ER() function. Aggregate functions can be used to capture arbitrary aggregation functionality beyond the one predefined in SQL (i.e., SU M (), AV G(), etc.). Virtual table functions (also known as table functions in Postgresql and Oracle) are used to create virtual tables that can be used in a similar way with tables. The API offers both serial access via a cursor, and random access via an index. The SQLite engine is not aware of the table size allowing the input and output to be arbitrarily large. All UDFs are implemented in Python. Both Python and SQLite are not strictly typed. This enables the implementation of UDFs that have dynamic schemas based on their input data. 10 Optique Deliverable D7.1 2.5 Techniques for Distributed Query Planning and Execution: One-Time Queries SQL Processing Engine The first layer is APSW, a wrapper of SQLite that makes possible the control of the DB engine from Python. APSW also makes possible to implement UDFs in Python enabling SQLite to use them in the same way as its native UDFs. Both Python and SQLite are executed in the same process, greatly reducing the communication cost between them. The Connection and Function Manager (CFM) is the external interface. It receives SQL queries, transforms them into SQL92, and passes them to SQLite for execution. This component also supports query execution tracing and monitoring. Finally, it automatically finds and loads all the available UDFs. The query is parsed, optimized, and transformed into the intermediate dataflow language described earlier. The Parser and Optimizer of the system, use information stored in the Catalog. The catalog contains all the information about the tables. The dataflow language is optimized and transformed into a dataflow graph. Each node of the graph is an SQL query and each link is either an input or an output table. The graph produced contains two types of operators: SQLExec and unionReplicator. Operator SQLExec takes as a parameter an SQL query and executes it reading the partitions from its input and producing the output partitions. Operator unionReplicator, performs a union all to the partitions of the input (all the partitions must be from the same dTable) and replicates the result to all of its outputs. The dataflow graph is given to ADP for execution. The system schedules the dataflow graph and monitors its progress. When the execution completes successfully, the table produced is added to the Catalog of the system. ADP finds the best schedule for the dataflow graph and executes it on the containers reserved on the cloud. A particular database is located in a particular catalog on the file system of each machine. Each partition is in a different file. A table is attached to the database without needing to importing the data, and as a result, eliminating the startup and shutdown time of the database engine. The limit on the number of attached databases in Sqlite is 62. For operators that have more than 62 input partitions, we create a tree of unions. The total number of iterations is bounded by ceil(log62|inputs|) - 1 . 2.6 Conclusions In this chapter we described the architecture and the components of ADP. We also presented example queries that specify the use of UDFs in the language and we gave a detailed characterization of the dataflow language. In the next chapter we will present results regarding the optimization of dataflow processing, which also contains the task of optimizing execution of queries expressed in the specified language in a system like ADP. 11 Chapter 3 Elastic Dataflow Processing in the Cloud Query processing has been studied for a long time by the database community in the context of parallel, distributed, and federated databases [15, 28, 37]. Recently, cloud computing has attracted much attention in the research community and software industry and fundamental database problems are being revisited [17]. Thanks to virtualization, cloud computing has evolved over the years from a paradigm of basic IT infrastructures used for a specific purpose (clusters), to grid computing, and recently, to several paradigms of resource provisioning services: depending on the particular needs, infrastructures (IaaS — Infrastructure as a Service), platforms (PaaS — Platform as a Service), and software (SaaS — Software as a Service) can be provided as services [18]. One of the important advantages of these incarnations of cloud computing is the cost model of resources. Clusters represent a fixed capital investment made up-front and a relatively small operational cost paid over time. In contrast, clouds are characterized by elasticity, and offer their users the ability to lease resources only for as long as needed, based on a per quantum pricing scheme, e.g., one hour on Amazon EC2.1 Together with the lack of any up-front cost, this represents a major benefit of clouds over earlier approaches. The elasticity, i.e., the ability to use computational resources that are available on demand, challenges the way we implement algorithms, systems, and applications. Execution of dataflows can be elastic, providing several choices of price-to-performance ratio and making the optimization problem two dimensional [27]. Modern applications combine the notions of querying & search, information filtering & retrieval, data transformation & analysis, and other data manipulations. Such rich tasks are typically expressed in a high level language (SQL enhanced with UDFs), optimized [29], and transformed into an execution plan. The latter is represented as a data processing graph that has arbitrary data operators as nodes and producerconsumer interactions as edges. In this chapter we focus on the common denominator of all distributed data processing systems: scheduling, i.e., where each node of the dataflow graph will be executed. Scheduling is a well-known NP-complete problem [19, 42]. Traditionally, the main criterion to optimize is the completion time of the dataflow, and many algorithms have been proposed for that problem [30, 52]. Scheduling dataflows on the cloud is a challenging task since it has a very rich space of alternative schedules taking into account the monetary cost of using the resources. A major research problem is the development of new distributed computing paradigms that fit closely the elastic computation model of cloud computing. The most successful of these computational models today is MapReduce. Our first contribution is to demonstrate that very significant elasticity exists in tasks modeled with the MapReduce abstraction. Moreover, we show that elasticity can be discovered in practice using highly scalable and efficient algorithms. It is natural to expect that more refined models would allow further optimizations and extraction of elasticity. At the same time, it is also reasonable to be concerned as to whether the resulting complexity of the refined model would allow these optimizations to be performed. Our second contribution is to demonstrate that there exists a very fertile middle-ground in terms of abstraction which enables the extraction of much more elasticity than what is possible under MapReduce while remaining computationally tractable. 1 http://aws.amazon.com/ec2 12 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries In this work, we propose a two step approach to explore the space of alternative schedules on the Cloud with respect to both completion time and monetary cost. Initially, we compute a global ranking of the operators of the dataflow based on their influence. Given the ranking, a dynamic programming algorithm finds the initial skyline of solutions. This skyline is further refined by a 2D simulated annealing algorithm. To illustrate the intuition behind our approach, we use the example of dataflow graph in Fig. 3.1. B C d A e Figure 3.1: A simple dataflow. Heavy operators are bigger. Thicker arrow means large volume of data. Given that we can measure the influence of each operator in the graph, we can assign the most influential operators first. In our example the best solution is: (i) put in different containers A, B, & C, (ii) put together e and C, and (iii) put together d and A. The challenge is how to measure the influence of each operator in the dataflow. The number, nature, and temporal & monetary costs of the schedules on the skyline depend on many parameters, such as the dataflow characteristics (execution time of operators, amount of data generated, etc.), the cloud pricing scheme (quantum length and price), the network bandwidth, and more. We incorporated these scheduling algorithms to ADP. To the best of our knowledge, our system is the first attempt to address the problem of dataflow processing on the cloud with respect to both completion time and monetary cost. Our techniques can successfully find trade-offs between time and money, and offer to the user the choice of choosing the best trade-off or choose automatically the appropriate one based on the user preferences and constraints. In this chapter, we make the following contributions: • We show that elasticity can be embedded into processing systems that use the MapReduce abstraction. • We propose a model of jobs and resources that can capture the special characteristics of the Cloud. • We propose a two step solution to the scheduling problem on the cloud. We compute the ranking of the operators in the dataflow. Given that, a dynamic programming algorithm finds the initial skyline of solutions. The skyline is then refined by a 2D simulated annealing algorithm. • We show that our approach is able to successfully find trade-offs between completion time and monetary cost, and thus, exploit the elasticity of clouds. • We show that using our model, we achieve manageable complexity and a significant gain producing schedules that dominate plans produced by the MapReduce abstraction on every dimension. The remainder of this chapter is organized as follows. In Section 3.1 we present the related work. In Section 3.2 we introduce the notations we use and the problem definition. In Section 3.3 we describe the dataflow language abstractions we use. In Section 3.4 we present the ranking algorithms that we used. In Section 3.5 we present the algorithms that we propose and in Section 3.6 we compute their complexity. In Section 3.7 we present our experimental effort and in Section 3.8 we conclude. 3.1 Related Work Dataflow processing presents a major challenge and a lot of attention has been given in the recent years. In [45] is presented a methodology to design ETL dataflows based on multiple criteria. This is complementary to our work. However, to the best of our knowledge, the optimization is not automatic for the time being. New approaches [46], focuses on optimizing execution time of data flows executed over multiple execution engines (DBMS, MapReduce, etc.). 13 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries There are also several efforts that move in the same direction as our work but try to solve simpler versions of the problem. Examples include a particle swarm optimization of general dataflows having a single-dimensional weighted average parameter of several metrics as the optimization criterion [38], a heuristic optimization of independent tasks (no dependencies) having the number of machines that should be allocated to maximize speedup as the optimization criterion given a predefined budget [44], and focusing on energy efficiency [51]. In summary, we capitalize on the elasticity of clouds and produce multiple schedules, enabling the user to select the desired trade-off. To the best of our knowledge, no dataflow processing system deals with the concept of elasticity or two-dimensional time/space optimization, which constitute our key novelties. 3.2 3.2.1 Preliminaries Dataflow We use the same modeling as in [27] and for completeness, we summarize here. A dataflow is represented as a directed acyclic graph graph(ops, f lows). Nodes (ops) correspond to arbitrary concrete operators and edges (f lows) correspond to data transferred between them. An operator in ops is modeled as op(time, cpu, memory, behavior) where time is the execution time of the operator & cpu is its average CPU utilization measured as a percentage of the host CPU power when executed in isolation (without the presence of other operators), memory is the maximum memory required for the effective execution of the operator, and behavior is a flag that is equal to either pipeline (PL) or store-and-forward (SnF). If behavior is equal to SnF, all inputs to the operator must be available before execution; if it is equal to PL, execution can start as soon as some input is available. Two typical examples from databases are sort and select operators: sort is SnF and select is PL. These metrics are either computed or collected by the system [31]. We model an operator as having a uniform resource consumption during its execution (cpu, memory, and behavior do not change). A flow between two operators, producer and consumer, is modeled as f low(producer, consumer, data), where data is the size of the data transferred. The container is the abstraction of the host, virtual or physical, encapsulating the resources provided by the underlying infrastructure. Containers are responsible for supervising operators and providing the necessary context for executing them. A container is described by its CPU, its available memory, and its network bandwidth: cont(cpu, memory, network). A schedule SG of a dataflow graph G is an assignment of its operators into containers schedule(assigns). An individual operator assignment is modeled as: assign(op, cont, start, end) where start and end are the start and end time of the operator correspondingly, executed in the presence of other operators. Time t(SG ) and money m(SG ) costs are the completion time and the monetary cost of a schedule SG of a dataflow graph G. Cloud providers lease computing resources that are typically charged based on a per time quantum pricing scheme. For this reason we measure t(SG ) and m(SG ) in quanta. Qt and cost Qm are the quantum time and price of leasing a container for Qt respectively. The cloud is a provider of virtual hosts (containers). We model only the compute service of the cloud and not the storage service. Assuming that the storage service is not used to store temporary results, a particular dataflow G will read and write the same amount of data for any schedule SG of operators into containers. So the cost of reading the input and writing the output is the same. 3.2.2 Schedule Estimation The algorithm estimates the time that every operator starts and its execution time given an assignment of operators to containers. We use a geometric approach. The time-shared resources are cpu, network, and disk. Those resources form boxes in a multi-dimensional space. Operators are boxes with resources and limited time. Containers are also modeled as boxes with infinite time. Intuitively, the problem becomes how to fit the boxes of the operators into the boxes of containers. 14 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Operators are modeled with three boxes I, P , and O with dimensions (time, cpu, inRate, outRate) as follows: I are the resources needed to read the input from the network or from the disk, P are resources needed to process the data, and O are resources needed to write the output to the network or to the disk. For SnF operators I and O are only disk resources. For data transfer operators, P is always zero. The data read from the network is always written to the disk. Notice that even for PL operators, this modeling is acceptable because at some level, PL operators are a sequence of SnF operators that read a small fraction of the input, process it, and produce a small fraction of the output. Definitions Formally, let A be an operator that belongs to dataflow G. The operator is defined as A(time, cpu, −, −) with assign(A, X, −, −). Without loss of generality, assume that A is a LP operator. For SnF operators we use the disk instead of the network for I/O. The total data that A reads from the network is: D∗→A = X {D : f low(B, A, D), assign(B, Y, −, −), X 6= Y } B∈G Similarly, the total data that A writes to the network is: X DA→∗ = {D : f low(A, B, D), assign(B, Y, −, −), X 6= Y } B∈G The three boxes of the operator are defined as follows: I( D∗→A , DTCP U , X.network, 0) X.network P (A.time, A.cpu, 0, 0) DA→∗ , DTCP U , 0, X.network) X.network being a system parameter. In isolation, the box AB of operator A will have the following O( with DTCP U properties: AB .time = I.time + P.time + O.time AB .cpu = I.cpu · I.time + P.cpu · P.time + O.cpu · O.time AB .time AB .inRate = AAB .outRate = I.inRate · I.time AB .time O.outRate · O.time AB .time In the context of others, operators are scaled in the time dimension due to time-shared resources. However, assuming uniform behavior for all operators, the following measures do not change at any scale. instructions = AB .time · AB .cpu indata = AB .time · AB .inRate outdata = AB .time · AB .outRate Those measures are used to calculate the cpu, inRate and outRate at any time scale. A group S is a subset of connected operators that can be executed concurrently. Groups can have either connected PL operators or only one SnF operator. The execution time of the operators belonging to the same group is the same. As a consequence, that all operators inside the same group are scaled to reach the most time − − − consuming operator. Formally, a group P is defined as P (→ o ,→ s ) with → o being the vector with the operators → − that are members of the group and s being the vector with the time scale of operators inside the group. 15 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Money d1 d2 d3 Time Figure 3.2: Two skylines A (circle) and B (square). Their distance is ( d1+d2 2 , d3). Estimation Algorithm The estimation algorithm estimates when and for how long every operator will run given any schedule. Thus, the completion time and monetary cost is estimated for the whole schedule. The graph is examined from producers to consumers starting from the operators that have no inputs. The set of operators that can be → − → − executed in parallel is divided into groups. Given a vector of groups P , we find the vector of S with the time scales. An operator oi that belong to group Pj is scaled by Sj · P.si . The next time event t is defined by the group that will terminate first. The remaining time and in/out data for each operator is reduced by the percentage of the time that the operator is executed till t. The terminated operators are removed from the ready set and the operators from the queued set that their memory constraint are satisfied, are inserted in the ready set. The algorithm terminated when all the operators are examined. The completion time t(SG ) is defined as the time of the termination of the last operator of the schedule. To compute the money m(SG ), we slice time in each container into windows of length Qt starting from the first operator of the schedule. The monetary cost is then a simple count of the time-windows that have at least one operator running, multiplied by Qm . |C| |W | X X m(SG ) = Qm ∗ ( (ci , wj )) i=1 j=1 with C = {ci } being the set of containers, W = {wj } being the set of time-windows, and (ci , wj ) = 3.2.3 1, if at least one operator is active in wj in ci 0, otherwise Skyline Distance Set To compare solutions from different algorithms we define the distance between two skylines. Let A and B be two skylines produced by different algorithms. The units of both time and money are in quanta. Assume that A is dominated by B, i.e., skyline(A ∪ B) = B. Let |A| be the number of points in skyline A. The distance of A from B is defined as: P|A| dist(Ai , B) Dp (A, B) = i=1 |A| with dist(Ai , B) being the distance of schedule Ai ∈ A from skyline B. Intuitively, this distance shows the average distance in quanta between the schedules of the two skylines. Several works has studied the problem of finding the distance between a point and a skyline [26, 22]. In general, the problem is to minimize the total cost of making Ai part of B, i.e., make it a point in the skyline. In this work we define this distance as being the minimum L2 distance needed for Ai to be part of B. We can compute it as shown in Figure 3.2. This distance is smooth at the point where Ai becomes part of the skyline. 16 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries We define the distance between two arbitrary skylines as follows. Let A and B be two skylines. We compute C = skyline(A ∪ B). The distance D(A, B) is defined as the pair (Dp (A, C), Dp (B, C)). Figure 3.2 shows an example. This distance has several good properties. Let (Dp (A, C), Dp (B, C)) = (a, b). The following hold: (i) a, b ≥ 0 (ii) if a = 0 then A dominates B (iii) if a, b > 0, both A and B have schedules with tradeoffs (iv) is smooth where a schedule becomes part of the skyline Figure 3.2 shows an example where both a and b are positive. This distance can be generalized to n skylines (S1 , ..., Sn ) as follows: D(S1 , ..., Sn ) = (Dp (Si , C), ..., Dp (Sn , C)) with C = skyline(∪(Si )). In our experiments, we compute the generalized skyline distance from the results of all algorithms to compare the results. 3.3 Dataflow Language In our system, requests take the form of queries in SQL with user defined functions (UDFs). The SQL query is transformed into the intermediate level that we described in Section 2.3.1. This intermediate level SQL script is then transformed to a dataflow graph using the modeling that described in Section 3.2. Figure 3.3 shows the dataflow graph produced from query 8 of the TPC-H benchmark. 61 18 63 19 64 74 75 76 77 13 78 14 79 15 80 81 102 8 73 48 103 9 16 49 10 32 50 104 11 33 51 105 12 34 52 106 20 35 45 99 36 46 100 29 30 31 47 101 90 91 92 93 94 95 96 97 108 110 Figure 3.3: The dataflow graph produced from TPC-H query 8. 3.4 Operator Ranking In this section we present the methodology we use to rank the operators of a dataflow graph. A score is computed for each operator according to the influence they have on the schedule. Figure 3.4 shows the results of ranking the Montage dataflow with 50 SnF operators. Influential operators have darker colors. 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 40 45 50 Figure 3.4: The Montage dataflow ranking information. Darker colors means higher scores. A simple way to rank the operators is by computing a relative score based on their properties and the input & output data F (op, in, out). We call this ranking Structure Ranking because it takes into account only the immediate neighborhood for each operator in the graph. The scoring function we use is defined as follows: score(oi ) = a · oi .proc_time + (1 − a)(oi .io_time) 17 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries with oi the operator i, proc_time = oi .time and P|in| P|out| i=1 (in[i].data) + i=1 (out[i].data) io_time = net_speed Parameter a expresses the importance of the execution time and I/O time. In our experiments we used a = 0.5. The ranking of operators is defined as SR(G) = (score(o1 ), score(o2 ), ..., score(on )) Assuming there are no long-range correlations between operators in the graph, this is a good ranking function. A long-range correlation between two operators exists when the minimum length of the path between them is more than 2, and the assignment of one operator in the schedule affects the assignment of the other. If the graph has this property, Structure Ranking will not find it. To overcome the problem, we compute the score for each operator by measuring their influence on the schedule directly. We can measure that by finding the partial derivative of the operators on the space of schedules produced by our model. Given a particular schedule SG of a dataflow graph G, we compute: θSE θSE (SG ), ..., (SG )) θo1 θon with oi being the operator i in the dataflow and SE being the schedule estimation algorithm. Essentially, the derivative shows how sensitive is the dataflow to the different assignments of the operators. We call this ranking Derivative Ranking. The derivative of SE is hard to compute analytically, so instead we use an iterative process to approximate it. Assuming the graph has n operators, the space of schedules is a n-dimensional hypercube. Each axis of that cube is of length C (the number of containers). To measure the derivative, we assign each operator to all possible positions in its axis, without changing the positions of the others, and measure the difference in time and money of all the produced schedules. Algorithm 1 shows the ranking process. The algorithm creates a schedule by randomly assigning the operators into containers. Then, each operator is assigned to every possible container without changing the positions of the others. At each step, it measures the difference of the cost function provided. This is repeated until the ordering of the operators does not change. The cost function we used is defined as follows: ∇SE(SG ) = ( F (S(G)) = a ∗ S(G).time + (1 − a) ∗ S(G).money In our experiments we set a = 0.5. Intuitively, the Structure Ranking and the Derivative Ranking should produce similar results if there are no long-range correlations between the operators in the dataflow graph. Indeed, this is the case for graphs with SnF operators. Below we present the results of ranking the Ligo [13] dataflow with 50 operators. We show the structure (S) and derivative (D) ranking. Each operator is a different character. Each list is ranked from the highest score to the lowest. We measure the difference of the two rankings using the Kendall tau rank distance [16]. S: W noXmM T piOkRf V N jP QghlU SqY rBAEHKCGJF LID1c2a3b45Zd6e D: W M XT ORN P nomkV pf jQiU ShglrqY JDBF GHELKCIA1a42b3Zce6d5 Kendall tau Distance : 0.088 We observe that the two rankings are not significantly different and the most influential operators are the same. However, for PL operators, the long-range correlations do exist. Below we show the results of Ligo with PL operators. S: W noXmM T piOkRf V N jP QghlU SqY rBAEHKCGJF LID1c2a3b45Zd6e D: Y W N T QXJDRIV CBOLU bGZ4F S3HM EAacP gj1f ndKkhi5m6le2oprq Kendall tau Distance : 0.475 We observe that the ranking is significantly different. Pipeline operators run in parallel and two connected operators run the same amount of time, regardless if one of them is faster than the other. The structure ranking, would give the fast operator a smaller score than the slow one, which is not correct, because they are both influential. An illustrative example is operator D. Using the structure ranking, its rank is 37 and using derivative is 8. 18 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Algorithm 1 Ranking Algorithm Input: G: The dataflow graph N: The number of containers F: F (S(G)) → <: The cost function M: The maximum number of iterations Output: scores: The scores of the operators 1: scores[|G.ops|] ← 0 2: for m ∈ [1, M ] do 3: S ← RandomScheduler(G, N ) 4: all[|G.ops|][N ] ← 0 5: for o ∈ G do 6: for c ∈ [1, N ] do 7: assign(o, c, _, _) 8: all[o][m] ← F (S) 9: end for 10: end for 11: curr[|G.ops|] ← 0 12: for o ∈ G do 13: curr[o] ← maxi (all[o][i]) − mini (all[o][i]) 14: end for 15: if ranking of ops is the same in scores and curr then 16: break 17: end if 18: for o ∈ G do 19: scores[o] ← ((m − 1) · scores[o] + curr[o])/m 20: end for 21: end for 22: return scores 3.5 Scheduling Algorithm The scheduling algorithm that we propose has two phases. The first phase computes the skyline of schedules based on a dynamic programming algorithm using the ranking of the operators. This skyline is further refined using a 2D simulated annealing. In the following sections we present the two algorithms. 3.5.1 Dynamic Programming The dynamic programming algorithm that we use is shown in Algorithm 2. The operators are considered from producer to consumer. Each operator with no inputs, is a candidate for assignment. An SnF operator is a candidate, as soon as all of its inputs are available. A PL operator is a candidate, as soon as all of its inputs come from PL or from completed SnF operators. The algorithm chooses the operator with the maximum rank from the list of available operators. That operator is added to all the schedules in the skyline at every possible position. At every step, only the schedules in the skyline are kept. The skyline may contain too many points [40]. For the experiments, we keep all the points in the skyline. In practice this is infeasible. Several approaches can be followed. One approach would be to keep k representative schedules (k is a system parameter) from the skyline: the fastest, the cheapest, and k − 2 equally distributed in between. 19 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Algorithm 2 Dynamic Programming Input: G: A dataflow graph. C: The maximum number of parallel containers to use. Output: skyline: The solutions. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: skyline ← ready ←{operators in G that have no dependencies} f irstOperator ← maxRank(ready) f irstSchedule ← {assign(f irstOperator, 1, −, −)} skyline ←{f irstSchedule} while ready 6= do next ← maxRank(ready) S← for all schedules s in space do for all containers c (c ≤ C) do S ← S ∪ {s + assign(next, c, −, −)} end for end for space ← skyline of S ready ← ready − {next} ready ← ready ∪ {operators in G that dependency constraints no longer exist} end while 18: return skyline 3.5.2 Simulated Annealing The 2D simulated annealing is shown in Algorithm 3. The initial skyline is computed by Algorithm 2. At each step, all the schedules of the skyline are considered. A neighbor of each schedule is computed by assigning an operator to another container. If the newly produced schedule dominates the old one, only that schedule is kept. Both schedules are kept if they do not dominate each other. If the old one dominates the new, the new one is kept with probability that depends on the euclidean distance between them and the temperature. The lower the temperature, the smaller the probability of keeping a dominated schedule. As RandN eighbor we use two functions: (i) purely random and (ii) random based on ranking. The later, chooses the operators with probability proportional to their scores. 3.5.3 Parallel Wave The Parallel Wave is a generalization of the scheduling algorithm for MapReduce graphs. In the beginning, the algorithm finds the depth of each operator in the graph. The operators with no inputs, have depth zero. The depth of every other operator is the maximum depth of the operators connected to its inputs plus one. In a MapReduce pipeline, all the operators in the map phase will be at the same depth. The same holds for operators in the reduce phase. We used this algorithm to add elasticity into MapReduce dataflow pipelines, similar to the ones produced by Hive [48] or Flume [8]. Let W be the number of different depths in the graph and pi be the maximum parallelism of depth i. pi is defined as min(C, |Wops |), with C being the maximum number of containers and |Wi | being the number of operators at depth i. This algorithm considers all the combinations of different pi . For the scheduling of operators in the same level, we use a simple load balancing mechanism. 3.5.4 Exhaustive The exhaustive algorithm enumerates all the different schedules and keeps the ones in the skyline. We use this algorithm for two purposes: (i) to compare the results of the proposed algorithms with the optimal and (ii) as a initialization step for simulated annealing to find the optimal assignment for the most influential operators. Let N be the number of operators and C be the number of containers. The space of solutions is 20 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Algorithm 3 Simulated Annealing Input: G: A dataflow graph. K: Maximum number of iterations. C: Maximum number of containers Output: skyline: The solutions. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: skyline ← DynamicP rogramming(G, C) S ← skyline k←0 while k < K do for all schedules s in S do n ← RandN eighbor(s) if n.time < s.time and n.money < s.money then // Dominate s ← next else if n.time < s.time or n.money < s.money then // Tradeoff S ← S ∪ {next} else L2Distance(s,next) − T (k) if e > rand[0,1] then s ← next end if end if end if end for skyline ← skyline of (skyline ∪ S) k ←k+1 end while 24: return skyline C N . For N = 2 and C = 2 the assignments are 22 . However, the solution space has a lot of symmetries. In the example, only two solutions are different: i) assign both operators to the same container or ii) assign them to different containers. We break the symmetries2 generating schedules as shown in Figure 3.5. The figure shows all different solutions when scheduling three operators (A, B, and C). We can compute the A A AB B C ABC A AB C AC B A B BC A B C Figure 3.5: All the solutions for three operators A, B, and C. number of different schedules as follows. The number of leafs at the sub-tree starting with k containers and n remaining operators to assign is: ss (n, k) = k · ss (n − 1, k) + ss (n − 1, k + 1) With ss (0, k) = 1 ∀k. The root of the tree is ss (N, 0). A simple dynamic programming technique is used to solve the equation. A generalization is to use at most C containers. ssg (n, k, C) = k · ssg (n − 1, k, C) + ssg (n − 1, k + 1, C) 2 Finding symmetries in the graph is a much harder problem. 21 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries ssg (0, k, C) = 1, ∀k ≤ C 0, otherwise Number of Schedules 1e+14 # Schedules 1e+12 1e+10 1e+08 1e+06 10000 100 1 0 2 4 6 8 10 12 14 16 18 20 # Operators Figure 3.6: Total and unique number of schedules. Figure 3.6 shows the number of schedules before and after breaking the symmetries. For 20 operators, the unique number of schedules are approximately 10 orders of magnitudes less. We leave for future work the breaking the symmetries on the graph. 3.6 Complexity Analysis We assume as given a dataflow graph G with n operators, l links, and c containers. We also assume that the number of schedules in the skyline is no more than s. Schedule Estimation: The worst case scenario is all operators to be ready in the beginning and only one operator to terminate each time. At each phase, all operators and all links are considered. The complexity is n X n(n + 1) SEC = O( (i + l)) = O(nl + ) 2 i=1 Given that in most graphs l > n, SEC = O(nl). Ranking: Ranking takes at most r iterations to complete. At each iteration, nc invocations of SE are performed. Thus, the complexity is RC = O(rn2 lc). Dynamic Programming: This algorithm takes n steps. At each step at most sc invocations of SE are performed, one for each schedule and container. Therefore, the complexity is DY NC = O(sn2 lc). Simulated Annealing: Simulated annealing takes at most k steps to run and in every step at most s invocations of SE are performed for the new neighbor. The complexity is SAC = O(ksnl). The algorithm, calls the ranking algorithm, then the dynamic programming, and at the end, the simulated annealing. Therefore, the complexity of the algorithm is: O(n2 lc(r + s + ks/nc)). 3.7 Experimental Evaluation This section presents the results of our experimental effort divided into two groups. The first group contains experiments with our model and algorithms. The second group contains experiment with our system and the comparison with Hive [48]. 3.7.1 Model and Algorithms The parameters of the experiment are shown in Table 3.1. We begin by presenting the experimental setup. 22 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Table 3.1: Algorithm Experiment Properties Property Dataflow Output Size Operator type Ranking Search Data transfer Values Montage, Ligo, Cybershake, TPC-H, MapReduce 10x − 10000x SnF, PL Derivative, Structure NL, Dyn, SA, Exh, PW DTCP U = 0.1, DTM EM = 0.05 Table 3.2: Operator Properties Property time cpu memory data Values 0.2, 0.4, 0.6, 0.8, 1.0 0.4, 0.45, 0.5, 0.55, 0.6 0.05, 0.1, 0.15, 0.2, 0.25 0.2, 0.4, 0.6, 0.8, 1.0 Experimental Setup Dataflow Graphs: We examine five families of dataflow graphs: Montage [25] (Fig. 3.7A), Ligo [13] (Figure 3.7B), Cybershake [12] (Figure 3.7C), MapReduce (Figure 3.8), and the first 10 queries of TPC-H [6] (Figure 3.3 shows query 8). The first three are abstractions of dataflows that are used in scientific applications: Montage is used by NASA to generate mosaics of the sky, Ligo is used by the Laser Interferometer Gravitational-wave Observatory to analyze galactic binary systems, and Cybershake is used by the Southern California Earthquake Center to characterize earthquakes. The MapReduce and TPC-H graphs are generated by our dataflow language as presented in Section 3.3. Figure 3.7: The scientific graphs Montage(A), Ligo(B), and Cybershake(C). Operator Types: We have indicated the values of operator properties as percentages of the corresponding parameters of container resources (Table 3.2). For example, an operator having memory needs equal to 0.4 uses 40% of a container’s memory. Furthermore, execution times are given as percentages of the cloud’s time quantum and so are data sizes (inputs/outputs of operators), taking into account the network speed. For example, an execution time of 0.5 indicates that the operator requires half of a time quantum to complete its execution (say, 30 minutes according to the predominant Amazon cost model). Likewise, an output of size 0.2 requires one fifth of a time quantum to be transferred through the network if needed. This way, the output data size is in inverse correlation with network speed. Money is measured as described in Section 3.2. We have used synthetic workloads based on Montage, Ligo, and Cybershake dataflows as defined in [7]. 23 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries 1 2 4 5 7 8 10 11 13 0 14 3 16 6 17 9 51 19 12 20 15 22 18 23 21 25 24 26 27 28 30 52 53 54 45 46 47 48 49 56 57 59 60 62 63 65 66 55 58 61 64 68 69 70 71 67 29 33 31 36 32 39 34 35 37 38 40 41 43 44 42 50 72 Figure 3.8: A MapReduce pipeline consisting of two jobs. Each job has three phases, map, combine, and reduce. We scaled up the specified run times and operator output sizes by a factor of 50 and 1000 respectively; run time to output size ratio was increased by 20. We also set operator memory to 10% of the container capacity. We set the data transfer CPU utilization t be DTCP U = 0.1 and memory needs DTM EM = 0.05. We experimented with changing the output by a factor of 10, 100, 1000, and 10000. The properties of operators in MapReduce and TPC-H are chosen with uniform probability from the corresponding sets of values shown in Table 3.2. We have used dataflows with both SnF and PL operators. Optimization Algorithms: We used Nested Loops Optimizer (NL) as presented in [27], Exhaustive (Exh), Parallel Wave (PW), Dynamic Programming (DP), and Simulated Annealing (SA). For the NL algorithm, we used the greedy algorithm in the inner loop and 10 different number of containers in the range [1 − N ] with N being the number of operators in the graph. We used both structure and derivative methods for ranking. For the initialization of the SA algorithm, we used DP, Exh using the 10 most influential operators, and random. We used two RandN eighbor functions (Algorithm 3 Line 6): random, and random based on ranking (each operator is selected with probability proportional to its score). Measurements: For each experiment we generated 10 different graphs with different seeds. For Montage, Ligo, and Cybershake, we used the generator in [7]. TPC-H and MapReduce graphs are generated from the language described in Section 3.3. We run all the algorithms for each different graph and computed the generalized skyline distance from the results. As defined in Section 3.2, the generalized skyline distance is the distance of the skyline produced by each algorithm from the combined skyline. In the results we show the average value from produced from the 10 different seeds. The time and money are measured in quanta. Model Validation We begin by presenting the experiments we performed to validate our model. We used the TPC-H queries. The properties of each operator, as presented in Section 3.2, can be either computed or collected by the system during execution [31]. In this work are assumed to be given. To acquire the properties, we run each operator in isolation using only one container and collected the statistics. The cloud price quantum is set to 10 seconds. Figure 3.9 shows the real and estimated execution time and money for all queries of TPC-H using 8 containers and 8GB of data in total. We show the fastest plan in the skyline. We observe that our model is able to successfully predict both the time and money of the dataflows. Furthermore, query 9 clearly shows that our model overestimates the actual running time and money, thus it can be used as an upper bound of the real. Compare with Optimal In this set of experiments, we compare the skylines produced by our algorithms with the optimal. We generate the optimal result using the Exh algorithm. Since computing optimally the skyline is very expensive for large dataflows, we used TPC-H graphs with 10 operators. To have a better understanding of the quality of the results, we also computed the worst solutions. Those solutions are in the Max-Max skyline. Here we show the results of queries 3, 6, and 9 (the results are similar for the other queries). The characteristics of the operators are chosen with uniform probability from the values in Table 3.2. The results are shown in Figure 3.10. All operators are SnF and the output data replication is 100. We observe that the DP algorithm produces solutions that are not far from optimal. SA significantly improves the results produced by DP, especially for query 6. 24 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Estimated and Real Time 140 Time (seconds) 120 Estimated Real 100 80 60 40 20 0 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Money (quanta) Estimated and Real Money 100 90 80 70 60 50 40 30 20 10 0 Estimated Real 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Figure 3.9: Real and estimated execution time and money for queries of the TPC-H benchmark. We are aware of the fact that these results cannot be generalized to larger graphs. We are currently working on techniques to reduce the search space by breaking symmetries on the graph itself and compute exact solutions for very large graphs. One way to pursue this is by analyzing the SQL scripts presented in Section 3.3. The dataflow graphs are generated by a relatively small number of SQL queries and they are very symmetric due to the nature of SQL. Elastic MapReduce We created several MapReduce dataflows with different numbers of jobs. The properties of each operator are chosen from the values in Table 3.2 with uniform probability. Figure 3.11 shows the results of dynamic programming and parallel wave using the dataflow of Figure 3.8 with SnF and PL operators. First, we observe that we can have significant elasticity on MapReduce dataflows. Second, we see that using the algorithm we propose, we gain significant improvement. We are able to find plans that are as fast as the ones of MapReduce but a lot cheaper. Finally, we observe that the gain for PL is much bigger than the case for SnF. This was expected since the MapReduce graphs are SnF. We also experimented with varying the size of data. Figure 3.12 shows the results. We observe that when the amount of data increases, we are to find better schedules for both SnF and PL dataflows. Furthermore, we observe that for SnF dataflows, the MapReduce scheduler performs better. We expected this since is designed for this type of dataflows. Effect of Ranking In this set of experiments we measured the effect of ranking the operators. We used both structure and derivative ranking with Dyn and SA algorithms. The dataflows have SnF and PL operators. The output data replication factor is set to 100. As a baseline for Dyn, we compare with the FIFO algorithm, i.e., the operators are assigned with the same ordering they become available. 25 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries TPC-H Distance from Optimal Skyline Distance 5 Q3 Q6 Q9 4 3 2 1 0 Ma Dy Dy n+S n x A Figure 3.10: Comparison of different algorithms with the optimal using TPC-H queries with 10 operators. SnF Operators PL Operators 220 PW Dyn Money (Quanta) Money (Quanta) 280 240 200 160 PW Dyn 200 180 160 140 45 50 55 60 65 70 Time (Quanta) 75 80 10 20 30 40 50 Time (Quanta) 60 70 Figure 3.11: Dynamic algorithm compared with Parallel Wave on MapReduce graphs. Dynamic Programming: Figure 3.13 shows the results for Dyn using the Ligo dataflow with 100 PL and SnF operators. We observe that ranking in general, improves dramatically the solutions compared to FIFO for both PL and SnF dataflows. As expected, for SnF operators the results are similar because the rankings do not differ much. For PL operators however, we see that the ranking based on derivative is much more beneficial than the ranking based on structure. Simulated Annealing: For simulated annealing, we experimented with the the initialization algorithm and the neighbor selection. Figure 3.14 shows the results for the Cybershake dataflow with 100 SnF and PL operators. In the left plot of Figure 3.14, we observe that the initialization algorithm is essential. The best choice is the Dyn algorithm. It does not obtain to much worse results than Exh for SnF operators and is the best for PL operators. Using Exh on the 10 most influential operators is the best for SnF but not beneficial at all for PL operators. This is because the influential operators for PL dataflows are many more and much more sensitive than the SnF operators. We are currently working on ways to make SA multi phase: each phase considers only a subset of operators. The operators can be partitioned based on their ranking. This could be done using histogram based partitioning [23] like equi-width or max-diff. The middle plot of Figure 3.14 shows the results of SA using different algorithms to choose a neighbor. Interestingly, the ranking based neighbor selection hurts SA. It prevents it from navigating the search space by actually restricting it to select only the most influential operators to move. The purely random neighbor selection is the best choice. In the right part of Figure 3.14 we show the effect of using different ranking algorithms. We observe that the derivative ranking is significantly better than structure for PL operators and do not differ to much for SnF operators. Compare Algorithms In this set of experiments, we compare all algorithms using the scientific dataflows Montage, Ligo, and Cybershake. All dataflows have 50 operators and the data replication is 100. We compare the algorithm proposed in this work with the ones in [27] using the generalized skyline distance. 26 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries PL Operators Skyline Distance (log scale) Skyline Distance (log scale) SnF Operators 10 Dyn PW 1 0.1 0.01 10 100 1000 Data Replication (log scale) 10 Dyn PW 1 0.1 0.01 10000 10 100 1000 Data Replication (log scale) 10000 Figure 3.12: Dynamic algorithm compared with Parallel Wave on MapReduce graphs. Dynamic Programming for Ligo 100 Distance 3 Derivative Structure FIFO 2 1 0 PL F Sn Figure 3.13: Dynamic programming with different ranking using Ligo dataflow with 100 PL and SnF operators. Figure 3.15 shows the results. We observe that the combination of Dyn algorithm to find the initial skyline and refinement of that skyline by SA with random neighbor is the best choice. In some cases, the solutions produced are one order of magnitude better than the NL algorithm. As observed earlier, using ranking with SA is not beneficial. Furthermore, the Exh algorithm do not give very good results compared to Dyn+SA. We remind that the Exh uses only the 10 most influential operators. We also measured the number of schedules in the skyline produced by each algorithm. In general, Dyn finds more schedules in skyline compared to NL and SA improves that even more. That, combined with the fact that Dyn+SA produces the best skyline of schedules, makes it an obvious choice. We also experimented with varying the size of the output data generated by the operators for Montage, Ligo, and Cybershake dataflows with 50 operators. Figure 3.16 shows the results. We observe that Dyn+SA proposed in this work produces plans that in some cases are one order of magnitude better than NL. Finally, Figure 3.17 shows the running time of the algorithms. We observe that Dyn algorithm has the fastest running time. Exh and Exh+SA have very long running time compared to the others because they examine a very large number of schedules. The call of the Schedule Estimation sub-routine is expensive. 3.7.2 Systems The parameters of the experiment are shown in Table 3.3. Experimental Setup Execution Environment: In our experiments, all containers have the same resources (cpu, memory, disk, and network). The resources were kindly provided by Okeanos 3 , the cloud by GRNet 4 . We used 32 virtual machines (containers), each with 1 CPU, 8 GB of memory, and 60 GB of disk. We measured the average network speed to 15 MB/sec. We used Hadoop 1.1.2 and Hive 0.11. The time and money are measured in quanta. 3 4 okeanos.grnet.gr www.grnet.gr 27 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Init algorithm for Cybershake 100 6 80 60 40 Dyn Ranking for Cybershake 100 12 Derivative Structure Random 7 Distance Distance 100 Neighbors for Cybershake 100 8 Dyn Exh Random 10 5 4 3 Derivative Structure 8 Distance 120 6 4 2 20 2 1 0 0 0 PL F Sn PL F Sn PL F Sn Figure 3.14: Varying data for Ligo dataflow with 100 PL and SnF operators. # points in skyline for Montage, Ligo, and Cybershake (50 Ops) 30 Montage Ligo Cybershake 10 # points in skyline Skyline Distance (log scale) Distance for Montage, Ligo, and Cybershake (50 Ops) 100 1 0.1 0.01 Montage Ligo Cybershake 25 20 15 10 5 0 k) an ) k an (R (R SA h+ Ex SA SA k) an ) k an (R (R SA + yn + yn yn L SA h+ Ex D D D N h+ Ex SA SA SA + yn + yn yn L h+ Ex D D D N Figure 3.15: Compare distance and number of schedules in the skyline of different algorithms from the combined skyline on various scientific dataflows with 50 operators. Dataset: We generated a total of 512 GB data (or approx. 2.2 billion tuples) using the generator provided for TPC-H. The benchmark has eight tables: region(1), partsupp(32, ps_partkey), orders(32, o_orderkey), lineitem(32, l_orderkey), customer(32, c_custkey), part(32, p_partkey), nation(1), and supplier(1). In parenthesis, we show the number of partitions we created for each table and the key based on which the partitioning was performed. In Hive, we used the CLUSTERED BY when we created the tables. The replication factor of Hadoop is set to 1 in order to have a clear comparison with our system. After loading the data to Hadoop we used the balancer to evenly distribute the data to the cluster. In ADP, the tables are horizontally partitioned and distributed to the cluster. The partitions of the same table are placed on different virtual machines. Table 3.3: System Experiment Properties Property Dataflow Operator type Ranking Search Num of VMs TPC-H Data (GB) Quantum size Values TPC-H, MapReduce SnF Derivative, Structure Dyn, PW 8, 16, 24, 32 64, 128, 256, 512 10 seconds 28 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Ligo (50 ops) NL Dyn Dyn+SA 4 Cybershake (50 ops) 12 NL Dyn Dyn+SA Skyline Distance 5 Skyline Distance Skyline Distance Montage (50 ops) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 2 1 10 8 4 2 0 10 100 1000 Data Replication (log scale) 10000 NL Dyn Dyn+SA 6 0 10 100 1000 Data Replication (log scale) 10000 10 100 1000 Data Replication (log scale) 10000 Figure 3.16: Varying data for Montage, Ligo, and Cybershake dataflows with 50 operators. Run Time (Seconds - log scale) Alg Run Time (50 Ops) 100 Montage Ligo Cybershake 10 1 k) an ) k an (R (R SA h+ Ex SA SA SA + yn + yn yn L h+ Ex D D D N Figure 3.17: Running time of the algorithms for Montage, Ligo, and Cybershake dataflows with 50 PL operators. Elastic Execution In our first set of experiments, we examine the elasticity of TPC-H graphs. Figure 3.18 shows the results. We observe that most of the queries are elastic, i.e., there is tradeoff between time and money. Three different examples are queries 2, 3, and 4. Query 3 is inelastic, meaning that the more machines we add, we increase both time and money. This is a typical behavior for distributed system when the data size is small. Query 2 does not use the two large tables lineitem and orders. Query 3 is a typical example of elastic dataflow. Money can be traded for time. Finally, query 4 runs faster and cheaper with more machines. Scalability Execution In this set of experiments we show the scalability of our system. Figure 3.19 shows the execution time of TPC-H queries varying the size of the data. We observe that our system scales well and can handle up to half of TB of data. Query 9 is a difficult query since it produces a large amount of intermediate results and we run out of disk for the 256 GB and 0.5TB datasets. Figure 3.20 shows the execution time of TPC-H queries varying the number of virtual machines. We used the 64GB dataset in order to be able to run with 8 virtual machines. We observe that our system can effectively exploit the available resource and run faster queries as the number of machines increases. Compare with Hive In our final set of experiments, we compared the schedules generated by our algorithms with Hive. Since the goal of Hive is to produce plans that run as fast as possible, we choose to execute the fastest plan in the skyline. Figure 3.21 shows the results. We observe that our system is able to run queries faster. 29 Techniques for Distributed Query Planning and Execution: One-Time Queries 24 24 16 24 16 24 32 60 50 40 30 32 300 270 240 210 180 32 400 350 300 250 200 Time Q8 Money Q8 8 16 24 200 160 120 80 40 32 1800 1600 1400 1200 Time Q9 Money Q9 8 16 30 25 20 15 10 24 32 400 350 300 250 200 Time Q10 Money Q10 8 16 24 Money 600 500 400 300 Time Q7 Money Q7 8 Money 32 Time Q6 Money Q6 8 Money 170 160 150 140 130 Time Q5 Money Q5 16 Money 32 Money 16 8 Time 210 180 150 120 Money Time Time Time 24 50 40 30 20 Time 32 Time Q4 Money Q4 8 Time 24 16 30 20 10 0 30 25 20 15 10 250 200 150 100 50 Time Q3 Money Q3 8 Time 16 25 20 15 10 5 20 15 10 5 32 Time Q2 Money Q2 8 Time 24 Money 16 Money Time 8 7.5 7 6.5 6 8 6 4 2 0 280 270 260 250 Time Q1 Money Q1 Money 40 30 20 10 0 Money Time Optique Deliverable D7.1 32 #VMs Figure 3.18: TPC-H queries on ADP varying number of machines. Figure 3.22 shows the same queries in the time-money 2D space. In order to find the monetary cost of queries in Hive we used the total MapReduce CPU time spent provided by Hive. This measure does not include the time spend for network IO, so is a lower bound of the actual cost. We observe that we are able not only to run faster but also cheaper than MapReduce. Analyzing in depth the performance factors that make our system run faster is out of the scope of this deliverable. Here we present a summary. MapReduce graphs: We support complex dataflows that fit better the model of SQL enhanced with UDFs. MapReduce dataflows are restricted to 3 phases (map, combine, reduce). An example is the group-by operator. The best way to execute it is with a tree of partial group-by operators. Is relatively easy to express a tree dataflow using our system. In MapReduce however, we need several jobs to do that when the height 30 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries TPC-H Queries on ADP 7000 64 GB 128 GB 256 GB 0.5 TB Time (Seconds) 6000 5000 4000 3000 2000 1000 0 10 Q 9 Q 8 Q 7 Q 6 Q 5 Q 4 Q 3 Q 2 Q 1 Q Figure 3.19: Execution time of TPC-H on ADP varying data size with 32 VMs. Time (Seconds) TPC-H Queries on ADP on 64 GB varying #VMs 1800 1600 1400 1200 1000 800 600 400 200 8 16 24 32 10 Q 9 Q 8 Q 7 Q 6 Q 5 Q 4 Q 3 Q 2 Q 1 Q Figure 3.20: Execution time of TPC-H on ADP with 64GB of data varying number of VMs. of the tree is more than 2, introducing substantial accidental complexity. We also have to be careful in the intermediate jobs (all except the last job) how we assign keys to pairs. Assigning as key the group-by column will not work for trees with height > 2. MapReduce initialization cost: A small factor influencing the performance of Hive is the initialization cost of each MapReduce job. This becomes a problem especially when the number of jobs is large. Tenzing [9] solved this problem by having a pool of Map and Reduce jobs already initialized and used on demand. The initialization in ADP is minimal and is performed only once per job. Synchronization of each MapReduce phase: Another factor that affect the performance is the synchronization of each MR job. Delays are common in clusters due to network or disk latencies. Delays cause machines to waste valuable time waiting for a small number of Map or Reduce jobs to finish. ADP creates a single job per query, so the different phases of the execution are blended together and any possible delays do not affect the whole query. Partitioning exploitation: We exploit the partitioning of the table. If the data is partitioned on the same with the join key, we perform hash join without re-partitioning (the equivalent for Hive is join in the reduce phase). This reduces the data transferred thought the network. Hand-Optimized Scripts: In this work, we focus on scheduling the dataflow graphs, i.e., the final stage of optimization in ADP. Finding the best sequence of joins to execute is considered as a given. The query scripts are given as input to the system. For future work, this will serve as a baseline to evaluate different rewriting techniques applied on the first level of optimization of ADP. MapReduce We converted the TPC-H queries into MapReduce dataflows using Hive explain capability. With our language abstractions we can easily express MapReduce dataflows. The following example shows how we express 31 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries TPC-H(64) Queries on Hive and ADP 4000 Hive ADP Time (Seconds) 3500 3000 2500 2000 1500 1000 500 10 Q 9 Q 8 Q 7 Q 6 Q 5 Q 4 Q 3 Q 2 Q 1 Q Figure 3.21: Execution time of TPC-H queries on Hive and ADP with 64GB data and 32 VMs. TPC-H(64) Queries on Hive and ADP 10000 Money (Quanta) Q9 ADP 1000 Q9 Hive 100 Hive ADP 10 1 10 100 Time (Quanta) 1000 Figure 3.22: 2D space of TPC-H on Hive and ADP with 64GB data and 32 VMs. distributed hash joins using the abstraction of MapReduce (join on the reduce side). distributed create temp table kv_A partition on cA as select cA, ... from input_A; distributed create temp table kv_B partition on cB as select cB, ... from input_B; distributed create table output as select reduce(value) from kv_A, kv_B where cA = cB; Figure 3.23 shows the results of the experiment. We observe that the difference of Hive with our system is not significant with MapReduce dataflows. 3.7.3 Conclusions of Experiments We emphasize that this work focuses on the elasticity of dataflow graphs. The experiments with Hive presented in this section serve as a proof that our system is comparable with state-of-the-art database management systems. The results presented here are encouraging and we leave for future work the comparison using larger datasets and more complex queries running on a cluster with more machines. The conclusions of the experimental study are summarized as follows. • A very significant elasticity exists in common MapReduce tasks. • Using our model and the proposed algorithms, we are able to extract more elasticity than MapReduce. 32 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries TPC-H Hive Queries on ADP and Hive 4000 Hive ADP Time (Seconds) 3500 3000 2500 2000 1500 1000 500 10 Q 9 Q 8 Q 7 Q 6 Q 5 Q 4 Q 3 Q 2 Q 1 Q Figure 3.23: Execution time of TPC-H expressed on MapReduce on Hive and ADP with 64GB data and 32 VMs. • The proposed solution of dynamic programming refined by simulated annealing is a promising approach to solve the problem of elastic dataflow processing of the Cloud. • A good ranking of the operators has great impact on the quality of the results produced by the our algorithms. • Our algorithms are able to efficiently explore the 2D space and find tradeoffs between time and money. 3.8 Conclusions and Future Work In this chapter we presented an efficient algorithm for the scheduling of dataflow graphs on the cloud based on both completion time and monetary cost. The algorithm ranks the operators of the dataflow based on their influence. We implemented the algorithm in our platform for data processing flows on the cloud. Through several experiments we showed that our methods can efficiently explore the 2D space of time and money and are able to find good trade-offs. Our future work moves in several directions. Theoretical bounds: We want to provide theoretical bounds on the error of the skylines produced by our algorithm. Works such as [39, 40] prove that the approximation of the skyline is a tractable problem. Elasticity Prediction: The ranking of the operators could be used to predict the elasticity of a dataflow. Preliminary results show that this is feasible. Dataflows with low elasticity tend to have some operators that dominate others. On the other hand, dataflows with high elasticity, do not have heavy backbone operators. This observation leads to an enhancement of the algorithm. If the dataflow is not elastic, run a much simpler algorithm that minimizes completion time. Given that the dataflow is not elastic, the money is also minimized. Belief Propagation: We also plan to use belief propagation techniques to rank the operators along with new scheduling algorithms that exploit the rich information that belief propagation algorithms produce, such as the probability distribution of the solutions. In the next chapter we present the integration of ADP into the Optique platform and present the progress that has been made in the communication with other components. 33 Chapter 4 Integration of ADP With the Optique Platform In this chapter we present the integration of the distributed query processing and optimization module in the Optique platform during the first year of the project. We describe the communication with other modules and the current state of the process. 4.1 Progress First, we collaborated closely with other work packages (especially WP2 and WP6) to clearly define the role of ADP in the general Optique architecture. The results of this work have been described with more details in deliverable D2.1 and also in [1]. Figure 4.1 is borrowed from the aforementioned deliverable and shows the architecture of the distributed query processing component including the communication with the query transformation and the communication with external data sources. Then, we focused on the integration of ADP with the other components of Optique. More specifically, we have been implementing the JDBC interface that other Optique Platform components can use. With respect to the integration with the platform, we have created some instances of ADP running on virtual machines on the Optique shared infrastructure. We have imported data from relational databases for both use cases of the project and we have run queries issued by the Query Transformation module. 4.2 JDBC Interface We have developed a JDBC interface to integrate with the other components that use ADP. We chose JDBC because it is well defined and the integration is very simple (e.g., Quest uses JDBC already). The design is shown in Figure 4.2. The front-end issues queries in the form of HTTP post. The results are given to the client as a JSON stream. The ADP gateway is an HTTP server that listens to requests. An example of a result to a request is shown below. {schema: [[id, integer],[name, text]], time:5.2, error:[]} [0, Germany] [1, Greece] [2, Italy] ... The first line contains the metadata of the result. It is a JSON dictionary that includes the schema, the execution time, and the possible errors of the execution. More fields may be added in the future. After the metadata, each line contains one record. Each record is a JSON array with the values of the columns. This 34 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Integrated via Information Workbench Presentation Layer Optique's configuration interface Query Answering Component Query transformation Shared database Query Rewriting Answ Manager 1-time Q SPARQL 1-time Q SPARQL Stream Q Configuration of modules Stream Q LDAP authentification ADP Gateway: JDBC, Stream API Distributed Query Execution based on ADP Master Data Connector Optimisation Engine Optimisation Engine Execution Engine Execution Engine Worker Worker Stream Connector P2P Net Worker Worker Fast Local Net Application, Internal Data Layer Stream connector JDBC, Teiid Externat Data Layer ... RDBs, triple stores, temporal DBs, etc. ... Components Component Cloud API Cloud (virtual resource pool) data streams Colouring Convention Group of components Front end: mainly Web-based Optique solution Externat Cloud Types of Users Expert users Figure 4.1: General architecture of the ADP component within the Optique System Figure 4.2: The JDBC front-end and the ADP gateway are connected using HTTP protocols transferring JSON encoded streams. allows the data to be sent as streams. Listing 4.1 presents some sample code for using the ADP using the JDBC driver. 35 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Listing 4.1: Using the ADP JDBC driver C l a s s . forName ( " madgik . adp . j d b c . AdpDriver " ) ; S t r i n g s i e m e n s D a t a b a s e = " h t t p : / / 1 0 . 2 5 4 . 1 1 . 1 8 : 9 0 9 0 / home/adp/ d a t a b a s e s / s i e m e n s " ; S t r i n g siemensQuery = " s e l e c t ∗ from . . . ; " ; Connection conn = DriverManager . g e t C o n n e c t i o n ( " j d b c : adp : " + siemensDatabase , " adp " , " siemens " ) ; Statement stmt = conn . c r e a t e S t a t e m e n t ( ) ; R e s u l t S e t r s = stmt . executeQuery ( siemensQuery ) ; while ( r s . next ( ) ) { ... } rs . close () ; stmt . c l o s e ( ) ; The database is specified by the address of the ADP gateway and the path of the database. In our example, the address is 10.254.11.18:9090 and the database path is /home/adp/databases/siemens. The rest of the code is JDBC standard. 4.3 Queries Provided by the Use Cases We have successfully imported data from the NPD1 and Siemens databases that are set up in Optique’s development infrastructure and we have successfully executed example queries provided by the partners. All the SQL queries can be found in Appendix A. Table 4.1 shows the execution times for the queries of NPD dataset of the Statoil use case. We observe that ADP has a constant overhead. This is due to optimization, scheduling, monoring, reporting, etc. that are essential for distributed processing. The data is not large enough to show the benefits from using a distributed processing engine. However, we observe that even in this setting, for the first query of NPD, we are able to reduce significantly the running time. We anticipate large benefits with larger amount of data. Table 4.1: NPD Queries Query No 01 02 03 04 05 06 07 08 09 10 12 13 14 15 16 17 18 MySQL Running Time 104.637404 1.420361 0.091786 0.026232 0.600685 0.002712 0.001985 0.010112 0.042385 0.040352 0.40598 0.156954 0.088088 4.309496 0.008964 0.00743 0.002539 ADP Running Time 8.313 8.291 8.340 8.248 8.331 8.462 8.421 8.474 8.428 8.377 8.328 8.491 8.478 12.236 8.482 8.436 8.430 We measured the overhead using very simple queries that do very simple processing and found that it is around 8 seconds. We are working on methods to reduce the pre-processing time altough, for large datasets, this overhead is insignificant. Figure 4.3 shows the Statoil queries after removing the overhead. Table 4.2 show the execution times for the Siemens use case. Figure 4.4 shows the same data in a plot. We observe that ADP is slower than PostgreSQL. This is because we use only one machine and we are not able to explore the parallelism of the queries and ameliorate the overhead of the distributed engine. When we integrate more tightly with the platform, and especially with the eCloudManager, we will be able to execute the queries in a distributed fashion and accelerate the execution. 1 one of the data sources used in the Statoil use case 36 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Time (seconds, log scale) Statoil Queries 100 10 1 MySQL 0.1 ADP (Normalized) 0.01 0.001 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Query Figure 4.3: The Queries of the Statoil use case on MySQL and ADP Table 4.2: Siemens Queries Query No 01 02 03 04 05 06 07 08 PostgreSQL Running Time 4.473 1.599 0.405 0.446 0.468 0.486 0.447 0.325 ADP Running Time 28.905 9.230 8.001 7.606 8.082 7.593 7.354 68.300 Siemens Queries Time (seconds, log scale) 100 10 PostgreSQL 1 ADP 0.1 1 2 3 4 5 6 7 8 Query Figure 4.4: The Queries of the Siemens use case on PostreSQL and ADP 37 Chapter 5 Conclusions In this deliverable we presented the work that has been carried out in the context of work package WP7 during the first year of the Optique project. We started by describing the ADP system for distributed query processing. We gave emphasis to the language abstractions of the system. We then proceeded to the main research results regarding the optimization of dataflow execution on the cloud. We also reported on the current status of the implementation and the communication with other components of the Optique platform. In the second year of the project we will turn our attention to efficient execution of continuous and temporal OBDA queries, by taking into consideration the work done for tasks 5.2 and 5.3 of WP5. Furthermore, we will have tighter integration with the query transformation module by considering possible feedback that will be useful for the optimization process of query rewriting, like execution statistics. Furthermore, we will integrate tightly with the platform, and especially with the eCloudManager, in order to run ADP in distributed mode. Also, during the first year of the project, it became evident that the execution of queries from different data sources will be crucial for the successful tackling of the project’s use cases, especially for the Statoil use case. Work related to federation and query execution from distributed data sources should be pushed earlier and during the second year we will take into account problems related to this aspect. We plan to work closely with the query transformation module to achieve efficient execution of federated queries by taking into account factors like the constraints and QoS of different components. 38 Bibliography [1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922–933, 2009. [2] Amazon. "amazon elastic map reduce, http://aws.amazon.com/elasticmapreduce/", 2011. [3] Apache. "Mahout : Scalable machine-learning and data-mining library, http://mahout.apache.org/", 2010. [4] Apache. Apache Hadoop, http://hadoop.apache.org/, 2011. [5] Apache. TPC-H Benchmark, http://www.tpc.org/tpch/, 2012. [6] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, Mei-Hui Su, and K. Vahi. Characterization of scientific workflows. pages 1 –10, nov. 2008. [7] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not., 45(6):363– 375, 2010. [8] Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, and Michael Wong. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318–1327, 2011. [9] Jeffrey Dean and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". In 6th Symposium on Operating System Design and Implementation, pages 137–150, 2004. [10] E. Deelman and et. al. "Pegasus: Mapping Large Scale Workflows to Distributed Resources in Workflows in e-Science". Springer, 2006. [11] Ewa Deelman et al. Managing large-scale workflow execution from resource provisioning to provenance tracking: The cybershake example. In IEEE e-Science, page 14, 2006. [12] Ewa Deelman, Carl Kesselman, and more. "GriPhyN and LIGO, Building a Virtual Data Grid for Gravitational Wave Scientists". In IEEE HPDC, pages 225–, 2002. [13] Ewa Deelman, Gurmeet Singh, Miron Livny, G. Bruce Berriman, and John Good. The cost of doing science on the cloud: the montage example. In IEEE/ACM SC, page 50, 2008. [14] David J. DeWitt and Jim Gray. Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98, 1992. [15] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing top k lists. In SODA, pages 28–36, 2003. [16] Daniela Florescu and Donald Kossmann. Rethinking cost and performance of database systems. SIGMOD Record, 38(1):43–48, 2009. 39 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries [17] Luis Miguel Vaquero Gonzalez, Luis Rodero Merino, Juan Caceres, and Maik Lindner. "A break in the clouds: towards a cloud definition". Computer Communication Review, 39(1):50–55, 2009. [18] Ronald L. Graham. "Bounds on Multiprocessing Timing Anomalies". SIAM Journal of Applied Mathematics, 17(2):416–429, 1969. [19] Hadapt. "hadapt analytical platform, http://www.hadapt.com/", 2011. [20] Herodotos Herodotou and Shivnath Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 4(11):1111–1122, 2011. [21] Jin Huang, Bin Jiang, Jian Pei, Jian Chen, and Yong Tang. Skyline distance: a measure of multidimensional competence. Knowl. Inf. Syst., 34(2):373–396, 2013. [22] Yannis E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19–30, 2003. [23] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. "Dryad: distributed dataparallel programs from sequential building blocks". In EuroSys, pages 59–72, 2007. [24] Joseph C. Jacob et al. "Montage: a grid portal and software toolkit for science, grade astronomical image mosaicking". Int. J. Comput. Sci. Eng., 4(2):73–87, 2009. [25] Youngdae Kim, Gae won You, and Seung won Hwang. Escaping a dominance region at minimum cost. In DEXA, pages 800–807, 2008. [26] Herald Kllapi, Dimitris Bilidas, Ian Horrocks, Yannis Ioannidis, Ernesto Jiménez-Ruiz, Evgeny Kharlamov, Manolis Koubarakis, and Dmitiy Zheleznyakov. Distributed query processing on the cloud: the optique point of view (short paper). In 10th OWL: Experiences and Directions Workshop (OWLED), 2013. [27] Herald Kllapi, Eva Sitaridi, Manolis M. Tsangaris, and Yannis E. Ioannidis. Schedule optimization for data processing flows on the cloud. In Proc. of SIGMOD, pages 289–300, 2011. [28] Donald Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422–469, 2000. [29] Donald Kossmann. "The state of the art in distributed query processing". ACM Comput. Surv., 32(4):422–469, 2000. [30] Yu-Kwong Kwok and Ishfaq Ahmad. "Benchmarking and Comparison of the Task Graph Scheduling Algorithms". J. Parallel Distrib. Comput., 59(3):381–422, 1999. [31] Jiexing Li, Arnd Christian König, Vivek R. Narasayya, and Surajit Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. PVLDB, 5(11):1555–1566, 2012. [32] Harold Lim, Herodotos Herodotou, and Shivnath Babu. Stubby: A transformation-based optimizer for mapreduce workflows. PVLDB, 5(11):1196–1207, 2012. [33] Michael J. Litzkow, Miron Livny, and Matt W. Mutka. "Condor - A Hunter of Idle Workstations". In ICDCS, pages 104–111, 1988. [34] David T. Liu and Michael J. Franklin. "The Design of GridDB: A Data-Centric Overlay for the Scientific Grid". In VLDB, pages 600–611, 2004. [35] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330–339, 2010. 40 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries [36] Christopher Olston et al. "Pig latin: a not-so-foreign language for data processing". In SIGMOD Conference, pages 1099–1110, 2008. [37] M. Tamer Özsu and Patrick Valduriez. Principles of Distributed Database Systems. Prentice-Hall, 2 edition, 1999. [38] Suraj Pandey, Linlin Wu, Siddeswara Mayura Guru, and Rajkumar Buyya. A particle swarm optimization-based heuristic for scheduling workflow applications in cloud computing environments. In IEEE AINA, pages 400–407, 2010. [39] Christos H. Papadimitriou and Mihalis Yannakakis. On the approximability of trade-offs and optimal access of web sources. In FOCS, pages 86–92, 2000. [40] Christos H. Papadimitriou and Mihalis Yannakakis. "Multiobjective Query Optimization". In PODS, 2001. [41] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277–298, 2005. [42] Garey M. R., Johnson D. S., and Sethi Ravi. "The Complexity of Flowshop and Jobshop Scheduling". Mathematics of operations research, 1(2):117–129, 1976. [43] Srinath Shankar and David J. DeWitt. "Data driven workflow planning in cluster management systems". In HPDC, pages 127–136, 2007. [44] João Nuno Silva, Luís Veiga, and Paulo Ferreira. Heuristic for resources allocation on utility computing infrastructures. In Bruno Schulze and Geoffrey Fox, editors, MGC, page 9. ACM, 2008. [45] Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. Qox-driven etl design: reducing the cost of etl consulting engagements. In SIGMOD Conference, pages 953–960, 2009. [46] Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. Optimizing analytic data flows for multiple execution engines. In SIGMOD Conference, pages 829–840, 2012. [47] Michael Stonebraker, Paul M. Aoki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, and Andrew Yu. Mariposa: A wide-area distributed database system. VLDB J., 5(1):48–63, 1996. [48] Ashish Thusoo et al. "Hive - a petabyte scale data warehouse using Hadoop". In ICDE, pages 996–1005, 2010. [49] Mads Torgersen. Querying in C#: how language integrated query (LINQ) works. In OOPSLA Companion, pages 852–853, 2007. [50] Manolis M. Tsangaris and more. Dataflow processing and optimization on grid and cloud infrastructures. IEEE Data Eng. Bull., 32(1):67–74, 2009. [51] Xiaoli Wang, Yuping Wang, and Hai Zhu. Energy-efficient task scheduling model based on mapreduce for cloud computing using genetic algorithm. JCP, 7(12):2962–2970, 2012. [52] Fatos Xhafa and Ajith Abraham, editors. Metaheuristics for Scheduling in Distributed Computing Environments. Studies in Computational Intelligence. Springer, 2008. [53] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1–14, 2008. 41 Glossary ADP API ART CFM DBMS DP ETL FIFO HTTP JDBC JSON IaaS NL NPD OBDA PaaS PL PW SaaS QoS RDBMS RMI SA SnF SQL UDF VM W3C WP Athena Distributed Processing Application Programming Interface ADP Run Time Connection and Function Manager Data Base Management System Dynamic Programming Extract Transform Load First In First Out Hypertext Transfer Protocol Java Database Connectivity JavaScript Object Notation Infrastructure as a Service Nested Loops Optimizer Norwegian Petroleum Directorate Ontology-based Data Access Platform as a Service Pipeline Parallel Wave Software as a Service Quality of Service Relational Data Base Management System Remote Method Invocation Simulated Annealing Store and Forward Structured Query Language User Defined Function Virtual Machine World Wide Web Consortium Work Package 42 Appendix A Test Queries from the Use Cases This appendix contains the the test queries that have been provided by the use case partners and have been executed on the ADP instances on the fluidOps infrastructure. A.1 NPD Queries Query No 1 SELECT DISTINCT a . prlName , a . cmpLongName , a . p r l L i c e n s e e I n t e r e s t , a . p r l L i c e n s e e D a t e V a l i d T o FROM l i c e n c e _ l i c e n s e e _ h s t a WHERE a . p r l L i c e n s e e D a t e V a l i d T o IN (SELECT MAX( b . p r l L i c e n s e e D a t e V a l i d T o ) FROM l i c e n c e _ l i c e n s e e _ h s t b WHERE a . prlName = b . prlName GROUP BY b . prlName ) ORDER BY a . prlName Query No 2 SELECT a . prlName , a . cmpLongName , a . prlOperDateValidFrom FROM l i c e n c e _ o p e r _ h s t a WHERE a . prlOperDateValidFrom IN (SELECT MAX( b . prlOperDateValidFrom ) FROM l i c e n c e _ o p e r _ h s t b WHERE a . prlName = b . prlName ) ORDER BY a . prlName Query No 3 SELECT prlName , prlDateGranted , p r l D a t e V a l i d T o FROM l i c e n c e ORDER BY prlName Query No 4 SELECT w l b P r o d u c t i o n L i c e n c e , MIN( wlbEntryDate ) FROM w e l l b o r e _ e x p l o r a t i o n _ a l l GROUP BY w l b P r o d u c t i o n L i c e n c e ORDER BY w l b P r o d u c t i o n L i c e n c e Query No 5 SELECT prlName , cmpLongName , p r l L i c e n s e e D a t e V a l i d F r o m as fromDate , p r l L i c e n s e e D a t e V a l i d T o as toDate FROM l i c e n c e _ l i c e n s e e _ h s t ORDER BY prlName , fromDate 43 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries Query No 6 SELECT fldName , SUBSTRING_INDEX( wlbName , FROM f i e l d ORDER BY fldName ’− ’ , 1 ) Query No 7 SELECT fldName , fldRemainingOE , f l d R e m a i n i n g O i l , fldRemainingGas , f l d R e m a i n i n g C o n d e n s a t e FROM f i e l d _ r e s e r v e s ORDER BY fldRemainingOE DESC Query No 8 SELECT fclBelongsToName as f i e l d , fclName as f a c i l i t y , f c l K i n d as type FROM f a c i l i t y _ f i x e d WHERE f c l B e l o n g s T o K i n d = ’FIELD ’ ORDER BY f i e l d Query No 9 SELECT w l b P r o d u c t i o n L i c e n c e , COUNT(DISTINCT wlbWell ) FROM ( SELECT w l b P r o d u c t i o n L i c e n c e , wlbWell FROM w e l l b o r e _ d e v e l o p m e n t _ a l l UNION SELECT w l b P r o d u c t i o n L i c e n c e , wlbWell FROM w e l l b o r e _ e x p l o r a t i o n _ a l l UNION SELECT w l b P r o d u c t i o n L i c e n c e , wlbWell FROM w e l l b o r e _ s h a l l o w _ a l l ) AS t GROUP BY w l b P r o d u c t i o n L i c e n c e Query No 10 SELECT DISTINCT ∗ FROM ( (SELECT wlbName , ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) AS lenghtM FROM w e l l b o r e _ c o r e WHERE wlbCoreIntervalUom = ’ [ f t ]’ AND ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) > 30 ) UNION (SELECT wlbName , wlbTotalCoreLength AS lenghtM FROM w e l l b o r e _ c o r e WHERE wlbCoreIntervalUom = ’ [m ]’ AND wlbTotalCoreLength > 30 ) ) as t Query No 11 SELECT DISTINCT s t r a t . lsuName , c o r e s . wlbName AS wlbName FROM w e l l b o r e _ c o r e AS c o r e s , s t r a t _ l i t h o _ w e l l b o r e AS s t r a t WHERE c o r e s . wlbNpdidWellbore = s t r a t . wlbNpdidWellbore AND ( ( c o r e s . wlbCoreIntervalUom = ’ [ f t ]’ AND (GREATEST( c o r e s . w l b C o r e I n t e r v a l T o p ∗ 0 . 3 0 4 8 , s t r a t . lsuTopDepth ) < LEAST( c o r e s . w l b C o r e I n t e r v a l B o t t o m ∗ 0 . 3 0 4 8 , s t r a t . lsuBottomDepth ) ) ) OR ( c o r e s . wlbCoreIntervalUom = ’ [m ]’ 44 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries AND (GREATEST( c o r e s . w lb C o re I n t er v a lT o p , s t r a t . lsuTopDepth ) < LEAST( c o r e s . wlbCoreIntervalBottom , s t r a t . lsuBottomDepth ) ) ) ) ORDER BY s t r a t . lsuName , wlbName Query No 12 SELECT DISTINCT c o r e s . wlbName , c o r e s . lenghtM , w e l l b o r e . w l b D r i l l i n g O p e r a t o r , w e l l b o r e . wlbCompletionYear FROM ( (SELECT wlbName , wlbNpdidWellbore , ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) AS lenghtM FROM w e l l b o r e _ c o r e WHERE wlbCoreIntervalUom = ’ [ f t ]’ ) UNION (SELECT wlbName , wlbNpdidWellbore , wlbTotalCoreLength AS lenghtM FROM w e l l b o r e _ c o r e WHERE wlbCoreIntervalUom = ’ [m ]’ ) ) as c o r e s , ( (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear FROM w e l l b o r e _ d e v e l o p m e n t _ a l l ) UNION (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear FROM w e l l b o r e _ e x p l o r a t i o n _ a l l ) UNION (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear FROM w e l l b o r e _ s h a l l o w _ a l l ) ) as w e l l b o r e WHERE w e l l b o r e . wlbNpdidWellbore = c o r e s . wlbNpdidWellbore AND w e l l b o r e . w l b D r i l l i n g O p e r a t o r LIKE ’%STATOIL% ’ AND wlbCompletionYear >= 2008 AND lenghtM > 50 ORDER BY c o r e s . wlbName Query No 13 SELECT DISTINCT c o r e s . wlbName , c o r e s . lenghtM , w e l l b o r e . w l b D r i l l i n g O p e r a t o r , w e l l b o r e . wlbCompletionYear FROM ( (SELECT wlbName , wlbNpdidWellbore , ( wlbTotalCoreLength ∗ 0 . 3 0 4 8 ) AS lenghtM FROM w e l l b o r e _ c o r e WHERE wlbCoreIntervalUom = ’ [ f t ]’ ) UNION (SELECT wlbName , wlbNpdidWellbore , wlbTotalCoreLength AS lenghtM FROM w e l l b o r e _ c o r e WHERE wlbCoreIntervalUom = ’ [m ]’ ) ) as c o r e s , ( (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear FROM w e l l b o r e _ d e v e l o p m e n t _ a l l ) UNION (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear FROM w e l l b o r e _ e x p l o r a t i o n _ a l l ) 45 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries UNION (SELECT wlbNpdidWellbore , w l b D r i l l i n g O p e r a t o r , wlbCompletionYear FROM w e l l b o r e _ s h a l l o w _ a l l ) ) as w e l l b o r e WHERE w e l l b o r e . wlbNpdidWellbore = c o r e s . wlbNpdidWellbore AND w e l l b o r e . w l b D r i l l i n g O p e r a t o r LIKE ’%STATOIL% ’ AND wlbCompletionYear < 2008 AND lenghtM < 10 ORDER BY c o r e s . wlbName Query No 14 SELECT f i e l d . fldName prfPrdOeNetMillSm3 , prfPrdOilNetMillSm3 , prfPrdGasNetBillSm3 , prfPrdNGLNetMillSm3 , prfPrdCondensateNetMillSm3 FROM f i e l d _ p r o d u c t i o n _ m o n t h l y a , field WHERE p r f N p d i d I n f o r m a t i o n C a r r i e r = f l d N p d i d F i e l d AND prfPrdOeNetMillSm3 IN (SELECT MAX( prfPrdOeNetMillSm3 ) FROM f i e l d _ p r o d u c t i o n _ m o n t h l y b WHERE a . p r f N p d i d I n f o r m a t i o n C a r r i e r = b . p r f N p d i d I n f o r m a t i o n C a r r i e r GROUP BY b . p r f I n f o r m a t i o n C a r r i e r ) ORDER BY prfPrdOeNetMillSm3 DESC Query No 15 SELECT f i e l d . fldName , AVG( prfPrdOeNetMillSm3 ) , AVG( p r f P r d O i l N e t M i l l S m 3 ) , AVG( prfPrdGasNetBillSm3 ) , AVG( prfPrdNGLNetMillSm3 ) , AVG( prfPrdCondensateNetMillSm3 ) FROM f i e l d _ p r o d u c t i o n _ y e a r l y , field WHERE p r f N p d i d I n f o r m a t i o n C a r r i e r = f l d N p d i d F i e l d AND p r f Y e a r < 2013 −− e x c l u d e c u r r e n t , and i n c o m p l e t e , y e a r GROUP BY p r f I n f o r m a t i o n C a r r i e r ORDER BY AVG( prfPrdOeNetMillSm3 ) DESC Query No 16 SELECT f i e l d . fldName , SUM( prfPrdOeNetMillSm3 ) , SUM( p r f P r d O i l N e t M i l l S m 3 ) , SUM( prfPrdGasNetBillSm3 ) , SUM( prfPrdNGLNetMillSm3 ) , SUM( prfPrdCondensateNetMillSm3 ) FROM f i e l d _ p r o d u c t i o n _ y e a r l y , field WHERE p r f N p d i d I n f o r m a t i o n C a r r i e r = f l d N p d i d F i e l d −− AND p r f Y e a r < 2013 −− e x c l u d e c u r r e n t , and i n c o m p l e t e , y e a r GROUP BY p r f I n f o r m a t i o n C a r r i e r ORDER BY SUM( prfPrdOeNetMillSm3 ) DESC Query No 17 SELECT p r f I n f o r m a t i o n C a r r i e r , SUM( p r f P r d O i l N e t M i l l S m 3 ) AS sumOil , SUM( prfPrdGasNetBillSm3 ) AS sumGas 46 Optique Deliverable D7.1 Techniques for Distributed Query Planning and Execution: One-Time Queries FROM f i e l d _ p r o d u c t i o n _ m o n t h l y AS prod , field WHERE prod . p r f N p d i d I n f o r m a t i o n C a r r i e r = f i e l d . f l d N p d i d F i e l d AND prod . p r f Y e a r = 2010 AND prod . prfMonth >= 1 AND prod . prfMonth <= 6 AND f i e l d . cmpLongName = ’ S t a t o i l Petroleum AS ’ GROUP BY p r f I n f o r m a t i o n C a r r i e r ORDER BY p r f I n f o r m a t i o n C a r r i e r A.2 Siemens Queries Query No 1 SELECT value FROM measurement WHERE s e n s o r =57 and "Timestamp" > ’ 2009−01−01 0 0 : 0 0 : 0 0 ’ and " Timestamp" < ’ 2009−12−31 2 3 : 0 0 : 0 0 ’ Query No 2 SELECT " e v e n t t e x t " , COUNT( " e v e n t t e x t " ) AS c FROM message WHERE "Timestamp" >= ’ 2009−01−01 0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2009−01−03 2 3 : 0 0 : 0 0 ’ GROUP BY " e v e n t t e x t " ORDER BY c DESC Query No 3 SELECT " e v e n t t e x t " , COUNT( " e v e n t t e x t " ) AS c FROM message WHERE "Timestamp" >= ’ 2009−01−01 0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2009−01−03 2 3 : 0 0 : 0 0 ’ GROUP BY " e v e n t t e x t " ORDER BY c DESC LIMIT 10 Query No 4 SELECT " as se mb l y " , COUNT( " message " ) AS e v e n t F r e q u e n c y FROM message WHERE " c a t e g o r y "=6 AND " Timestamp" >= ’ 2005−01−01 0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2005−12−31 2 3 : 0 0 : 0 0 ’ GROUP BY " a ss em bl y " ORDER BY e v e n t F r e q u e n c y DESC LIMIT 10 Query No 5 SELECT " c a t e g o r y " , COUNT( " c a t e g o r y " ) AS c a t e g o r y F r e q u e n c y FROM message WHERE "Timestamp" > ’ 2005−03−25 0 0 : 0 0 : 0 0 ’ AND "Timestamp" < ’ 2006−04−04 0 0 : 0 0 : 0 0 ’ GROUP BY " c a t e g o r y " ORDER BY c a t e g o r y F r e q u e n c y DESC LIMIT 5 Query No 6 SELECT "Timestamp" FROM message WHERE " e v e n t t e x t "= ’ C o n t r o l l e r f a u l t ’ AND "Timestamp" > ’ 2007−11−15 0 0 : 0 0 : 0 0 ’ AND "Timestamp" < ’ 2007−12−31 2 3 : 5 9 : 5 9 ’ ORDER BY "Timestamp" ASC LIMIT 1 Query No 7 SELECT e v e n t t e x t , COUNT( e v e n t t e x t ) AS f r e q FROM message WHERE " assembly "=6 AND "Timestamp" >= ’ 2005−01−01 0 0 : 0 0 : 0 0 ’ AND "Timestamp" <= ’ 2005−01−01 0 1 : 4 4 : 4 7 ’ GROUP BY e v e n t t e x t ORDER BY f r e q DESC LIMIT 10 Query No 8 SELECT COUNT( s e n s o r ) FROM measurement , message WHERE measurement . s e n s o r =35 AND measurement . " Timestamp"=message . "Timestamp" AND measurement . "Timestamp" >= ’ 2005−01−01 0 0 : 0 0 : 0 0 ’ AND measurement . "Timestamp" <= ’ 2005−01−10 0 0 : 0 0 : 0 0 ’ AND measurement . value >745 47