OGSA-DAI OGSA DAI DQP AD Developer’s l ’ Vi View Bartosz Dobrzelecki Applications Consultant, EPCC bartosz@epcc.ed.ac.uk +44 131 650 5137 User’s View DQPConfiguration.xml <DQPConfiguration xmlns="http://ogsadai.org.uk/dqp/namespaces/2008/12"> <dataResources> < <resource url="http://localhost:8080/dai/services" l "htt //l lh t 8080/d i/ i " dsos="DataSourceService" drerID="DataRequestExecutionResource" resourceID="MySQLResource" i L isLocal="true"/> l "t "/> <resource url="http://localhost:8090/dai/services" dsos="DataSourceService" drerID="DataRequestExecutionResource" reso rceID "Reso rce2"/> resourceID="Resource2"/> <resource url="http://localhost:8095/dai/services" dsos="DataSourceService" drerID="DataRequestExecutionResource" resourceID="MySQLResource" alias="MySQL"/> </dataResources> <evaluationResources> <resource url="http://localhost:8085/dai/services" drerID="DataRequestExecutionResource"/> </evaluationResources> </DQPConfiguration> MySQLResource_employee MySQL_employee Query processing steps Logical Query Plan SQL query expression SQL Parser LQP Builder Abstract Syntax Tree Optimiser Optimiser Optimiser Results Workflow W kfl Builder Execute Partitioner Optimised LQP Partitioned LQP OGSA-DAI Requests and Sub-workflows 4 Query execution SQL Query OGSA-DAI Request Result OGSA-DAI Data Node 3 OGSA-DAI-DQP DQP Coordinator SubWorkflow OGSA-DAI Request data OGSA-DAI Data Node 1 data OGSA-DAI Data Node 2 DB1 DB2 DB3 Producing Abstract Syntax Tree (AST) • First step: parse SQL and generate AST. • We use ANTLR 3 to generate code from grammars. • Two grammars: – SQL to AST – AST to SQL (tree grammar) • The tree grammar is used in our OGSA-DAI Views product which implements p read only y SQL Views by y rewriting g AST. • In DQP the tree grammar is used to generate string representations for column definitions definitions, conditions conditions, ect ect. AST is a contract • We do not expect AST to be changed. • However, we do provide a mechanism for exposing new operators to the language surface. SELECT A.aname AS name FROM aircraft A, certified C WHERE A.aid = C.aid Relation valued functions SELECT A.aname AS name FROM outerUnion( (SELECT * FROM aircraft A), (SELECT * FROM certified C), 'ALL') A Logical Query Plan • Second step: translate AST to a logical query plan. plan SELECT aname AS name FROM aircraft WHERE aid = 10 • Operator anatomy anatomy. Attribute (name, source, type) Heading H di (list of Attributes) parent Operator specific p internals children c de OperatorID Operators • Behaviour defined in the Operator interface – Validation – checks if operator gets all the input data it needs needs, detects missing attributes, ambiguities, deals with correlation, performs type checking. – Update – updates operator internals after it was (re) connected. • Operator, Heading, Attribute objects can be annotated with arbitrary annotations (key :String -> value :Object) – Sample uses: – Attribute is sorted, correlated, temporary – Which physical algorithm for join operator – Estimated E ti t d cardinality di lit – There will be a set of default annotations Operator family • Unary: – – – – – – – – – – • Binary: SELECT PROJECT RENAME DUPLICATE ELIMINATION SORT GROUP BY SCALAR GROUP BY ONE ROW ONLY TABLE SCAN EXCHANGE – – – – – – – – – – INNER JOIN PRODUCT UNION INTERSECTION DIFFERENCE FULL OUTER JOIN [LEFT][RIGHT] OUTER JOIN [ANTI] SEMI JOIN APPLY [UNARY][BINARY][SCAN] REL_FUNCTION Data Dictionary • Data Dictionary provides information about federated data resources, available evaluators (DRERs), logical and physical table schemas. • It is populated when the resource is initialised. • Most of the entries can be annotated – you y can p plug g in yyour own code to be executed on initialisation – you may want to annotate attributes with histograms. • TABLE_SCAN TABLE SCAN operator t b builds ild itits H Heading di using i d data t from Data Dictionary (on update). • After Aft assembling bli LQP iis validated. lid t d Optimisation • After successful validation LQP is optimised by a chain of optimisers. optimisers • This chain is defined as part of the Compiler configuration. • Optimisers need to implement a single method: Operator optimise(Operator lqpRoot, DataDictionary dataDictionary, CompilerConfiguration compilerConfiguration) throws LQPException; Default optimisers • Query normalisation + heuristics – – – – Remove redundant operators Select Push Down + implicit join detection Rename Pull Up p Project Pull Up • Join orderingg • Partitioning – finding best places for EXCHANGE operators • TABLE_SCAN TABLE SCAN iimplosion l i – pushing hi as much h processing i as we can to the RDBMS Normalisation AST to LQP translator is not trying to be smart - SELECT Temp.name, Temp.AvgSalary FROM ( SELECT A.aid, A.aname AS name, AVG ( (E.salary) y) AS y FROM aircraft A, certified C, employees E WHERE A.aid = C.aid AND C.eid = E.eid AND A.cruisingrange > 1000 GROUP BY A.aid, A aid A.aname A aname ) AS Temp it takes it easy LQP is then y a chain normalised by of optimisers Join Ordering • Not there yet. • Will be based on the same cost model as in OGSA-DQP. • We will also reuse the same algorithm that produces left deep trees. • More sophisticated models and algorithms (considering bushy trees, semi joins, etc.) will be implemented later on. • You can always implement your own and replace the default. Partitioning optimiser • Pluggable optimiser decides how to split LQP into partitions by inserting the EXCHANGE operator. • Default optimiser will put most load on the “local” evaluator (DRER) – otherwise it will choose randomly. TABLE_SCAN Implosion • Not there yet. • We will always try to push as much processing as we can to the RDBMS. • TABLE_SCAN _ “eats” as much of a tree as it can and builds up an equivalent SQL query. SELECT * FROM ( SELECT * FROM aircraft WHERE aircraft.cruisingrange>1000 g g ) aircraft JOIN ( SELECT * FROM certified ) certified ON aircraft.aid=certified.aid SQL support level of a relational resource • TABLE_SCAN implosion needs to know what level of SQL is supported by the underlying resource. – fully featured RDBMS – simple SQL interface for csv files supporting only simple filtering or records – a web service wrapper • Relational resources will expose a resource property – a serialised object i l implementing ti SQLS SQLSupportLevel tL l interface i t f similar i il to t that th t d defined fi d b by JDBC JDBC: java.sql.DatabaseMetaData java sql DatabaseMetaData public boolean supportsColumnAliasing() public boolean supportsCorrelatedSubqueries() public boolean supportsSubqueriesInComparisons() public boolean supportsSubqueriesInExists() ... Executing the plan • Build phase – Each LQP Operator has associated Activity Pipeline Builder class which takes in Operator and returns Activity Output. – Most operators can be mapped directly to single Activity. – Some operators may have different implementations (for example join operator), builder chooses default one or is guided by an Annotation. – Operator -> > Builder class mapping is configurable configurable. • Setup phase – For F each h EXCHANGE Data D t Source S Resource R iis created. t d • Execution phase – All workflows (partitions) are submitted. – Coordinator always executes sub workflow (with at least the EXCHANGE CONSUMER operator) EXCHANGE_CONSUMER Extensibility points • New Operator can be introduced by mapping relation valued function to Operators p to Activity y Pipeline p Builder. • New Operator can be included in the default query normalisation by providing strategies for SELECT push down, RENAME/PROJECT pull up. • Optimisation chain is configurable – it is easy to plug in new LQP transformations. • Alternative physical operator implementations can be introduced by replacing default defa lt Activity Acti it Pipeline B Builders ilders – annotations can be used sed to choose between several implementations. • Scalar, Scalar aggregate and relation valued User Defined Functions will be supported. Introducing a new operator SELECT A.aname AS name FROM outerUnion( (SELECT * FROM aircraft A), (SELECT * FROM certified C), 'ALL') A • LQP Builder will check if there is a mapping from outerUnion ->> Operator and use Operator object in LQP. • If there is no mapping – look for a relation valued function outerUnion in the Function Repository and connect generic RELVAL_FUNCION operator. CompilerConfiguration.xml <LQPCompilerConfiguration xmlns="http://ogsadai.org.uk/dqp/namespaces/2008/12"> <builders operator="GROUP_BY“ default="uk default= uk.org.ogsadai.dqp.execute.workflow.GroupBy org ogsadai dqp execute workflow GroupBy"/> /> <builders operator="INNER_THETA_JOIN“ default="uk.org.ogsadai.dqp.execute.workflow.ProductSelect"> <builder name="HASH name HASH_JOIN JOIN“ class="uk.org.ogsadai.dqp.execute.workflow.HashJoin"/> </builders> <relationFunction name="outerUnion" operator="OUTER_UNION"/> <operator name="OUTER_UNION“ class="uk.org.ogsadai.dqp.lqp.operators.extra.OuterUnionOperator"/> g g qp qp p p <builders operator=“OUTER_UNION“ default="uk.org.ogsadai.dqp.execute.workflow.OuterUnion"/> <optimisationChain> <optimiser class="uk.org.ogsadai.dqp.lqp.optimiser.QueryNormaliser" /> <optimiser class="uk.org.ogsadai.dqp.lqp.optimiser.SelectPushDown" /> </optimisationChain> </LQPCompilerConfiguration> User Defined Functions • Three types – Scalar SELECT editDistance(a.name, ditDi t ( ‘J ‘John’) h ’) FROM a – Aggregate SELECT * FROM a HAVING a.age<median(a.age) – Relation valued – Unary SELECT * FROM sample(a, 0.75) – Binary SELECT * FROM f fuse(SELECT (SELECT * FROM a), ) (SELECT * FROM b)) – Scan (tuple producing) SELECT * FROM randomInt(0, 10, 1000) • Implementations of sub interfaces of the Function interface interface. • Function Repository is part of the Data Dictionary. Discovering Evaluator Capabilities • We assume that every evaluation resource has the same set of activities and UDFs UDFs. • Checking if activities are supported is quite easy – Get list of supported activities from each evaluation resource (DRER) – Ask Activity Pipeline Builder for a list of required activities • Checking for UDF availability is more tricky – Introduce UDF Resource + “GetUDFSchemas” activity – Match by name and parameter list, types, return type – Relation valued functions are problematic – they need to validate themselves inside LQP and provide headings – this is dynamic – function schema as a script?