Ingres Top “n” Changes Feb. 13, 2008 1. Introduction This document discusses changes required to allow Ingres to compile and execute top “n” queries with reasonable efficiency. Included are changes to QEF to support the possible need to restart a query if the required “n” rows are not returned by the optimal query plan and to implement a “priority queue” operation in which only the first “n” rows according to some ordering are returned. Changes to OPF are required to optimize certain top “n” queries using the scheme described in the doctoral thesis of Donko Donjerkovic and to identify the need for priority queues at appropriate locations in the compiled query plan. Changes are required to optimizedb to augment histograms with the maximum error estimate that is subsequently used by OPF to perform the probabilistic determination of the best top “n” query plan, as described in Donko’s thesis. 2. Probabilistic Top “n” Query Optimization The top “n” optimization approach espoused in Donko’s thesis is based on the technique of compiling a range predicate (the so-called cutoff predicate) into the query that selects only rows whose ranking attribute is less than (or greater than, depending on whether the ORDER BY clause is ascending or descending) some constant value. Any query plan built around a cutoff predicate must be prepared for the possibility that the value of the cutoff constant (called κ in Donko’s thesis) will result in fewer than “n” rows being selected, and the need to restart the query to retrieve the remaining rows. The intuition behind Donko’s thesis is to develop plans for successive κ values, each of which has an expected cost which incorporates not only the cost estimate of the optimal plan (with cutoff predicate), but also the cost of a restart plan factored by the probability of restart. The resulting query plan can be thought of as the union of the optimal top “n” plan with the restart plan (which simply uses the complement of the top “n” cutoff predicate to return the remaining rows). A “stop after n rows” operator in the top of the query plan (already implemented in Ingres) will prevent the execution of the restart component when the optimal plan returns at least “n” rows. Creating this plan (or pair of complementary plans) involves multiple calls to the query optimizer with successive κ values. Optimization is as usual for the pairs of queries computing costs for the best plan in each call. However, in addition to plan costs, the optimizer must also return a probability for the likelihood that the optimal plan (the nonrestart plan) will NOT return “n” rows. The expected cost of the plans for a given κ value is the sum of the cost estimate for the optimal plan and the cost estimate for the restart plan multiplied by the probability that it will be required. The chosen κ is the value that minimizes this expected cost. 3. OPF Optimization of Top “n” Based on discussions with Donko, it is proposed that the logic to drive the identification of the optimal κ value for the cutoff predicate should be located in opj_joinop() as a loop outside the loop that enumerates each of the subqueries identified by query rewrite. Rewrite will identify queries to be optimized in this way, and will flag subqueries for which the cutoff predicate is required. The opj_joinop() loop will then propose successive values of κ and enumerate the subqueries with each such value. Expected costs of the top “n” query will be returned for each κ and the plan with the lowest cost will be chosen once costs for successive κ’s differ by little enough. For each plan optimized with a κ value in the cutoff predicate, a plan will be optimized with the complement of the cutoff predicate to act as the restart query to produce the remaining rows in the event that the optimized plan doesn’t produce the required “n” rows. The rewrite changes will be straightforward, likely requiring a few new pieces of information to be added to the global state structure of OPF (OPS_STATE) and to the subquery structures (OPS_SUBQUERY). Even the loop in opj_joinop() should be relatively simple to code. Some mechanism will be needed to generate the successive κ values (Donko used the golden section search algorithm, though something even more basic like binary search could be used). Then the loop needs only to keep track of the expected cost values from each κ and the corresponding CO-trees, and apply the termination condition appropriately. The difficult part of the OPF work will be to generate the probabilities that the compiled plan will return fewer than the “n” rows requested. This won’t require a rewrite of OPF cost analysis, but it will require new information to be tracked as plans are assembled. Specifically, we will need to accumulate probabilities on whether the fragment being compiled will produce the required “n” rows. New logic will be required in the selectivity estimation functions to use the error bounds built by optimizedb to produce the probabilities. This is the part of Donko’s algorithms that I don’t fully understand yet, but I’m getting there. Very rough estimates of implementation effort for the OPF changes are as follows: - rewrite changes to identify top “n” optimization potential and update OPS_STATE, OPS_SUBQUERY structures with appropriate information: 1 week, - opj_joinop() changes to drive cutoff parameter estimation: 1-2 weeks, - code generation changes to flag sorts requiring priority queue processing: 2 days, - query optimizer changes to produce probability estimates: 1 month, - adding referential relationship catalogs (see Implementation Issues, below): 2 weeks. 4. optimizedb Changes Estimation of the expected costs of different top “n” query plans with different cutoff values using Donko’s probabilistic optimization requires that histograms be augmented with a maximum selectivity error value. It is effectively the maximum error that Ingres would produce for range predicates involving all values of the column. A second pass of the data turned into a histogram would be required in optimizedb in which the selectivity for “x > n” is estimated for each value n in the column. The estimated value (using the same mechanism as OPF to perform the estimation) is then compared to the actual value (which can be computed, since the real data is being processed). The largest difference between estimated and actual selectivity is then recorded with the histogram. The error estimate can be saved as either a new column in iistatistics (catalog change) or as a new field at the end of the free form histogram data in iihistogram. A rough estimate of the implementation effort of these changes is 2 weeks. 5. QEF Changes Two changes of significance are required for QEF. The first will be the ability to restart a query that hasn’t returned the requisite “n” rows from the optimized top “n” query plan. As described earlier, a complimentary plan will be compiled that returns the result rows not produced by the optimized plan – effectively reversing the cutoff predicate. This could be done in one of several ways. A single query plan could be produced which UNIONs the optimized plan with the complimentary plan, only executing the complimentary plan if the optimized plan fails to produce “n” rows. This is definitely the easiest option, though it would require the overhead of two query plans (including initialization in QEF) when, hopefully, only one would be executed. The other slightly more efficient approach would be to detach the optimized and restart plans and only initialize the latter if actually required. The UNION approach could be achieved with almost no work at all in QEF, and a trivial amount of work in OPF. The detached plan approach would require additional logic in qeq.c to trigger the initialization and execution of the restart plan, when required. This would take 1-2 weeks of implementation work. The second change required in QEF is the support of priority queues. This is an access technique that returns only the top “n” rows according to some ordering scheme. One way of implementing it is to augment the QEF sort to only load the top “n” rows into its sort heap. The first “n” rows are loaded as usual, but then only rows whose ranking attribute places them in the current top “n” will be subsequently loaded. And when this happens, the new rows will simply replace existing rows that are no longer in the current top “n”. All that is required is to track the current “nth” ranking attribute value to determine if new rows are to be retained or discarded. This technique should be easily introduced into the existing QEF sort. It would be unusual to encounter a “n” value too large for the QEF memory sort, but for such cases the DMF sort could be extended similarly. If “n” is large enough that even the DMF sort overflows to disk, the normal DMF disk sort could be performed (with no optimization) and only the first “n” rows would be returned. Appropriate flags and the “n” value itself would have to be added to the QEN_TSORT node structure. If priority queues are to be supported in the DMF sort, equivalent changes would be required in the DMR_CB structure. The OPF changes to achieve this would be 1-2 days. Changes to the QEF sort shouldn’t take more than 1-2 weeks and changes to the DMF sort should also be in the order of 1-2 weeks. 6. Implementation Issues The probabilistic approach extends into reasonably complex queries. Donko’s thesis supplies guidelines for dealing with range predicates, equality predicates, equijoins and unions. However, the more complex the query, the less accurate the estimates will be (e.g. in the presence of subselects) and the lower the quality of the resulting plans. Moreover, it is effective only when the ranking attribute is a column in a table (or possibly a scalar expression involving a column). Most importantly it does NOT address ranking attributes that are the result of aggregate functions (sum, avg, etc.). In TPC H, for example, 3 of the 5 top “n” queries rank on aggregates and cannot be handled by the proposed technique. Most of the top “n” TPC DS queries likewise involve aggregate ranking attributes. In TPC E, however, it appears that most top “n” queries rank on columns and could use the proposed optimization. A trivial change to QEF could be made to exclude priority queues from the QEF sort and pass them straight through to the DMF sort. That would result in priority queues only being implemented in the DMF sort. While I understand most of the techniques presented in Donko’s thesis, I still don’t fully understand the mechanisms used to generate the probabilities of different cardinalities in the result set. Using this approach for top “n” join queries depends on knowledge of equijoins that map referential relationships. This information has never been available in a useful form in the Ingres catalogs. I have long promoted the idea of new, simple catalogs that record the definitions of referential relationships in a form that is easily useable by OPF. Moreover, the “maximum error” statistic that needs to be computed for each histogram would be best stored in a new column in the iistatistics catalog. So it seems likely that this feature will require catalog changes. 7. Summary This document describes a methodology for optimizing and executing top “n” queries, along with the changes that will be required to implement it. As can be seen, this is not a trivial project. The project is also not conducive to incremental implementation – most of the work described here will be required before any benefit to top “n” processing will be seen. An encouraging sign is that John Galloway ran the TPC H top “n” query 17 with an explicitly coded cutoff predicate and reduced the execution time from 88 to 7 seconds. So it would seem that optimization of top “n” queries is definitely worth the effort.