Online Aggregation Joseph M. Hellerstein Peter J. Haas Helen J. Wang

advertisement
Online Aggregation
Joseph M. Hellerstein
Peter J. Haas
Helen J. Wang
Motivation for Online
Aggregation
• Traditional aggregation takes a long time to
return a very small final result from a large
amount of data
• The result does not have to be very precise!
• Online aggregation allows users to observe
the progress of their queries and control
execution on the fly
New interface for aggregation
• Observe the progress of their queries
• Control execution on the fly
Statistical estimation
• Users do not need to set a priori
specification of stopping condition
• The interface is easier for users with no
statistical background
• It requires more powerful statistical
estimation techniques (Hoeffding’s
inequality versus Chebyshev’s inequality)
Related work
• Online Analytical Processing (OLAP): batch mode
and precise computation
• “fast first” query processing, similar to Online
Aggregation but has simpler performance goals
• APPROXIMATE:defines an approximate
relational algebra used to process standard
relational queries in an iteratively refined manner
Usability goals
• Continuous observation
• Control of time/precision
• Control of fairness/partiality
Performance goals
• Minimum time to accuracy: produce a
useful estimate of the final answer ASAP
• Minimum time to completion: secondary
goal, assume user will terminate processing
long before the final answer is produced
• Pacing: guarantee a smooth and continuous
improving display
A naïve approach
• Use user-defined output functions supported by
POSTGRES to provide simple running aggregates
• Can not support complex aggregates like
aggregation with GROUP BY
• Current systems optimize only for accurate
solutions
• Skip factor k – the DBMS only ships an update to
the user interface after k input tuples have been
processed
Random access to data
We need to retrieve data in random order to
produce meaningful statistical estimation.
Three ways to get records in random order:
• Heap scans
• Index scans
• Sampling from indices (less efficient)
Non-blocking GROUP BY and
DISTINCT
• Sorting is a blocking algorithm and only
one group is computed at a time after
sorting
• Hashing is non-blocking, but hash table
need to fit in memory to have good
performance
• Hybrid Cache (an extension of hybrid
hashing) might be good
Index striding
k1
k2
k3
• Hash-based grouping can be unfair
• Solution: probe the index to find all the
groups and then process tuples from each
group in a “round robin” fashion
• Can control speed by weighting the
schedule
• Fair for groups with different cardinality
Index striding - Continued
• Efficiency: will be as good as scanning a
relation via a clustered secondary index if
– Index is the primary access method or
– Relation is clustered by the grouping columns
or
– Index keys contain both the grouping and
aggregation columns, with the grouping
columns as a prefix
Non-blocking join algorithms (1)
• Sort-merge join is not acceptable for online
aggregation because sorting is blocking
• Hybrid hash join blocks for the time required to
hash the inner relation
• Pipeline hash join techniques may be appropriate
for online aggregations when both relations are
large
• Merge join (without sort) and hash join provide
output with orders – not good for statistic
estimation
Non-blocking join algorithms (2)
• The “safest” traditional join algorithm is
nested loop, particularly if there is an index
on the inner relation
• More on ripple join coming next…
Issues in optimization (1)
• Avoid sorting completely
• It is undesirable to produce results ordered on
aggregation or grouping columns
• Divide cost model into two parts:
– Time td spent in blocking operations
– Time to spent producing output
Use cost function: f(to) + g(td) (where f is linear and g is
super-linear) to “tax” operations with too much dead
time
Issues in optimization (2)
• Preference to plans that maximize user
control (such as index-striding)
• Trade off between output rate of a query
and its time to completion
– Create natural controls in this regard for naïve
users
– Run multiple versions of a query is a solution
but will waste computing resources
Running confidence intervals (1)
• Confidence parameter p(0,1) is
prespecified
• Display a precision parameter єn such that
running aggregate Yn is within  єn of the
final answer μ with probability
approximately equal to p. [Yn- єn,Yn+ єn]
contains μ with probability approximately
equal to p
Running confidence intervals (2)
• Three types to contruct from n retrieved records:
– Conservative confidence intervals based on Hoeffding’s
inequality or recent extention of this inequality, for all
n>=1
– Large-sample confidence intervals based on central
limit theorems (CLT’s), for n both small and large
enough
– Deterministic confidence intervals contain μ with
probability 1, only for very large n
Running confidence intervals (3)
• SELECT AVG(exp) FROM R;
v(i) (1  i  m): the value of exp when applied to tuple i
Li: the random index of the ith tuple retrieved from R
a and b are a priori bounds a  v(i)  b for 1  i  m
• Conservative confidence interval equations:
Y n  (1 / n)i 1 v( Li )
n
  (1 / m)i 1 v(i )
m
P{| Y n   |  }  1  2e
 2 n 2 /(b  a ) 2
1/ 2
 1
2 
 n  (b  a) ln(
) 
 2n 1  p 
Running confidence intervals (4)
• Large-sample confidence interval equations
• By central limit theorems (CLT’s) , Ynapproaches a normal
distribution with a mean () and a variance s2/n as n, the
sample size, increases. s2 can be replaced by the estimator
Tn,2(v)
n
1
Tn , 2 (v)  (n  1) i 1 (v( Li )  Yn ) 2
 n (Yn   
  n 
 n 
P{| Yn   |  }  P 
 1/ 2   2 1/ 2   1
1/ 2
Tn , 2 (v) 
 Tn , 2 (v)
 Tn , 2 (v) 
 ( z p )  ( p  1) / 2
1/ 2
 z T (v ) 

 n  

n


2
p n,2
Performance issues – skip factor
Performance issues – index
striding (a large group)
Performance issues – index
striding (a small group)
Conclusion
• aggregates have running output and
confidence interval
• hash-based grouping and duplicateelimination is not blocking
• index striding gives more control over
GROUP BY
Future work
• Graphical user interface
• Nested queries
• Checkpointing and continuation
Ripple Joins for Online
Aggregation
Peter J. Haas
Joseph M. Hellerstein
Join algorithms for online
aggregation
• Sort-merge and hash join algorithms are
blocking algorithms – not acceptable in
online aggregation
• Merge (without sorting) provides ordered
output – bad for statistical estimator in
online aggregation
• Nested loop join is the best, but...
An artificial bad example for
nested loop join
• SELECT AVG(S.a + R.b/10000000) FROM
R, S
If R is the inner relation, for each tuple from
S, we need to scan the whole relation R. But
the scan does not provide much information
for the output at all.
Overview of ripple join (1)
Overview of ripple join (2)
• online nested-loops join is a special case of
ripple join
Ripple join algorithms (1)
• It can be viewed as a generalization of nestedloops join in which the traditional roles of “inner”
and “outer” relation are continually interchanged
during processing
R
R
n-1*n-1
n-1*n-1
S
R
S
n-1*n-1
S
Ripple join algorithms (2)
• We need to modify the iterator for our
algorithm – have to keep track of more
states
• Aspect ratios does not have to be 1
• It can be extended to multiple ripple joins
Ripple join variants
• Block ripple join improves performance on
I/O
• Indexed ripple join = index-enhanced
nested-loops join, the role of inner/outer
relations does not alternate any more
• Hash ripple join can be used for equijoin
queries – two hash tables in memory for
both R and S
Estimators for SUM, COUNT
and AVG
| R | *| S |
sum 
exp p (r , s)

| Rn | * | Sn | ( r ,s )Rn Sn
| R |*| S |
count 
one

| Rn | * | S n | ( r ,s )Rn Sn
sum
avg 
count
Confidence intervals
• Use central limit theorems (CLT’s) to
compute “large-sample” confidence
intervals
• Fix the problems in classic CLT’s with
newly defined s2 for different aggregate
queries
K
d (k )
k 1
k
s 
2
Ripple optimization:choosing
aspect ratios (1)
• Blocking factor  is prespecified, we want
to optimize k’s – the aspect-ratio
parameters
K
d (k )
2
• minimize s  
k 1 k
such that
12 3 ...K K-1c (decided by animation speed)
1  k  mk/  for 1  k  K
1,2 ,3 ,...K interger
Choosing aspect ratios (2)
• Solve relaxed version of the optimization
problem by droping all constraints other
than the first one
• Adjust the ratios accordingly during
execution to satisfy the other constraints
• Starting ripple join with all =1 and start
updating the aspect ratio after certain
amount of time
Performance (1)
• SELECT ONLINE_AVG(enroll.grade)
FROM enroll, student
WHERE enroll.sid = student.sid
AND student.honors_code IS NULL;
• Hash ripple join gives tight confidence interval
within seconds
• Nested lopps join takes over 10 seconds to begin
producing output
• Best batch join algorithm (hybrid hash join) takes
208 seconds to complete
Performance (2)
Performance (3) – choosing
aspect ratios
• Select a lower animation speed to allow all
kinds of aspect ratios
• Query:
SELECT ONLINE_AVG(d.grade/a.grade)
FROM enroll d, enroll a
WHERE d.college = “Education”
AND a.college = “Agriculture”
AND a.year = d.year;
Performance (4) – choosing
aspect ratios
Conclusion
• The bigger the database, the more attractive
online join algorithms appear
• User can control animation speed to trade
off between precision and updating speed
• The system will adjust aspect ratios
accordingly to achieve the best updating
performance
Future work
• Optimization for ripple joins – choice
between many variants of ripple joins
• Parallelization of ripple joins
Online Dynamic Reordering for
Interactive Data Processing
Vijayshankar Raman
Bhaskaran Raman
Joseph M. Hellerstein
Motivating applications
• Unclustered index striding in online
aggregation might be very inefficient
because of random I/Os
• Scalable spreadsheets want to provide
services like sorting, scrolling and jumping
instantaneously
• Sort algorithm used in some query plans is
not the best solution
The prefetch and Spool (P&S)
• Use the time a process takes to fetch
interesting tuples to the buffer
• Ratios of different groups in the buffer
depends on user’s interest
• Spools tuples (in chunks) that are not very
interesting onto side-disk
• Phase 1 scans from the input and Phase 2
scans from the side-disk with certain order
Policies for online reordering
• User preference and metric (confidence,
rate and strict) decide delivery priority of
different groups
• We want to maximize the rate at which the
feedback function rises as the number of
tuples we retrieve increases
Performance (1)
• Compared to Index Striding
– Much better than unclustered index striding
because of less random I/Os
– For extremely small groups, IS might be better
• Works better for slower processing rate
because P&S has more time to build up the
buffer between “gets”
Performance (2)
• Works well for scalable spreadsheets
• After P&S finishes Phase 1, retrieve tuples
at interesting place in milliseconds while a
blocking sort takes 15 minutes
• Random jump is not as good as short jump
because nearby data has higher priority
(using strict metric)
Performance (3)
• Can be used to replace sort in some
traditional query plans
• Exact ordering is not needed in some query
plans and reordering can provide nonblocking service which is good for
pipelined parallelism and interactive
estimation techniques like online
aggregation
Future work
• Other feedback functions for different
scenario
• providing data for multiple users
• Usages in graphical data visualization
Download