AIM - University of Illinois Urbana

advertisement
Boolean + Ranking:
Querying a Database by K-Constrained
Optimization
Zhen Zhang
Joint work with: Seung-won Hwang, Kevin C. Chang, Min
Wang, Christian A. Lang, Yuan-chi Chang
The Database and Info. Systems Lab.
University of Illinois at Urbana-Champaign
Many queries naturally combine Boolean and ranking
Traditional databases
Boolean query:
dept = CS and year = 2
+
Qualifying constraint
Quantifying function
Find top answers
B: dept = CS and year = 2
R: gpa
Information retrieval
Database applications on Web
Ranking query:
Top 5 ranked by gpa
AIM
2
Motivating scenarios

Data retrieval:

Find houses in certain price range with good
price/sqrft ratio
Select h.address from House h,
h CrimeRate c
Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode
Order by h.size/|h.price-300k| *c.crimerate
Limit 1 -1 Limit 10

Data analysis:

Find products with highest sale increase in
consecutive years
Select itemid from Sales s1, Sales s2
Where s1.itemid = s2.itemid and s2.year – s1.year = 1
Order by s2.sale – s1.sale Limit 10
AIM
3
Boolean + Ranking form a coherent goal
function

Boolean B + Ranking R = Goal function G
For a tuple t
G(t) = B(t)*R(t) =
AIM
R(t)
0 (ie, lowest score)
if B(t) is true
if B(t) is false
4
The nature of Boolean + Ranking is
K-constrained optimization query

Optimize goal function G over database D
Goal
function G
Database D
G
h.size/|h.price-300k|
[h.price ≤ 200k ν h.price ≥ 400k ]
Addr
Zip
Price
Size
1.
Oak park, Chicago
60644
600K
4500
2.
Mattis, Champaign
61821
350K
2000
3.
…
150K
1000
4.
…
250K
2000
5.
…
300K
3500
6.
…
80K
500
AIM
D
5
What is the query evaluation mechanism?
Boolean query
+
Ranking query
How to answer?
AIM
6
Current techniques lack of global search mechanism

If evaluated as separate operators
D
…
Rankingquery
query
Boolean
BR
…
…
Rankingquery
query
Boolean
BR
 Current techniques optimize only condition-by-condition
 If search by an overall goal function G as a ranking
function
Goal function G
D
B
R
 Current techniques restrict G to be monotonic
AIM
7
Our thesis: Evaluate query as its nature suggests!
G
OPT*
Optimize G
over D
Function optimization
of G
Discrete state search
over D
D
D
AIM
8
We view compound index as discrete space
AIM
Addr
Zip
Price
Size
1.
Oak park, Chicago
60644
600K
4500
2.
Mattis, Champaign
61821
350K
2000
3.
…
150K
1000
4.
…
250K
2000
5.
…
300K
3500
6.
…
80K
500
9
We view compound index as discrete space
b1
Price (k)
0-250
250-600
b2
0-100
b3
100-250
………
250-350
b6
2
350-600
600
1
350
2
b7
5
1
250
5
4
3
100
6
1500
a1
3000
0-3000
4000
4500 size
3000-4500
a2
0-1500
a3
1500-3000
………
AIM
3000-4000
a6
5
4000-6000
1
a7
10
We view compound index as discrete space
b1
Price (k)
0-250
b2
0-100
600
b3
100-250
………
Mij =(ai, bj)
250-600
250-350
350
350-600
b6
b7
2
5
1
M11
1
2
250
5
M22
…… M55
100
6
0-3000
3000
4000
4500 size
M75
M56
M66
M77
M6
M76
4
2
5
1
3000-4500
a2
a3
1500-3000
………
AIM
M33
7
1500
0-1500
M23
4
3
a1
M32
3000-4000
a6
5
4000-6000
1
a7
11
We view compound index as discrete space
conceptually, combined space
b1
Price (k)
0-250
b2
0-100
600
b3
100-250
………
Mij =(ai, bj)
250-600
250-350
350
350-600
b6
2
b7
2
5
M11
1
5
250
1
M22
100
…
6
0-3000
3000
4500 size
4000
M55
M75
M56
M66
M77
M6
M76
4
2
5
1
3000-4500
a2
a3
1500-3000
………
AIM
M33
7
1500
0-1500
M23
4
3
a1
M32
3000-4000
a6
5
4000-6000
1
a7
12
How to perform the search in the space?

What is the search mechanism?


How to conceptually view the index space of
D for search
How to guide the search?

How to use function G to focus the search
AIM
13
Challenge 1: What is the search mechanism?
AIM
14
We encode as A* because it’s optimal


What A* is: Finding the shortest path
Why we choose: Completeness and optimality with
proper heuristics


Complete: guarantee to find shortest path
Optimal: visit least number of nodes
origin
5
3
1
2
5
1
7
9
AIM
6
destination
15
Encoding our problem into shortest path is
challenging

K-constrained
optimization
Shortest path
Find a tuple with
maximal score
Find a path with
minimal distance
How to encode:
 a tuple  a path?
 score of tuple distance of path?
AIM
16
Therefore, we encode K-constrained opt. as:

How to encode a tuple to a path?


Adding a virtual target t* only reachable through tuples
How to encode maximal tuple with minimal path?

Quality of path depends solely on the tuple it passes by

M11
For tuple state t
0
D(t, t*) = - G(t)

For two states r, u
M22
0
M32
M23
M33
0
D(r, u) = 0
0
… M55 M75
M56 M66 M77 M67 M76
0
0
4
2
5
1
- G(1)
- G(4)
t*
AIM
17
Challenge 2: How to guide the search?
AIM
18
We use function opt. to sketch the landscape of G


Function optimization measures quality of states
Function optimization enables:



1. How to define heuristics?
2. How to configure space?
3. Where to start the search?
AIM
19
1. Define admissible heuristics: Measure
tightest upper bound

To guarantee completeness


A* requires admissible heuristics, ie, estimate
optimistically
To ensure admissible heuristics

Function optimization gives tightest upper bound


Analytical approaches
Numeric analysis package
H(region) = OPTMAX(G, region)
ie, maximal value of G in the region
AIM
20
2. Configure descending space: disconnect
uphills

To guarantee optimality


A* requires descending heuristics
To ensure descending heuristics

Remove uphill links
M11
M22
…
M55
M75
4
AIM
M32
M23
M33
M56
M66
M77
2
5
1
M67
M76
21
Find right start point: Start from local optima

To guarantee correctness
Every tuple state must be reachable from start states
Taking only downhills requires start with high points



To ensure reachability
Initial states should contain all local optima

M11
M22
…
M55
4
M75
M32
M23
M33
M56
M66
M77
2
5
1
AIM
M67
M76
22
Putting together:
Executing A* on the configured space
top-down
M11
M22
…
M55
4

M75
M32
M23
M56
M66
2
5
M33
M77 M67 M76
M57
1
Search is implemented as priority queue driven traversal
AIM
23
Putting together:
Executing A* on the configured space
top-down
M22
…
M55
M75
4
M11
M32
M56
M66
2
5
bottom-up
M22
…
M55
4

M75
M23
M32
M33
M77 M67 M76
M57
1
M11
M23
M56
M66
2
5
M33
M77 M67 M76
M57
1
Bottom-up approach is always better than top-down
AIM
24
Experiments

Comparison vs.


Boolean then ranking
Ranking then boolean

Metrics:

Settings:


node accessed = Nl + Nt
Benchmark queries over real dataset
Controlled queries over synthetic dataset
AIM
25
Benchmark queries

Datasets:


19,706 real estate listing crawled online
Queries



Q1: size * bedrms/| price-450k| : [40k<=price<=50k]
Q2: size * ebedrms / |price-350k| : [price<400k^size>4000]
Q3: size/price : [bedrms=3 ν bedrms=4]
BR_clustered
BR_unclustered
OPT*
Q1
AIM
Q2
Q3
26
Controlled queries

Datasets
Three randomly generated datasets of 100k points



Uniform, gaussian, logvariatenormal
Queries
Linear average queries: (eg, 0.4*a + 0.6*b)
Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2)
Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)



!"#$
!
"#$%
!"#$
%
AIM
27
Conclusion

Problem


Abstraction


Study K-constrained optimization queries as boolean +
ranking
Encode K-constrained optimization into shortest path
problem
Framework

Develop OPT* to process K-constrained optimization
AIM
28
Thank you!
Questions?
AIM
29








How to implement function optimization?
How do we compare with RankSQL?
If bottom-up is always better, why consider top-down
Computing upper bound for each region is costly
Random vs. sequential I/O
Assuming indices on every attribute?
Materialize state space for every query?
Exponential number of states when attribute grows




Not every attribute has index on it
Selective choose the right index (attribute) to use
We do perform experiment to study how the system scale with
#attr
Your algorithm is not optimal because you change the
space
AIM
30
Download