ppt slides

advertisement
Automated Ranking of
Database Query Results
Sanjay Agarwal, Surajit Chaudhuri, Gautam Das,
Aristides Gionis
Presented by
Archana vijayalakshmanan
4/11/2006
Contents





Introduction
Different ranking functions
Breaking ties
Implementation
Conclusion
Introduction



Automated ranking of the results of the query is popular
aspect of IR.
Database system support only a boolean query model.
 Empty answers
 Many answers
Automated ranking of query results is taking user query
and mapping to Top-K query with ranking function.
Automated Ranking functions for the ‘Empty
Answers Problem’

IDF Similarity

QF Similarity

QFIDF Similarity
IDF Similarity
<attribute
,value>
w
tuple
d

IR technique
 Database(only categorical attribute)
T=<t1,……tm>
Q=set of key words
Q=<q1,…...qm> Condition is “WHERE is A1=q1”
IDF(w)=log(N/F(w))
IDFk(t)=log(n/Fk(t))
TF(w,d)=Frequency of occurance of w in d
n-number of tuples in database
Fk(t) -Frequency of tuples in database where Ak=t
Cosine similarity between query and
document is normalized dot product
of the two corresponding vector
Similarity between T and Q is
m
SIM (T , Q)   S k (t k , q )
k 1
k
Sum of corresponding similarity coefficients over
all attributes
• dot product is un-normalized
Similarity function known as cosine similarity
with TF-IDF weightings
•TF is irrelavant
Similarity function known as IDF similarity
Eg query={CONVERTIBLE,NISSAN}
Generalizations of IDF similarity

For numeric data

Inappropriate to use previous similarity coefficients.

frequency of numeric value depends on nearby values.

Discretizing numeric to categorical attribute is problematic.
 Solution:

{t1,t2…..tn} be the values of attribute A.For every value t,
sum of”contributions” of t from every other point ti
contributions modeled as gaussian distribution

Similarity function is
bandwidth parameter

For range/set of values
QF Similarity


Importance of attribute values is determined by frequency
of their occurence in workload
For categorical data

query frequency QF(q)=
raw frequency of occurrence of value q of attribute A in query strings of workload (RQF(q)
raw frequency of most frequently occuring value in workload (RQFMax)


s(t,q)= QF(q), if q=t
0
, otherwise
Similarity between pairs of different categorical attribute
values can also be derived from workload eg. To find
S(TOYOTA,HONDA),

Analyzing IN clauses of queries:
If certain pair of values often occur together in the workload ,they are similar .e.g
queries with C as “MFR IN {TOYOTA,HONDA,NISSAN}”
 Several recent queries in workload by a specific user repeatedly requesting for
TOYOTA and HONDA.
QFIDF Similarity

QF is purely workload-based. Big disadvantage for insufficient or
unreliable workloads.

For QFIDF Similarity

S(t,q)=QF(q) *IDF(q) when t=q

where QF(q)=(RQF(q)+1)/(RQFMax+1).
Thus we get small non zero value even if value is never referenced in
workload model
Breaking ties


Problem: Many tuples may tie for the same similarity score and get
ordered arbitarily.Arise in empty and many answers problem.
Solution: Determine the weights of missing attribute values that
reflect their “global importance” for ranking purposes by using
workload information.

Extend QF similarity ,use quantity
to break ties.
 Extending IDF similarity by using IDF values presents challenges.
Implementation
Pre-processing component
 Query–processing component

Pre-processing component

Compute and store a representation of similarity function in auxiliary
database tables.

For categorical data, compute IDF(t) (resp QF(t)) ,to compute frequency
of occurences of values in database and store the results in auxillary
database tables.
 For numeric data, an approximate representation of smooth function
IDF() (resp(QF()) is stored, so that function value is retrieved at runtime.
Query processing component

Main task: Given a query Q and an integer K, retrieve Top-K tuples
from the database using one of the ranking functions.

Ranking function extracted in pre-processing phase.
 SQL-DBMS for solving top-K problem.

Handling simpler query processing problem

Input: table R with M categorical columns, Key column TID, C is
conjunction of form Ak=qk..... and integer K.
 Output: top-K tuples of R similar to Q.
 Similarity function: Overlap Similarity.
Implementation of Top-K operator


Traditional approach
Indexed based approach

overlap similarity function satisfies the following monotonic property. Adapt TA algorithm
If T and U are two tuples such that for all K, Sk(tk,qk)< Sk(uk,qk) then SIM(T,Q) < SIM(U,Q)


To adapt TA implemented Sorted and random access methods.
Performs sorted access for each attribute, retrieve complete tuples with corresponding TID
by random access and maintains buffer of Top-K tuples seen so far.
Indexed-based TA(ITA)
Sorted access
Random
access
Conclusion


Thus TF-IDF based techniques were extended to
numerical and mixed data.
Workload tracking was used as a weak form of
collaborative filtering.
Download