Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Archana vijayalakshmanan 4/11/2006 Contents Introduction Different ranking functions Breaking ties Implementation Conclusion Introduction Automated ranking of the results of the query is popular aspect of IR. Database system support only a boolean query model. Empty answers Many answers Automated ranking of query results is taking user query and mapping to Top-K query with ranking function. Automated Ranking functions for the ‘Empty Answers Problem’ IDF Similarity QF Similarity QFIDF Similarity IDF Similarity <attribute ,value> w tuple d IR technique Database(only categorical attribute) T=<t1,……tm> Q=set of key words Q=<q1,…...qm> Condition is “WHERE is A1=q1” IDF(w)=log(N/F(w)) IDFk(t)=log(n/Fk(t)) TF(w,d)=Frequency of occurance of w in d n-number of tuples in database Fk(t) -Frequency of tuples in database where Ak=t Cosine similarity between query and document is normalized dot product of the two corresponding vector Similarity between T and Q is m SIM (T , Q) S k (t k , q ) k 1 k Sum of corresponding similarity coefficients over all attributes • dot product is un-normalized Similarity function known as cosine similarity with TF-IDF weightings •TF is irrelavant Similarity function known as IDF similarity Eg query={CONVERTIBLE,NISSAN} Generalizations of IDF similarity For numeric data Inappropriate to use previous similarity coefficients. frequency of numeric value depends on nearby values. Discretizing numeric to categorical attribute is problematic. Solution: {t1,t2…..tn} be the values of attribute A.For every value t, sum of”contributions” of t from every other point ti contributions modeled as gaussian distribution Similarity function is bandwidth parameter For range/set of values QF Similarity Importance of attribute values is determined by frequency of their occurence in workload For categorical data query frequency QF(q)= raw frequency of occurrence of value q of attribute A in query strings of workload (RQF(q) raw frequency of most frequently occuring value in workload (RQFMax) s(t,q)= QF(q), if q=t 0 , otherwise Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(TOYOTA,HONDA), Analyzing IN clauses of queries: If certain pair of values often occur together in the workload ,they are similar .e.g queries with C as “MFR IN {TOYOTA,HONDA,NISSAN}” Several recent queries in workload by a specific user repeatedly requesting for TOYOTA and HONDA. QFIDF Similarity QF is purely workload-based. Big disadvantage for insufficient or unreliable workloads. For QFIDF Similarity S(t,q)=QF(q) *IDF(q) when t=q where QF(q)=(RQF(q)+1)/(RQFMax+1). Thus we get small non zero value even if value is never referenced in workload model Breaking ties Problem: Many tuples may tie for the same similarity score and get ordered arbitarily.Arise in empty and many answers problem. Solution: Determine the weights of missing attribute values that reflect their “global importance” for ranking purposes by using workload information. Extend QF similarity ,use quantity to break ties. Extending IDF similarity by using IDF values presents challenges. Implementation Pre-processing component Query–processing component Pre-processing component Compute and store a representation of similarity function in auxiliary database tables. For categorical data, compute IDF(t) (resp QF(t)) ,to compute frequency of occurences of values in database and store the results in auxillary database tables. For numeric data, an approximate representation of smooth function IDF() (resp(QF()) is stored, so that function value is retrieved at runtime. Query processing component Main task: Given a query Q and an integer K, retrieve Top-K tuples from the database using one of the ranking functions. Ranking function extracted in pre-processing phase. SQL-DBMS for solving top-K problem. Handling simpler query processing problem Input: table R with M categorical columns, Key column TID, C is conjunction of form Ak=qk..... and integer K. Output: top-K tuples of R similar to Q. Similarity function: Overlap Similarity. Implementation of Top-K operator Traditional approach Indexed based approach overlap similarity function satisfies the following monotonic property. Adapt TA algorithm If T and U are two tuples such that for all K, Sk(tk,qk)< Sk(uk,qk) then SIM(T,Q) < SIM(U,Q) To adapt TA implemented Sorted and random access methods. Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by random access and maintains buffer of Top-K tuples seen so far. Indexed-based TA(ITA) Sorted access Random access Conclusion Thus TF-IDF based techniques were extended to numerical and mixed data. Workload tracking was used as a weak form of collaborative filtering.