Top-k Searching According to User Preferences Based on Fuzzy Functions with Usage of Tree-Oriented Data Structures Efficient Matúš Ondreička Superised by Prof. Jaroslav Pokorný Faculty of Mathematics and Physics Department of Software Engineering Charles University in Prague Czech Republic Research - outline introduction top-k problem, user preferences, fuzzy functions related work technical solutions Tree-Oriented Data Structures set of B+-trees multidimensional B+-tree multidimensional B+-tree with lists MD-algorithm, MXT-algorithm experiments, current results motivation of future research VLDB 2011 PhD Workshop Matúš Ondreička 2 Top-k problem top-k searching the (few) best k objects with more k objects with the highest ratting according to user preferences based on fuzzy functions attributes efficient top-k searching without accessing all the objects allow the full support of model of local preferences global preferences VLDB 2011 PhD Workshop Matúš Ondreička user preferences 3 Model of user preferences local preferences objects are preferred according to one attribute an attribute's domain is continuous 100% 1 fU(x): xA → [0, 1] an attribute's domain is discrete fU(x) modeled with an fuzzy function 0% 0 0€ xA 1000€ evaluating of each value ACER := 0.6, APPLE := 1.0, DELL := 0.9, SONY := 0.8 global preferences objects are preferred according more attributes modeled with an aggregation function @U(x): ( f1U(x), ..., fmU(x) ) → [0, 1] @U(x) e.g. weighted average VLDB 2011 PhD Workshop Matúš Ondreička w1 . f1U(x) + ... + wm . fmU(x) = w1 + ... + wm 4 Motivation and related work XML, multimedia, the Web, etc. relational databases Ilyas, Beskales, Soliman: A survey of top-k query processing techniques in relational database systems. 2008. ranking functions query optimalization Fagin's algorithms Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences 66, 2003. only support of a monotone ranking functions based on sorted lists no supporting of local user preferences BASIC MOTIVATION FOR OUR RESEARCH VLDB 2011 PhD Workshop Matúš Ondreička 5 Usage of B+-tree local user preference by fuzzy function 0.2 0.5 0.8 on monotonous interval moving in leaf level ‘’ways’’ in leaf level continuously on all ‘’ways’’ comparing objects on different ‘’ways’’ choosing the biggest on all the ‘’ways’’ 0.0 0.1 0.2 C Q U w1 D 0.6 0.7 0.8 0.3 0.4 0.5 G H w2 R Y w3 S E 0.9 1.0 T K w4 M N F w5 1 obtaining objects during the computation of algorithm with ratings in descending order by fuzzy function fU VLDB 2011 PhD Workshop 0 0 0.1 Matúš Ondreička 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 Fagin's algorithms TA (threshold algorithm) and NRA (no random access) searches the best k objects according to monotone aggregate function @ without accessing all objects preconditions a set of objects X with values of m attributes A1, ..., Am objects from the set X are stored in m lists L1, ..., Lm lists contain pairs (x, ax) lists are sorted in descending order monotone aggregation function @ L1 L2 L3 (x1, 1.0) (x2, 0.8) (x3, 0.6) (x4, 0.4) (x5, 0.2) (x6, 0.0) (x3, 1.0) (x4, 0.8) (x6, 0.6) (x1, 0.4) (x2, 0.2) (x5, 0.0) (x1, 1.0) (x4, 0.8) (x3, 0.6) (x5, 0.4) (x2, 0.2) (x6, 0.0) multi-user solution lists are based on B+-tree algorithm can get pairs (x, fU(x)) from B+-tree sequentially in descending order according to user's fuzzy function fU(x) VLDB 2011 PhD Workshop A1 A2 A3 B+-tree B+-tree B+-tree x1 x2 x3 x4 x5 x6 x5 x2 x1 x6 x4 x3 x6 x2 x5 x3 x4 x1 Matúš Ondreička 7 Multidimensional B-tree MDB-tree allows to index set of objects by m > 1 attributes in one data structure m levels, values of one attribute are stored in each level nodes are B+-trees, whose leaf nodes are linked in two directions A B C D E F G H I J K A1 0.0 0.0 0.0 0.0 0.5 1.0 1.0 1.0 1.0 1.0 1.0 A2 0.4 0.4 1.0 1.0 0.9 0.0 0.0 0.0 0.4 0.7 0.7 VLDB 2011 PhD Workshop A3 0.3 0.5 0.5 0.5 1.0 0.0 0.0 0.7 0.7 0.4 0.6 0.0 0.5 1.0 0.4 1.0 0.9 0.3 0.5 0.5 1.0 A C D E B Matúš Ondreička 0.0 0.4 0.7 0.0 0.7 F G H 0.7 I 0.4 0.6 J K 8 MD-algorithm search the best k objects in a multidimensional B-tree (MDB-tree) without getting all the objects principle of MD-algorithm MD-algorithm searches MDB-tree with the recursive procedure it uses the temporary list TK of the best actual k objects it uses the best rating B(S) of B+-tree S monotone aggregate function @ definition analogically to Fagin’s TA-algorithm B(S)=1+1= 2.0 B(S) of B+-tree S in i-th level of MDB-tree B(S) = @(k1, ..., ki-1, 1, ..., 1) 0.4 0.7 0.8 example: @(xA1, xA2)= xA1 + B(S)=0.8+1= 1.8 xA2 0.3 0.6 1.0 0.3 0.5 0.8 B(S)=0.8+0.7= 1.5 A VLDB 2011 PhD Workshop Matúš Ondreička B C D E F G H 9 Searching the best 3 objects 1 0 S1 0.8 1.0 f1U(x) 0.6 S2 S6 0.4 0.6 0.3 0.0 0.4 f2U(x) S8 0.6 0.2 0.1 0.9 1.0 0.5 B(S3)=1.0+0.6+1= 2.6 1 0 0.3 0.0 rating B(S2)=1.0+1+1= 3.0 1 0 TK object 1st 2nd 3rd S7 0.8 1.0 1.0 f3U(x) P V VLDB 2011 PhD Workshop B S4 S3 S5 S10 S9 0.5 0.9 0.1 0.5 0.6 0.2 0 1.0 0.8 1.0 0.7 0.5 0.7 2.1 2.2 2.2 1.8 1.8 B(S)=1.0+0.6+0.5=2.1 B(S)=1.0+0.6+0.6= B(S)=1.0+0.6+0.2= H U E F J G Z M M Q Q Matúš Ondreička C X A O I Y D S R L K T W 10 MXT-algorithm based on integration of MD-algorithm and TA-algorithm uses new data structure: multidimensional B+-tree with lists first n attributes (nominal) stored and searched in the same way as in MD-algorithm last m - n attributes (ordinal) stored as groups of m - n Fagin's sorted lists searched by instances of Fagin's TA-algorithm 0.0 0.3 A2 1.0 0.2 A3 A4 A3 A4 A3 A2 A1 0.1 0.3 1.0 A3 A4 0.6 A3 A4 A2 0.6 0.7 A3 A4 A3 A4 0.4 A3 A4 A2 0.7 A3 A4 {x1, 1.0} {x3, 1.0} {x2, 0.8} {x4, 0.7} {x3, 0.5} {x6, 0.6} {x4, 0.4} {x1, 0.3} {x5, 0.2} {x5, 0.1} {x6, 0.0} {x2, 0.0} A4 VLDB 2011 PhD Workshop A1 1.0 1.0 1.0 1.0 1.0 1.0 A2 0.7 0.7 0.7 0.7 0.7 0.7 A3 1.0 0.8 0.5 0.4 0.2 0.0 A4 0.3 0.0 1.0 0.7 0.1 0.6 ... … … … … 1.0 Matúš Ondreička x1 x2 x3 x4 x5 x6 11 An example of results implemented top-k algorithms tests results the number of obtained objects real data TA-algorithm, MD-algorithm, MXT-algorithm using lists based on B+-trees implementation in Java data structures have been tested in memory (not on disk) 8 822 flats for rent in Prague ||dom(District)|| = 10 ||dom(Type)|| = 10 ||dom(Area)|| = 229 ||dom(Price)|| = 411 real user's preferences user prefers flats of some types in specific districts, smaller prices and bigger areas VLDB 2011 PhD Workshop Matúš Ondreička 12 Motivation, future research improvements of performance of algorithms heuristics improvement of data structures. attribute dependencies between more attributes similarity measures to find k objects most similar to an object can be user preference user feedback After running of first top-k query user tune his/her preferences and execute next top-k query different data models very large data sets tree-oriented data structure allow to dynamise the environment while solving a top-k problem data streams in MXT-algorithm construction, instances of TA-algorithm would be computed concurrently different models of user preferences automatic arrangement levels in MDB-tree with lists, manage empty values parallel computing to monitor a distribution of the key values in nodes tree-oriented data structure as a sliding window approximations, uncertain data, heterogeneous data web environment more information resources distributed on the web VLDB 2011 PhD Workshop Matúš Ondreička 13 An application TreeTopK VLDB 2011 PhD Workshop Matúš Ondreička 14 Thank You for attention! VLDB 2011 PhD Workshop Matúš Ondreička 15