Improving Retrieval Accuracy in Web Databases Using Attribute Dependencies Ravi Gummadi & Anupam Khulbe gummadi@asu.edu – akhulbe@asu.edu Computer Science Department Arizona State University 1 Agenda • Introduction [Ravi] • SmartINT System [Anupam] • Query Processing [Anupam] – Source Selection – Tuple Expansion • Learning [Anupam] • Experiments [Ravi] • Conclusion & Future Work [Ravi] 2 INTRODUCTION 3 This describes the imaginary schema containing all the attributes of a vehicle Introduction Consider a table with Universal Relation from vehicle domain VIN Make Vehicletype MID Cylind Model Price Engine Miles ers Dealer V001 HACC9 Honda Fullsize 6 Accord 19000 K24A4 V002 TYCRA Toyota Midsize 08 Corolla 14000 F23A1 V003 TYCRA Toyota Midsize 09 Corolla 16000 155 HP V004 TYCRY 2AZ-FE Toyota Fullsize 09 Camry 12000 I4 109k V005 HACV0 Database Honda Midsize 8 Civic 11500 Administrator F23A1 120k 4 Introduction 45k 80k 50k 6 4 4 6 Address Frank 1011 E Lemon St, Scottsdale, AZ Frank 1011 E Lemon St, Scottsdale, AZ John 900 10th Street, Tucson, AZ Steven 601 Apache Blvd, Glendale, AZ Frank 1011 E Lemon St, Scottsdale, AZ 4 Normalized Tables Lossless Normalization Name Frank Steven John Database Administrator Primary Key VIN V001 V002 V003 V004 V005 Introduction Make MID HACC96 Honda TYCRA08 Toyota TYCRA09 Toyota TYCRY09 Toyota HACV08 Honda MID Miles HACC96 45k TYCRA08 80k TYCRA09 50k TYCRY09 109k HACV08 120k Dealer Frank Frank John Steven Frank Model Accord Corolla Corolla Camry Civic Price 19000 14000 16000 12000 11500 Cars-for-Sale Dealer-Info Address 1011 E Lemon St, Scottsdale, AZ 601 Apache Blvd, Glendale, AZ 900 10th Street, Tucson, AZ VehicleReview type Excellent Midsize Good Fullsize Average SUV Excellent Fullsize Very Good Midsize Car-Reviews Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 Cylinders 6 4 4 6 4 Foreign Key 5 Query Processing SELECT make, mid, model FROM cars-for-sale c, car-reviews r WHERE cylinders = 4 AND price < $15k MID TYCRA08 HACV08 Complete Data Make Toyota Honda Certain Query Model Corolla Civic Lossless Normalization Accurate Results Introduction 6 Advent of Web (in context of Vehicle Domain) Used Car Dealers Car Reviewers Database Administrator Customers Selling Cars Engine Makers Introduction 7 A Sample Data Model Name Address Frank 1011 E Lemon St, Scottsdale, AZ Steven 601 Apache Blvd, Glendale, AZ John 900 10th Street, Tucson, AZ Used Car Dealers MID Make HACC96 Honda HACV08 Honda TYCRY08 Toyota TYCRA09 Toyota Model Accord Civic Camry Corolla Customers Selling Cars Model_name Corolla Accord Highlander Camry Civic Price 19000 12000 14500 14500 Car Reviewers Review Excellent Good Average Excellent Very Good Vehicle-type Midsize Fullsize SUV Fullsize Midsize MID Mdl HACC96 Accord TYCRA08 Corolla TYCRA09 Corolla TYCRY09 Camry HACV08 Civic HACV07 Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Dealer Frank Frank John Steven Frank Cylinders 6 4 4 6 4 4 Engine Makers Introduction 8 VIN field masked Hidden Sensitive Information A Sample Data Model Name Address Frank 1011 E Lemon St, Scottsdale, AZ Steven 601 Apache Blvd, Glendale, AZ John 900 10th Street, Tucson, AZ Used Car Dealers – t_dealer_info Schema Heterogeneity Unavailability of Information MID Make HACC96 Honda HACV08 Honda TYCRY08 Toyota TYCRA09 Toyota Model_name Corolla Accord Highlander Camry Civic Key might not be the shared attribute Review Excellent Good Average Excellent Very Good Vehicle-type Midsize Fullsize SUV Fullsize Midsize Dealer Frank Frank John Steven Frank Car Reviewers – t_car_reviews Model Accord Civic Camry Corolla Customers Selling Cars – t_car_sales Price 19000 12000 14500 14500 MID Mdl HACC96 Accord TYCRA08 Corolla TYCRA09 Corolla TYCRY09 Camry HACV08 Civic HACV07 Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Cylinders 6 4 4 6 4 4 Engine Makers – t_eng_makers Introduction 9 Vehicles Revisited Engine Makers Table 2 Car Reviewers Table 1 Customers Selling Cars User Query Used Car Dealers Introduction 10 Query is Partial…. SELECT make, model FROM cars-for-sale c, carreviews r WHERE cylinders = 4 AND price < $15k The attributes from one source are not visible in other source in WebDBs; the query is not complete Introduction The tables are not visible to the users 11 Approaches – Single Table • Answering queries from a single table • Unable to propagate constraints; Inaccurate results SELECT make, model WHERE cylinders = 4 AND price < $15k MID Make Model Price HACV08 Honda Civic 12000 TYCRY08 Toyota Camry 14500 TYCRA09 Toyota Corolla 14500 Inaccurate Result – Camry has 6 cylinders Introduction Customers Selling Cars MID Make HACC96 Honda HACV08 Honda TYCRY08 Toyota TYCRA09 Toyota Model Accord Civic Camry Corolla Price 19000 12000 14500 14500 12 Approaches – Direct Join • Join the tables based on shared attribute • Leads to spurious tuples which do not exist SELECT make, model WHERE cylinders = 4 AND price < $15k Make Honda Honda Toyota Toyota Price Mdl 12000Civic 12000Civic 14500Corolla 14500Corolla Introduction Cylinders 4 4 4 4 Join the following two tables Spurious results Generates extra tuples Make Honda Honda Toyota Toyota Engine F23A1 J27B1 F23A1 155 HP Model Accord Civic Camry Corolla Price Customers Selling Cars 19000 12000 14500 14500 Mdl Accord Corolla Corolla Camry Civic Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Engine Makers Cylinders 6 4 4 6 4 4 13 Why is JOIN not working? The Rules of Normalization • Eliminate Repeating Groups • Eliminate Redundant Data • Eliminate Columns Not Dependent On Key Cannot ensure in Autonomous Web Databases All Columns are dependent on Key in Normalization which is NOT necessarily true in Ad hoc Normalization!! Introduction http://www.datamodel.org/NormalizationRules.html 14 Dependencies…. • Shared attribute(s) is not the ‘Key’! • The shared attribute’s relation with other columns is unknown!! • LEARN the dependencies between them • Mine Functional Dependencies (FD) among the columns.. – Neat…works quite well ‘IF ONLY’ the data is clean – Lot of noisy data in Web Databases • Instead consider – APPROXIMATE FUNCTIONAL DEPENDENCIES Introduction 15 Approximate Functional Dependencies • Approximate Functional Dependencies are rules denoting approximate determinations at attribute level. – AFDs are of the form (X ~~> Y), where X and Y are sets of attributes – X is the “determining set” and Y is called “dependent set” – Rules with singleton dependent sets are of high interest • Examples of AFDs – (Nationality ~~> Language) – Make ~~> Model – (Job Title, Experience) ~~> Salary Introduction 16 Using AFDs for Query Processing • These AFDs make up for the missing dependency information between columns. • They help in propagating constraints distributed across tables. • They help in predicting the attributes distribute across tables • They assist in completing the entity information by predicting the related attributes Introduction MID Make Model Price HACV08 Honda Civic 12000 TYCRA09 Toyota Corolla 14500 17 Summary • Traditional query processing does not hold for Autonomous Web Databases. • Problems like incomplete/Noisy data, imprecise query and ad hoc normalization exist. • Schema Heterogeneity can be countered by existing works. • (Still) Missing PK-FK information lead to inaccurate joins. • Mine Approximate Functional Dependencies and use them to make up for missing PK-FK information. Introduction 18 Problem Statement Given a collection of ad hoc normalized tables, the attribute mappings between the tables and a partial query – return the user an accurate result set covering the majority of attributes described in the universal relation. Introduction 19 Agenda • Introduction [Ravi] • SmartINT System [Anupam] • Query Processing [Anupam] – Source Selection – Tuple Expansion • Learning [Anupam] • Experiments [Ravi] • Conclusion & Future Work [Ravi] 20 SMART-INT(EGRATOR) & RELATED WORK 21 SmartINT Framework Result Set AFDMiner Tuple Expansion Query Statistics Learner Source Selection Tree of Tables Web Database Attribute Mapping Q U E R Y I N T E R F A C E Graph of Tables SmartINT 22 Related Work – Attribute Mapping •Large body of research over the past few years •Automatic and Manual Approaches Result Set •LSD (Doan et al, SIGMOD 2001) AFDMiner •Simiflood (Melnik et al, ICDE 2002) Tuple •Cupid (J. Madhavan et al, VLDB 2001) Expansion •SEMINT (Clifton et al, TKDE 2000) •Clio (Hernandez et al, SIGMOD 2001) •Schema Mapping(Translation Rules) is Query Statistics More Difficult!! •1-1 Learner Attribute mapping is comparatively easier and Source can be automated Selection Tree of Tables Web Database Attribute Mapping Q U E R Y I N T E R F A C E Graph of Tables SmartINT 23 Related Work – Query Interface •Imprecise Queries •Vague (A. Motro, ACM TOIS 1998) AFDMiner •AIMQ (U. Nambiar et al, ICDE 2006) Tuple •QUIC (Kambhampati et al, CIDR 2007) Expansion Result Set •Keyword Search •BANKS (Bhalotia et al, ICDE 2002) Query •DISCOVER (Hristdis et al, VLDB 2003) Statistics •KITE (Mayassam et al, ICDE 2007) Learner •PK-FK Assumption does not hold!! Source Selection Tree of Tables Web Database Attribute Mapping Q U E R Y I N T E R F A C E Graph of Tables SmartINT 24 Related Work – Web Database •Query Processing on Web Databases is an important research problem Result Set AFDMiner • Ives at al, SIGMOD 2004 • Lembo et Tuple al, KRDB 2002 Expansion •QPIAD (G. Wolf et al, VLDB 2007) from DBYochan, close to ours in spirit, uses AFD based prediction to make up for missing Query Statistics data. Learner Source Selection Tree of Tables Web Database Attribute Mapping Q U E R Y I N T E R F A C E Graph of Tables SmartINT 25 Related Work – AFD Mining AFDMiner Statistics Learner Web Database Q U Result Set E R Tuple •FD/AFD Mining is an important problem inY Expansion DB Community I •Mines AFDs as approximation of AFDs with N few error tuples T •CORDS Query E •TANE R F Source •Mining them as condensed representationAof Selection association rules C Tree of Tables •AFDMiner (Kalavagattu, MS Thesis, ASU E 2008) Attribute Mapping Graph of Tables SmartINT 26 Agenda • Introduction [Ravi] • SmartINT System [Anupam] • Query Processing [Anupam] – Source Selection – Tuple Expansion • Learning [Anupam] • Experiments [Ravi] • Conclusion & Future Work [Ravi] 27 Result Set AFDMiner Tuple Expansion Query Statistics Learner Source Selection Q U E R Y I N T E R F A C E Tree of Tables Web Database Attribute Mapping Graph of Tables QUERY PROCESSING 28 Query Answering Task SELECT Make, Vehicle-type WHERE cylinders = 4 AND price < $15k Distributed constraints Distributed attributes Model_name Review Corolla Excellent Accord Good Highlander Average Camry Excellent Civic Very Good Attributes need to be integrated Query Processing Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Vehicle-type Midsize Fullsize SUV Fullsize Midsize Dealer Frank Frank John Steven Frank Price 19000 12000 14500 14500 Result set should adhere to all the constraints distributed across tables Attribute Match Mdl Accord Corolla Corolla Camry Civic Civic Name Address Frank 1011 E Lemon St, Scottsdale, AZ Steven 601 Apache Blvd, Glendale, AZ John 900 10th Street, Tucson, AZ Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Cylinders 6 4 4 6 4 4 Query Answering Approach Select a tree Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Direction of constraint Model_name Review propagationVehicle-type and Corolla Excellent Midsize attribute prediction Accord Good Fullsize Highlander matters! Average SUV Camry Civic Excellent Very Good Predict attributes using AFDs to expand seed tuples Query Processing Fullsize Midsize Dealer Frank Frank John Steven Frank Process root table constraints to generate “seed” tuples Price 19000 12000 14500 14500 Propagate constraints to the root table Mdl Accord Corolla Corolla Camry Civic Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Cylinders Role of AFDs AccuracyAddress of constraint propagation and 1011 E Lemon St, Scottsdale, AZ attribute prediction depends on AFD confidence Name Frank Steven 601 Apache Blvd, Glendale, AZ John 900 10th Street, Tucson, AZ 6 4 4 6 4 4 Tuple Expansion Query Source Selection Tree of Tables SOURCE SELECTION 31 Selecting the best tree Objective: Given a graph of tables and a query, select the most relevant tree of tables of size up to k 1 4 2 Source Selection 4 2 5 3 3 6 Query Requirements 1. Need to estimate relevance of a table, when some of the constraints are not mapped on to its attributes 2. Need a relevance function for a tree of tables Source Selection 32 Constraint Propagation < 15k Table 1 Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Price 19000 12000 14500 14500 Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Price 19000 12000 14500 14500 Model = Corolla or Civic Table 2 Table 2 Mdl Accord Corolla Corolla Camry Civic Civic Table 1 = Cylinders 6 4 4 4 6 4 4 Mdl Accord Corolla Corolla Camry Civic Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 = Cylinders 6 4 4 4 6 4 4 Propagate Cylinders = 4 to Table 1 Distributed Other constraints information AFD provides the cond. probability P2(Cylinders = 4 | Mdl = modeli) Source Selection 33 Relevance of a tree T1 Factors? 1. Root table relevance Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Relevance of tree T w.r.t query q Price 19000 12000 14500 14500 C1: Price< 15k C2: Model = ‘Corolla’ or ‘Civic’ T2 T3 Model_name Corolla Accord Highlander Camry Civic Review Here, Excellent Good Average Excellent Very Good Vehicle-type Midsize Fullsize SUV Fullsize Midsize Dealer Frank Frank John Steven Frank 3. AFD Confidence: How accurately can the value be Source Selection predicted? Mdl Accord Corolla Corolla Camry Civic Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 Cylinders 6 4 4 6 4 4 2. Value overlap: What fraction of tuples in base-table can be expanded by child table 34 Relevance of a table Factors? Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla 1. Fraction of query attributes provided - horizontal relevance 2. Conformance to constraints - vertical relevance Price 19000 12000 14500 14500 C1: Price< 15k C2: Model = ‘Corolla’ or ‘Civic’ Mdl Accord Corolla Corolla Camry Civic Civic Engine K24A4 F23A1 155 HP 2AZ-FE I4 F23A1 J27B1 =4 Cylinders 6 4 4 6 4 4 SELECT Make, Vehicle-type WHERE cylinders = 4 AND price < $15k Source Selection 35 Tuple Expansion Query Source Selection Tree of Tables TUPLE EXPANSION 36 Tuple Expansion • Tuple expansion operates on the tree of tables given by source selection • It has two main steps 1. Constructing the Schema 2. Populating the tuples 37 Phase 1: Constructing schema Tree of tables Table 1 Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Price 19000 12000 14500 14500 Make Model Price Table 3 Model_name Review Corolla Excellent Accord Good Highlander Average Camry Excellent Civic Very Good Vehicle-type Midsize Fullsize SUV Fullsize Midsize Dealer Frank Frank John Steven Frank Model_name Vehicle-type SELECT Make, Vehicle-type WHERE cylinders = 4 AND price < $15k Tuple Expansion Constructed schema 38 Phase 2: Populating the tuples Local constraint Price < 15k Make Honda Honda Toyota Toyota Model Accord Civic Camry Corolla Model_name Corolla Accord Highlander Camry Civic Vehicle-type Midsize Fullsize SUV Fullsize Midsize Tuple Expansion Price 19000 12000 14500 14500 Make Honda Toyota Evaluate constraints Model Civic Corolla Vehicle-type Midsize Midsize Predict Vehicle-type Translated constraint Model = Corolla or Civic 39 Agenda • Introduction [Ravi] • SmartINT System [Anupam] • Query Processing [Anupam] – Source Selection – Tuple Expansion • Learning [Anupam] • Experiments [Ravi] • Conclusion & Future Work [Ravi] 40 Result Set AFDMiner Tuple Expansion Query Statistics Learner Source Selection Q U E R Y I N T E R F A C E Tree of Tables Web Database Attribute Mapping Graph of Tables LEARNING 41 AFD Mining • The problem of AFD Mining is learn all AFDs that hold over a given relational table • Two costs: 1. Major cost is the Combinatoric cost of traversing the search space 2. Cost of visiting data to validate each rule (To compute the interestingness measures) • Search process for AFDs is exponential in terms of the number of attributes Learning Specificity Normalized with the worst case Specificity i.e., X is a key • The Specificity measure captures our intuition of different types of AFDs. • It is based on information entropy – Shares similar motivations with the way SplitInfo is defined in decision trees while computing Information Gain Ratio • Follows Monotonicity – The Specificity of a subset is equal to or lower than the Specificity of the set. (based on Apriori property) Learning Lattice Traversal Specificity Follows Monotonicity ABC AB ABD ACD BCD AFDMiner mines rules with High Confidence and Low Specificity which are apt for works like QPIAD, but SmartINT requires rules with High Specificity. So we AC AD BC BD CD change the direction of traversal so that we can use the monotonicity of Specificity to prune more nodes. A Upper bound on Specificity – bottom up makes sense Learning ABCD B Reaches the Specificity threshold C Ǿ D Traversal direction through the lattice depends on the pruning techniques available 44 Lattice Traversal Specificity Follows Monotonicity Lower bound on Specificity – Top down makes sense ABCD Reaches the Specificity threshold ABC AB ABD AC ACD AD BC BCD BD CD All these nodes are pruned off A B C Ǿ Learning D Traversal direction through the lattice depends on the pruning techniques available 45 Pruning Strategies 1. Pruning off non-shared Attributes – SmartINT is not interested in non-shared attributes in the determining set. It is only interested in rules with shared attributes in determining set. 2. Pruning by Specificity – Specificity(Y) ≥ Specificity(X), where Y is a superset of X – If Specificity(X) < minSpecificity, we can prune all AFDs with X and its subsets as the determining set Learning Agenda • Introduction [Ravi] • SmartINT System [Anupam] • Query Processing [Anupam] – Source Selection – Tuple Expansion • Learning [Anupam] • Experiments [Ravi] • Conclusion & Future Work [Ravi] 47 EXPERIMENTAL EVALUATION 48 Experimental Hypothesis In the context of Autonomous Web Databases, If you Learn Approximate Functional Dependencies (AFDs) and use them in query answering, then it would result in a better retrieval accuracy than using direct-join or single-table approaches. 49 Experimental Setup Performed experiments over Vehicle data crawled from Google Base 350,000 Tuples Generated different partitions of the tables Posed queries on the data with varying projected attributes and varying constraints Implemented in Java Source code at the following location [In development] http://24cross7.svnrepository.com/svn/sorcerer/trunk/code/smartintweb Data stored in MySQL database Experiments 50 Evaluation Methodology • We should have the ‘Oracular Truth’ to compare the approaches • MASTER TABLE - Table containing all the tuples with the universal relation which serves as oracular truth • Splitting MASTER TABLE into different partitions • Issue queries over both partitioned tables and master table – Compare the results and measure precision Experiments 51 Lets consider the following tuple from Master Table (Ground Truth) Correctness & Completeness Tuple from Master Table (8 Attributes) Correctness of a tuple = fraction of correct values Completeness of a tuple = Total number of values retrieved Here it is 3/6 Here it is 6/8 Tuple from one of the approaches (6 Attributes) RIGHT WRONG RIGHT The following is the tuple from one of the approaches Experiments WRONG RIGHT WRONG Need two metrics analogous to Precision and Recall at the tuple level 52 Precision & Recall Result Set from Master Table (8 Attributes) Precision = Average Correctness of the tuple Result Set from one of the approaches (6 Attributes) RIGHT WRONG RIGHT WRONG RIGHT WRONG Recall = Cumulative completeness of tuples returned Experiments 53 Varying No. of Projected Attributes Precision vs Attributes 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.8 Precision Recall Recall vs Attributes 0.6 0.4 0.2 0 2 2 4 Attributes 6 4 Attributes 6 F-measure vs Attributes 1 0.9 0.8 F-measure 0.7 0.6 Around 0.55 improvement In F-measure…. 0.5 0.4 0.3 0.2 0.1 0 2 Experiments 4 Attributes 6 54 Varying No. of Constraints Recall vs Constraints 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Recall Precision Precision vs Constraints 2 3 Constraints 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 3 Constraints F-measure vs Constraints 4 1 0.9 0.8 F-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 Experiments 3 Constraints 4 55 Other Experiments 1 0.9 Comparison with Multiple 0.8 0.7 Join Paths 0.6 0.5 Precision 0.4 Recall 0.3 F-measure 0.2 SmartINT performed better than all possible joins 0.1 0 Join: Model Join: Year Join: Model, Year SmartInt Variable Width Expansion The dip in F-measure can be used to stop the expansion Experiments 56 Learning Evaluation AFDMiner performs better than TANE approach The execution time and the quality of AFDs are both higher than TANE Kalavagattu 2008 – M.S Thesis Experiments 57 DEMO [work in progress] http://149.169.227.245:8080/smartintweb/ Experiments 58 Agenda • Introduction [Ravi] • SmartINT System [Anupam] • Query Processing [Anupam] – Source Selection – Tuple Expansion • Learning [Anupam] • Experiments [Ravi] • Conclusion & Future Work [Ravi] 59 CONCLUSION & FUTURE WORK 60 Conclusion • Autonomous Web Databases call for novel systems to counter the problems due to uncertainty of the Web. • SmartINT makes an effort to answer one such issue – Missing PK-FK • The system gave good improvement in terms of F-measure over approaches like Single Table and Direct Join. Conclusion and Future Work 61 Autonomous Web Traditional Database QPIAD (VLDB ‘07, VLDBJ ‘09) Incomplete Complete Data AIMQ(ICDE ‘06) SmartINT QUIC(CIDR ‘07) (Submitted to ICDE ‘09) Imprecise Ad hoc Certain Query Lossless Normalization Probabilistic Accurate Results Conclusion and Future Work 62 Future Work • Back-door JOIN – Can SmartINT be used as back-door approach to join tables? – SmartINT performs as good as other systems when PK-FK relation is present – In the absence of such information, other systems fail whereas SmartINT gives good accuracy • Vertical Aggregation – Taking into account the vertical overlap between the tables – In the absence of substantial overlap, the strength of AFDs would not help you to retrieve accurate results • Discover Key Info – Using AFDMiner to discover key information Conclusion and Future Work 63 Future Work • Top ‘KW’ search – Striking a balance between the number of tuples and width of the tuple. – The more you expand the less precise the results are going to be • Diverse results – Providing the user with diverse set of results. Conclusion and Future Work 64 Thank you… • Prof. Subbarao Kambhampati • Prof. Pat Langley • Prof. Jieping Ye • Special thanks to –Aravind Kalavagattu –Raju Balakrishnan 65 QUESTIONS 66 Individual Contribution • Problem Identification and Formulization – Identifying the problem: Joint work – Using AFDs for Tuple Expansion: Gummadi – Source Selection: Khulbe • System Development and Evaluation – Initial framework setup: Gummadi – Tuple Expansion, Experiments (Multiple join paths, variable widthe expansion): Gummadi – Source Selection, Experiments (comparison with direct-join and single table approaches): Khulbe • Writing – – – – Introduction, Related Work, System Description: Gummadi Preliminaries, Source Selection: Khulbe Experiments: Joint Work Learning: Aravind Kalavagattu 67