Issues in Bridging DB & IR 11/21 Administrivia Homework 4 socket open *PLEASE* start working. There may not be a week extra time before submission Considering making Homework 4 subsume the second exam—okay? Topics coming up DB/IR (1.5 classes); Collection Selection (.5 classes) Social Network Analysis (1 class); Webservices (1 class) Interactive review/Summary (last class) DB and IR: Two Parallel Universes Database Systems Information Retrieval canonical application: accounting libraries data type: numbers, short strings text foundation: algebraic / logic based probabilistic / statistics based search paradigm: Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists) parallel universes forever ? CIDR 2005 DB vs. IR DBs allow structured querying Queries and results (tuples) are different objects Soundness & Completeness expected User is expected to know what she is doing IR only supports unstructured querying Queries and results are both documents! High Precision & Recall is hoped for User is expected to be a dunderhead. Top-down Motivation: Applications (1) - Customer Support Typical data: Why customizable Customers (CId, Name, Address, Area, Category, scoring? Priority, ...) • wealth of different apps within this app class Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...) • different customer classes Answers (AId, RId, Date, Class, Body, WFId, WFStatus, ...) needs • adjustment to evolving business • scoring on text + structured data (weighted sums, language models, skyline, premium customer from Germany: w/ correlations, etc. a etc.) „A notebook, model ... configured with ..., has problem with the driver of its Wave-LAN card. I already tried the fix ..., but received error message ...“ request classification & routing find similar requests Typical queries: Platform desiderata (from app developer‘s viewpoint): • Flexible ranking and scoring on text, categorical, numerical attributes • Incorporation of dimension hierarchies for products, locations, etc. • Efficient execution of complex queries over text and data attributes • Support for high update rates concurrently with high query load CIDR 2005 Top-down Motivation: Applications (2) More application classes: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer relationship management in banks, insurances, telcom, etc. • Bulletin boards for social communities • P2P personalized & collaborative Web search etc. etc. CIDR 2005 Top-down Motivation: Applications (3) Next wave Text2Data: use Information-Extraction technology (regular expressions, HMMs, lexicons, other NLP and ML techniques) to convert text docs into relational facts, moving up in the value chain Example: „The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7, and is organized by D.J. DeWitt, Mike Stonebreaker, ...“ Conference ConfOrganization Name Year Location Date Prob Name Year Chair Prob CIDR 2005 Asilomar 05/01/04 0.95 CIDR 2005 P68 0.9 CIDR 2005 P35 0.75 • facts now have confidence scores • queries involve probabilistic inferences and result ranking • relevant for „business intelligence“ CIDR 2005 People Id Name P35 Michael Stonebraker P68 David J. DeWitt Some specific problems 1. 2. 3. How to handle textual attributes in data processing (e.g. Joins)? How to support keyword-based querying over normalized relations? How to handle imprecise queries? (Ullas Nambiar’s work) 4. How to do query processing for top-K results? (Surajit et. Al. paper in CIDR-2005) 1. Handling text fields in data tuples Often you have database relations some of whose fields are “Textual” E.g. a movie database, which has, in addition to year, director etc., a column called “Review” which is unstructured text Normal DB operations ignore this unstructured stuff (can’t join over them). SQL sometimes supports “Contains” constraint (e.g. give me movies that contain “Rotten” in the review STIR (Simple Text in Relations) The elements of a tuple are seen as Documents (rather than atoms) Query language is same as SQL save a “similarity” predicate Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output in a ranked form—with the rank proportional to the similarity Neat idea… but does have some implementation difficulties Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples Uses A*-search to focus on top-K answers (See Surajit et. Al. CIDR 2005 who argue for a whole new query algebra to help support top-K query processing) WHIRL queries • Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing The Hitchhiker’s Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series – not surprisingly, as Adams wrote the screenplay …. Men in Black, 1997 Will Smith does an excellent job in this … Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy … … … Star Wars Episode III The Senator Theater 1:00, 4:15, & 7:30pm. Cinderella Man The Rotunda Cinema 1:00, 4:30, & 7:30pm. … … … WHIRL queries • “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) • “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”) WHIRL queries • Similarity is based on TFIDF rare words are most important. • Search for high-ranking answers uses inverted indices…. The Hitchhiker’s Guide to the Galaxy, 2005 Star Wars Episode III Men in Black, 1997 Hitchhiker’s Guide to the Galaxy Space Balls, 1987 Cinderella Man … … WHIRL queries • Similarity is based on TFIDF rare words are most important. • Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms” The Star Wars Episode III Hitchhiker’s Guide to the Galaxy, 2005 Hitchhiker’s Guide to the Galaxy Men in Black, 1997 Cinderella Man Space Balls, 1987 … … Years are common in the review archive, so have low weight hitchhiker movie00137 the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ….. WHIRL results • This sort of worked: – Interactive speeds (<0.3s/q) with a few hundred thousand tuples. – For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. – Average precision better than 90% on 5-way joins WHIRL and soft integration • WHIRL worked for a number of web-based demo applications. – e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor • WHIRL could link many data types reasonably well, without engineering • WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) • WHIRL was relational – But see ELIXIR (SIGIR2001) • WHIRL users need to know schema of source DBs • WHIRL’s query-time linkage worked only for TFIDF, tokenbased distance metrics – Text fields with few misspellimgs • WHIRL was memory-based – all data must be centrally stored—no federated data. – small datasets only WHIRL vision: very radical, everything was inter-dependent SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. (~ TFIDF-similar) Query Q R.a S.a S.b T.b Anhai Anhai Doan Doan Dan Dan Weld Weld William Will Cohen Cohn Steve Steven Minton Mitton William David Cohen Cohn String Similarity Metrics Tf-idf measures are not really very good at handling similarity between “short textual attributes” (e.g. titles) String similarity metrics are more suitable String similarity can be handled in terms of Edit distance (# of primitive ops such as “backspace”, “overtype”) needed to convert one string into another N-gram distance (see next slide) N-gram distance An n-gram of a string is a contiguous n-character subsequence of the string 3 grams of string “hitchhiker” are “space” can be treated as a special character A string S can be represented as a set of its n-grams Similarity between two strings can be defined in terms of the similarity between the sets {hit, itc, tch, chh, hhi, hik, ike, ker} Can do jaccard similarity N-grams are to strings what K-shingles are to documents Document duplicate detection is often done in terms of the set similarity between its shingles Each shingle is hashed to a hash signature. A jaccard similarity is computed between the document shingle sets Useful for plagiarism detection (see Turnitin software does it..) Performance 2. Supporting keyword search on databases How do we answer a query like “Soumen Sunita”? Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords? Motivation Keyword search of documents on the Web has been enormously successful Simple and intuitive, no need to learn any query language Database querying using keywords is desirable SQL is not appropriate for casual users Form interfaces cumbersome: Require separate form for each type of query — confusing for casual users of Web information systems Not suitable for ad hoc queries Motivation Many Web documents are dynamically generated from databases E.g. Catalog data Keyword querying of generated Web documents May miss answers that need to combine information on different pages Suffers from duplication overheads Examples of Keyword Queries On a railway reservation database On a university database “database course” On an e-store database “mumbai bangalore” “camcorder panasonic” On a book store database “sudarshan databases” Differences from IR/Web Search Related data split across multiple tuples due to normalization E.g. Paper (paper-id, title, journal), Author (author-id, name) Writes (author-id, paper-id, position) Different keywords may match tuples from different relations What joins are to be computed can only be decided on the fly Cites(citing-paper-id, cited-paper-id) Connectivity Tuples may be connected by Foreign key and object references Inclusion dependencies and join conditions Implicit links (shared words), etc. Would like to find sets of (closely) connected tuples that match all given keywords Basic Model Database: modeled as a graph Nodes = tuples Edges = references between tuples foreign key, inclusion dependencies, .. BANKS: Keyword search… MultiQuery Optimization Edges are directed. paper writes Charuta S. Sudarshan Prasan Roy author Answer Example Query: sudarshan roy MultiQuery Optimization writes author S. Sudarshan paper writes Prasan Roy author Combining Keyword Search and Browsing Catalog searching applications Keywords may restrict answers to a small set, then user needs to browse answers If there are multiple answers, hierarchical browsing required on the answers What Banks Does The whole DB seen as a directed graph (edges correspond to foreign keys) Answers are subgraphs Ranked by edge weights Solutions as rooted weighted trees In BANKS, each potential solution is a rooted weighted tree where Nodes are tuples from tables Node weight can be defined in terms of “pagerank” style notions (e.g. back-degree) Edges are foreign-primary key references between tuples across tables Links are given domain specific weights They use log(1+x) where x is the number of back links Paperwrites is seen as stronger than Papercites table Tuples in the tree must cover keywords Relevance of a tree is based on its weight Weight of the tree is a combination of its node and link weights BANKS: Keyword Search in DB 11/23: Imprecise Queries Collection Selection Part III: Answer Imprecise Queries with [ICDE 2006;WebDB, 2004; WWW, 2004] Why Imprecise Queries ? Toyota A Feasible Query Want a ‘sedan’ priced around $7000 Make =“Toyota”, Model=“Camry”, Price ≤ $7000 What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Camry $7000 1999 Toyota Camry $7000 2001 Toyota Camry $6700 2000 Toyota Camry $6500 1998 ……… Dichotomy in Query Processing Databases IR Systems User knows what she wants • User has an idea of what she wants User query completely expresses the need • User query captures the need to some degree • Answers ranked by degree of relevance Answers exactly matching query constraints Imprecise queries on databases cross the divide Existing Approaches Similarity search over Vector space • Data must be stored as vectors of text WHIRL, W. Cohen, 1998 Enhanced database model • • Add ‘similar-to’ operator to SQL. Distances provided by an expert/system designer 1. User/expert must provide similarity measures 2. New operators to use distance measures VAGUE, A. Motro, 1998 3. Not applicable over autonomous databases Binderberger et al, 2003 Our Objectives: Support similarity search and query refinement over abstract data types User guidance • Limitations: Users provide information about objects required and their possible neighborhood Proximity Search, Goldman et al, 1998 1. Minimal user input 2. Database internals not affected 3. Domain-independent & applicable to Web databases Imprecise queries vs. Empty queries The “empty query” problem arises when the user’s query, when submitted to the database leads to empty set of answers. • We want to develop methods that can automatically minimally relax this empty query and resubmit it so the user gets some results Existing approaches for empty query problem are mostly syntactic— and rely on relaxing various query constraints • Little attention is paid to the best order in which to relax the constraints. Imprecise query problem is a general case of empty query problem • • We may have non-empty set of answers to the base query We are interested not just in giving some tuples but give them in the order of relevance General ideas for supporting imprecise queries Main issues are 1. 2. How to rewrite the base query such that more relevant tuples can be retrieved. How to rank the retrieved tuples in the order of relevance. A spectrum of approaches are possible—including 1. 2. 3. Data-dependent approaches User-dependent approaches Collaborative approaches We will look at an approach—which is basically data-dependent AFDs based Query Relaxation Imprecise Query Map: Convert Q “like” to “=” Qpr = Map(Q) Derive Base Set Abs Abs = Qpr(R) Use Base Set as set of relaxable selection queries Using AFDs find relaxation order Derive Extended Set by executing relaxed queries Use Concept similarity to measure tuple similarities Prune tuples below threshold Return Ranked Set An Example Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Use Base Set as set of relaxable selection queries Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Relation:- CarDB(Make, Model, Price, Year) Imprecise query Q :− CarDB(Model like “Camry”, Price like “10k”) Base query Qpr :− CarDB(Model = “Camry”, Price = “10k”) Base set Abs Use Concept similarity to measure tuple similarities Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001” Obtaining Extended Set Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Problem: Given base set, find tuples from database similar to tuples in base set. Solution: • Consider each tuple in base set as a selection query. e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” • Relax each such query to obtain “similar” precise queries. e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000” • Execute and determine tuples having similarity above some threshold. Challenge: Which attribute should be relaxed first ? Make ? Model ? Price ? Year ? Solution: Relax least important attribute first. • Least Important Attribute Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes. • Does not decide values of other attributes • Value may depend on other attributes E.g. Changing/relaxing Price will usually not affect other attributes but changing Model usually affects Price Dependence between attributes useful to decide relative importance • Approximate Functional Dependencies & Approximate Keys Approximate in the sense that they are obeyed by a large percentage (but not all) of tuples in the database • Can use TANE, an algorithm by Huhtala et al [1999] Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Attribute Ordering Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Given a relation R • • • Determine the AFDs and Approximate Keys Pick key with highest support, say Kbest Partition attributes of R into • key attributes i.e. belonging to Kbest non-key attributes I.e. not belonging to Kbest Sort the subsets using influence weights (1 error ( A' Aj )) InfluenceWeight ( Ai) | A' | CarDB(Make, Model, Year, Price) Key attributes: Make, Year Non-key: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} where Ai ∈ A’ ⊆ R, j ≠ i & j =1 to |Attributes(R)| Attribute relaxation order is all non-keys first then keys Multi-attribute relaxation - independence assumption 2-attribute: {(Price, Model), (Price, Year), (Price, Make)….. } Tuple Similarity Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Use Base Set as set of relaxable selection queries Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Tuples obtained after relaxation are ranked according to their similarity to the corresponding tuples in base set Similarity (t1, t 2) AttrSimilarity (value(t1[ Ai]), value(t 2[ Ai])) Wi where Wi = normalized influence weights, ∑ Wi = 1 , i = 1 to |Attributes(R)| Value Similarity • Euclidean for numerical attributes e.g. Price, Year • Use Concept similarity to measure tuple similarities Concept Similarity for categorical e.g. Make, Model Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Concept (Value) Similarity Use Base Set as set of relaxable selection queries Use Concept similarity to measure tuple similarities Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Concept: Any distinct attribute value pair. E.g. Make=Toyota • • Visualized as a selection query binding a single attribute Represented as a supertuple ST(QMake=Toyota) Model Year Camry: 3, Corolla: 4,…. 2000:6,1999:5 2001:2,…… Price 6500:3, 4000:6 Concept Similarity: Estimated as the Supertuple for5995:4, Concept Make=Toyota percentage of correlated values common to two given concepts Similarity (v1, v2) Commonality(Correlated (v1, values( Ai)), Correlated (v2, values( Ai))) where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R • Measured as the Jaccard Similarity among supertuples representing the concepts JaccardSim(A,B) = A B A B Concept (Value) Similarity Graph Dodge Nissan 0.15 0.11 Honda BMW 0.12 0.22 0.25 0.16 Ford Chevrolet Toyota Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Empirical Evaluation of Use Base Set as set of relaxable selection queries Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Goal • Evaluate the effectiveness of the query relaxation and concept learning Setup • A database of used cars CarDB( Make, Model, Year, Price, Mileage, Location, Color) • • • Populated using 30k tuples from Yahoo Autos Concept similarity estimated for Make, Model, Location, Color Two query relaxation algorithms RandomRelax – randomly picks attribute to relax GuidedRelax – uses relaxation order determined using approximate keys and AFDs Use Concept similarity to measure tuple similarities Imprecise Query Convert Q Map: “like” to “=” Derive Base Set Abs Qpr = Map(Q) Abs = Qpr(R) Use Base Set as set of relaxable selection queries Using AFDs find relaxation order Prune tuples below threshold Derive Extended Set by executing relaxed queries Return Ranked Set Evaluating the effectiveness of relaxation Test Scenario • 10 randomly selected base queries from CarDB • 20 tuples showing similarity > Є • • Weighted summation of attribute similarities Euclidean distance used for Year, Price, Mileage Concept Similarity used for Make, Model, Location, Color Limit 64 relaxed queries per base query • 0.5 < Є < 1 128 max possible – 7 attributes Efficiency measured using metric Work / Re levantTuple | ExtractedTuples | | Re levantExtracted | Use Concept similarity to measure tuple similarities Efficiency of Relaxation in Random Relaxation Guided Relaxation 180 Є = 0.7 900 Є= 0.7 Є = 0.6 Є = 0.5 700 Є = 0.6 Є = 0.5 140 Work/Relevant Tuple 800 Work/Relevant Tuple 160 600 500 400 300 120 100 80 60 200 40 100 20 0 0 1 2 3 4 5 6 7 8 Queries 9 10 1 2 3 4 5 6 7 8 Queries •Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7. •Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7. •Not resilient to change in Є •Resilient to change in Є 9 10 Summary An approach for answering imprecise queries over Web database • • • Mine and use AFDs to determine attribute importance Domain-independent concept similarity estimation technique Tuple similarity score as a weighted sum of attribute similarity scores Empirical evaluation shows • • Reasonable concept similarity models estimated Set of similar precise queries efficiently identified Collection Selection/Meta Search Introduction • Metasearch Engine • A system that provides unified access to multiple existing search engines. • Metasearch Engine Components – Database Selector • Identifying potentially useful databases for each user query – Document Selector • Identifying potentially useful document returned from selected databases – Query Dispatcher and Result Merger • Ranking the selected documents Collection Selection Collection Selection Query Execution WSJ WP FT Results Merging CNN NYT Evaluating collection selection • Let c1..cj be the collections that are chosen to be accessed for the query Q. Let d1…dk be the top documents returned from these collections. • We compare these results to the results that would have been returned from a central union database – Ground Truth: The ranking of documents that the retrieval technique (say vector space or jaccard similarity) would have retrieved from a central union database that is the union of all the collections • Compare precision of the documents returned by accessing General Scheme & Challenges • Get a representative of each of the database – Representative is a sample of files from the database • Challenge: get an unbiased sample when you can only access the database through queries. • Compare the query to the representatives to judge the relevance of a database – Coarse approach: Convert the representative files into a single file (super-document). Take the (vector) similarity between the query and the super document of a database to judge that database’s relevance – Finer approach: Keep the representative as a mini-database. Union the mini-databases to get a central mini-database. Apply the query to the central mini-database and find the top k answers from it. Decide the relevance of each database based on which of the answers came from which database’s representative • You can use an estimate of the size of the database too – What about overlap between collections? (See ROSCO paper) Uniform Probing for Content Summary Construction • Automatic extraction of document frequency statistics from uncooperative databases – [Callan and Connell TOIS 2001],[Callan et al. SIGMOD 1999] • Main Ideas – Pick a word and send it as a query to database D • RandomSampling-OtherResource(RS-Ord): from a dictionary • RandomSampling-LearnedResource(RS-Lrd): from retrieved documents – Retrieval the top-K documents returned – If the number of retrieved documents exceeds a threshod T, stop, otherwise retart at the beginning – k=4 , T=300 – Compute the sample document frequency for each word that appeared in a retrieved document. CORI Net Approach (Representative as a super document) • Representative Statistics – The document frequency for each term and each database – The database frequency for each term • Main Ideas – Visualize the representative of a database as a super document, and the set of all representative as a database of super documents – Document frequency becomes term frequency in the super document, and database frequency becomes document frequency in the super database – Ranking scores can be computed using a similarity function such as the Cosine function ReDDE Approach (Representative as a mini-collection) • Use the representatives as mini collections • Construct a union-representative that is the union of the mini-collections (such that each document keeps information on which collection it was sampled from) • Send the query first to union-collection, get the top-k ranked results – See which of the results in the top-k came from which minicollection. The collections are ranked in terms of how much their mini-collections contributed to the top-k answers of the query. – Scale the number of returned results by the expected size of the actual collection We didn’t cover beyond this Selecting among overlapping collections Overlap between collections News metasearcher, bibliography search engine, etc. Objectives: Retrieve variety of results Avoid collections with irrelevant or redundant results Results Collection Selection Collections: 1. FT 2. CNN WSJ Existing work (e.g. CORI) assumes collections are disjoint! 10/21/2004 1. …… 2. …… 3. …… . . “bank mergers” MS Thesis Defense Thomas Hernandez Query Execution Results Merging WP FT CNN NYT Arizona State University ROSCO approach 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University Challenge: Defining & Computing Overlap Collection C1 Collection overlap may be non-symmetric, or “directional”. (A) 1. Result A Collection C2 A. 1. Result V 2. Result B 2. Result W 3. Result C 3. Result X 4. Result D 4. Result Y 5. Result E 5. Result Z 6. Result F 7. Result G Collection C1 Collection C2 1. Result A 2. Result B 3. Result C 1. Result V 2. Result W 3. Result X Document overlap may be non-transitive. (B) B. 4. Result D 5. Result E 6. Result F 4. Result Y 5. Result Z 7. Result G Collection C3 1. Result I 2. Result J 3. Result K 4. Result L 5. Result M 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University Gathering Overlap Statistics Solution: Consider query result set of a particular collection as a single bag of words: Approximate overlap between 3+ collections using only pairwise overlaps Approximate overlap as the intersection between the result set bags: 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University Controlling Statistics Objectives: Limit the number of statistics stored Improve the chances of having statistics for new queries Solution: Identify frequent item sets among queries (Apriori algorithm) Store statistics only with respect to these frequent item sets 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University The Online Component Collection Selection System Purpose: determine collection order for user query User query Gather coverage and overlap information for past queries Map the query to frequent item sets Compute statistics for the query using mapped item sets Collection Order 1. …… 2. …… . 1. Map query to stored item sets 2. Compute statistics for query Identify frequent item sets among queries Coverage / Overlap Statistics Determine collection order for query Compute statistics for the frequent item sets Online Component Offline Component Map the query to frequent item sets Compute statistics for the query using mapped item sets 3. Determine collection order 10/21/2004 MS Thesis Defense Thomas Hernandez Determine collection order for query Arizona State University Training our System Training set: 90% of the query list Gathering statistics for training queries: Probing of the 15 collections Identifying frequent item sets: Support threshold used: 0.05% (i.e. 9 queries) 681 frequent item sets found Computing statistics for item sets: Statistics fit in a 1.28MB file Sample entry: network,neural 22 MIX15 0.11855 CI,SC 747 AG 0.07742 AD 0.01893 SC,MIX15 801.13636 … 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University Performance Evaluation Measuring number of new and duplicate results: Duplicate result: has cosine similarity > 0.95 with at least one retrieved result New result: has no duplicate Oracular approach: Knows which collection has most new results Retrieves large portion of new results early 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University Evaluaton of Collection Selection 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University 10/21/2004 MS Thesis Defense Thomas Hernandez Arizona State University