Query Processing and Networking Infrastructures Day 1 of 2 Joe Hellerstein UC Berkeley Septemer 20, 2002 Two Goals Day 1: Primer on query processing Targeted to networking/OS folk Bias: systems issues Day 2: Seed some cross-fertilized research Especially with networking Thesis: dataflow convergence query processing and routing Clearly other resonances here Dataflow HW architectures Event-based systems designs ML and Control Theory Online Algorithms (Sub)Space of Possible Topics Traditional Relational QP: Optimization & Execution Distributed & Federated QP Parallel QP Adaptive QP Indexing Data Reduction Compression Boolean Text Search Visual Querying & Data Visualization Traditional Text Ranking Transactional Storage & Networking Data Model & Query Language Design Active DBs (Trigger Systems) NFNF Data Models (OO, XML, “Semistructured”) Online and Approximate QP Media Queries, Feature Extraction & Similarity Search Hypertext Ranking Data Streams & Continuous Queries Statistical Data Analysis (“Mining”) Likely Topics Here Traditional Relational QP: Optimization & Execution Distributed & Federated QP Parallel QP Adaptive QP Indexing Data Reduction Compression Boolean Text Search Visual Querying & Data Visualization Traditional Text Ranking Transactional Storage & Networking Data Model & Query Language Design Active DBs (Trigger Systems) NFNF Data Models (OO, XML, “Semistructured”) Online and Approximate QP Media Queries, Feature Extraction & Similarity Search Hypertext Ranking Data Streams & Continuous Queries Statistical Data Analysis (“Mining”) Plus Some Speculative Ones Traditional Relational QP: Optimization & Execution Distributed & Federated QP Parallel QP Content Routing Indirection Architectures Adaptive QP Online and Approximate QP Indexing Data Streams & Continuous Queries Peer-to-Peer QP Sensornet QP Boolean Text Search Traditional Text Ranking Network Monitoring Outline Day 1: Query Processing Crash Course Intro Queries as indirection How do relational databases run queries? How do search engines run queries? Scaling up: cluster parallelism and distribution Day 2: Research Synergies w/Networking Queries as indirection, revisited Useful (?) analogies to networking research Some of our recent research at the seams Some of your research? Directions and collective discussion Getting Off on the Right Foot Roots: database and IR research “Top-down” traditions (“applications”) Usually begins with semantics and models. Common Misconceptions Query processing = Oracle or Google. IR search and DB querying are fundamentally different Need not be so heavyweight or monolithic! Many reusable lessons within Very similar from a query processing perspective Many similarities in other data models as well Querying is a synchronous, interactive process. Triggers, rules and "continuous queries" not so different from plain old queries. So… we’ll go bottom-up Focus on resuable building blocks Attempt to be language- and modelagnostic illustrate with various querying scenarios Confession: Two Biases Relational query engines Most mature and general query technology Best documented in the literature Conceptually general enough to “capture” most all other models/schemes Everybody does web searches So it’s both an important app, and an inescapable usage bias we carry around It will inform our discussion. Shouldn’t skew it Lots of other query systems/languages you can keep in mind as we go LDAP, DNS, XSL/Xpath/XQuery, Datalog What Are Queries For? I Obvious answer: search and analysis over big data sets Search: select data of interest Boolean expressions over content sometimes with an implicit ordering on results Analysis: construct new information from base data. compute functions over each datum concatenate related records (join) partition into groups, summarize (aggregates) aside: “Mining” vs. “Querying”? As a rule of thumb, think of mining as WYGIWIGY. Not the most general, powerful answer… What Are Queries For? II Queries bridge a (large!) level of indirection Declarative programming: what you want, not how to get it Easy (er) to express Allows the “how” to change under the covers ?? !! A critical issue! Not just for querying Method invocation, data update, etc Motivation for this Indirection Critical when rates of change differ across layers: In particular, when dapp/dt << denvironment/dt E.g. DB apps are used for years, decades (!!) E.g. networked env: high rates of change (??) DB lit calls this “data independence” Data Independence: Background Bad Old Days Hierarchical and “Network” (yep!) data models Nesting & pointers mean that apps explicitly traverse data, become brittle when data layouts change Apps with persistent data have slow dapp/dt And the database environments change faster! Logical changes to representation (schema) Physical changes in storage (indexes, layouts, HW) DBs often shared by multiple apps! In B.O.D., all apps had to be rewritten on change It’s a SW Engineering Thing Analogy: imagine if your C structs were to survive for decades you’d keep them very simple encapsulation to allow future mods Similar Analogy to NWs protocol simplicity is good soft state is good (discourages hardcoded refs to transient resources) But the fun systems part follows directly: Achieve the goal w/respectable performance over a dynamic execution environment Codd’s Data Independence Ted Codd, IBM c. 1969 and forward Turing award 1981 Two layers of indirection Applications Logical Representation (schema) Physical Representation (storage) Logical Spanned by views Independence and query rewriting Physical Independence Spanned by query optimization and execution A More Architectural Picture Declarative query over views Bridges logical independence Query Rewriter Query Processor Declarative query over base tables Bridges physical independence Optimizer Query Plan (Procedural) N.B.: This classical QP architecture raises some problems. To be revisited! Executor Iterator API Access Methods Access Methods & Indexing Access Methods Base data access layer Model: Data stored in unordered collections Relations, tables, one type per collection Interface: iterators Open(predicate) -> cursor Usually simple predicates: attribute op constant op usually arithmetic (<, >, =), though we’ll see extensions (e.g. multi-d ops) Next(cursor) -> datum (of known type) Close(cursor) Insert(datum of correct type) Delete(cursor) Typical Access Methods “Heap” files unordered array of records usually sequential on disk predicates just save cross-layer costs Traditional Index AMs B-trees actually, “B+”-trees: all data at leaves Can scan across leaves for range search predicates (<,>,=, between) result in fewer I/Os random I/Os (at least to find beginning of range) Linear Hash index Litwin ‘78. Supports equality predicates only. This is it for IR and standard relational DBs Though when IR folks say “indexing”, they sometimes mean all of query processing Primary & Secondary Indexes Directory Directory Data (key, ptr) pairs Data Primary & Secondary Indexes Directory Directory Data (key, ptr) pairs Data Directory An Exotic Forest of Search Trees Multi-dimensional indexes For geodata, multimedia search, etc. Dozens! E.g. R-tree family, disk-based QuadTrees, kdB-trees And of course “linearizations” with B-trees Path indexes For XML and OO path queries E.g. Xfilter Etc. Lots of one-off indexes, often many per workload No clear winners here Extensible indexing scheme would be nice Generalized Search Trees (GiST) [Hellerstein et al., VLDB 95] What is a (tree-based) DB index? Typically: A clustering of data into leaf blocks Hierarchical summaries (subtree predicates -- SPs) for pointers in directory blocks p1 p2 p3 … Generalized Search Trees (GiST) Can realize that abstraction with simple interface: User registers opaque SP objects with a few methods Consistent(q, p): should query q traverse subtree? Penalty(d, p): how bad is it to insert d below p Union (p1, p2): form SP that includes p1, p2 PickSplit({p1, …, pn}): partition SPs into 2 Tree maintenance, concurrency, recovery all doable under the covers Covers many popular multi-dimensional indexes Most of which had no concurrency/recovery story http://gist.cs.berkeley.edu Some Additional Indexing Tricks [O’Neil/Quass, SIGMOD 97] Bitmap indexing Many matches per value in (secondary) index? Rather than storing pointers to heap file in leaves, store a bitmap of matches in a (sorted) heap file. Only works if file reorg is infrequent Can make intersection, COUNT, etc. quicker during query processing Can mix/match bitmaps and lists in a single index Works with any (secondary) index with duplicate matches “Vertical Partitioning” / “Columnar storage” Again, for sorted, relatively static files Query Processing Dataflow Infrastructures Dataflow Infrastructure Dataflow abstraction is very simple “box-and-arrow” diagrams (typed) collections of objects flow along edges Details can be tricky “Push” or “Pull”? More to it than that How do control-flow and dataflow interact? Where does the data live? Don’t want to copy data If passing pointers, where does the “real” data live? Iterators Most uniprocessor DB engines use iterators Open() -> cursor Next(cursor) -> typed record Close(cursor) Simple and elegant Control-flow and dataflow coupled Familiar single-threaded, procedure-call API Data refs passed on stack, no buffering Blocking-agnostic Works w/blocking ops -- e.g. Sort Works w/pipelined ops Note: well-behaved iterators “come up for air” in inner loops g E.g. for interrupt handling f R S Where is the In-Flight Data? In standard DBMS, raw data lives in disk format, in shared Buffer Pool Iterators pass references to BufPool A tuple “slot” per iterator input Never copy along edges of dataflow Join results are arrays of refs to base tables Operators may “pin” pages in BufPool BufPool never replaces pinned pages Ops should release pins ASAP (esp. across Next() calls!!) Some operators copy data into their internal state Can“spill” this state to private disk space Weaknesses of Simple Iterators Evolution of uniprocessor archs to parallel archs esp. “shared-nothing” clusters Opportunity for pipelined parallelism Opportunity for partition parallelism Take a single “box” in the dataflow, and split it across multiple machines Problems with iterators in this environment Spoils pipelined parallelism opportunity Polling (Next()) across the network is inefficient Nodes sit idle until polled, and during comm A blocking producer blocks its consumer But would like to keep iterator abstraction Especially to save legacy query processor code And simplify debugging (single-threaded, synchronous) Exchange [Graefe, SIGMOD 90] Encapsulate partition parallelism & asynchrony Keep the iterator API between ops Exchange operator partitions input data by content E.g. join or sort keys Note basic architectural idea! Encapsulate dataflow tricks in operators, leaving infrastructure untouched We’ll see this again next week, e.g. in Eddies Exchange Internals Really 2 operators, XIN and XOUT XIN is “top” of a plan, and pulls, pushing results to XOUT queue XOUT spins on its local queue One thread becomes two XOUT Producer graph & XIN Consumer graph & XOUT Routing table/fn in XIN supports partition parallelism E.g. for || sort, join, etc. Producer and consumer see iterator API Queue + thread barrier turns NWbased “push” into iterator-style “pull” XIN route Exchange Exchange Benefits? Remember Iterator limitations? “Spoils pipelined parallelism opportunity” “Polling (Next()) across the network is inefficient” solved by Exchange thread boundary Solved by XIN pushing to XOUT queue “A blocking producer blocks its consumer” Still a problem! Exchange Limitations Doesn’t allow consumer work to overlap w/blocking producers E.g. streaming data sources, events E.g. sort, some join algs Entire consumer graph blocks if XOUT queue empty Control flow coupled to dataflow, so XOUT won’t return without data Queue is encapsulated from consumer But … Note that exchange model is fine for most traditional DB Query Processing May need to be extended for new settings… Fjords [Madden/Franklin, ICDE 01] Thread of control per operator Queues between each operator Asynch or synch calls Can do asynch poll-and-yield iteration in each operator (for both consumer and producer) Or can do synchronous get_next iteration Can get traditional behavior if you want: Synch polls + queue of size 1 Synch consumer, asynch producer Iterators = Exchange Asynch calls solve the blocking problem of Exchange iterator exchange Fjords Disadvantages: Lots of “threads” Best done in an event-programming style, not OS threads Operators really have to “come up for air” (“yield”) Need to write your own scheduler Harder to debug But: Maximizes flexibility for operators at the endpoints Still provides a fairly simple interface for operator-writers Basic Relational Operators and Implementation Relational Algebra Semantics Selection: sp(R) Returns all rows in R that satisfy p Projection: pC(R) Returns all rows in R projected to columns in C In strict relational model, remove duplicate rows In SQL, preserve duplicates (multiset semantics) Cartesian Product: R S Union: R S Difference: R — S Note: R, S must have matching schemata Join: R p S = sp(R S) Missing: Grouping & Aggregation, Sorting Operator Overview: Basics Selection Typically “free”, so “pushed down” Often omitted from diagrams Projection In SQL, typically “free”, so “pushed down” No duplicate elimination Always pass the minimal set of columns downstream Typically omitted from diagrams Cartesian Product Unavoidable nested loop to generate output Union: Concat, or concat followed by dup. elim. Operator Overview, Cont. Unary operators: Grouping & Sorting Grouping can be done with hash or sort schemes (as we’ll see) Binary matching: Joins/Intersections Alternative algorithms: Nested loops Loop with index lookup (Index N.L.) Sort-merge Hash Join Don’t forget: have to write as iterators Every time you get called with Next(), you adjust your state and produce an output record Unary External Hashing [Bratbergsengen, VLDB 84] E.g. GROUP BY, DISTINCT Two hash functions, hc (coarse) and hf (fine) Two phases: Phase 1: for each tuple of input, hash via hc into a “spill” partition to be put on disk B-1 blocks of memory used to hold output buffers for writing a block at a time per partition Original Relation OUTPUT Partitions 1 1 2 INPUT ... 2 hash function hc B-1 BDisk B main memory buffers Disk Unary External Hashing Phase 2: for each partition, read off disk and hash into a main-memory hashtable via hf For distinct, when you find a value already in hashtable, discard the copy For GROUP BY, associate some agg state (e.g. running SUM) with each group in the hash table, and maintain Partitions hash Hash table for partition Ri (k < B pages) fn hf Result Output buffer Disk B main memory buffers External Hashing: Analysis To utilize memory well in Phase 2, would like each partition to be ~ B blocks big Hence works in two phases when B >= |R| Same req as external sorting! Else can recursively partition the partitions in Phase 2 Can be made to pipeline, to adapt nicely to small data sets, etc. Hash Join (GRACE)[Fushimi, et al., VLDB 84] Phase 1: partition each relation on the join key with hc, spilling to disk Phase 2: build each partition of smaller relation into a hashtable via hf scan matching partition of bigger relation, and for each tuple probe the hashtable via hf for matches Would like each partition of smaller relation to fit in memory So works well if B >= |smaller| Size of bigger is irrelevant!! (Vs. sort-merge join) Popular optimization: Hybrid hash join Partition #0 doesn’t spill -- it builds and probes immediately Partitions 1 through n use rest of memory for output buffers [DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84] Hash-Join Original Relations OUTPUT 1 Partitions 1 2 INPUT 2 hash function ... hc B-1 B-1 Disk B main memory buffers Partitions of R & S hash fn Disk Hash table for partition Ri (k < B-1 pages) hf hf Input buffer for Si Disk Join Result Output buffer B main memory buffers Symmetric Hash Join [Mikillineni & Su, TOSE 88] [Wilschut & Apers, PDIS 91] Pipelining, in-core variant Build and probe symmetrically Correctness: Each output tuple generated when its last-arriving component appears Can be extended to out-of-core case Tukwila [Ives & HaLevy, SIGMOD ‘99] Xjoin: Spill and read partitions multiple times Correctness guaranteed by timestamping tuples and partitions [Urhan & Franklin, DEBull ‘00] Relational Query Engines A Basic SQL primer SELECT [DISTINCT] <output expressions> FROM <tables> [WHERE <predicates>] [GROUP BY <gb-expression> [HAVING <h-predicates>]] [ORDER BY <expression>] Join tables in FROM clause If GROUP BY, partition results by GROUP applying predicates in WHERE clause And maintain aggregate output expressions per group Delete groups that don’t satisfy HAVING clause If ORDER BY, sort output accordingly Examples Single-table S-F-W DISTINCT, ORDER BY Multi-table S-F-W And self-join Scalar output expressions Aggregate output expressions With and without DISTINCT Group By Having Nested queries Uncorrelated and correlated A Dopey Query Optimizer For each S-F-W query block Create a plan that: Forms the cartesian product of the FROM clause Applies the WHERE clause Incredibly inefficient Huge intermediate results! Then, as needed: Apply Apply Apply Apply spredicates … tables the GROUP BY clause the HAVING clause any projections and output expressions duplicate elimination and/or ORDER BY An Oracular Query Optimizer For each possible correct plan: Run the plan (infinitely fast) Measure its performance in reality Pick the best plan, and run it in reality A Standard Query Optimizer Three aspects to the problem Legal plan space (transformation rules) Cost model Search Strategy Plan Space Many legal algebraic transformations, e.g.: Cartesian product followed by selection can be rewritten as join Join is commutative and associative Can reorder the join tree arbitrarily NP-hard to find best join tree in general Selections should (usually) be “pushed down” Projections can be “pushed down” And “physical” choices Choice of Access Methods Choice of Join algorithms Taking advantage of sorted nature of some streams Complicates Dynamic Programming, as we’ll see Cost Model & Selectivity Estimation Cost of a physical operator can be modeled fairly accurately: E.g. number of random and sequential I/Os Requires metadata about input tables: Number of rows (cardinality) Bytes per tuple (physical schema) In a query pipeline, metadata on intermediate tables is trickier Cardinality? Requires “selectivity” (COUNT) estimation Wet-finger estimates Histograms, joint distributions and other summaries Sampling Search Strategy Dynamic Programming Used in most commercial systems IBM’s System R [Selinger, et al. SIGMOD 79] Top-Down Branch and bound with memoization Exodus, Volcano & Cascades [Graefe, SIGMOD 87, ICDE 93, DEBull 95] Used in a few commercial systems (Microsoft SQL Server, especially) Randomized Simulated Annealing, etc. [Ioannidis & Kang SIGMOD 90] Dynamic Programming Use principle of optimality Any subtree of the optimal plan is itself optimal for its subexpression Plans enumerated in N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation. Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.) This gives all left-deep plans. Generalization is easy… A wrinkle: physical properties (e.g. sort orders) violate principle of optimality! Use partial-order dynamic programming I.e. keep undominated plans at each step -- optimal for each setting of the physical properties (each “interesting order”) Relational Architecture Review Query Parsing and Optimization Query Executor Lock Manager Access Methods Log Manager Buffer Management Disk Space Management DB Text Search Information Retrieval A research field traditionally separate from Databases Goes back to IBM, Rand and Lockheed in the 50’s G. Salton at Cornell in the 60’s Lots of research since then Products traditionally separate Originally, document management systems for libraries, government, law, etc. Gained prominence in recent years due to web search Today: simple IR techniques Show similarities to DBMS techniques you already know IR vs. DBMS Seem like very different beasts IR DBMS Imprecise Semantics Precise Semantics Keyword search SQL Unstructured data format Structured data Read-Mostly. Add docs occasionally Expect reasonable number of updates Page through top k results Generate full answer Under the hood, not as different as they might seem But in practice, you have to choose between the 2 IR’s “Bag of Words” Model Typical IR data model: Each document is just a bag of words (“terms”) Detail 1: “Stop Words” Certain words are considered irrelevant and not placed in the bag e.g. “the” e.g. HTML tags like <H1> Detail 2: “Stemming” Using English-specific rules, convert words to their basic form e.g. “surfing”, “surfed” --> “surf” Detail 3: we may decorate the words with other attributes E.g. position, font info, etc. Not exactly “bag of words” after all Boolean Text Search Find all documents that match a Boolean containment expression: “Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft” Note: query terms are also filtered via stemming and stop words When web search engines say “10,000 documents found”, that’s the Boolean search result size. Text “Indexes” When IR folks say “index” or “indexing” … Usually mean more than what DB people mean In our terms, both “tables” and indexes Really a logical schema (i.e. tables) With a physical schema (i.e. indexes) Usually not stored in a DBMS Tables implemented as files in a file system A Simple Relational Text Index Create and populate a table InvertedFile(term string, docID int64) Build a B+-tree or Hash index on InvertedFile.term May be lots of duplicate docIDs per term Secondary index: list compression per term possible This is often called an “inverted file” or “inverted index” Maps from words -> docs whereas normal files map docs to the words in the doc (?!) Can now do single-word text search queries Handling Boolean Logic How to do “term1” OR “term2”? Union of two docID sets How to do “term1” AND “term2”? Intersection (ID join) of two DocID sets! How to do “term1” AND NOT “term2”? Set subtraction Also a join algorithm How to do “term1” OR NOT “term2” Union of “term1” and “NOT term2”. “Not term2” = all docs not containing term2. Yuck! Usually forbidden at UI/parser Refinement: what order to handle terms if you have many ANDs/NOTs? “Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft” Boolean Search in SQL (SELECT docID FROM InvertedFile WHERE word = “window” INTERSECT SELECT docID FROM InvertedFile WHERE word = “glass” OR word = “door”) EXCEPT SELECT docID FROM InvertedFile WHERE word=“Microsoft” ORDER BY magic_rank() Really there’s only one query (template) in IR Single-table selects, UNION, INTERSECT, EXCEPT Note that INTERSECT is a shorthand for equijoin on a key Often there’s only one query plan in the system, too! magic_rank() is the “secret sauce” in the search engines Fancier: Phrases and “Near” Suppose you want a phrase E.g. “Happy Days” Add a position attribute to the schema: InvertedFile (term string, docID int64, position int) Index on term Enhance join condition in query Can’t use INTERSECT syntax, but query is nearly the same SELECT I1.docID FROM InvertedFile I1, InvertedFile I WHERE I1.word = “HAPPY” AND I2.word = “DAYS” AND I1.docID = I2.docID AND I2.position - I1.position = 1 ORDER BY magic_rank() Can relax to “term1” NEAR “term2” Position < k off Classical Document Ranking TF IDF (Term Freq. Inverse Doc Freq.) For each term t in the query QueryTermRank = #occurrences of t in q TF log((total #docs)/(#docs with this term)) IDF normalization-factor For each doc d in the boolean result DocTermRank = #occurrences of t in d log((total #docs)/(#docs with this term)) normalization-factor Rank += DocTermRank*QueryTermRank Requires more to our schema InvertedFile (term string, docID int64, position int, DocTermRank float) TermInfo(term string, numDocs int) Can compress DocTermRank non-relationally This basically works fine for raw text There are other schemes, but this is the standard TF IDF Some Additional Ranking Tricks Phrases/Proximity Ranking function can incorporate position Query expansion, suggestions Can keep a similarity matrix on terms, and expand/modify people’s queries Document expansion Can add terms to a doc E.g. in “anchor text” of refs to the doc Not all occurrences are created equal Mess with DocTermRank based on: Fonts, position in doc (title, etc.) Hypertext Ranking Also factor in graph structure Social Network Theory (Citation Analysis) “Hubs and Authorities” (Clever), “PageRank” (Google) Intuition: recursively weighted in-degrees, outdegrees Math: eigenvector computation PageRank sure seems to help Though word on the street is that other factors matter as much Anchor text, title/bold text, etc. Updates and Text Search Text search engines are designed to be query-mostly Deletes and modifications are rare Can postpone updates (nobody notices, no transactions!) Can’t afford to go offline for an update? Updates done in batch (rebuild the index) Create a 2nd index on a separate machine Replace the 1st index with the 2nd Can do this incrementally with a level of indirection So no concurrency control problems Can compress to search-friendly, update-unfriendly format For these reasons, text search engines and DBMSs are usually separate products Also, text-search engines tune that one SQL query to death! The benefits of a special-case workload. Architectural Comparison Query Optimization and Execution { Search String Modifier Relational Operators Ranking Algorithm “The Query” Files and Access Methods The Access Method Buffer Management Disk Space Management Concurrency and Recovery Needed DB DBMS Buffer ManagementOS Disk Space Management DB Search Engine } Simple DBMS Revisiting Our IR/DBMS Distinctions Data Modeling & Query Complexity DBMS supports any schema & queries Requires you to define schema Complex query language (hard for folks to learn) Multiple applications at output RowSet API (cursors) IR supports only one schema & query No schema design required (unstructured text) Trivial query language Single application behavior Page through output in rank order, ignore most of output Storage Semantics DBMS: online, transactional storage IR: batch, unreliable storage Distribution & Parallelism Roots Distributed QP vs. Parallel QP Distributed QP envisioned as a k-node intranet for k ~= 10 Sound old-fashioned? Think of multiple hosting sites (e.g. one per continent) Parallel QP grew out of DB Machine research All in one room, one administrator Parallel DBMS architecture options Shared-Nothing Shared-Everything Shared-Disk Shared-nothing is most general, most scalable Distributed QP: Semi-Joins [Bernstein/Goodman ‘79] Main query processing issue in distributed DB lit: use semi-joins R S = pR(R S) Observe that R S (R p(S)) S Assume each table lives at one site, R is bigger. To reduce communication: Ship S’s join columns to R’s site, do semijoin there, ship result to S’s site for the join Notes I’m sloppy about dups in my def’ns above Semi-joins aren’t always a win Extra cost estimation task for a distributed optimizer Bloom Joins [Babb, TODS 79] A constant optimization on semi-joins Idea: (R p(S)) S is redundant Semi-join can safely return “false hits” from R Rather than shipping p(S), ship a superset A particular kind of lossy set compression allowed Bloom Filter (B. Bloom, 1970) Hash each value in a set via k independent hash functions onto an array of n bits Check membership correspondingly By tuning k and n, can control false hit rate Rediscovered recently in web lit, some new wrinkles (Mitzenmacher’s compressed B.F.’s, Rhea’s attenuated B.F.’s) Sideways Information Passing These ideas generalize more broadly COMBINER Set o’ Stuff Costly Set Generator Sideways Information Passing These ideas generalize more broadly E.g. “magic sets” rewriting in datalog & SQL Tricky to do optimally in those settings, but wins can be very big COMBINER Set o’ Stuff Costly Less Costly Generator Set Generator Parallelism 101 [See DeWitt & Gray, CACM 92] Pipelined vs. Partitioned Pipelined typically inter-operator Nominal benefits in a dataflow Partition typically intra-operator E.g. hash join or sort using k nodes Speedup & Scaleup Speedup: x=old_time/new_time Ideal: linear Scaleup: small_sys_elapsed_small_problem / big_sys_elapse_big_problem Ideal: 1 Transaction scaleup: N times as many TPC-C’s for N machines Batch scaleup: N times as big a DB for a query on N machines Impediments to Good Parallelism Startup overheads Amortized for big queries Interference usually the result of unpredictable communication delays (comm cost, empty pipelines) Skew Of these, skew is the real issue in DBs “Embarrassingly parallel” I.e. it works Data Layout Horizontal Partitioning For each table, assign rows to machines by some key Or assign arbitrarily (round-robin) Vertical Partitioning Sort table, and slice off columns Usually not a parallelism trick But nice for processing queries on read-mostly data (projection is free!) Intra-Operator Parallelism E.g. for Hash Join Every site with a horizontal partition of either R or S fires off a scan thread Every storage site reroutes its data among join nodes based on hash of the join column(s) Upon receipt, each site does local hash join Recall Exchange! Skew Handling Skew happens Even when hashing? Yep. Can pre-sample and/or pre-summarize data to partition better Solving skew on the fly is harder Need to migrate accumulated dataflow state FLuX: Fault-Tolerant, Load-balancing eXchange In Current Architectures All DBMSs can run on shared memory, many on shared-nothing The high end belongs to clusters The biggest web-search engines run on clusters (Google, Inktomi) And use pretty textbook DB stuff for Boolean search Fun tradeoffs between answer quality and availability/management here (the Inktomi story) Precomputations Views and Materialization A view is a logical table A query with a name In general, not updatable If to be used often, could be materialized Pre-compute and/or cache result Could even choose to do this for common query sub-expressions Needn’t require a DBA to say “this is a view” Challenges in Materialized Views Three main issues: Given a workload, which views should be materialized Given a query, how can mat-views be incorporated into query optimizer As base tables are updated, how can views be incrementally maintained? See readings book, Gupta & Mumick Precomputation in IR Often want to save results of common queries E.g. no point re-running “Britney Spears” or “Harry Potter” as Boolean queries Can also use as subquery results E.g. the query “Harry Potter Loves Britney Spears” can use the “Harry Potter” and “Britney Spears” results Constrained version of mat-views No surprise -- constrained relational workload And consistency of matview with raw tables is not critical, so maintenance not such an issue. Precomputed Aggregates Aggregation queries work on numerical sets of data Math tricks apply here Theme: replace the raw data with small statistical summaries, get approximate results Some trivial, some fancy Histograms, wavelets, samples, dependency-based models, random projections, etc. Heavily used in query optimizers for selectivity estimation (a COUNT aggregate) Spate of recent work on approximate query processing for AVG, SUM, etc. [Garofalakis, Gehrke, Rastogi tutorial, SIGMOD ‘02] A Taste of Next Week Query Dataflows Meet NWs Some more presentation Indirection in space and time Thematic similarities, differences in NW/QP Adaptive QP In Telegraph Eddies, Stems, FluX A taste of QP in Sensor Networks Revisit a NW “classic” through a DB lens TinyDB (TAG), Directed Diffusion A taste of QP in p2p PIER project at Berkeley: parallel QP over DHTs Presentations from MITers? Open Discussion Contact jmh@cs.berkeley.edu http://www.cs.berkeley.edu/~jmh