Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik Interactive Data Exploration (IDE) Where’s Horrible Where’s Gelatinous Waldo? Blob? Searching for “interesting stuff” within big data • Exploratory analysis: ad-hoc & repetitive • Questions are not well defined • “Interesting” can be complex • Human-in-the loop operation • Fast, online results • Query refinement 2 Exploratory Queries: An SDSS example • Searching for regions of interest • “Celestial 3-5o by 5-7o rectangular regions with average brightness > 0.8” • Shape-based conditions • “3-5o by 5-7o regions” • Content-based conditions Semantic Windows • “average brightness > 0.8” Sloan Digital Sky Survey (SDSS) 3 “Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL 1. Divide the data into cells 2. Enumerate all regions 3. Final filtering (> 0.8) 4 No native support for exploratory constructs! • SQL queries • No power set • GROUP BY – no overlaps • OVER – too restrictive • Performance problems • Large CPU overhead • Hard to optimize • No interactivity 5 SQL/SW Extensions for Data Exploration SELECT lb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) ra FROM sdss GRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN HAVING 5 AND 40 STEP 1 dec avg(brightness) > 0.8 AND size(ra) = 3 AND size(dec) >= 1 AND size(dec) <= 3 6 Search Process Outline 1. Dynamically enumerate windows (subject to pruning) 2. Study in order of utility 3. Output the windows satisfying the conditions Focus is on online results! 7 Enumerating Windows 1 1 2 2 Extension: 1 • Any dimension • One step 1 1 3 3 2 1 2 3 4 4 4 3 4 2 8 Cost-aware Solver • Best-first search based on the utility • Utility = f(benefit, cost) • Benefit – how close a window is to satisfy the conditions • Computed for the aggregates from content-based conditions • A distance between the required value and the estimated value • Cost – how expensive it is to read a window from disk • Measured in cells we have to read • Adjustments are made for skewed data 9 Best-first Search Priority Queue (utility-ordered) 1 0.98 0.80 0.79 3 2 1 4 1 3 0.98 4 2 0.85 1 3 0.80 0.79 1 4 10 Best-first Search Priority Queue (utility-ordered) 1 3 0.98 2 4 0.98 2 1 0.85 1 0.85 3 1 0.79 1 4 0.79 0.80 0.80 3 0.80 1 2 3 4 4 11 Optimizations • Cost and benefit are estimated by sampling • Uniform – sample the whole search space • Stratified – sample each cell uniformly • Aggregate values are cached in a cell cache • Dynamic utility updates • Avoiding same cells re-reads • Constraint-based pruning during the search 12 Pruning 1 1 2 2 Shape-based conditions: 1 3 3 2 1 2 3 4 4 4 3 4 Size > 1 Shape is ? x 2 13 Prefetching • Problem: small reads • Help online results • Hurt total performance No prefetching 1 3 • Window-locality vs. disk-locality • Poor disk page utilization • Thrashing: reading the same pages multiple times • Prefetching: read a neighborhood with every window • Larger reads, fewer number • Better disk page utilization 4 2 With prefetching 3 1 2 4 14 Adaptive Prefetching • How much to prefetch? • Large reads might hurt online results • Progress-driven scheme: • Finding new results? Prefetch a small amount • No new results? Increase the prefetch exponentially 15 Online vs. Total Performance Results 6000 • • • 5000 35GB data set (part of the SDSS) 4GB total memory (1GB shared buffer) First results in 10-20 seconds Time, s 4000 3000 2000 1000 Static 0 20% 40% 60% Adaptive 80% 100% PostgreSQL total % of results returned 16 Distributed Semantic Windows Architecture Coordinator Worker Worker Worker Query Executor Query Executor Query Executor Functions/ Estimations Functions/ Estimations Window Processor Window Processor Cell Data Data Manager DBMS Cell Data Cell Data Functions/ Estimations Window Processor Cell Data Data Manager Data Manager DBMS DBMS • Coordinator • Starts workers • Collects results • Data Overlap • Windows belong to multiple partitions • Workers exchange cells • Asynchronous communication • Workers request data • No blocking • Small overhead 17 800 700 600 500 400 300 200 100 0 700 600 500 Time, s Time, s Data Overlap in Distributed Search 400 300 200 100 0 First Result 4 nodes, no overlap All Results Total Time 4 nodes, full overlap First Result 8 nodes, no overlap All Results Total Time 8 nodes, full overlap 18 Other Experiments (from the paper) • Data layout: window-locality vs. disk-locality • Hilbert ordering • Index-based clustering • Sorting by an axis • Controlling the aggressiveness of prefetching • Users can control the size of prefetching • Smaller result delays vs. total completion time • Sampling • Stratified vs. uniform 19 Related Work • OLAP cubes • Grid-based aggregation, no exploration • Online Aggregation (Hellerstein, et al.) • Approximation, exact result at the end • Online skylines (Rundensteiner, et al.) • Careful input/output space analysis to determine candidates • Difficult for Semantic Windows: dimensional vs. measurement attributes • Big data systems (SciBORQ, BlinkDB, etc.) • Approximate query answering via sampling 20 Conclusion and Future Work • New data exploration framework – Semantic Windows • Cost-aware solver • Adaptive prefetching to address data layout issues • Distributed computation • What is next? • Constraint Programming (CP) can perform exploration • DBMS can store and manage data • CP + DBMS = Searchlight 21 Questions? Supported by: