Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik Interactive Data Exploration (IDE) Where’s Horrible Where’s Gelatinous Waldo? Blob? Searching for “interesting stuff” within big data • Exploratory analysis: ad-hoc & repetitive • Questions are not well defined • “Interesting” can be complex • Human-in-the loop operation • Fast, online results • Query refinement 2 Exploratory Queries: An SDSS example • Searching for regions of interest • “Celestial 3-5o by 5-7o rectangular regions with average brightness > 0.8” • Shape-based conditions • “3-5o by 5-7o regions” • Content-based conditions Semantic Windows • “average brightness > 0.8” Sloan Digital Sky Survey (SDSS) 3 “Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL 1. Divide the data into cells 2. Enumerate all regions 3. Final filtering (> 0.8) 4 No native support for exploratory constructs! • SQL queries • No power set • GROUP BY – no overlaps • OVER – too restrictive • Performance problems • Large CPU overhead • Hard to optimize • No interactivity 5 SQL/SW Extensions for Data Exploration SELECT lb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) ra FROM sdss GRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN HAVING 5 AND 40 STEP 1 dec avg(brightness) > 0.8 AND size(ra) = 3 AND size(dec) >= 1 AND size(dec) <= 3 6 Search Process Outline 1. Dynamically enumerate windows (subject to pruning) 2. Study in order of utility 3. Output the windows satisfying the conditions Focus is on online results! 7 Enumerating Windows 1 1 2 2 Extension: 1 • Any dimension • One step 1 1 3 3 2 1 2 3 4 4 4 3 4 2 8 Cost-aware Solver • Best-first search based on the utility • Utility = f(benefit, cost) • Benefit – how close a window is to satisfy the conditions • Computed for the aggregates from content-based conditions • A distance between the required value and the estimated value • Cost – how expensive it is to read a window from disk • Measured in cells we have to read • Adjustments are made for skewed data 9 Best-first Search Priority Queue (utility-ordered) 1 0.98 0.80 0.79 3 2 1 4 1 3 0.98 4 2 0.85 1 3 0.80 0.79 1 4 10 Best-first Search Priority Queue (utility-ordered) 1 3 0.98 2 4 0.98 2 1 0.85 1 0.85 3 1 0.79 1 4 0.79 0.80 0.80 3 0.80 1 2 3 4 4 11 Optimizations • Cost and benefit are estimated by sampling • Uniform – sample the whole search space • Stratified – sample each cell uniformly • Aggregate values are cached in a cell cache • Dynamic utility updates • Avoiding same cells re-reads • Constraint-based pruning during the search 12 Pruning 1 1 2 2 Shape-based conditions: 1 3 3 2 1 2 3 4 4 4 3 4 Size > 1 Shape is ? x 2 13 Prefetching • Problem: small reads • Help online results • Hurt total performance No prefetching 1 3 • Window-locality vs. disk-locality • Poor disk page utilization • Thrashing: reading the same pages multiple times • Prefetching: read a neighborhood with every window • Larger reads, fewer number • Better disk page utilization 4 2 With prefetching 3 1 2 4 14 Adaptive Prefetching • How much to prefetch? • Large reads might hurt online results • Progress-driven scheme: • Finding new results? Prefetch a small amount • No new results? Increase the prefetch exponentially 15 Online vs. Total Performance Results 6000 • • • 5000 35GB data set (part of the SDSS) 4GB total memory (1GB shared buffer) First results in 10-20 seconds Time, s 4000 3000 2000 1000 Static 0 20% 40% 60% Adaptive 80% 100% PostgreSQL total % of results returned 16 Distributed Semantic Windows Architecture Coordinator Worker Worker Worker Query Executor Query Executor Query Executor Functions/ Estimations Functions/ Estimations Window Processor Window Processor Cell Data Data Manager DBMS Cell Data Cell Data Functions/ Estimations Window Processor Cell Data Data Manager Data Manager DBMS DBMS • Coordinator • Starts workers • Collects results • Data Overlap • Windows belong to multiple partitions • Workers exchange cells • Asynchronous communication • Workers request data • No blocking • Small overhead 17 800 700 600 500 400 300 200 100 0 700 600 500 Time, s Time, s Data Overlap in Distributed Search 400 300 200 100 0 First Result 4 nodes, no overlap All Results Total Time 4 nodes, full overlap First Result 8 nodes, no overlap All Results Total Time 8 nodes, full overlap 18 Other Experiments (from the paper) • Data layout: window-locality vs. disk-locality • Hilbert ordering • Index-based clustering • Sorting by an axis • Controlling the aggressiveness of prefetching • Users can control the size of prefetching • Smaller result delays vs. total completion time • Sampling • Stratified vs. uniform 19 Related Work • OLAP cubes • Grid-based aggregation, no exploration • Online Aggregation (Hellerstein, et al.) • Approximation, exact result at the end • Online skylines (Rundensteiner, et al.) • Careful input/output space analysis to determine candidates • Difficult for Semantic Windows: dimensional vs. measurement attributes • Big data systems (SciBORQ, BlinkDB, etc.) • Approximate query answering via sampling 20 Conclusion and Future Work • New data exploration framework – Semantic Windows • Cost-aware solver • Adaptive prefetching to address data layout issues • Distributed computation • What is next? • Constraint Programming (CP) can perform exploration • DBMS can store and manage data • CP + DBMS = Searchlight 21 Questions? Supported by:

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik

Related documents

Products

Support

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib