Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik

advertisement
Interactive Data Exploration
Using Semantic Windows
Alexander Kalinin
Ugur Cetintemel, Stan Zdonik
Interactive Data Exploration (IDE)
Where’s Horrible
Where’s Gelatinous
Waldo? Blob?
Searching for “interesting stuff” within big data
• Exploratory analysis: ad-hoc & repetitive
• Questions are not well defined
• “Interesting” can be complex
• Human-in-the loop operation
• Fast, online results
• Query refinement
2
Exploratory Queries: An SDSS example
• Searching for regions of interest
• “Celestial 3-5o by 5-7o rectangular regions
with average brightness > 0.8”
• Shape-based conditions
• “3-5o by 5-7o regions”
• Content-based conditions
Semantic Windows
• “average brightness > 0.8”
Sloan Digital Sky Survey (SDSS)
3
“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL
1. Divide the data into cells
2. Enumerate all regions
3. Final filtering (> 0.8)
4
No native support for exploratory constructs!
• SQL queries
• No power set
• GROUP BY – no overlaps
• OVER – too restrictive
• Performance problems
• Large CPU overhead
• Hard to optimize
• No interactivity
5
SQL/SW Extensions for Data Exploration
SELECT lb(ra), rb(ra), lb(dec), rb(dec),
avg(brightness)
ra
FROM sdss
GRID BY ra BETWEEN 100 AND 300 STEP 1
dec BETWEEN
HAVING
5 AND 40 STEP 1
dec
avg(brightness) > 0.8 AND
size(ra) = 3
AND
size(dec) >= 1
AND
size(dec) <= 3
6
Search Process Outline
1. Dynamically enumerate windows (subject to pruning)
2. Study in order of utility
3. Output the windows satisfying the conditions
Focus is on online results!
7
Enumerating Windows
1
1
2
2
Extension:
1
• Any dimension
• One step
1
1
3
3
2
1
2
3
4
4
4
3
4
2
8
Cost-aware Solver
• Best-first search based on the utility
• Utility = f(benefit, cost)
• Benefit – how close a window is to satisfy the conditions
• Computed for the aggregates from content-based conditions
• A distance between the required value and the estimated value
• Cost – how expensive it is to read a window from disk
• Measured in cells we have to read
• Adjustments are made for skewed data
9
Best-first Search
Priority Queue (utility-ordered)
1
0.98
0.80
0.79
3
2
1
4
1
3
0.98
4
2
0.85
1
3
0.80
0.79
1
4
10
Best-first Search
Priority Queue (utility-ordered)
1
3
0.98
2
4
0.98
2
1
0.85
1
0.85
3
1
0.79
1
4
0.79
0.80
0.80
3
0.80
1
2
3
4
4
11
Optimizations
• Cost and benefit are estimated by sampling
• Uniform – sample the whole search space
• Stratified – sample each cell uniformly
• Aggregate values are cached in a cell cache
• Dynamic utility updates
• Avoiding same cells re-reads
• Constraint-based pruning during the search
12
Pruning
1
1
2
2
Shape-based conditions:
1
3
3
2
1
2
3
4
4
4
3
4
Size > 1
Shape is ? x 2
13
Prefetching
• Problem: small reads
• Help online results
• Hurt total performance
No prefetching
1
3
• Window-locality vs. disk-locality
• Poor disk page utilization
• Thrashing: reading the same pages multiple times
• Prefetching: read a neighborhood with every window
• Larger reads, fewer number
• Better disk page utilization
4
2
With prefetching
3
1
2
4
14
Adaptive Prefetching
• How much to prefetch?
• Large reads might hurt online results
• Progress-driven scheme:
• Finding new results? Prefetch a small amount
• No new results? Increase the prefetch exponentially
15
Online vs. Total Performance Results
6000
•
•
•
5000
35GB data set (part of the SDSS)
4GB total memory (1GB shared buffer)
First results in 10-20 seconds
Time, s
4000
3000
2000
1000
Static
0
20%
40%
60%
Adaptive
80%
100%
PostgreSQL
total
% of results returned
16
Distributed Semantic Windows Architecture
Coordinator
Worker
Worker
Worker
Query
Executor
Query
Executor
Query
Executor
Functions/
Estimations
Functions/
Estimations
Window
Processor
Window
Processor
Cell Data
Data
Manager
DBMS
Cell Data
Cell Data
Functions/
Estimations
Window
Processor
Cell Data
Data
Manager
Data
Manager
DBMS
DBMS
• Coordinator
• Starts workers
• Collects results
• Data Overlap
• Windows belong to multiple partitions
• Workers exchange cells
• Asynchronous communication
• Workers request data
• No blocking
• Small overhead
17
800
700
600
500
400
300
200
100
0
700
600
500
Time, s
Time, s
Data Overlap in Distributed Search
400
300
200
100
0
First Result
4 nodes, no overlap
All Results
Total Time
4 nodes, full overlap
First Result
8 nodes, no overlap
All Results
Total Time
8 nodes, full overlap
18
Other Experiments (from the paper)
• Data layout: window-locality vs. disk-locality
• Hilbert ordering
• Index-based clustering
• Sorting by an axis
• Controlling the aggressiveness of prefetching
• Users can control the size of prefetching
• Smaller result delays vs. total completion time
• Sampling
• Stratified vs. uniform
19
Related Work
• OLAP cubes
• Grid-based aggregation, no exploration
• Online Aggregation (Hellerstein, et al.)
• Approximation, exact result at the end
• Online skylines (Rundensteiner, et al.)
• Careful input/output space analysis to determine candidates
• Difficult for Semantic Windows: dimensional vs. measurement attributes
• Big data systems (SciBORQ, BlinkDB, etc.)
• Approximate query answering via sampling
20
Conclusion and Future Work
• New data exploration framework – Semantic Windows
• Cost-aware solver
• Adaptive prefetching to address data layout issues
• Distributed computation
• What is next?
• Constraint Programming (CP) can perform exploration
• DBMS can store and manage data
• CP + DBMS = Searchlight
21
Questions?
Supported by:
Download