Powerpoint Slides - University of California, Irvine

advertisement
Quality Aware Sensor Database
(QUASAR) Project**
Sharad Mehrotra
Department of Information and
Computer Science
University of California, Irvine
**Supported in part by a collaborative NSF ITR grant entitled “real-time data capture, analysis, and
querying of dynamic spatio-temporal events” in collaboration with UCLA, U. Maryland, U. Chicago
UCI Database Group
Talk Outline
• Quasar Project
–
–
–
–
–
motivation and background
data collection and archival components
query processing
tracking application using QUASAR framework
challenges and ongoing work
• Brief overview of other research projects
– MARS Project - incorporating similarity retrieval and
refinement over structured and semi-structured data to aid
interactive data analysis/mining
– Database as a Service (DAS) Project - supporting the
application service provider model for data management
UCI Database Group
Emerging Computing
Infrastructure…
In-body, in-cell, in-vitro spaces
• Generational advances to
computing infrastructure
– sensors will be everywhere
• Emerging applications with
limitless possibilities
Instrumented
wide-area spaces
– real-time monitoring and
control, analysis
• New challenges
Roadsi de
Bas e stati on
To the fixed
Infrastruc ture
(Intern et)
Ad hoc (802.11) link
Cel lular (CDPD?) li nk
Roadsi de
Bas e stati on
To the fixed
Infrastruc ture
(Intern et)
Ad hoc (802.11) link
Cel lular (CDPD?) li nk
Immediate vici nity
area boundary
(single-hop)
Immediate vici nity
area boundary
(single-hop)
– limited bandwidth & energy
– highly dynamic systems
• System architectures are
due for an overhaul
– at all levels of the system OS,
middleware, databases,
applications
UCI Database Group
Impact to Data Management …
Data/query request
Data producers
server
Data/query result
client
• Traditional data management
–
–
–
–
client-server architecture
efficient approaches to data storage & querying
query shipping versus data shipping
data changes with explicit update
• Emerging Challenge
– data producers must be considered as “first class”
entities
• sensors generate continuously changing highly dynamic
data
• sensors may store, process, and communicate data
UCI Database Group
Data Management Architecture
Issues
producer
cache
Data/query request
Data/query result
Data producers
client
server
• Where to store data?
– Do not store -- stream model
• not suitable if we wish to archive data for future analysis
or if data is too important to lose
– at the producers
• limited storage, network, compute resources
– at the servers
• server may not be able to cope with high data production
rates. May lead to data staleness and/or wasted
resources
• Where to compute?
– At the client, server, data producers
UCI Database Group
Quasar Architecture
• Hierarchical architecture
client
server
data flow
Query flow
Client cache
Server cache
& archive
producer
cache
producer
UCI Database Group
– data flows from producers to
server to clients periodically
– queries flow the other way:
• If client cache does not suffices,
then
• query routed to appropriate
server
• If server cache does not suffice,
then access current data at
producer
– This is a logical architecture-producers could also be
clients.
Quasar: Observations & Approach
• Applications can tolerate errors in sensor data
– applications may not require exact answers:
•
small errors in location during tracking or error in answer to
query result may be OK
– data cannot be precise due to measurement errors,
transmission delays, etc.
• Communication is the dominant cost
– limited wireless bandwidth, source of major energy drain
• Quasar Approach
– exploit application error tolerance to reduce communication
between producer and server
– Two approaches
• Minimize resource usage given quality constraints
• Maximize quality given resource constraints
UCI Database Group
Quality-based Data Collection
Problem
Sensor time series
…p[n], p[n-1], …, p[1]
• Let P = < p[1], p[2], …, p[n] > be a sequence of environmental
measurements (time series) generated by the producer, where n
= now
• Let S = <s[1], s[2], …, s[n]> be the server side representation of
the sequence
• A within- quality data collection protocol guarantees that
for all i
error(p[i], s[i]) < 
•  is derived from application quality tolerance
UCI Database Group
Simple Data Collection Protocol
Sensor time series
…p[n], p[n-1], …, p[1]
• sensor Logic (at time step n)
Let p’ = last value sent to server
if error(p[n], p’) > 
send p[n] to server
• server logic (at time step n)
If new update p[n] received at step n
s[n] = p[n]
Else
s[n] = last update sent by sensor
– guarantees maximum error at server less than equal to 
UCI Database Group
Exploiting Prediction Models
• Producer and server agree upon a prediction model (M, )
• Let spred[i] be the predicted value at time i based on (M, )
• sensor Logic (at time step n)
if error(p[n], spred[n] ) > 
send p[n] to server
• server logic (at time step n)
•
If new update p[n] received at step n
s[n] = p[n]
Else
s[n] = spred[n] based on model (M, )
UCI Database Group
Challenges in Prediction
• Simple versus complex models?
• Complex and more accurate models require more parameters
(that will need to be transmitted).
• Goal is to minimize communication not necessarily best
prediction
• How is a model M generated?
• static -- one out of a fixed set of models
• dynamic -- dynamically learn a model from data
• When should a model M or parameters  be changed?
• immediately on model violation:
– too aggressive -- violation may be a temporary
phenomena
• never changed:
– too conservative -- data rarely follows a single model
UCI Database Group
Challenges in Prediction (cont.)
• who does the model update?
• Server
– Long-haul prediction models possible, since server
maintains history
– might not predict recent behavior well since server does
not know exact S sequence; server has only samples
– extra communication to inform the producer
• Producer
– better knowledge of recent history
– long haul models not feasible since producer does not
have history
– producers share computation load
• Both
UCI Database Group
– server looks for new models, sensor performs parameter
fitting given existing models.
Archiving Sensor Data
• Often sensor-based applications are built with only the real-time
utility of time series data.
– Values at time instants <<n are discarded.
• Archiving such data consists of maintaining the entire S
sequence, or an approximation thereof.
• Importance of archiving:
– Discovering large-scale patterns
– Once-only phenomena, e.g., earthquakes
– Discovering “events” detected post facto by “rewinding” the time
series
– Future usage of data which may be not known while it is being
collected
UCI Database Group
Problem Formulation
• Let P = < p[1], p[2], …, p[n] > be the sensor time series
• Let S = < s[1], s[2], …, s[n] > be the server side representation
• A within archive quality data archival protocol guarantees that
error(p[i], s[i]) < archive
• Trivial Solution: modify collection protocol to collect data at
quality guarantee of min(archive , collect)
– then prediction model by itself will provide a archive quality data
stream that can be archived.
• Better solutions possible since
– archived data not needed for immediate access by real-time or
forecasting applications (such as monitoring, tracking)
– compression can be used to reduce data transfer
UCI Database Group
Data Archival Protocol
Sensor updates for
data collection
…p[n], p[n-1],
..
Compressed representation
for archiving
compress
Sensor memory buffer
processing at sensor exploited to reduce
communication cost and hence battery drain
• Sensors compresses observed time series p[1:n] and sends a
lossy compression to the server
• At time n :
– p[1:n-nlag] is at the server in compressed form s’ [1:n-nlag] withinarchive
– s[n-nlag+1:n] is estimated via a predictive model (M, )
• collection protocol guarantees that this remains within- collect
– s[n+1:] can be predicted but its quality is not guaranteed
(because it is in the future and thus the sensor has not observed
these values)
UCI Database Group
Piecewise Constant Approximation
(PCA)
• Given a time series Sn = s[1:n] a piecewise constant approximation
of it is a sequence
PCA(Sn) = < (ci, ei) >
that allows us to estimate s[j] as:
scapt [j] = ci
if j in [ei-1+1, ei]
= c1
if j<e1
Value
c1
c3
c2
UCI Database Group
e1
e2
c4
Time
e3
e4
Online Compression using PCA
•
Goal: Given stream of sensor values, generate a within-archive PCA
representation of a time series
•
Approach (PMC-midrange)
– Maintain m, M as the minimum/maximum values of observed samples
since last segment
– On processing p[n], update m and M if needed
• if M - m > 2archive , output a segment ((m+M )/2, n)
6
Value
Example: archive =
1.5
4
3
2.5
2
Time
1
UCI Database Group
2
3
4
5
Online Compression using PCA
• PMC-MR …
– guarantees that each segment compresses the corresponding
time series segment to within-archive
– requires O(1) storage
– is instance optimal
• no other PCA representation with fewer segments can meet the
within-archive constraint
• Variant of PMC-MR
– PMC-MEAN, which takes the mean of the samples seen thus far
instead of mid range.
UCI Database Group
Improving PMC using Prediction
• Observation: Prediction models guarantee a within- collect version of
the time series at server even before the compressed time series
arrives from the producer.
• Can the prediction model be exploited to reduce the overhead of
compression.
– If archive> collect no additional effort is required for archival --> simply
archive the predicted model.
• Approach:
– Define an error time series E[i] = p[i]-spred[i]
– Compress E[1:n] to within-archive instead of compressing p[1:n]
– The archive contains the prediction parameters and the compressed
error time series
– Within-archive of E[I] + (M,
archive version of p
UCI Database Group
) can be used to reconstruct a within-
Combing Compression and Prediction
(Example)
25
30
25
Predicted Time
Series
20
15
20
Compressed Time
Series
15
(7 segments)
Actual Time
Series
10
Actual Time
Series
10
5
5
0
0
-5
0
0
10
20
30
40
50
60
Actual – Predicted
0.5
0
-0.5
-1
-1.5
Compressed Error
-2.5
-3
-3.5
-4
UCI Database Group
20
Error =
1
-2
10
-5
(2 segments)
30
40
50
60
Estimating Time Series Values
• Historical samples (before n-nlag) is maintained at the server withinarchive
• Recent samples (between n-nlag+1 and n) is maintained by the
sensor and predicted at the server.
• If an application requires q precision, then:
– if q  collect then it must wait for  time in case a parameter refresh is en
route
– if q  archive but q < collect then it may probe the sensor or wait for a
compressed segment
– Otherwise only probing meets precision
• For future samples (after n) immediate probing not available as an
option
UCI Database Group
Experiments
• Data sets:
– Synthetic Random-Walk
• x[1] = 0 and x[i]=x[i-1]+sn where sn drawn uniformly from [-1,1]
– Oceanographic Buoy Data
• Environmental attributes (temperature, salinity, wind-speed, etc.) sampled at
10min intervals from a buoy in the Pacific Ocean (Tropical Atmosphere
Ocean Project, Pacific Marine Environment Laboratory)
– GPS data collected using IPAQs
• Experiments to test:
– Compression Performance of PMC
– Benefits of Model Selection
– Query Accuracy over Compressed Data
– Benefits of Prediction/Compression Combination
UCI Database Group
Compression Performance
K/n ratio: number of segments/number of samples
UCI Database Group
Query Performance Over Compressed
Data
“How many sensors have values >v?” (Mean selectivity = 50)
UCI Database Group
Impact of Model Selection
• Objects moved at
approximately
constant speed (+
measurement noise)
•Three models used:
• loc[n] = c
• loc[n] = c+vt
• loc[n] = c+vt+0.5at2
K/n ratio: number of segments/number of
samples. pred is the localization tolerance in
meters
UCI Database Group
•Parameters v, a were
estimated at sensor
over moving-window
of 5 samples
Combining Prediction with
Compression
K/n ratio: number of segments/number of samples
UCI Database Group
GPS Mobility Data from Mobile Clients (iPAQs)
QUASAR Client
Time Series
Latitude Time Series:
1800 samples
Compressed Time Series
(PMC-MR, ICDE 2003)
Accuracy of ~100 m
130 segments
UCI Database Group
Query Processing in Quasar
• Problem Definition
– Given
• sensor time series with quality-guarantees captured at
the server
• A query with a specified quality-tolerance
– Return
• query results incurring least cost
• Techniques depend upon
– nature of queries
– Cost measures
• resource consumption -- energy, communication, I/O
• query response time
UCI Database Group
Aggregate Queries
minQ = 2
8
7
maxQ = 7
2
6
Q
sumQ = 2+7+6 = 15
9
avgQ = 15/3 = 5
3
S
UCI Database Group
countQ = 3
Processing Aggregate Queries
(minimize producer probe)
Let S = <s1,s2, …,sn> be set of sensors that meet the query criteria
si.high = sipred[t] + jpred
sj.low = sipred[t] - jpred
• MIN Query
– c=
sn
s3
s2
minj(si.high)
– b = c - query
– Probe all sensors where sj.low < b
s1
a
b
c
• only s1 and s3 will be probed
5
3
• Sum Query
– select a minimal subset S’  S such that
si in S’ (jpred) >= si in S(jpred)- query
– If query = 15, only s1 will be probed
UCI Database Group
5
s4
s3
2
10
s5
s2
s1
Minimizing Cost at Server
• Error tolerance of queries can be exploited to reduce
processing at server.
• Key Idea
– Use a multi-resolution index structure (MRA-tree) for
processing aggregate queries at server.
– An MRA-Tree is a modified multi-dimensional index trees (RTree, quadtree, Hybrid tree, etc.)
– A non-leaf node contains (for each of its subtrees) four
aggregates {MIN,MAX,COUNT,SUM}
– A leaf node contains the actual data points (sensor models)
UCI Database Group
MRA Tree Data Structure
Spatial View
Tree Structure View
A
D
S1
S2
S3
S4
B
E
B
C
G
S5
F
A
C
S6
S7
D
F
G
S8
S1
UCI Database Group
E
S2
S3
S4
S5
S6
S7
S8
MRA-Tree Node Structure
Non-Leaf Node
Leaf Node
min
2
4
1
6
M1
M2
M3
max
4
5
2
6
1
2
3
count
3
2
4
1
sum
9
9
4
6
Probe “Pointers”
(each costs 2 messages)
Disk Page Pointers
UCI Database Group
(each costs 1 I/O)
Node Classification
• Two sets of nodes:
– NP (partial contribution to the query)
– NC (complete contribution)
is contained
Q
contains
N Q
N
partially overlaps
Q
N
UCI Database Group
disjoint
N
Q
Aggregate Queries using MRA Tree
• Initialize NP with the root
• At each iteration: Remove one node N from NP and for each
Nchild of its children
– discard, if Nchild disjoint with Q
– insert into NP if Q is contained or partially overlaps with Nchild
– “insert” into NC if Q contains Nchild (we only need to maintain
aggNC)
– compute the best estimate based on contents of NP and NC
N
UCI Database Group
Q
MIN (and MAX)
Interval
9
minNC = min { 4, 5 } = 4
minNP = min { 3, 9 } = 3
4
L = min {minNC, minNP} = 3
H = minNC = 4
5
hence, I = [3, 4]
Estimate
Traversal
Lower bound:
Choose N  NP:
E(minQ) = L = 3
minN = minNP
UCI Database Group
3
MRA Tree Traversal
• Progressive answer refinement until
NP is exhausted
• Greedy priority-based local decision
for next node to be explored based
on:
– Cost (1 I/O or 2 messages)
– Benefit (Expected Reduction in
answer uncertainty)
A
B
D
S1
C
E
S2
S3
UCI Database Group
F
S4
S5
G
S6
S7
S8
Adaptive Tracking of mobile objects in
sensor networks
Track visualization
object
Base station 1
Wireless link
Show me the approximate
track of the object with
precision 
Server
Wireless Sensor Grid
Base station 2
Base station 3
 Tracking Architecture
 A network of wireless acoustic sensors arranged as a grid transmitting via a
base station to server
 A track of the mobile object generated at the base station or server
 Objective
 Track a mobile object at the server such that the track deviates from the real
trajectory within a user defined error threshold track with minimum
communication overhead.
UCI Database Group
Sensor Model
 Wireless sensors : battery operated, energy constrained
 Operate on the received acoustic waveforms
 Signal attenuation of target object given by : Is(t) = P /4 r2
P : source object power
r= distance of object from sensor
Is(t) = intensity reading at time t at ith sensor
Ith : Intensity threshold at ith sensor
UCI Database Group
Sensor States
S2
Receive BS message
Ii < I th
Ii < I th
S0
S1
(Initial state)
Ii > I th
• S0 : Monitor ( processor on, sensor on, radio off )
– shift to S1 if intensity above threshold
• S1 : Active state ( processor on, sensor on, radio on)
– send intensity readings to base station.
– On receiving message from BS containing error tolerance shift to S2
• S2 : Quasi-active (processor on, sensor on, radio intermittent)
– send intensity reading to BS if error from previous reading exceeds
error threshold
Quasar Collection approach used in Quasi-active state
UCI Database Group
Server side protocol
 Server maintains:
 list of sensors in the active/ quasi-active state
 history of their intensity readings over a period of time
 Server Side Protocol
 convert track quality to a relative intensity error at sensors
 Send relative intensity error to sensor when sensor state =
S1( quasi- active state)
 Triangulate using n sensor readings at discrete time
intervals.
UCI Database Group
Basic Triangulation Algorithm (using 3 sensor
readings)
P: source object power, Ii = intensity reading at ith
sensor
(x1, y1)
(x2, y2)
(x-x1)2 + (y- y1)2 = P/4 I1
(x-x2)2 + (y- y2)2 = P/4 I2
(x, y)
(x-x3)2 + (y- y3)2 = P/4 I3
(x3, y3)
Solving we get (x, y)=f(x1,x2,x3,y1,y2,y3, P,I1, I2 , I3, )
 More complex approaches to amalgamate more than three sensor
readings possible
 Based on numerical methods -- do not provide a closed form equation
between sensor reading and tracking location !
Server can use simple triangulation to convert track quality to sensor
intensity quality tolerances and a more complex approach to track.
UCI Database Group
Adaptive Tracking : Mapping track quality to
sensor reading
Claim 1 (power constant)
 I1
Intensity
( I1 )
Let Ii be the intensity value of sensor
| Δ Ii | Ii ξ /(1  Iiξ )
If
then, track quality
is guaranteed to be within track
2
ti
time
t( i+1 )
2
/ C and C is a constant
where    track
derived from the known locations of the
sensors and the power of the object.
 I2
Intensity
( I2 )
ti
time
t( i+1 )
Claim 2 (power varies between [Pmin , Pmax ])
 I3
 If
Intensity
( I3 )
time
ti
Y (m)
UCI Database Group
then
t( i+1 )
 track
X (m)
2
Pmin 2  track
| I i | 2 [ I i
 I i Pmax ]
Pmax
C'
track quality is guaranteed to be within track
where C’ = C/ P2 and is a constant .
 The above constraint is a conservative
estimate. Better bounds possible
Adaptive Tracking: prediction to improve
performance
 Communication overhead further reduced by exploiting the predictability
of the object being tracked
Static Prediction : sensor & server agree on a set of prediction models
only 2 models used: stationary & constant velocity
Who Predicts: sensor based mobility prediction protocol
 Every sensor by default follows a stationary model
Based on its history readings may change to constant velocity model
(number of readings limited by sensor memory size)
informs server of model switch
UCI Database Group
Actual Track versus track on Adaptive
Tracking (error tolerance 20m)
• A restricted random motion : the object starts at (0,d) and moves from one
node to another randomly chosen node until it walks out of the grid.
UCI Database Group
Energy Savings due to Adaptive Tracking
total energy consumption over all sensor nodes for random mobility
model with varying track or track error.
 significant energy savings using adaptive precision protocol over non
adaptive tracking ( constant line in graph)
UCI Database Group
 for a random model, prediction does not work well !
Energy consumption with Distance from BS
total energy consumption over all sensor nodes for random mobility model with
varying base station distance from sensor grid.
 As base station moves away, one can expect energy consumption to increase
since transmission cost varies as d n ( n =2 )
 adaptive precision algorithm gives us better results with increasing base
station
distance
UCI Database Group
Challenges & Ongoing Work
• Ongoing Work:
–
–
–
–
Supporting a larger class of SQL queries
Supporting continuous monitoring queries
Larger class of sensors (e.g., video sensors)
Better approaches to model fitting/switching in
prediction
• In the future:
–
–
–
–
distributed Quasar architecture
optimizing quality given resource constraints
supporting applications with real-time constraints
dealing with failures
UCI Database Group
The DAS Project**
Goals:
Support Database as a Service on the
Internet
Collaboration:
IBM (Dr. Bala Iyer)
UCI (Gene Tsudik)
** Supported in part by NSF ITR grant entitled “Privacy in Database as a Service” and by the
IBM Corporation
UCI Database Group
Software as a Service
• Get …
– what you need
– when you need
• Pay …
– what you use
• Don’t worry …
– how to deploy, implement, maintain, upgrade
UCI Database Group
Software As a Service: Why?
• Advantages
– reduced cost to client
• pay for what you use
and not for hardware,
software infrastructure
or personnel to deploy,
maintain, upgrade…
– reduced overall cost
• cost amortization across
users
– Better service
• leveraging experts
across organizations
UCI Database Group
• Driving Forces
– Faster, cheaper, more accessible
networks
– Virtualization in server and
storage technologies
– Established e-business
infrastructures
• Already in Market
– ERP and CRM (many examples)
– More horizontal storage services,
disaster recovery services, e-mail
services, rent-a-spreadsheet
services etc.
– Sun ONE, Oracle Online
Services, Microsoft .NET My
Services etc
Better Service for Cheaper
Database As a Service
Most Significant DB Execution Problems
Ease of Administration
58%
Qualified Administrators
57%
Compatibility
51%
Qualified Programmers
51%
Platform Independence
40%
0
• Why?
10
20
30
40
50
60
% of respondents (Source: InfoWeek Research)
– Most organizations need DBMSs
– DBMSs extremely complex to deploy, setup, maintain
– require skilled DBAs with high cost
UCI Database Group
70
What do we want to do?
Server
Internet
User
Application Service Provider (ASP)
• Database as a Service (DAS) Model
– DB management transferred to service provider for
• backup, administration, restoration, space management,
upgrades etc.
– use the database “as a service” provided by an ASP
• use SW, HW, human resources of ASP, instead of your own
BUT….
UCI Database Group
Challenges
• Economic/business model?
– How to charge for service, what kind of service
guarantees can be offered, costing of guarantees,
liability of service provider.
• Powerful interfaces to support complete application
development environment
– User Interface for SQL, support for embedded SQL
programming, support for user defined interfaces, etc.
• Scalability in the web environment
– overheads due to network latency (data proxies?)
• Privacy and Security
– Protecting data at service providers from intruders and
attacks.
– Protecting clients from misuse of data by service
providers
– Ensuring result integrity
UCI Database Group
Data privacy from service provider
Server
User Data
User
Untrusted
Application Service Provider
Encrypted User
Database
Server Site

The problem is we do not trust “the service provider” for sensitive
information!

Fact 1: Theft of intellectual property due to database vulnerabilities costs
American businesses $103 billion annually
Fact 2: 45% of those attacks are conducted by insiders!

encrypt the data and store it

but still be able to run queries over the encrypted data

do most of the work at the server

(CSI/FBI Computer Crime and Security Survey, 2001)
UCI Database Group
System Architecture
Client Site
Server Site
Encrypted
Results
Result
Filter
Temporary
Results
Client Side
Query
?
Server Side
Query
Service Provider
Query
Translator
Original Query
Metadata
?
Encrypted User
Database
?
Actual Results
UCI Database Group
User
NetDB2 Service
2
• Developed in
collaboration with IBM
• Deployed on the Internet
about 2 years ago
1
4
3
UCI Database Group
– Been used by 15
universities and more
than 2500 students to
help teaching database
classes
• Currently offered
through IBM Scholars
Program
MARS Project**
Goals: integration of similarity retrieval
and query refinement over structured
and semi-structured databases to help
interactive data analysis/mining
**Supported in part by NSF CAREER award, NSF grant entitled “learning digital behavior”
and a KDD grant entitled “Mining events and entities over large spatio-temporal data sets”
UCI Database Group
Similarity Search in Databases (SR)
Honda sedan,
inexpensive,
after 1994,
around LA
Exact Search semantics (unranked)
MARS-QL
select * from user_car_catalog
where model ~= Honda Accord,
year >= 1994, price <= 4K,
location ~= LA
Used Car Catalog
Alice
Honda sedan,
inexpensive,
after 1994,
around LA
Bob
UCI Database Group
Year Model
Mileage
Transmission
Location
Color Price
...
Similarity is Subjective:
results reflect personal
interpretation of
`around’,`inexpensive’,
and relative importance
Honda Accord
95
150K
LA
3500
A
Honda Accord
94
90K
LA
3975
M
Similarity search (Alice – price more important)
Honda Accord
94
90K
LA
3975
M
1
Honda Accord
95
150K
LA
3500
A
1
Toyota Camry
94
100K
Malibu
3500
A
.8
Honda Accord
94
60K
Irvine
5000
A
.7
Honda Accord
94
70K
San
Diego
4500
A
.6
Similarity search (Bob – location more important)
Honda Accord
94
90K
LA
3975
M
1
Honda Accord
95
150K
LA
3500
A
1
Honda Prelude
95
50K
LA
6000
A
.8
Honda Accord
98
30K
LA
6500
A
.7
Honda Accord
94
60K
Irvine
5000
A
.5
Query Refinement (QR)
Mileage also
important
Results
–
Honda Accord
94
90K
LA
3975
M
1

Honda Accord
95
150K
LA
3500
A
1

Toyota Camry
94
100K
Malibu
3500
A
.8

Honda Accord
94
60K
Irvine
5000
A
.7

Honda Accord
94
70K
San
Diego
4500
A
.6
Refined Query
select * from user_car_catalog
where model ~= Honda Accord,
year >= 1994, price <= 4K,
location ~= LA,
mileage~=60K
UCI Database Group
Refined Results
Honda Accord
94
90K
LA
3975
M
.9
Honda Accord
94
60K
Irvine
5000
A
.8
Honda Accord
94
70K
4500
A
.6
Toyota Camry
94
100K
San
Diego
Malibu
3500
A
.6
Honda Accord
93
80K
San
Diego
4500
A
.5
Why are SR and QR important?
• Most queries are similarity searches
– Specially in exploratory data analysis tasks (e.g., catalog search)
– Users have only a partial idea of their information need
• Existing Search technologies (text retrieval, SQL) do not
provide appropriate support for SR and (almost) no support
for QR.
– Users must artificially convert similarity queries to keyword-searches
or exact-match queries
– Good mappings difficult or not feasible
• Lack of good knowledge of the underlying data or its structure
• Exact-match may be meaningless for certain data types (e.g., images, text, multimedia)
UCI Database Group
Similarity Access and Interactive Mining
Architecture
Search Client
Query/Feedback
Ranked Results
Query Session
Manager
Initial
Query
Ranking
Rules
Feedback
Ranked
Results
Similarity
Query
Processor
Schemes
Feedback
Table
Feedback-based
Refinement
Method
History-based
Refinement
Method
Refinement
Manager
Answer
Table
Legend:
--- logging
__ Process
Scores
Table
Query Log
Manager/Miner
Query
Results
Query
Log
Types
Database
UCI Database Group
ORDBMS
Similarity
Operators
MARS Challenges...
• Learning queries from
– user interactions
– user profiles
– past history of other users
• Efficient implementation of
– similarity queries
– refined queries
• Role of similarity queries in
– OLAP
– interactive data mining
UCI Database Group
Query-Session Manager
Similarity Query Processor
-parse the query
- check query validity
-generate schema for support tables
- maintain sessions registry
-executes query on ORDBMS
- ranks results (e.g. can exclude already
see tuples, etc)
- logs query(query or Top-k)
Refinement Manager
Query Log Manager/Miner
- maintains a registry of query refinement
policies (content/collaborative)
- generates the scores table
- identifies and invokes intra-predicate refiners.
- maintains query log
. Initial-Final pair
. Top-K results
. Complete trajectory
- Query-query similarity (can have multiple policies)
- Query clustering
UCI Database Group
Text Search Technologies
(Altavista, Verity, Vality, Infoseek)
Approach
convert enterprise
structured data into a
searchable text index.
Strengths
support ranked
retrieval
can handle missing
data, synonyms, data
entry errors
UCI Database Group
Limitations
cannot capture semantics
of relationships in data
cannot capture semantics
of non-text fields (e.g.,
multimedia)
limited support for
refinement or preferences
in current systems
cannot express similarity
queries over structured or
semi-structured data (e.g.,
price, location)
Movies
…
…
…
…
…
…
Actors
…
…
…
…
…
…
Directors
…
…
…
…
…
…
Al Pacino acted
in a movie
directed by
Francis Ford
Coppola
Honda accord
near LA
approx. $4000
SQL-based Search Technologies
Oracle, Informix, DB2, Mercado
Approach
translate similarity
query into exact SQL
query.
1994 Honda
accord near
LA approx.
$4000
Strengths
support structured as
well as semistructured data
support for arbitrary
data types
Scalable attributebased lookup
UCI Database Group
select *
from user_car_catalog
where model = Honda
Accord and 1993  year 1995 and
dist(90210)  50 and price < 5000
Limitations
translation is difficult or not
possible
difficult to guess right
ranges
causes near misses
not feasible for nonnumeric fields
cannot rank answers
based on relevance
does not account for user
preference or query
refinement
Download