slides - iLab! - University of Southern California

advertisement
AIMS: An Immersidata
Management System
Cyrus Shahabi
Computer Science Department &
Integrated Media Systems Center
University of Southern California
Los Angeles, CA 90089-0781
shahabi@usc.edu
http://infolab.usc.edu
CIDR’03
1
Outline

Definitions and Motivating Applications

Immersive Data Types (focus: immersidata)

AIMS Architecture

Subsystems: Acquisition, Storage & Querying

Current Status (demo, if time permits)

Conclusion and Future Work
CIDR’03
2
Immersive Environments


Immersive Environments allow a user to become
immersed within an augmented or virtual reality
environment in order to interact with people,
objects, places, and databases.
Examples







CIDR’03
Office of the Future (UNC)
Fire Fighter Training System (Georgia Tech)
Planetary Exploration (JPL)
Physical/Occupational Therapy System (Haifa Univ.)
Virtual Classroom and Office (USC IMSC)
Haptic Museum (USC IMSC)
MRE: Mission Rehearsal Exercise (USC ICT)
3
Thesis (1)


It is absolutely critical to understand the data
generated by and for immersive environments
For example, from the data acquired from a
user’s interactions with an immersive
environment (i.e., immersidata), we can learn
about the user’s behavior to:








For immersive and multimedia community!
For database community:

CIDR’03
Study human factor issues
Measure the effectiveness of the environment
Customize the information delivery
Identify pitfalls in the system
Better understand the user’s intentions
Improve the system performance
Immersive sensors are the user interfaces of the future;
as a research community we should study their
generated data or we will miss the boat.
4
Example: Immersive Sensor Data Streams
<Si, x, y, z, t, v>
CIDR’03
5
Application (1) : Immersive Sensor Pattern
Recognition
On-Line Query & Analysis
Recognition
command
System
Play
Run
Stop
0.72
0.15
0.63
Zoom-In
Zoom-Out
0.92
0.25
DB of
Labeled
Patterns
Immersive environment
CIDR’03
6
Application (1) : American Sign
Language (ASL) as well-defined
patterns
1. User makes
ASL
signs w/ a glove
C
E
F
4. ASL signs recognized
Acquisition
Module
Immersidata 2. Sensor values
Database sampled over time
Spatio-Temporal
(moving sensors)
Query Evaluation
CIDR’03
fi
p2
di
3. Semantic description
of hand
Recognition modules:
-SVD
-Bayesian Classifiers
-Neural Net
p1
7
Application (1) : ASL On-Line Q&A …
On-Line query and analysis challenges:
 A hand sign is composed of a sequence of data samples across
multiple sensor streams
 A sequence for one sign has no fixed length (i.e., can’t tell when
one ends and the other starts!)
An example statement in American Sign Language (ASL)
I
like
yellow
shoes
Two problems (chicken & egg-problem) with interdependent
solutions should be addressed
• Isolate signs
• Recognize the isolated sign
CIDR’03
8
Application (2) : Immersive Classroom
Off-Line Query & Analysis



Study attention performance for Normal & ADHDDiagnosed Children
A classroom as a virtual environment (virtual
students, a virtual teacher, desks, a blackboard,
a window to the playground, doors)
Presence of distracters




CIDR’03
Paper airplane
Ambient classroom noise
Students walking
Cars passing outside, visible through the window
9
Application (2) : IC Off-Line Q&A …



User, wearing HMD, is immersed into the class
Trackers monitor body movements and stream data
to the database
Task: pressing a button when a particular letter
pattern is seen on the virtual blackboard (e.g., AX)
Displayed
Characters
Head sensor data
Arm sensor data
DB
Leg sensor data
Mouse Clicks
CIDR’03
Distracters
10
Application (2) – IC Off-Line Q&A …

Off-line query and analysis:

Range-sum queries
• Sum of body movements
• Average reaction time to the patterns
• Number of correct hits

Classification and clustering
• Use a classification technique to differentiate between
normal and ADHD-diagnosed subjects (e.g., SVM)

CIDR’03
Distinguishing hyperactive kids from normal by
automatically analyzing tracker data: major
impact in psychotherapy, able to discriminate
and specify diagnosis in a manner not possible
using existing traditional methods
11
Thesis (2)
CIDR’03

Immersive applications in training and simulation
domains, share common data storage and
analysis requirements (i.e., dealing w/ sensor
data streams, aka immersidata)

Hence, instead of building customized systems
for the “acquisition, storage and querying”
needs of each immersive application, one can
design a general-purpose system addressing
many of the shared requirements
12
Focus: Immersidata [MIS’99]

Data acquired from user’s interaction with the
immersive environment




Subject body positions
Subject recognized gestures
Can be analyzed to learn about user’s behavior
Specifications





Multidimensional <si, x, y, z, t, v>
Spatio-Temporal
Continuous Data Streams (CDS)
Potentially large in size and bandwidth requirements
Noisy
…, <sn,xn,yn,zn,hn,pn,rn,tn>, …, …,<s1,x1,y1,z1,h1,p1,r1,t1>, …
CIDR’03
14
AIMS: An Immersidata Management System
3. User interaction
module
Application-specific
GUI
Pattern isolation
heuristic
1. Acquisition module
Pattern matching:
SVD-based measure
DWPT basis selection
for each dimension
Transformation
Sensor Data
Streams
4. Query & analysis
module
ProPolyne [web] services
2. Storage module
Wavelets packing
into disk blocks or DB BLOBS
CIDR’03
Users states
Immersidata storage
and contexts (file-system + OR-DBMS)
15
Challenges of AIMS Subsystems

Acquisition [SIGMETRICS’01,ICME’02]



Storage [SIGMOD’03?]


Approximate, progressive, and efficient polynomial analytical query
on large amount of multidimensional data
Online Query and Analysis [MMM’03]



CIDR’03
Physical level of storage system should be designed to store
transformed data (e.g., wavelet coefficients)
• Block allocation strategies considering query patterns
Offline Query and Analysis [EDBT’02.PODS’02]


Data should be filtered and transformed (similar to signals)
Database friendly signal processing techniques are required
Common challenges with querying continuous data streams
Real-time pattern recognition on aggregation of multiple data
streams that are incrementally completing
Data from all streams form the meaningful data
16
Approaches:
1. Acquisition Module
• INPUT: Multidimensional streams
• OUTPUT: Wavelet coefficients




CIDR’03
Receive multidimensional sensor streams
In real-time selects different basis per dimension
(optimally) from the DWPT (Discrete Wavelet Packet
Transforms) library
Applies multidimensional transformation to data
(generates multi-resolution representations of data)
NOTE: no compression is applied, no data will be
lost by this process
17
Approaches:
2. Storage Module
• INPUT: Wavelet coefficients
• OUTPUT: disk blocks
metadata records


CIDR’03
Optimally packs related wavelet coefficients into
disk blocks (to reduce future I/O cost) and store
them in the file system or within OR-DBMS
Includes corresponding disk blocks info into the
DBMS (Database Management System) for future
queries
18
Optimal Disk Placement for Wavelet Data
Tiling - Blocking (Haar wavelets)
CIDR’03
20
Approaches:
3. User Interaction Module
• INPUT: Camera/speech/tracker/immersive-sensor
• OUTPUT: application commands and queries
user profile/state and application context




CIDR’03
Receives data from various input-devices (beyond keyboard
and mouse) used by the user (e.g., for data visualization
purposes)
Understands the set of requested actions (SVD + mutualinformation)
Translate actions to application-specific commands and/or
database queries (takes user-profile & context into account)
Also stores a history of users interactions to be mined off-line
and/or on-line to extract user state/behavior and application
context to facilitate future interactions by the same user (e.g.,
personalization/customization)
21
Approaches:
4. Query & Analysis Module
• INPUT: Range and point queries
• OUTPUT: Aggregate values/Integrated events




CIDR’03
Transforms queries into a consistent wavelet domain as of data
Performs queries efficiently (and perhaps approximately or
progressively) in the wavelet domain
Displays the correct resolution/granularity of aggregate
value(s) and/or events to the user based on user profile (e.g.,
tolerable latency time) and/or system requirements and/or data
availability
An event is tagged with space (e.g., latitude, longitude and
altitude), time and bag of attributes
22
AIMS Main Theme:
Data Manipulation, Query & Analysis in the
WAVELET Domain


Main idea/distinction: storage is cheap and queries
are ad-hoc; let’s keep all the wavelet coefficients!
(no data compression)
Intuition: At the data population time, we don’t know
which coefficients are more/less important
• Different than the signal-processing objective to reconstruct
•

CIDR’03
the entire signal as good as possible
This has been observed by [Garofalakis & Gibbons,
SIGMOD’02], but they proposed other ways to drop
coefficients assuming a uniform workload
Opportunity: At the query time, however, we have
the knowledge of what is important to the pending
query
23
AIMS Main Theme: Q&A of Wavelets

Define range-sum query as dot product of query
vector and data vector (also observed by [Gilbert et. al,
VLDB’2001] but no query transformation)




Offline: Multidimensional wavelet transform of data
At the query time: “lazy” wavelet transform of
query vector (very fast)
Dot product of query and data vectors in the
transformed domain  exact result
Choose high-energy query coefficients only  fast
approximate result (90% accuracy by retrieving < 10% of
data)

CIDR’03
Choose query coefficients in order of energy 
progressive result
24
Current Status: ProPolyne
Demonstration
CIDR’03
26
AIMS with a Twist!
3. User interaction
module
<x, y, z, t, value>
Remote Sensor Data Streams
<lat, long, altitude, t, temperature>
Application-specific
GUI
Pattern isolation
heuristic
Pattern matching:
SVD-based measure
1. Acquisition module
DWPT basis selection
for each dimension
Transformation
4. Query & analysis
module
2. Storage module
ProPolyne [web] services
Wavelets packing
into disk blocks or DB BLOBS
CIDR’03
Users states
and contexts
Sensor Data storage
(file-system + DBMS)
27
Conclusion and Future Work


A new application domain, immersive applications, and one of
its data set, immersidata, were introduced
Database challenges involved in managing immersidata
discussed:




The design of AIMS, an innovative data systems architecture,
were reported
Future Work




CIDR’03
Some direct adoption of the typical database research
techniques (e.g., OLAP)
Some modifications/extensions of the current research
contributions (e.g., in the area of data streams) that are not
applicable immediately
I/O efficient ways for Wavelet transformation and incremental
update
Hybrid sorting of both data and query coefficients
Prototypical implementation of an end-to-end application using
AIMS
Performance evaluation
28
Application (3) – Physical/Occupational Therapy
Both On-Line and Off-Line Q&A







Rehabilitation research using virtual environments and gaming technologies
Enables individuals with severe physical disabilities to use their residual
motor abilities in more efficient and less fatiguing ways
Patient watches her video projected on a 2-d virtual environment
Video cameras track body movements
Animated target characters are manipulated within the environment
Patient is asked to hit the targets to gain more score
Potential data analysis tasks


CIDR’03
Offline analysis of user performance in order to find specific motor disabilities
Online analysis of body movements to add more targets in the directions which
need more exercises
29
Thanks!
CIDR’03
30
Haptic Data Acquisition [SIGMETRICS’01]

Temporal aspect: the rate of which the values of
sensors should be sampled?

CIDR’03
Trade-off between ‘accuracy & bandwidth utilization

Fixed Sampling:
 Sampling at a constant rate; max value of speed is a
function of system speed and/or haptic glove

Group Sampling:
 Intuitive grouping of sensors; different sampling rate for
each group

Adaptive Sampling:
 Dynamic sampling; within a window of session, every
sensor sampled at an individual optimal rate
31
ProPolyne Features


“Measure” can be any polynomial on any
combination of attributes

Can support COUNT, SUM, AVERAGE

Also supports Covariance, Kurtosis, etc.

All using one set of pre-computed aggregates
Independent from how well the data set can be
compressed/approximated by wavelets

CIDR’03
Because: We show “range-sum queries” can always be
approximated well by wavelets (not always HAAR though!)

Low update cost: O(logd N)

Can be used for exact, approximate and progressive
range-sum query evaluation
32
Polynomial Range-Sum Queries

Polynomial range-sum queries: Q(R,f,I)



I is a finite instance of schema F
R SubSetOf Dom(F), is the range
f : Dom(F)  R is a polynomial of degree d
Q ( R, f , I ) 



xI  R
f (x)
Example: F = (Age, Salary)
R: (25 < age < 40) & (55k < salary < 150k)
COUNT : f ( x )  1( x )  1
Q( R,1, I ) 
1( x )  1(28,55K )  1(30,58k )  2
Age Salary
25
28
30
50
55
57
$50k
$55k
$58k
$100k
$130k
$120k
I
xR I
SUM : f ( x )  salary ( x )
Q( R, salary , I ) 
 f ( x )  salary (28,55K )  salary (30,58k )  113k
xR  I
Q ( R, salary  age, I ) 
 salary ( x )  age( x ) 
f (28,55K )  f (30,58k )  3280 M
xR  I
Cov(age, salary ) 
CIDR’03
Q ( R, salary  age, I ) Q ( R, age, I )Q ( R, salary , I )

Q ( R,1, I )
(Q ( R,1, I ))^ 2
33
Polynomial Range-Sum Queries as
“Vector Queries”



The data frequency distribution of I is the function
DI : Dom(F)  Z that maps a point x to the number of times
it occurs in I
To emphasize the fact that a query is an operator on the
data frequency distribution, we write
Q ( R, f , I )  Q ( R, f , DI )
Example: D(25,50)=D(28,55)=…=D(57,120)=1 and D(x)=0
otherwise.
Hence:
Q ( R, f , DI ) 
 f ( x )  ( x )D ( x )
Age Salary
I
R
xDom( F )
25
28
30
50
55
57
 ( x )  1 if x  R
R
where:
 (x)  0
R
CIDR’03
Or:
Vector Query
if
xR
Q( R, f , DI ) 
query
$50k
$55k
$58k
$100k
$130k
$120k
I
fR, DI
data
34
Overview of Wavelets
a[i]’s
0  i  2j
Ha[i]’s
0  i  2 j 1
0  i  2 j 2
H2a [i]’s
H3a[i]’s
0  i  2 j 3
GH2a[i]’s
Ga[i]’s
HGoperator:
operator:computes
measuresahow
local
average
of the
array
a ata
much
values
in
array
GHa[i]’s
every
other point
vary inside
each to
of produce
the
an
array of summary
summarized
blocks to
coefficients:
compute anˆHa
a[i ]bcoefficients:
[i ]  aarray
[Ga
 ]bˆ[ofdetail
]
Example (Haar) h=[1/2,1/2]
Example (Haar) g=[1/2,-1/2]
DWT of a

â
CIDR’03
Summary coefficients
coefficients
ofDetail
a at level
2
of a at level 2

aka wavelet coefficients of a
35
Naive Evaluation of Vector Queries
Using Wavelets

Hence, vector queries can be computed in the wavelettransformed space as:
Q( R, f , D)  ( fˆR, Dˆ ) 
N 1
fˆ ( ,...,


R
 0 ,...,

0
)Dˆ ( 0,...,d  1)
d 1
d 10
Algorithm:

Off-line transformation of data vector (or “data distribution
function”, i.e., D, to be exact)
• O (|I|ldlogdN) for sparse data, O (|I|) = Nd for dense data

Transform the query vector at submission
• O (Nd) !

Sum-up the products of the corresponding elements of data
and query vectors
• Retrieving elements of data vector: O (Nd) !
CIDR’03
36
Fast Evaluation of Vector Queries Using
Wavelets

Main intuitions:



“query vector” can be transformed quickly because
most of the coefficients are known in advance
“Transformed query vector” has a large number of
negligible (e.g., zero) values (independent on how well
data can be approximated by wavelet)
Example: Haar filter & COUNT function on R=[5,12] on
the domain of integers from 0 to 15:
  {0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0}
R
1 3
3
1
1
1
1
,
,0, ,0, ,0,0,
,0,0,0,
,0}
2 2 2 2 2 2
2
2
2
ˆ  {2, ,
R
H4a GH3a
CIDR’03
GH2a
GHa
Ga
At each step, you
know the zeros
37
Exact Evaluation of Vector Queries
Query:
SUM(salary) when
(25 < age < 40) &
(55k < salary < 150k)
# of Nonzero Coordinates: 4380
CIDR’03
# of Wavelet Coefficients: 837
38
Approximate Evaluation of Vector Queries
CIDR’03
39
Optimal Disk Placement for Wavelet Data




CIDR’03
The goal is to efficiently store wavelet coefficients
Efficiently means fast access to stored data, low I/O
complexity, little disk access
How to achieve this: create a principle of locality of
reference
Designed for wavelet overlap queries, but can be extended
for polynomial range-sum queries over multidimensional
data
40
Optimal Disk Placement for Wavelet Data
Discrete Wavelet Transform
x0
x1
x2
x3
x4
x5
x6
x7
6
7
Time Domain
DWT
0
CIDR’03
1
2
3
4
5
Wavelet Domain
(coefficients)
41
SVD Background

The idea of SVD is based on the following theorem
of linear algebra:

If matrix
, then there exist column-orthonormal
mn
X VRsuch that
matrices U and
where
and
,
T
X

U

A

V
and
is a diagonal matrix
A  R r r
such that
U  R mr
V  R r n
A  diag (a1 , a2 ,..., a p )
a1  a2  ...  a p
CIDR’03
42
Weighted-Sum SVD


Each data sequence could be represented
as a matrix, where the columns (r) are the
sensors and hence their # is fixed
The similarity metric of two data
sequences is defined on the ‘square’
matrices
 To
eliminate the effect that the number of rows
(i.e., the time dimension) in the two matrices
are different (i.e., multiply the matrix by its
transpose matrix)
CIDR’03
43
Weighted-Sum SVD
Problem: Obtain the similarity of input sequence
SVD decompose
square
q11
q1r
qr1
qrr
e1, e2, … , er ×
and the pattern
c1
c2
e1
e2
×
cr
er
cw1
cw2
dw1
dw2
cwr
cw1+cw2+…+ cwr=1
dwr
weight
dw1+dw2+…+ dwr=1
square
p11
p1r
pr1
CIDR’03
prr
SVD decompose
f 1 , f 2 , … , fr ×
d1
d2
×
dr
f1
f2
fr
44
Weighted-Sum SVD
Problem: Obtain the similarity of input sequence
e1, e2, … , er
e1
e2
cw1
cw2
cwr
r
1   cwi ei  f i 
i 1
r
 2   dwi ei  f i 
i 1
CIDR’03
and the pattern
f 1 , f 2 , … , fr
f1
f2
dw1
dw2
er
dwr
fr
The similarity of input sequence
and the pattern
=min(Θ1, Θ2)
45
The Ridge-Climbing Heuristic

Procedure:



Compute the accumulated similarity values (ASVs)
between the input sequence and all vocabulary
sequences
Keep track of all ASVs
For each vocabulary sequence, check whether the ASV
is monotonically increasing, and whether a maximum is
reached
• Yes: put this vocabulary into the candidates pool


CIDR’03
Choose the vocabulary from the candidates pool with
biggest maximal value
Isolate the recognized stream
46
The Ridge-Climbing Heuristic
Assume the database only has three vocabulary sequence, like, yellow, and I.
Input sequence
Maximum is reached!
Isolate!
Reset the ASVs
like
yellow
time
CIDR’03
I
ASVs
ASVs
ASVs
like
time
time
47
Download