AIMS: An Immersidata Management System Cyrus Shahabi Computer Science Department & Integrated Media Systems Center University of Southern California Los Angeles, CA 90089-0781 shahabi@usc.edu http://infolab.usc.edu CIDR’03 1 Outline Definitions and Motivating Applications Immersive Data Types (focus: immersidata) AIMS Architecture Subsystems: Acquisition, Storage & Querying Current Status (demo, if time permits) Conclusion and Future Work CIDR’03 2 Immersive Environments Immersive Environments allow a user to become immersed within an augmented or virtual reality environment in order to interact with people, objects, places, and databases. Examples CIDR’03 Office of the Future (UNC) Fire Fighter Training System (Georgia Tech) Planetary Exploration (JPL) Physical/Occupational Therapy System (Haifa Univ.) Virtual Classroom and Office (USC IMSC) Haptic Museum (USC IMSC) MRE: Mission Rehearsal Exercise (USC ICT) 3 Thesis (1) It is absolutely critical to understand the data generated by and for immersive environments For example, from the data acquired from a user’s interactions with an immersive environment (i.e., immersidata), we can learn about the user’s behavior to: For immersive and multimedia community! For database community: CIDR’03 Study human factor issues Measure the effectiveness of the environment Customize the information delivery Identify pitfalls in the system Better understand the user’s intentions Improve the system performance Immersive sensors are the user interfaces of the future; as a research community we should study their generated data or we will miss the boat. 4 Example: Immersive Sensor Data Streams <Si, x, y, z, t, v> CIDR’03 5 Application (1) : Immersive Sensor Pattern Recognition On-Line Query & Analysis Recognition command System Play Run Stop 0.72 0.15 0.63 Zoom-In Zoom-Out 0.92 0.25 DB of Labeled Patterns Immersive environment CIDR’03 6 Application (1) : American Sign Language (ASL) as well-defined patterns 1. User makes ASL signs w/ a glove C E F 4. ASL signs recognized Acquisition Module Immersidata 2. Sensor values Database sampled over time Spatio-Temporal (moving sensors) Query Evaluation CIDR’03 fi p2 di 3. Semantic description of hand Recognition modules: -SVD -Bayesian Classifiers -Neural Net p1 7 Application (1) : ASL On-Line Q&A … On-Line query and analysis challenges: A hand sign is composed of a sequence of data samples across multiple sensor streams A sequence for one sign has no fixed length (i.e., can’t tell when one ends and the other starts!) An example statement in American Sign Language (ASL) I like yellow shoes Two problems (chicken & egg-problem) with interdependent solutions should be addressed • Isolate signs • Recognize the isolated sign CIDR’03 8 Application (2) : Immersive Classroom Off-Line Query & Analysis Study attention performance for Normal & ADHDDiagnosed Children A classroom as a virtual environment (virtual students, a virtual teacher, desks, a blackboard, a window to the playground, doors) Presence of distracters CIDR’03 Paper airplane Ambient classroom noise Students walking Cars passing outside, visible through the window 9 Application (2) : IC Off-Line Q&A … User, wearing HMD, is immersed into the class Trackers monitor body movements and stream data to the database Task: pressing a button when a particular letter pattern is seen on the virtual blackboard (e.g., AX) Displayed Characters Head sensor data Arm sensor data DB Leg sensor data Mouse Clicks CIDR’03 Distracters 10 Application (2) – IC Off-Line Q&A … Off-line query and analysis: Range-sum queries • Sum of body movements • Average reaction time to the patterns • Number of correct hits Classification and clustering • Use a classification technique to differentiate between normal and ADHD-diagnosed subjects (e.g., SVM) CIDR’03 Distinguishing hyperactive kids from normal by automatically analyzing tracker data: major impact in psychotherapy, able to discriminate and specify diagnosis in a manner not possible using existing traditional methods 11 Thesis (2) CIDR’03 Immersive applications in training and simulation domains, share common data storage and analysis requirements (i.e., dealing w/ sensor data streams, aka immersidata) Hence, instead of building customized systems for the “acquisition, storage and querying” needs of each immersive application, one can design a general-purpose system addressing many of the shared requirements 12 Focus: Immersidata [MIS’99] Data acquired from user’s interaction with the immersive environment Subject body positions Subject recognized gestures Can be analyzed to learn about user’s behavior Specifications Multidimensional <si, x, y, z, t, v> Spatio-Temporal Continuous Data Streams (CDS) Potentially large in size and bandwidth requirements Noisy …, <sn,xn,yn,zn,hn,pn,rn,tn>, …, …,<s1,x1,y1,z1,h1,p1,r1,t1>, … CIDR’03 14 AIMS: An Immersidata Management System 3. User interaction module Application-specific GUI Pattern isolation heuristic 1. Acquisition module Pattern matching: SVD-based measure DWPT basis selection for each dimension Transformation Sensor Data Streams 4. Query & analysis module ProPolyne [web] services 2. Storage module Wavelets packing into disk blocks or DB BLOBS CIDR’03 Users states Immersidata storage and contexts (file-system + OR-DBMS) 15 Challenges of AIMS Subsystems Acquisition [SIGMETRICS’01,ICME’02] Storage [SIGMOD’03?] Approximate, progressive, and efficient polynomial analytical query on large amount of multidimensional data Online Query and Analysis [MMM’03] CIDR’03 Physical level of storage system should be designed to store transformed data (e.g., wavelet coefficients) • Block allocation strategies considering query patterns Offline Query and Analysis [EDBT’02.PODS’02] Data should be filtered and transformed (similar to signals) Database friendly signal processing techniques are required Common challenges with querying continuous data streams Real-time pattern recognition on aggregation of multiple data streams that are incrementally completing Data from all streams form the meaningful data 16 Approaches: 1. Acquisition Module • INPUT: Multidimensional streams • OUTPUT: Wavelet coefficients CIDR’03 Receive multidimensional sensor streams In real-time selects different basis per dimension (optimally) from the DWPT (Discrete Wavelet Packet Transforms) library Applies multidimensional transformation to data (generates multi-resolution representations of data) NOTE: no compression is applied, no data will be lost by this process 17 Approaches: 2. Storage Module • INPUT: Wavelet coefficients • OUTPUT: disk blocks metadata records CIDR’03 Optimally packs related wavelet coefficients into disk blocks (to reduce future I/O cost) and store them in the file system or within OR-DBMS Includes corresponding disk blocks info into the DBMS (Database Management System) for future queries 18 Optimal Disk Placement for Wavelet Data Tiling - Blocking (Haar wavelets) CIDR’03 20 Approaches: 3. User Interaction Module • INPUT: Camera/speech/tracker/immersive-sensor • OUTPUT: application commands and queries user profile/state and application context CIDR’03 Receives data from various input-devices (beyond keyboard and mouse) used by the user (e.g., for data visualization purposes) Understands the set of requested actions (SVD + mutualinformation) Translate actions to application-specific commands and/or database queries (takes user-profile & context into account) Also stores a history of users interactions to be mined off-line and/or on-line to extract user state/behavior and application context to facilitate future interactions by the same user (e.g., personalization/customization) 21 Approaches: 4. Query & Analysis Module • INPUT: Range and point queries • OUTPUT: Aggregate values/Integrated events CIDR’03 Transforms queries into a consistent wavelet domain as of data Performs queries efficiently (and perhaps approximately or progressively) in the wavelet domain Displays the correct resolution/granularity of aggregate value(s) and/or events to the user based on user profile (e.g., tolerable latency time) and/or system requirements and/or data availability An event is tagged with space (e.g., latitude, longitude and altitude), time and bag of attributes 22 AIMS Main Theme: Data Manipulation, Query & Analysis in the WAVELET Domain Main idea/distinction: storage is cheap and queries are ad-hoc; let’s keep all the wavelet coefficients! (no data compression) Intuition: At the data population time, we don’t know which coefficients are more/less important • Different than the signal-processing objective to reconstruct • CIDR’03 the entire signal as good as possible This has been observed by [Garofalakis & Gibbons, SIGMOD’02], but they proposed other ways to drop coefficients assuming a uniform workload Opportunity: At the query time, however, we have the knowledge of what is important to the pending query 23 AIMS Main Theme: Q&A of Wavelets Define range-sum query as dot product of query vector and data vector (also observed by [Gilbert et. al, VLDB’2001] but no query transformation) Offline: Multidimensional wavelet transform of data At the query time: “lazy” wavelet transform of query vector (very fast) Dot product of query and data vectors in the transformed domain exact result Choose high-energy query coefficients only fast approximate result (90% accuracy by retrieving < 10% of data) CIDR’03 Choose query coefficients in order of energy progressive result 24 Current Status: ProPolyne Demonstration CIDR’03 26 AIMS with a Twist! 3. User interaction module <x, y, z, t, value> Remote Sensor Data Streams <lat, long, altitude, t, temperature> Application-specific GUI Pattern isolation heuristic Pattern matching: SVD-based measure 1. Acquisition module DWPT basis selection for each dimension Transformation 4. Query & analysis module 2. Storage module ProPolyne [web] services Wavelets packing into disk blocks or DB BLOBS CIDR’03 Users states and contexts Sensor Data storage (file-system + DBMS) 27 Conclusion and Future Work A new application domain, immersive applications, and one of its data set, immersidata, were introduced Database challenges involved in managing immersidata discussed: The design of AIMS, an innovative data systems architecture, were reported Future Work CIDR’03 Some direct adoption of the typical database research techniques (e.g., OLAP) Some modifications/extensions of the current research contributions (e.g., in the area of data streams) that are not applicable immediately I/O efficient ways for Wavelet transformation and incremental update Hybrid sorting of both data and query coefficients Prototypical implementation of an end-to-end application using AIMS Performance evaluation 28 Application (3) – Physical/Occupational Therapy Both On-Line and Off-Line Q&A Rehabilitation research using virtual environments and gaming technologies Enables individuals with severe physical disabilities to use their residual motor abilities in more efficient and less fatiguing ways Patient watches her video projected on a 2-d virtual environment Video cameras track body movements Animated target characters are manipulated within the environment Patient is asked to hit the targets to gain more score Potential data analysis tasks CIDR’03 Offline analysis of user performance in order to find specific motor disabilities Online analysis of body movements to add more targets in the directions which need more exercises 29 Thanks! CIDR’03 30 Haptic Data Acquisition [SIGMETRICS’01] Temporal aspect: the rate of which the values of sensors should be sampled? CIDR’03 Trade-off between ‘accuracy & bandwidth utilization Fixed Sampling: Sampling at a constant rate; max value of speed is a function of system speed and/or haptic glove Group Sampling: Intuitive grouping of sensors; different sampling rate for each group Adaptive Sampling: Dynamic sampling; within a window of session, every sensor sampled at an individual optimal rate 31 ProPolyne Features “Measure” can be any polynomial on any combination of attributes Can support COUNT, SUM, AVERAGE Also supports Covariance, Kurtosis, etc. All using one set of pre-computed aggregates Independent from how well the data set can be compressed/approximated by wavelets CIDR’03 Because: We show “range-sum queries” can always be approximated well by wavelets (not always HAAR though!) Low update cost: O(logd N) Can be used for exact, approximate and progressive range-sum query evaluation 32 Polynomial Range-Sum Queries Polynomial range-sum queries: Q(R,f,I) I is a finite instance of schema F R SubSetOf Dom(F), is the range f : Dom(F) R is a polynomial of degree d Q ( R, f , I ) xI R f (x) Example: F = (Age, Salary) R: (25 < age < 40) & (55k < salary < 150k) COUNT : f ( x ) 1( x ) 1 Q( R,1, I ) 1( x ) 1(28,55K ) 1(30,58k ) 2 Age Salary 25 28 30 50 55 57 $50k $55k $58k $100k $130k $120k I xR I SUM : f ( x ) salary ( x ) Q( R, salary , I ) f ( x ) salary (28,55K ) salary (30,58k ) 113k xR I Q ( R, salary age, I ) salary ( x ) age( x ) f (28,55K ) f (30,58k ) 3280 M xR I Cov(age, salary ) CIDR’03 Q ( R, salary age, I ) Q ( R, age, I )Q ( R, salary , I ) Q ( R,1, I ) (Q ( R,1, I ))^ 2 33 Polynomial Range-Sum Queries as “Vector Queries” The data frequency distribution of I is the function DI : Dom(F) Z that maps a point x to the number of times it occurs in I To emphasize the fact that a query is an operator on the data frequency distribution, we write Q ( R, f , I ) Q ( R, f , DI ) Example: D(25,50)=D(28,55)=…=D(57,120)=1 and D(x)=0 otherwise. Hence: Q ( R, f , DI ) f ( x ) ( x )D ( x ) Age Salary I R xDom( F ) 25 28 30 50 55 57 ( x ) 1 if x R R where: (x) 0 R CIDR’03 Or: Vector Query if xR Q( R, f , DI ) query $50k $55k $58k $100k $130k $120k I fR, DI data 34 Overview of Wavelets a[i]’s 0 i 2j Ha[i]’s 0 i 2 j 1 0 i 2 j 2 H2a [i]’s H3a[i]’s 0 i 2 j 3 GH2a[i]’s Ga[i]’s HGoperator: operator:computes measuresahow local average of the array a ata much values in array GHa[i]’s every other point vary inside each to of produce the an array of summary summarized blocks to coefficients: compute anˆHa a[i ]bcoefficients: [i ] aarray [Ga ]bˆ[ofdetail ] Example (Haar) h=[1/2,1/2] Example (Haar) g=[1/2,-1/2] DWT of a â CIDR’03 Summary coefficients coefficients ofDetail a at level 2 of a at level 2 aka wavelet coefficients of a 35 Naive Evaluation of Vector Queries Using Wavelets Hence, vector queries can be computed in the wavelettransformed space as: Q( R, f , D) ( fˆR, Dˆ ) N 1 fˆ ( ,..., R 0 ,..., 0 )Dˆ ( 0,...,d 1) d 1 d 10 Algorithm: Off-line transformation of data vector (or “data distribution function”, i.e., D, to be exact) • O (|I|ldlogdN) for sparse data, O (|I|) = Nd for dense data Transform the query vector at submission • O (Nd) ! Sum-up the products of the corresponding elements of data and query vectors • Retrieving elements of data vector: O (Nd) ! CIDR’03 36 Fast Evaluation of Vector Queries Using Wavelets Main intuitions: “query vector” can be transformed quickly because most of the coefficients are known in advance “Transformed query vector” has a large number of negligible (e.g., zero) values (independent on how well data can be approximated by wavelet) Example: Haar filter & COUNT function on R=[5,12] on the domain of integers from 0 to 15: {0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0} R 1 3 3 1 1 1 1 , ,0, ,0, ,0,0, ,0,0,0, ,0} 2 2 2 2 2 2 2 2 2 ˆ {2, , R H4a GH3a CIDR’03 GH2a GHa Ga At each step, you know the zeros 37 Exact Evaluation of Vector Queries Query: SUM(salary) when (25 < age < 40) & (55k < salary < 150k) # of Nonzero Coordinates: 4380 CIDR’03 # of Wavelet Coefficients: 837 38 Approximate Evaluation of Vector Queries CIDR’03 39 Optimal Disk Placement for Wavelet Data CIDR’03 The goal is to efficiently store wavelet coefficients Efficiently means fast access to stored data, low I/O complexity, little disk access How to achieve this: create a principle of locality of reference Designed for wavelet overlap queries, but can be extended for polynomial range-sum queries over multidimensional data 40 Optimal Disk Placement for Wavelet Data Discrete Wavelet Transform x0 x1 x2 x3 x4 x5 x6 x7 6 7 Time Domain DWT 0 CIDR’03 1 2 3 4 5 Wavelet Domain (coefficients) 41 SVD Background The idea of SVD is based on the following theorem of linear algebra: If matrix , then there exist column-orthonormal mn X VRsuch that matrices U and where and , T X U A V and is a diagonal matrix A R r r such that U R mr V R r n A diag (a1 , a2 ,..., a p ) a1 a2 ... a p CIDR’03 42 Weighted-Sum SVD Each data sequence could be represented as a matrix, where the columns (r) are the sensors and hence their # is fixed The similarity metric of two data sequences is defined on the ‘square’ matrices To eliminate the effect that the number of rows (i.e., the time dimension) in the two matrices are different (i.e., multiply the matrix by its transpose matrix) CIDR’03 43 Weighted-Sum SVD Problem: Obtain the similarity of input sequence SVD decompose square q11 q1r qr1 qrr e1, e2, … , er × and the pattern c1 c2 e1 e2 × cr er cw1 cw2 dw1 dw2 cwr cw1+cw2+…+ cwr=1 dwr weight dw1+dw2+…+ dwr=1 square p11 p1r pr1 CIDR’03 prr SVD decompose f 1 , f 2 , … , fr × d1 d2 × dr f1 f2 fr 44 Weighted-Sum SVD Problem: Obtain the similarity of input sequence e1, e2, … , er e1 e2 cw1 cw2 cwr r 1 cwi ei f i i 1 r 2 dwi ei f i i 1 CIDR’03 and the pattern f 1 , f 2 , … , fr f1 f2 dw1 dw2 er dwr fr The similarity of input sequence and the pattern =min(Θ1, Θ2) 45 The Ridge-Climbing Heuristic Procedure: Compute the accumulated similarity values (ASVs) between the input sequence and all vocabulary sequences Keep track of all ASVs For each vocabulary sequence, check whether the ASV is monotonically increasing, and whether a maximum is reached • Yes: put this vocabulary into the candidates pool CIDR’03 Choose the vocabulary from the candidates pool with biggest maximal value Isolate the recognized stream 46 The Ridge-Climbing Heuristic Assume the database only has three vocabulary sequence, like, yellow, and I. Input sequence Maximum is reached! Isolate! Reset the ASVs like yellow time CIDR’03 I ASVs ASVs ASVs like time time 47