Databases

advertisement
Smart Home Technologies
Data Management and Databases
Databases for Smart Homes



Requirements
Database Types
Database Technologies


Smart Home Databases
Data Mining
Data Storage Requirements

Sensor data








Temperature (15 @ 8 Kbps)
Humidity (15 @ 8 Kbps)
Gas (15 @ 8 Kbps)
Light (15 @ 8 Kbps)
Motion (15 @ 8 Kbps)
Pressure (100 @ 8 Kbps)
Microphone (15 @ 500 Kbps)
Camera (15 @ 10 Mbps)
Data Storage Requirements

User data

Multimedia






Phone messages/conversations (500 Kbps – 10 Mbps)
Music (500 Kbps)
TV/Radio broadcasts (500 Kbps – 10 Mbps)
Home movies (10 Mbps)
Images
Computer



Programs
Data files
Operating systems
Data Storage Issues

Issues

Query frequency and type

Sampling/recording rates








205 sensors (158,900 Kbps)
Multimedia recordings
Simultaneous playback
Analysis, prediction, decision-making queries
Transaction granularity
Historical data, decay
Security and privacy
Centralized vs. distributed
What Data to Store

Type of Data




Raw data
Pre-processed
Compressed
Frequency of Data Storage for
Sensor Data

Tradeoff between precision and
quantity
Sensor Data Example

















9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
9/8/2002
2:0:1 AM~A5 (Coffee Maker) ON
1:6:59 AM~A9 (A/C) ON
3:58:52 AM~A0 (Stereo) ON
5:57:0 AM~A2 (Kitchen Light) ON
3:1:42 AM~A5 (Coffee Maker) OFF
7:8:3 AM~A3 (Stove) ON
12:54:52 PM~A10 (Bathroom Light) ON
4:58:5 AM~A0 (Stereo) OFF
8:1:20 AM~A3 (Stove) OFF
9:6:10 AM~A8 (Computer) ON
10:8:19 AM~A4 (Bathtub Heater) ON
11:9:4 AM~A0 (Stereo) ON
9:4:5 AM~A8 (Computer) OFF
10:9:4 AM~A4 (Bathtub Heater) OFF
2:2:5 PM~A10 (Bathroom Light) OFF
2:52:37 PM~A0 (Stereo) OFF
4:2:0 PM~A9 (A/C) OFF
Media Viewing Example
Watching Events
Date
Day Mood
Start End Device Program name Type
Comments
Others Rating
020302 Su normal 1330 1600 T
nba basketball sports
dallas mavericks go
team
none
5
020302 Su normal 1700 2100 t
super bowl
sports
gotta watch the
commercials
Dad
5
020402 m
normal 1900 2000 t
boston public
drama
hot teachers
none
5
020402 m
normal 2000 2100 t
ally mcbeal
drama
funny lawyers
none
4
020402 m
normal 2300 100 V
WWF RAW
wrestling testosterone
none
5
020502 t
normal 2100 2200 t
philly
drama
hot lawyers
none
4
020602 w
bored
1830 2200 t
nba basketball sports
GO MAVS
none
5
020702 th
tired
1900 2100 t
wwf
smackdown
wrestling its me soap
none
5
020702 th
tired
2100 2200 t
ER
drama
good show
none
4
020802 f
excited 1900 2230 t
olympics
sports
gotta watch
none
4
020902 sa
excited 1900 2230 t
olympics
sports
gotta watch
none
4
021002 su
ecstatic 1500 1800 t
NBA allstar
game
sports
gotta see what
happens
none
3
012802 M
normal 1900 2000 T
Boston Public
Drama
hot chicks teaching
none
5
012802 M
normal 2000 2100 T
Ally McBeal
Drama
hot chicks lawyering
none
5
Multimedia Example

Digital Silhouettes (Predictive Networks)

Predicting web surfing behavior ($$$)


Microsoft (2002) track TV viewing preferences
140 data items for each user

Demographics (50)


Subcategories within gender, age, income,
education, occupation, and race
90 Content preferences

golf, music, yoga
Database Types / Data Models






Relational
OO
Hybrid (Object-Relational)
Temporal
Deductive
Others

Spatial, …
Example Data Representations

Relational


We all know…flat tables of atomic attributes with
foreign key relationships
OO

Complex data reps


multivalued, composite
Temporal


Relational model: add valid start, end dates to
each table (versions of info and when valid)
Includes time, events, durations…
Operations

DDL/DML (data def/manip languages)



SQL
OQL
Update operations


Built-in insert, delete, update
Stored procedures for triggers, active
(ECA) rules
Example Operations for
Temporal Databases

INCLUDES



Rows valid in a certain time period
BEFORE/AFTER a time condition
Set operations

Union, intersection of 2 time periods
Active DB

Event-Condition-Action rules


Relational


Allow for decisions to be made in the database
instead of a separate application
Implemented as triggers
Challenges

Rule consistency


(2+ rules do not contradict)
Guaranteed termination

Trigger loops (T1 <->T2)
Smart Home Active DB Example


Java, Postgres, Jess rules
Event classification (local&composite)

Data Manipulation Events


Temporal Events (instance,recurring)


Set temp to 70 degrees at 7:00am workdays
Exception Events


TV show being viewed (channel, time, genre…)
Power failure
Behavioral Events

Time children home from school; dinner time
Active DB Example (TCU)
Title
Event
Condition
Action
TV View
Menu
TV turned on
Molly is holding
remote
Display shows
matching Molly’s
preferences
Entry
Lighting
Inhabitant enters
house
Light level
<threshold
Adjust lighting to
predetermined level
Aromatherapy
Every Friday
night when
Hanna sits on
sofa
Always
Release aroma
Night Idle
John on sofa idle
> 15 minutes,
TV&lights are on
No other
inhabitant in
room
Turn off all devices in
the room
Distributed vs. Centralized

Centralized database can produce a
bottleneck




Large volume of data input
Large database
Large volume of queries
In distributed databases, data consistency,
replication, and retrieval can be more
problematic



Consistency of schemas
Retrieval in case the data location is not known
Communication overhead to ensure database
consistency
SmartHome
Database Architecture

Centralized vs. distributed?







Answer: Both
Central storage of high demand, persistent data
Distributed storage of low demand, dynamic data
Distributed queries
Push processing toward sensors
Adaptive, hierarchical organization
End-effector autonomy (“smart sensor”)
Database Systems

Commercial







DB2
Empress
Informix
Oracle
MS Access
MS SQL
Sybase

Free



Berkeley DB
PostgreSQL
MySQL
UTA MavHome DB

Active



Reactive & proactive (e.g., to predict)
Distributed
Information collection agents

Rules




Local Agent: what data they need to collect
Distributed: coordinate overall monitoring of collected
information
Continuous monitoring of events
Extension of SNOOP
Microsoft Easy Living DB
(2002)

Relational


World Model DB Describes:





Fast & robust, but awkward for some data
Computing devices
People and their personal preferences/settings
Services
Rooms and doorways
Serves as Abstraction Layer between sensors and
application that use data from sensors

e.g. new sensors  no change to applications
Stanford Interactive
Workspace

Uses LORE

A semi-structured XML DB system


Still available, but work stopped in 2000
Data stored is catalog of (index to)

documents, images, 3-D models, applicationspecific domain models
Sensor Database Systems

COUGAR project





www.cs.cornell.edu/database/cougar
Query processing over ad-hoc sensor
networks
Small database component (QueryProxy) at
each sensor
Sensor clusters provide local aggregations
(e.g., min, max, mean)
Assumes centralized index of all data sources
Siemens Netabase

“The network is the database.”


Sensor networks




Navas and Wynblatt, ACM SIGMOD 2001
Large number of data sources (105)
Volatile data and data organization
“Thin” data servers on scaled-down hardware
Netabase approach




Query decomposition
Characteristic routing (ala IP routing)
Local joins
Query evaluation
Siemens Netabase

www.netabasesoftware.com
Data Warehouses

Repositories for data mining activities



Aggregates/summaries of data help efficiency
Optimized for decision-support, not
transaction processing
Definition (Elmasri, page 900)

A subject-oriented, integrated, non-volatile, timevariant collection of data in support of
management’s decisions”

Replace “management”, with “smart home agents”
Warehouse Properties





Very large: 100gigabytes to many terabytes
Tends to include historical data
Workload: mostly complex queries that access lots of data, and
do many scans, joins, aggregations. Tend to look for "the big
picture".
Updates pumped to warehouse in batches (overnight)
Data may be heavily summarized and/or consolidated in
advance (must be done in batches too, must finish overnight).

Research work has been done (e.g. "materialized views") -- a small
piece of the problem.
02.15.04 from http://redbook.cs.berkeley.edu/lec28.html
Data Warehouses

Data Cleaning




Data Migration: simple transformation rules (replace "gender" with "sex")
Data Scrubbing: use domain-specific knowledge (e.g. zip codes) to modify
data. Try parsing and fuzzy matching from multiple sources.
Data Auditing: discover rules and relationships (or signal violations thereof).
Not unlike data mining.
Data Loading



can take a very long time! (Sorting, indexing, summarization, integrity
constraint checking, etc.) Parallelism a must.
Full load: like one big xact – change from old data to new is atomic.
Incremental loading ("refresh") makes sense for big warehouses, but
transaction model is more complex – have to break the load into lots of
transactions, and commit them periodically to avoid locking
everything. Need to be careful to keep metadata & indices consistent along
the way.
02.15.04 from http://redbook.cs.berkeley.edu/lec28.html
Data Warehouses
02.15.04 from http://redbook.cs.berkeley.edu/lec28.html
Data Mining Definition


Discovery of new information in terms of patterns or
rules from vast amounts of data
Extracts patterns that can’t readily be found by
asking the right questions (queries)


TOO MUCH DATA FOR HUMANS
Emerged from



Artificial Intelligence:Machine learning, Neural nets, Genetic
Algorithms
Statistics
Operations Research
Data Mining Steps


Data selection -- pick the data needed
Data cleansing



Enrichment



Add data (e.g., age, gender, income)
Data transformation


Fix bad data (e.g., spelling, zip codes)
Hard to deal with missing, erroneous, conflicting, redundant
data
Aggregate (e.g., zip codes  regions)
Data mining
Reporting on discovered Knowledge
Types of Results

Association rules


Sequential patterns


Buy house  buy furniture within months
Classification trees


Buy diapers  buy lots of beer
Types of buyers (upscale,bargain-conscience, …)
Why do it?


Make more money
Science & medicine
Data Mining Goals


Find patterns to predict future events
Find major groupings


Groupings of buyers, stars, diseases …
Find which group something belongs to

creditworthiness
Data Mining Results







Association rules
Classification hierarchies
Clustering
Sequential patterns
Patterns within time series
Type of result, inputs & algorithms vary
Often interested in some combination of
these types of Knowledge
Clustering

Unsupervised learning techniques






Training samples are unclassified
Vs. supervised learning (classification)
Drug categories for depression
Categories of TV viewers
Categories of buyers (likely, unlikely)
Categories of households?

Single male, mother/children, conventional
(M/D/kids), DINKs.
Sequential Patterns


Detecting associations among events
with certain temporal relationships
Example:




Cardiac bypass for blocked arteries
AND within 18 months, high blood urea
THEN kidney failure likely in next 18
months
Particularly important in smart homes
Sequential Pattern Discovery

Sequence of itemsets

Grocery store purchases by 1 person
(3 itemsets)


{soy milk, bread, chocolate}, {bananas,
chocolate}, {lettuce, tomato, chocolate}
2 Subsequences


{soy milk, bread, chocolate}, {bananas, chocolate},
{bananas, chocolate}, {lettuce, tomato, chocolate}
Sequential Pattern Discovery

The support for a sequence S is the % of the given
set U of sequences of which S is a subsequence.





That is: how many times does S show up?
Find all subsequences from the given sequence sets
that have a user-defined minimum support.
The sequence S1, S2, … Sn, is a predictor of “fact”
that a customer that buys itemset S1 is likely to buy
itemset S2, then S3, …
Prediction support based on frequency of this
sequence in the past
Many research issues to create good algos
Patterns Within Time Series

Finding 2 patterns that occur over time



2003 stock prices of Choice Homes and
Home Depot
2 products show same sales pattern in
summer but different one in winter
Solar magnetic wind patterns may predict
earth atmospheric changes
Time Series Pattern Discovery

Time series are sequences of events




Event could be a transaction (closing daily
stock price)
Look at sequences over n days, or
Longest period in which change is no
greater than 1%
Comparing

Must define similarity measures
Other Approaches in Data Mining

Neural nets

Infer a function from a set of examples




Supervised & unsupervised algorithms
Capabilities



Non-parametric curve-fitting
Interpolates to solve new problems
classification
time-series prediction
Disadvantages

can’t see what it learned (not declarative)
Other Approaches in Data Mining

Genetic algorithms

Set up





Representation (strings over an alphabet)
Evaluation (fitness) function
Parameters: # of generations, cross-over rate,
mutation rate, etc.
Randomized (probabilistic operators),
parallel search over search space
Used for problem solving and clustering
Download