Smart Home Technologies Data Management and Databases Databases for Smart Homes Requirements Database Types Database Technologies Smart Home Databases Data Mining Data Storage Requirements Sensor data Temperature (15 @ 8 Kbps) Humidity (15 @ 8 Kbps) Gas (15 @ 8 Kbps) Light (15 @ 8 Kbps) Motion (15 @ 8 Kbps) Pressure (100 @ 8 Kbps) Microphone (15 @ 500 Kbps) Camera (15 @ 10 Mbps) Data Storage Requirements User data Multimedia Phone messages/conversations (500 Kbps – 10 Mbps) Music (500 Kbps) TV/Radio broadcasts (500 Kbps – 10 Mbps) Home movies (10 Mbps) Images Computer Programs Data files Operating systems Data Storage Issues Issues Query frequency and type Sampling/recording rates 205 sensors (158,900 Kbps) Multimedia recordings Simultaneous playback Analysis, prediction, decision-making queries Transaction granularity Historical data, decay Security and privacy Centralized vs. distributed What Data to Store Type of Data Raw data Pre-processed Compressed Frequency of Data Storage for Sensor Data Tradeoff between precision and quantity Sensor Data Example 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 9/8/2002 2:0:1 AM~A5 (Coffee Maker) ON 1:6:59 AM~A9 (A/C) ON 3:58:52 AM~A0 (Stereo) ON 5:57:0 AM~A2 (Kitchen Light) ON 3:1:42 AM~A5 (Coffee Maker) OFF 7:8:3 AM~A3 (Stove) ON 12:54:52 PM~A10 (Bathroom Light) ON 4:58:5 AM~A0 (Stereo) OFF 8:1:20 AM~A3 (Stove) OFF 9:6:10 AM~A8 (Computer) ON 10:8:19 AM~A4 (Bathtub Heater) ON 11:9:4 AM~A0 (Stereo) ON 9:4:5 AM~A8 (Computer) OFF 10:9:4 AM~A4 (Bathtub Heater) OFF 2:2:5 PM~A10 (Bathroom Light) OFF 2:52:37 PM~A0 (Stereo) OFF 4:2:0 PM~A9 (A/C) OFF Media Viewing Example Watching Events Date Day Mood Start End Device Program name Type Comments Others Rating 020302 Su normal 1330 1600 T nba basketball sports dallas mavericks go team none 5 020302 Su normal 1700 2100 t super bowl sports gotta watch the commercials Dad 5 020402 m normal 1900 2000 t boston public drama hot teachers none 5 020402 m normal 2000 2100 t ally mcbeal drama funny lawyers none 4 020402 m normal 2300 100 V WWF RAW wrestling testosterone none 5 020502 t normal 2100 2200 t philly drama hot lawyers none 4 020602 w bored 1830 2200 t nba basketball sports GO MAVS none 5 020702 th tired 1900 2100 t wwf smackdown wrestling its me soap none 5 020702 th tired 2100 2200 t ER drama good show none 4 020802 f excited 1900 2230 t olympics sports gotta watch none 4 020902 sa excited 1900 2230 t olympics sports gotta watch none 4 021002 su ecstatic 1500 1800 t NBA allstar game sports gotta see what happens none 3 012802 M normal 1900 2000 T Boston Public Drama hot chicks teaching none 5 012802 M normal 2000 2100 T Ally McBeal Drama hot chicks lawyering none 5 Multimedia Example Digital Silhouettes (Predictive Networks) Predicting web surfing behavior ($$$) Microsoft (2002) track TV viewing preferences 140 data items for each user Demographics (50) Subcategories within gender, age, income, education, occupation, and race 90 Content preferences golf, music, yoga Database Types / Data Models Relational OO Hybrid (Object-Relational) Temporal Deductive Others Spatial, … Example Data Representations Relational We all know…flat tables of atomic attributes with foreign key relationships OO Complex data reps multivalued, composite Temporal Relational model: add valid start, end dates to each table (versions of info and when valid) Includes time, events, durations… Operations DDL/DML (data def/manip languages) SQL OQL Update operations Built-in insert, delete, update Stored procedures for triggers, active (ECA) rules Example Operations for Temporal Databases INCLUDES Rows valid in a certain time period BEFORE/AFTER a time condition Set operations Union, intersection of 2 time periods Active DB Event-Condition-Action rules Relational Allow for decisions to be made in the database instead of a separate application Implemented as triggers Challenges Rule consistency (2+ rules do not contradict) Guaranteed termination Trigger loops (T1 <->T2) Smart Home Active DB Example Java, Postgres, Jess rules Event classification (local&composite) Data Manipulation Events Temporal Events (instance,recurring) Set temp to 70 degrees at 7:00am workdays Exception Events TV show being viewed (channel, time, genre…) Power failure Behavioral Events Time children home from school; dinner time Active DB Example (TCU) Title Event Condition Action TV View Menu TV turned on Molly is holding remote Display shows matching Molly’s preferences Entry Lighting Inhabitant enters house Light level <threshold Adjust lighting to predetermined level Aromatherapy Every Friday night when Hanna sits on sofa Always Release aroma Night Idle John on sofa idle > 15 minutes, TV&lights are on No other inhabitant in room Turn off all devices in the room Distributed vs. Centralized Centralized database can produce a bottleneck Large volume of data input Large database Large volume of queries In distributed databases, data consistency, replication, and retrieval can be more problematic Consistency of schemas Retrieval in case the data location is not known Communication overhead to ensure database consistency SmartHome Database Architecture Centralized vs. distributed? Answer: Both Central storage of high demand, persistent data Distributed storage of low demand, dynamic data Distributed queries Push processing toward sensors Adaptive, hierarchical organization End-effector autonomy (“smart sensor”) Database Systems Commercial DB2 Empress Informix Oracle MS Access MS SQL Sybase Free Berkeley DB PostgreSQL MySQL UTA MavHome DB Active Reactive & proactive (e.g., to predict) Distributed Information collection agents Rules Local Agent: what data they need to collect Distributed: coordinate overall monitoring of collected information Continuous monitoring of events Extension of SNOOP Microsoft Easy Living DB (2002) Relational World Model DB Describes: Fast & robust, but awkward for some data Computing devices People and their personal preferences/settings Services Rooms and doorways Serves as Abstraction Layer between sensors and application that use data from sensors e.g. new sensors no change to applications Stanford Interactive Workspace Uses LORE A semi-structured XML DB system Still available, but work stopped in 2000 Data stored is catalog of (index to) documents, images, 3-D models, applicationspecific domain models Sensor Database Systems COUGAR project www.cs.cornell.edu/database/cougar Query processing over ad-hoc sensor networks Small database component (QueryProxy) at each sensor Sensor clusters provide local aggregations (e.g., min, max, mean) Assumes centralized index of all data sources Siemens Netabase “The network is the database.” Sensor networks Navas and Wynblatt, ACM SIGMOD 2001 Large number of data sources (105) Volatile data and data organization “Thin” data servers on scaled-down hardware Netabase approach Query decomposition Characteristic routing (ala IP routing) Local joins Query evaluation Siemens Netabase www.netabasesoftware.com Data Warehouses Repositories for data mining activities Aggregates/summaries of data help efficiency Optimized for decision-support, not transaction processing Definition (Elmasri, page 900) A subject-oriented, integrated, non-volatile, timevariant collection of data in support of management’s decisions” Replace “management”, with “smart home agents” Warehouse Properties Very large: 100gigabytes to many terabytes Tends to include historical data Workload: mostly complex queries that access lots of data, and do many scans, joins, aggregations. Tend to look for "the big picture". Updates pumped to warehouse in batches (overnight) Data may be heavily summarized and/or consolidated in advance (must be done in batches too, must finish overnight). Research work has been done (e.g. "materialized views") -- a small piece of the problem. 02.15.04 from http://redbook.cs.berkeley.edu/lec28.html Data Warehouses Data Cleaning Data Migration: simple transformation rules (replace "gender" with "sex") Data Scrubbing: use domain-specific knowledge (e.g. zip codes) to modify data. Try parsing and fuzzy matching from multiple sources. Data Auditing: discover rules and relationships (or signal violations thereof). Not unlike data mining. Data Loading can take a very long time! (Sorting, indexing, summarization, integrity constraint checking, etc.) Parallelism a must. Full load: like one big xact – change from old data to new is atomic. Incremental loading ("refresh") makes sense for big warehouses, but transaction model is more complex – have to break the load into lots of transactions, and commit them periodically to avoid locking everything. Need to be careful to keep metadata & indices consistent along the way. 02.15.04 from http://redbook.cs.berkeley.edu/lec28.html Data Warehouses 02.15.04 from http://redbook.cs.berkeley.edu/lec28.html Data Mining Definition Discovery of new information in terms of patterns or rules from vast amounts of data Extracts patterns that can’t readily be found by asking the right questions (queries) TOO MUCH DATA FOR HUMANS Emerged from Artificial Intelligence:Machine learning, Neural nets, Genetic Algorithms Statistics Operations Research Data Mining Steps Data selection -- pick the data needed Data cleansing Enrichment Add data (e.g., age, gender, income) Data transformation Fix bad data (e.g., spelling, zip codes) Hard to deal with missing, erroneous, conflicting, redundant data Aggregate (e.g., zip codes regions) Data mining Reporting on discovered Knowledge Types of Results Association rules Sequential patterns Buy house buy furniture within months Classification trees Buy diapers buy lots of beer Types of buyers (upscale,bargain-conscience, …) Why do it? Make more money Science & medicine Data Mining Goals Find patterns to predict future events Find major groupings Groupings of buyers, stars, diseases … Find which group something belongs to creditworthiness Data Mining Results Association rules Classification hierarchies Clustering Sequential patterns Patterns within time series Type of result, inputs & algorithms vary Often interested in some combination of these types of Knowledge Clustering Unsupervised learning techniques Training samples are unclassified Vs. supervised learning (classification) Drug categories for depression Categories of TV viewers Categories of buyers (likely, unlikely) Categories of households? Single male, mother/children, conventional (M/D/kids), DINKs. Sequential Patterns Detecting associations among events with certain temporal relationships Example: Cardiac bypass for blocked arteries AND within 18 months, high blood urea THEN kidney failure likely in next 18 months Particularly important in smart homes Sequential Pattern Discovery Sequence of itemsets Grocery store purchases by 1 person (3 itemsets) {soy milk, bread, chocolate}, {bananas, chocolate}, {lettuce, tomato, chocolate} 2 Subsequences {soy milk, bread, chocolate}, {bananas, chocolate}, {bananas, chocolate}, {lettuce, tomato, chocolate} Sequential Pattern Discovery The support for a sequence S is the % of the given set U of sequences of which S is a subsequence. That is: how many times does S show up? Find all subsequences from the given sequence sets that have a user-defined minimum support. The sequence S1, S2, … Sn, is a predictor of “fact” that a customer that buys itemset S1 is likely to buy itemset S2, then S3, … Prediction support based on frequency of this sequence in the past Many research issues to create good algos Patterns Within Time Series Finding 2 patterns that occur over time 2003 stock prices of Choice Homes and Home Depot 2 products show same sales pattern in summer but different one in winter Solar magnetic wind patterns may predict earth atmospheric changes Time Series Pattern Discovery Time series are sequences of events Event could be a transaction (closing daily stock price) Look at sequences over n days, or Longest period in which change is no greater than 1% Comparing Must define similarity measures Other Approaches in Data Mining Neural nets Infer a function from a set of examples Supervised & unsupervised algorithms Capabilities Non-parametric curve-fitting Interpolates to solve new problems classification time-series prediction Disadvantages can’t see what it learned (not declarative) Other Approaches in Data Mining Genetic algorithms Set up Representation (strings over an alphabet) Evaluation (fitness) function Parameters: # of generations, cross-over rate, mutation rate, etc. Randomized (probabilistic operators), parallel search over search space Used for problem solving and clustering