CS240B Notes by
Carlo Zaniolo
UCLA CSD
1
Continuous, unbounded, rapid, time-varying streams of data elements
Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial applications
Web logs and click-streams
Manufacturing processes
DSMS = Data Stream Management System
2
Amazon/Cougar (Cornell) – sensors
Aurora (Brown/MIT) – sensor monitoring, dataflow
Hancock (AT&T) – Telecom streams
Niagara (OGI/Wisconsin) – Internet DBs & XML
OpenCQ (Georgia) – triggers, view maintenance
Stream (Stanford) – general-purpose DSMS
Tapestry (Xerox) – pubish/subscribe filtering
Telegraph (Berkeley) – adaptive engine for sensors
Gigascope: AT&T Labs – Network Monitoring
Stream Mill (UCLA) - power & extensibility
3
Data Models
Relational Streams--but XML streams important too
Tuple Time-Stamping
Order is important
Windows
Query Languages: Extensions of SQL or XQUERY
To support continuous (i.e., persistent) queries on transient data— reversal of roles.
Blocking operators excluded
Query Plans:
New execution models (main memory oriented)
Optimized scheduling for response time or memory
Quality of Services (QoS) & Approximation
Synopses
Sampling
Load shedding.
4
Several Startups
Streambase,
Coral8,
Apama, and
Truviso.
Oracle and DBMS companies
Publish/subscribe
Complex Event Processing (CEP)
Limitations: only simple applications—e.g. continuous queries expressed in SQL
No Support for Data Stream Mining queries.
5
Many applications: click stream analysis, intrusion detection,...
Many fast & light algorithms developed for stream mining.
Ensembles, Moment, SWIM, etc.
Analyst should be able to focus on high-level mining tasks.
Leaving QoS and lower-level issues to the system.
Integration of mining methods into Data Stream
Management Systems (DSMS) is required
Many research challenges.
Stream Mill Miner (SMM) is the first DSMS designed for that.
6
Data Stream Management Systems (DSMS)
Data stream mining applications so far ignored by
DSMS … although
A. DSMS technology is required for data stream mining
QoS, query scheduling, synopses, sampling, windows ,
...
B. But supporting DM applications is difficult since current DSMS only support simple query languages based on SQL.
Conclusion: either a shotgun wedding ... or a research breakthrough is needed here!
7
A Difficult Problem: the Inductive DBMS
Experience
Initial attempts to support mining queries in relational
DBMS : Unsuccessful
OR-DBMS do not fare much better [Sarawagi’ 98].
In 1996 the ‘ high-road ’ approach by Imielinski & Mannila who called for a quantum leap in functionality:
High-level declarative languages for DM .
Extensions for query processing and optimization.
The research area of Inductive DBMS was thus born
Inspired DMQL , Mine Rule , MSQL , etc.
Suffer from limited generality and performance issues.
8
Vendors have taken a ` low-road ’ approach.
A library of mining functions using a cache-mining approach
IBM DB2 Intelligent Miner
Oracle Data Miner
MS OLE DB for DM: mining models
Closed systems,
Lacking in coverage and user-extensibility.
Not as popular as dedicated, stand-alone mining systems, such as Weka
9
A comprehensive set of mining algorithms, and tools.
Generic algorithms over arbitrary data sets.
Independent on the number of columns in tables.
Open and extensible system based on Java.
These are the features that we want in our SMM— starting from SQL rather than Java!
Not an easy task ...why?
10
SMM Contributions
Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.
Language and System Extensions:
Genericity,
Extensibility, and
Performance
A suite of stream mining algorithms.
Existing ones and
Newly developed in this project—e.g., SWIM.
High level mining model for better
Usability
Control of mining process.
11
From SQL to Online Mining in SMM: step by step
Naïve Bayesian Classifier (NBC).
Important and frequently used.
Schema-specific NBC. Simple to express in SQL— by count, sum aggregates. But a generci NBC is still preferable.
Genericity : one function independent of number columns involved.
Schema independence in SQL?
12
Weka
Arrays of type real.
SMM
Verticalization .
Similar arrays, but in tables.
Built-in table function to reduce any table to this form.
Thus, generic UDAs work with this schema.
And further improvements are also supported in SMM
13
Most mining tasks cannot be implemented in SQL.
Solution: Define complex functions by User Defined
Aggregates (UDAs)
Complex mining tasks can be viewed as aggregates
UDAs Natively defined in SQL make the language computationally complete [Wang’ 04]
Turing-complete over static data
Non-blocking complete over data streams
Natural extensions to support windows and delta computations for data streams [Bai’ 06]
UDAs can be defined in a PL, for better performance
14
Windowed UDA Example – Continuous Count
}
WINDOW AGGREGATE sum(val REAL ): REAL {
TABLE state (tot real);
INITIALIZE : {
INSERT INTO state VALUES (val);
}
ITERATE : {
UPDATE state SET tot = tot + val;
}
EXPIRE : {
UPDATE state SET tot = tot – oldest().val;
}
/* No TERMINATE state */
For efficient differential computation
15
UDAs Invoked with standard SQL:2003 syntax of
OLAP functions.
SELECT learn(ts.Column, ts.Value, t.dec)
OVER ( ROWS 1000 PRECEDING )
FROM trainingstream AS t,
TABLE (verticalize(Outlook, Temp, Humidity, Wind)) AS ts
Powerful framework:
Concept drifts-shifts
Association rule mining
16
A window can be divided into panes (called a slide)
Tumbling windows when the size of the slide is equal or larger than that of the window
The slide/window combination is great for data stream mining.
Simple construct added to support slides in UDAs
Allowed us to build a flexible and efficient library of data stream mining UDAs
17
SMM Contributions
Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.
Language and System Extensions:
Genericity,
Extensibility, and
Performance
A suite of stream mining algorithms.
Existing ones and
Newly developed in this project—e.g., SWIM.
High level mining model for better
Usability
Control of mining process.
18
SWIM [Mozafari’ 08] – Maintaining frequent patterns over large windows with slides.
Differentially computes frequent patterns as slides enter (expire out of) the window.
Uses efficient ‘ Verifiers’ based on conditional counting .
Trade-off between Delay and Performance
Performance gain over existing algorithms.
19
If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all slides (PT)
…
Expired New
S4 S5 S6 S7
W4
Count/Update frequencies
W5
Mine
Count/Update frequencies
Add F7 to PT
Mining
Alg.
PT
Prune PT
……….
PT = F4 U F5 U F6
20
Concept Drifts/Shifts—Complex Processes
Ensemble based methods.
Weighted bagging [Wang’ 03], adaptive boosting [Chu’
04], inductive transfer [Forman’ 06].
Generic support, e.g. adaptive boosting (below).
21
Built-in Online Mining Algorithms In SMM
Online classifiers
Naïve Bayesian
Decision Tree
K-nearest Neighbor
Online clustering
DBScan [Ester’ 96]
IncDBScan
Windowed K-means*
DenStream* [Cao’ 06]
CluStream
Association rule mining
Approximate frequent items
SWIM [Mozafari’ 08]
Moment [Chi’ 04]
AFPIM
Time series/sequence queries
SQL-TS [Sadri’ 01]
Many more …
Already supported
To be supported
22
SMM Contributions
Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.
Language and System Extensions:
Genericity,
Extensibility, and
Performance
A suite of stream mining algorithms.
Existing ones and
Newly developed in this project—e.g., SWIM.
High level mining model for better
Usability
Control of mining process.
23
Complex SQL queries to invoke built-in and user-defined mining algorithms.
An open and extensible system
Most analysts would prefer using high-level mining language that
supports uniform invocation of built-in and userdefined mining algorithms (no SQL required)
describes the workflow of the mining process
Is also open and extensible to incorporate newly defined mining algorithms.
24
Example: Defining a Mining Model
CREATE MODEL TYPE NaiveBayesianClassifier {
SHAREDTABLES (DescriptorTbl),
};
Learn ( UDA LearnNaiveBayesian,
WINDOW TRUE ,
PARTABLES (), % names of param tables required by the method
PARAMETERS () % additional parameters to be specified for input
)
),
Classify ( UDA ClassifyNaiveBayesian,
WINDOW TRUE ,
PARTABLES (),
PARAMETERS ()
25
Example: Using a Mining Model
Creating an instance:
CREATE MODEL INSTANCE NaiveBayesianInstance
AS NaiveBayesianClassifier;
Uniform invocation of mining tasks:
RUN NaiveBayesianInstance.Learn WITH TrainingSet ;
26
SMM Vs. Weka
NBC and decision tree classifier
Datasets [UCI]
• Iris : 5 attributes
• Heart disease : 13 attributes
Overhead of integrating algorithms into
SMM
The SWIM algorithm standalone vs. integrated
Dataset [IBM Quest]
• Trans len 20, Pattern len 5, Tuples 50K
27
Comparison with Weka: NBC-Iris
28
29
Comparison with Weka: Decision Tree - Iris
30
Integration Overhead: Integrated SWIM vs.
Standalone SWIM
31
One server, multiple clients
Server (on Linux): hosts the ESL language and manages storage and continuous queries
Client (Java based GUI): allows the user to specify streams, queries, etc.
32
SMM integrates new solutions for several difficult problems:
Usability by high-level mining models
Extensibility by user-defined mining models that call on
UDAs with windows
Suite of built-in data stream mining UDAs
Generic mining UDAs by Verticalization & other techniques
Performance
SMM is the first of its kind: more and better systems will follow in its footsteps.
33
Faster & lighter mining algorithms
E.g. online algorithms for clustering
Integration of other mining algorithms
Data flow in mining models
Similar solution for databases
34
35
[Arasu’ 04] Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In VLDB , pages 336–347, 2004.
[Babcock’ 02] B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom.
Models and issues in data stream systems. In PODS , 2002.
[Bai’ 06] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo.
A data stream language and system designed for power and extensibility. In
CIKM , pages 337–346, 2006.
[Cao’ 06] F Cao, M Ester, W Qian, and A Zhou, Density-based Clustering over an Evolving Data Stream with Noise, To appear in Proceedings of SIAM 2006.
[Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the
2004 IEEE International Conference on Data Mining (ICDM’04) , November 2004.
[Chu’ 04] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data streams. In PAKDD , volume 3056, 2004.
[Ester’ 96] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data
Mining , pages 226–231, 1996.
[Forman’ 06] George Forman. Tackling concept drift by temporal inductive transfer. In SIGIR , pages 252–259, 2006.
36
[Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM , 39(11):58–64, 1996.
[Law’ 04] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and query language for data streams. In VLDB , pages 492–503, 2004.
[Mozafari’ 08] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo.
Verifying and mining frequent patterns from large windows over data streams. In International Conference on Data Engineering (ICDE) , 2008.
[Sadri’ 01] Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi.
Optimization of sequence queries in database systems. In PODS , Santa
Barbara, CA, May 2001.
[Sarawagi’ 98] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD , 1998.
[UCI-MLR] http://archive.ics.uci.edu/ml/datasets.html
[Wang’ 03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining conceptdrifting data streams using ensemble classifiers. In SIGKDD , 2003.
37