Formato Base dei Dati


Extending DSMS for

Data Stream Mining

CS240B Notes by

Carlo Zaniolo



Data Streams

 Continuous, unbounded, rapid, time-varying streams of data elements

 Occur in a variety of modern applications

 Network monitoring and traffic engineering

 Sensor networks, RFID tags

 Telecom call records

 Financial applications

 Web logs and click-streams

 Manufacturing processes

 DSMS = Data Stream Management System


Many Research Projects …

 Amazon/Cougar (Cornell) – sensors

 Aurora (Brown/MIT) – sensor monitoring, dataflow

 Hancock (AT&T) – Telecom streams

 Niagara (OGI/Wisconsin) – Internet DBs & XML

 OpenCQ (Georgia) – triggers, view maintenance

 Stream (Stanford) – general-purpose DSMS

 Tapestry (Xerox) – pubish/subscribe filtering

 Telegraph (Berkeley) – adaptive engine for sensors

 Gigascope: AT&T Labs – Network Monitoring

 Stream Mill (UCLA) - power & extensibility


Technology Challenges

 Data Models

 Relational Streams--but XML streams important too

 Tuple Time-Stamping

 Order is important

 Windows

 Query Languages: Extensions of SQL or XQUERY

 To support continuous (i.e., persistent) queries on transient data— reversal of roles.

 Blocking operators excluded

 Query Plans:

 New execution models (main memory oriented)

 Optimized scheduling for response time or memory

 Quality of Services (QoS) & Approximation

 Synopses

 Sampling

 Load shedding.


Commercial Developments

 Several Startups

 Streambase,

 Coral8,

 Apama, and

 Truviso.

 Oracle and DBMS companies

 Publish/subscribe

 Complex Event Processing (CEP)

Limitations: only simple applications—e.g. continuous queries expressed in SQL

 No Support for Data Stream Mining queries.


Data Stream Mining

 Many applications: click stream analysis, intrusion detection,...

 Many fast & light algorithms developed for stream mining.

 Ensembles, Moment, SWIM, etc.

 Analyst should be able to focus on high-level mining tasks.

 Leaving QoS and lower-level issues to the system.

 Integration of mining methods into Data Stream

Management Systems (DSMS) is required

 Many research challenges.

Stream Mill Miner (SMM) is the first DSMS designed for that.


Data Stream Management Systems (DSMS)

 Data stream mining applications so far ignored by

DSMS … although

A. DSMS technology is required for data stream mining

 QoS, query scheduling, synopses, sampling, windows ,


B. But supporting DM applications is difficult since current DSMS only support simple query languages based on SQL.

Conclusion: either a shotgun wedding ... or a research breakthrough is needed here!


A Difficult Problem: the Inductive DBMS


 Initial attempts to support mining queries in relational

DBMS : Unsuccessful

 OR-DBMS do not fare much better [Sarawagi’ 98].

 In 1996 the ‘ high-road ’ approach by Imielinski & Mannila who called for a quantum leap in functionality:

 High-level declarative languages for DM .

 Extensions for query processing and optimization.

 The research area of Inductive DBMS was thus born

 Inspired DMQL , Mine Rule , MSQL , etc.

 Suffer from limited generality and performance issues.


DBMS Vendors

 Vendors have taken a ` low-road ’ approach.

 A library of mining functions using a cache-mining approach

 IBM DB2 Intelligent Miner

 Oracle Data Miner

 MS OLE DB for DM: mining models

 Closed systems,

 Lacking in coverage and user-extensibility.

 Not as popular as dedicated, stand-alone mining systems, such as Weka



 A comprehensive set of mining algorithms, and tools.

 Generic algorithms over arbitrary data sets.

 Independent on the number of columns in tables.

 Open and extensible system based on Java.

These are the features that we want in our SMM— starting from SQL rather than Java!

Not an easy task ...why?


SMM Contributions

 Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

 Language and System Extensions:

 Genericity,

 Extensibility, and

 Performance

 A suite of stream mining algorithms.

 Existing ones and

 Newly developed in this project—e.g., SWIM.

 High level mining model for better

 Usability

 Control of mining process.


From SQL to Online Mining in SMM: step by step

 Naïve Bayesian Classifier (NBC).

 Important and frequently used.

 Schema-specific NBC. Simple to express in SQL— by count, sum aggregates. But a generci NBC is still preferable.

 Genericity : one function independent of number columns involved.

 Schema independence in SQL?



 Weka

 Arrays of type real.


 Verticalization .

 Similar arrays, but in tables.

 Built-in table function to reduce any table to this form.

 Thus, generic UDAs work with this schema.

 And further improvements are also supported in SMM



 Most mining tasks cannot be implemented in SQL.

 Solution: Define complex functions by User Defined

Aggregates (UDAs)

 Complex mining tasks can be viewed as aggregates

 UDAs Natively defined in SQL make the language computationally complete [Wang’ 04]

 Turing-complete over static data

 Non-blocking complete over data streams

 Natural extensions to support windows and delta computations for data streams [Bai’ 06]

 UDAs can be defined in a PL, for better performance


Windowed UDA Example – Continuous Count



TABLE state (tot real);





UPDATE state SET tot = tot + val;



UPDATE state SET tot = tot – oldest().val;


/* No TERMINATE state */

For efficient differential computation


Online Mining in SMM

 UDAs Invoked with standard SQL:2003 syntax of

OLAP functions.

SELECT learn(ts.Column, ts.Value, t.dec)


FROM trainingstream AS t,

TABLE (verticalize(Outlook, Temp, Humidity, Wind)) AS ts

 Powerful framework:

 Concept drifts-shifts

 Association rule mining


The Slide Construct

 A window can be divided into panes (called a slide)

 Tumbling windows when the size of the slide is equal or larger than that of the window

 The slide/window combination is great for data stream mining.

 Simple construct added to support slides in UDAs

 Allowed us to build a flexible and efficient library of data stream mining UDAs


SMM Contributions

 Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

 Language and System Extensions:

 Genericity,

 Extensibility, and

 Performance

 A suite of stream mining algorithms.

 Existing ones and

 Newly developed in this project—e.g., SWIM.

 High level mining model for better

 Usability

 Control of mining process.


Association Rule Mining

 SWIM [Mozafari’ 08] – Maintaining frequent patterns over large windows with slides.

 Differentially computes frequent patterns as slides enter (expire out of) the window.

 Uses efficient ‘ Verifiers’ based on conditional counting .

 Trade-off between Delay and Performance

 Performance gain over existing algorithms.


SWIM (Sliding Window Incremental Miner)

 If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all slides (PT)

Expired New

S4 S5 S6 S7


Count/Update frequencies



Count/Update frequencies

Add F7 to PT




Prune PT


PT = F4 U F5 U F6


Concept Drifts/Shifts—Complex Processes

 Ensemble based methods.

 Weighted bagging [Wang’ 03], adaptive boosting [Chu’

04], inductive transfer [Forman’ 06].

 Generic support, e.g. adaptive boosting (below).


Built-in Online Mining Algorithms In SMM

 Online classifiers

 Naïve Bayesian

 Decision Tree

 K-nearest Neighbor

 Online clustering

 DBScan [Ester’ 96]

 IncDBScan

 Windowed K-means*

 DenStream* [Cao’ 06]

 CluStream

 Association rule mining

 Approximate frequent items

 SWIM [Mozafari’ 08]

 Moment [Chi’ 04]


 Time series/sequence queries

 SQL-TS [Sadri’ 01]

 Many more …

Already supported

To be supported


SMM Contributions

 Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

 Language and System Extensions:

 Genericity,

 Extensibility, and

 Performance

 A suite of stream mining algorithms.

 Existing ones and

 Newly developed in this project—e.g., SWIM.

 High level mining model for better

 Usability

 Control of mining process.



 Complex SQL queries to invoke built-in and user-defined mining algorithms.

 An open and extensible system

 Most analysts would prefer using high-level mining language that

 supports uniform invocation of built-in and userdefined mining algorithms (no SQL required)

 describes the workflow of the mining process

 Is also open and extensible to incorporate newly defined mining algorithms.


Example: Defining a Mining Model

CREATE MODEL TYPE NaiveBayesianClassifier {

SHAREDTABLES (DescriptorTbl),


Learn ( UDA LearnNaiveBayesian,


PARTABLES (), % names of param tables required by the method

PARAMETERS () % additional parameters to be specified for input



Classify ( UDA ClassifyNaiveBayesian,





Example: Using a Mining Model

 Creating an instance:


AS NaiveBayesianClassifier;

 Uniform invocation of mining tasks:

RUN NaiveBayesianInstance.Learn WITH TrainingSet ;



 SMM Vs. Weka

 NBC and decision tree classifier

 Datasets [UCI]

• Iris : 5 attributes

• Heart disease : 13 attributes

 Overhead of integrating algorithms into


 The SWIM algorithm standalone vs. integrated

 Dataset [IBM Quest]

• Trans len 20, Pattern len 5, Tuples 50K


Comparison with Weka: NBC-Iris


Comparison with Weka: NBC-HD


Comparison with Weka: Decision Tree - Iris


Integration Overhead: Integrated SWIM vs.

Standalone SWIM


The Stream Mill System

 One server, multiple clients

 Server (on Linux): hosts the ESL language and manages storage and continuous queries

 Client (Java based GUI): allows the user to specify streams, queries, etc.



 SMM integrates new solutions for several difficult problems:

 Usability by high-level mining models

 Extensibility by user-defined mining models that call on

UDAs with windows

 Suite of built-in data stream mining UDAs

 Generic mining UDAs by Verticalization & other techniques

 Performance

 SMM is the first of its kind: more and better systems will follow in its footsteps.


Future Work

 Faster & lighter mining algorithms

 E.g. online algorithms for clustering

 Integration of other mining algorithms

 Data flow in mining models

 Similar solution for databases


Thank you!



 [Arasu’ 04] Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In VLDB , pages 336–347, 2004.

 [Babcock’ 02] B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom.

Models and issues in data stream systems. In PODS , 2002.

 [Bai’ 06] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo.

A data stream language and system designed for power and extensibility. In

CIKM , pages 337–346, 2006.

 [Cao’ 06] F Cao, M Ester, W Qian, and A Zhou, Density-based Clustering over an Evolving Data Stream with Noise, To appear in Proceedings of SIAM 2006.

 [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the

2004 IEEE International Conference on Data Mining (ICDM’04) , November 2004.

 [Chu’ 04] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data streams. In PAKDD , volume 3056, 2004.

 [Ester’ 96] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data

Mining , pages 226–231, 1996.

 [Forman’ 06] George Forman. Tackling concept drift by temporal inductive transfer. In SIGIR , pages 252–259, 2006.



 [Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM , 39(11):58–64, 1996.

 [Law’ 04] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and query language for data streams. In VLDB , pages 492–503, 2004.

 [Mozafari’ 08] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo.

Verifying and mining frequent patterns from large windows over data streams. In International Conference on Data Engineering (ICDE) , 2008.

 [Sadri’ 01] Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi.

Optimization of sequence queries in database systems. In PODS , Santa

Barbara, CA, May 2001.

 [Sarawagi’ 98] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD , 1998.


 [Wang’ 03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining conceptdrifting data streams using ensemble classifiers. In SIGKDD , 2003.

