Author - UCLA Computer Science

advertisement
UCLA
Computer Science Department
High-performance
Pattern Detection and Discovery
for
Databases and Data Streams
Barzan Mozafari
Adviser: Prof. Carlo Zaniolo
Committee Members:
Prof. Junghoo Cho,
Prof. D. Stott Parker, and
Prof. Mark Hansen
Winter 2011
Big Picture
1. Query Languages that
allow for the expression of
complex patterns
2. Scalable Systems that
support such languages
and can handle massive,
high-arrival data
3. Efficient, One-pass
Algorithms that can mine
large amounts of stored or
streaming data and extract
useful patterns
Data
Overview
• Introduction
• Query Languages for Pattern Detection
–
–
–
–
Kleene-* Constructs in SQL
Nested Words [SIGMOD’10, VLDB’10]
Optimization [Work in progress]
XSeq [Work in progress]
• Conclusion
Complex Event Patterns
• Sequences in DBs and CEP over data
streams
• Academic and industrial interest:
– SQL-TS [PODS ‘01]
– SASE [2006], SASE+ [2008]
– SQL Change proposal, 2007 (by Oracle, IBM and
Streambase)
– Other industrial and academic languages:
• Cayuga & CEL
• CEDR
• Microsoft CEP & LINQ
Our Contribution: K*SQL
1.
A powerful language for:
i. Expressing more complex patterns on relational streams
and sequences
ii. Querying data with more complex structures, e.g, XML and
genomic data
2.
3.
A unifying engine for sequence patterns and XML
New optimization techniques
•
4.
5.
pattern search over nested words
Efficient query execution backend for other
languages
XSeq: An XPath-resembling language to bring
Kleene-* to XML applications
Regular Expressions in SQL
rfid_readings (Time, SensorType, ensorId, ItemId)
Nested Kleene-*: K*SQL
Timestamp
BadgeID
Room
1226633804799
26
Room12
1226633805799
2
Room7
1226633806799
26
Room14
1226633807799
5
Room37
1226633808799
5
Room37
…
…
…
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
Lab
Lab
Room2
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN
Room12
Room7
Lab
Room2
Room7
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
L
Lab
L
Lab
Room2
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( L
)
WHERE L.room = ‘Lab’
Room12
Room7
Lab
Room2
Room7
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
L+
L
Lab
L
Lab
Room2
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( L+
)
WHERE L.room = ‘Lab’
Room12
Room7
Lab
Room2
Room7
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( L+ O+ )
WHERE L.room = ‘Lab’
AND O.room != ‘Decontamination’
L+
O+
L
Lab
L
Lab
O
Room2
O
Room12
O
Room7
Lab
Room2
Room7
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
L+
R
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( (R: L+ O*) )
WHERE L.room = ‘Lab’
AND O.room != ‘Decontamination’
O+
L+
R
O+
L
Lab
L
Lab
R
Room2
R
Room12
R
Room7
L
Lab
R
Room2
R
Room7
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
L+
R
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( (R: L+ O*)+ )
R+
WHERE L.room = ‘Lab’
AND O.room != ‘Decontamination’
O+
L+
R
O+
L
Lab
L
Lab
R
Room2
R
Room12
R
Room7
L
Lab
R
Room2
R
Room7
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
L+
R
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( (R: L+ O*)+ X)
R+
WHERE L.room = ‘Lab’
AND O.room != ‘Decontamination’
AND X.room = ‘Exit’
O+
L+
R
O+
L
Lab
L
Lab
R
Room2
R
Room12
R
Room7
L
Lab
R
Room2
R
Room7
X
Exit
Employees who spend >1 hour in the lab but leave
without going to decontamination room
SELECT
badgeID
L+
R
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( (R: L+ O*)+ X)
R+
WHERE L.room = ‘Lab’
AND O.room != ‘Decontamination’
AND X.room = ‘Exit’
AND sum(R.Last(L).timestamp –
R.First(L).timestamp)
> 3600
O+
L+
R
O+
L
Lab
L
Lab
R
Room2
R
Room12
R
Room7
L
Lab
R
Room2
R
Room7
X
Exit
Strictly More Expressive, through:
(i)Nested Kleene-*, (ii) Labels, i.e. Aliases
SELECT
badgeID
FROM rfid
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( (R: L+ O*)+ X)
WHERE L.room = ‘Lab’
AND O.room != ‘Decontamination’
AND X.room = ‘Exit’
AND sum(R.Last(L).timestamp –
R.First(L).timestamp)
> 3600
Strictly More Expressive, through:
(i)Nested Kleene-*, (ii) Labels, i.e. Aliases
SELECT badgeID,
L+
Last(R).Last(L).timestamp
– First(R).First(L).timestamp)
R
FROM rfid
+
O
PARTITION BY badgeID
ORDER BY timestamp
AS PATTERN ( (R: L+ O*)+ X)
L+
R+
WHERE L.room = ‘Lab’
R
AND O.room != ‘Decontamination’
O+
AND X.room = ‘Exit’
AND sum(R.Last(L).timestamp –
R.First(L).timestamp)
> 3600
L
Lab
L
Lab
R
Room2
R
Room12
R
Room7
L
Lab
R
Room2
R
Room7
X
Exit
K*SQL Checkpoint
1. A powerful language with a very efficient
implementation based on FSA
2. Subsumes SQL-MR, SASE+, Cayuga,
SQL-TS
3. Many interesting applications
– including queries on semistructured documents
Very natural question:
Can we handle full XML?
Automata and XML
Word Automata (FSA): only linear structure is
explicit, cannot model parenthesis languages
Ordered Tree Automata (OTA): only hierarchical
structure is explicit, exponentially less succinct
for word queries
Pushdown Automata (PDA): Many problems are
undecidable; expensive complexity
Advances in the Automata World
Nested Words [Alur’06]
Linear sequence + well-nested edges
Positions labeled with symbols in S
a1
a2
a3
a4
a5
a6
a7
a8
a9 a10
a11 a12
Positions classified as:
Call positions: both linear and hierarchical successors
Return positions: both linear and hierarchical predecessors
Internal positions: otherwise
20
Nested Word Applications
XML Document
<conference>
<name>
CAV 2006
</name>
<location>
<city>
Seattle
</city>
<hotel>
Sheraton
</hotel>
</location>
<sponsor>
MSR
</sponsor>
<sponsor>
Cadence
</sponsor>
</conference>
Program
global int x;
bool P() {
…
x = 3;
if Q x = 1 ;
…
}
bool Q () {
local int y;
…
x = y;
return (x==0);
}
RNA Sequence
Primary structure: Linear
sequence of nucleotides
(A, C, G, U)
Secondary structure:
Hydrogen bonds
between nucleotides
U
G
C
G
C
A
A
C
U
G
C
A
C
G
G
U
Odious Comparison
Property
FSA
NWA
PDA
input is read from left to right
Yes
Yes
Yes
Deterministic automata as expressive as
non-deterministic ones
Yes
Yes
No
Closed under complementation
Yes
Yes
Only for DPDA
w/ final state
Closed under union, intersection,
concatenation, and Kleene-*
Yes
Yes
No
Emptiness
Decidable
Decidable
Decidable
membership, language inclusion, language
equivalence
Decidable
Decidable
Undecidable
Can recognize paranthesis languages?
No
Yes
Yes
NWA is exponentially more succinct than Tree Automata
No query language has been proposed for NW
XML Sigmod Record:SAX-3
<!ELEMENT SigmodRecord (issue)* >
<!ELEMENT issue
(volume,number,articles) >
<!ELEMENT volume (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT articles (article)* >
<!ELEMENT article
(title,initPage,endPage,authors) >
<!ELEMENT title (#PCDATA)>
<!ELEMENT initPage (#PCDATA)>
<!ELEMENT endPage (#PCDATA)>
<!ELEMENT authors (author)* >
<!ELEMENT author (#PCDATA)>
<!ATTLIST author position CDATA
#IMPLIED>
tagI
ndex
Type
Token
Value
1
open
Sigmod
Record
_
2
open
issue
_
3
open
volume
_
4
text
_
11
5
close
volume
_
6
open
number
_
…
…
…
…
25
open
author
_
26
attribute
position
01
27
text
_
Karen
Botnich
…
…
…
…
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
XPath
Find articles of Carlo Zaniolo
as the 2nd co-author
//article
[authors/author
[@position = "01" and
text()="Carlo Zaniolo"]
]/title/text()
K*SQL
Question: Can we query nested words in
K*SQL?
In particular:
can we express traditional XML queries
– i.e. those often expressed via XPath/XQuery:
Find articles of Carlo Zaniolo
as the 2nd co-author
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(
)
WHERE
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<aut hors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
<article>
(OpArt
)
WHERE OpArt.value = ‘<article>’
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(OpArt
<article>
)
WHERE OpArt = open(‘article’)
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(OpArt OpTitl
)
WHERE OpArt = open(‘article’)
AND OpTitl = open(‘title’)
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(OpArt OpTitl Title
)
WHERE OpArt = open(‘article’)
AND OpTitl = open(‘title’)
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of
GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(OpArt OpTitl Title ClTitl
)
WHERE OpArt = open(‘article’)
AND OpTitl = open(‘title’) AND ClTitl =
close(‘title’)
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(OpArt OpTitl Title ClTitl E*
)
WHERE OpArt = open(‘article’)
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’)
AND isElement(E)
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
SELECT Title.token AS articleName
FROM sigmod_record
AS PATTERN
(OpArt OpTitl Title ClTitl E*
OpAuths
)
WHERE OpArt = open(‘article’)
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’)
AND isElement(E)
AND OpAuths = open(‘authors’)
<SigmodRecord>
<issue>
…
<article>
<title>
Implementation of GEM
</title>
<initPage>
45
</initPage>
…
<authors>
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E*
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
AND isElement(E)
AND OpAuths = open(‘authors’)
AND ClArt = close(‘article’)
…
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
AND OpAuths = open(‘authors’)
AND OpAu = open(‘author’)
<author
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
AND OpAu = open(‘author’)
AND pos.type = ‘attr’ AND pos.value = ’01’
AND pos.token = ‘position’
position="01">
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
Carlo Zaniolo
AND pos = attribute (‘position’, ’01’)
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos Author
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
AND pos = attribute(‘position’, ‘01’)
AND author.token = `Carlo Zaniolo’
Carlo Zaniolo
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos Author ClAu
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
Carlo Zaniolo
AND pos = attribute(‘position’, ‘01’)
AND author.value = `Carlo Zaniolo’
AND ClAu = close(‘author’)
</author>
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos Author ClAu E*
45
)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
Carlo Zaniolo
AND pos = attribute(‘position’, ‘01’)
</author>
AND author.value = `Carlo Zaniolo’
AND ClAu = close(‘author’)
…
</authors>
</article>
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos Author ClAu E*
45
ClAuths ClArt)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
Carlo Zaniolo
AND pos = attribute(‘position’, ‘01’)
</author>
AND author.token = `Carlo Zaniolo’
…
AND ClAu = close(‘author’)
AND ClAuths = close(‘authors’)
</authors>
</article>
AND ClArt = close(‘article’)
….
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos Author ClAu E*
45
ClAuths ClArt)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
Carlo Zaniolo
AND pos = attribute(‘position’, ‘01’)
</author>
AND author.token = `Carlo Zaniolo’
…
AND ClAu = close(‘author’)
</authors>
AND ClAuths = close(‘authors’)
</article>
….
AND ClArt = close(‘article’)
Find articles of Carlo Zaniolo
as the 2nd co-author
<SigmodRecord>
<issue>
…
SELECT Title.token AS articleName
<article>
<title>
FROM sigmod_record
Implementation of GEM
AS PATTERN
</title>
(OpArt OpTitl Title ClTitl E*
<initPage>
OpAuths E* OpAu Pos Author ClAu E*
45
ClAuths ClArt)
</initPage>
WHERE OpArt = open(‘article’)
…
AND OpTitl = open(‘title’) AND ClTitl = close(‘title’) <authors>
…
AND isElement(E)
<author
AND OpAuths = open(‘authors’)
position="01">
AND OpAu = open(‘author’)
Carlo Zaniolo
AND pos = attribute(‘position’, ‘01’)
</author>
AND author.token = `Carlo Zaniolo’
…
AND ClAu = close(‘author’)
</authors>
AND ClAuths = close(‘authors’)
</article>
….
AND ClArt = close(‘article’)
Sequence Queries over XML:
‘W’-Patterns in Stocks
<!ELEMENT Stocks (Stock)* >
<!ELEMENT Stock (symbol, date, price, volume)>
<!ELEMENT symbol (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT volume (#PCDATA)>
W-patterns in NASDAQ transactions with volume>1000
SELECT FIRST(Z).FIRST(X).Sym.token
FROM Nasdaq PARTITION BY Y.X.Sym.token
AS PATTERN
(Z: (X: OpSt Sym Date OP Price1 CP
OpV Volume ClV ClSt)*
(Y: OpSt Sym Date OP Price2 CP
OpV Volume ClV ClSt)*
)^2
WHERE
OpSt = open(‘Stock’) AND ClSt = open(‘Stock’)
AND OP = open(‘price’) AND CP = close(‘price’)
AND OpV = open(‘volume’) AND ClV = close(‘volume’)
AND INT(volume.token) >= 100
AND Z.X.price1.token < Z.PREV(X).price1.token
AND Z.Y.price2.token > Z.PREV(Y).price2.token
<Stock symbol=“YHOO”
date=“01-01-2010
23:10:00”>
<price> 18.50 </price>
<volume> 21 </volume>
</Stock>
<Stock symbol=“YHOO”
date=“01-01-2010
23:16:00”>
<price> 18.70 </price>
<volume> 11 </volume>
</Stock>
…
W-patterns in NASDAQ transactions with volume>1000
SELECT FIRST(Z).FIRST(X).Sym.token
FROM Nasdaq PARTITION BY Y.X.Sym.token
AS PATTERN
(Z: (X: OpSt Sym Date OP Price1 CP
OpV Volume ClV ClSt)*
(Y: OpSt Sym Date OP Price2 CP
OpV Volume ClV ClSt)*
)^2
WHERE
OpSt = open(‘Stock’) AND ClSt = open(‘Stock’)
AND OP = open(‘price’) AND CP = close(‘price’)
AND OpV = open(‘volume’) AND ClV = close(‘volume’)
AND INT(volume.token) >= 100
AND Z.X.price1.token < Z.PREV(X).price1.token
AND Z.Y.price2.token > Z.PREV(Y).price2.token
<Stock symbol=“YHOO”
date=“01-01-2010
23:10:00”>
<price> 18.50 </price>
<volume> 21 </volume>
</Stock>
<Stock symbol=“YHOO”
date=“01-01-2010
23:16:00”>
<price> 18.70 </price>
<volume> 11 </volume>
</Stock>
…
W-patterns in NASDAQ transactions with volume>1000
SELECT FIRST(Z).FIRST(X).Sym.token
FROM Nasdaq PARTITION BY Y.X.Sym.token
AS PATTERN
(Z: (X: OpSt Sym Date OP Price1 CP
OpV Volume ClV ClSt)*
(Y: OpSt Sym Date OP Price2 CP
OpV Volume ClV ClSt)*
)^2
WHERE
OpSt = open(‘Stock’) AND ClSt = open(‘Stock’)
AND OP = open(‘price’) AND CP = close(‘price’)
AND OpV = open(‘volume’) AND ClV = close(‘volume’)
AND INT(volume.token) >= 100
AND Z.X.price1.token < Z.PREV(X).price1.token
AND Z.Y.price2.token > Z.PREV(Y).price2.token
X*
Y*
X*
Y*
<Stock symbol=“YHOO”
date=“01-01-2010
23:10:00”>
<price> 18.50 </price>
<volume> 21 </volume>
</Stock>
<Stock symbol=“YHOO”
date=“01-01-2010
23:16:00”>
<price> 18.70 </price>
<volume> 11 </volume>
</Stock>
…
Optimization in K*SQL
• Compile-Time:
– Inferring inter-predicate implications
– Query re-writing, e.g. adding more constrainst
– Greedy predicate assignment
• Run-Time: Avoiding unnecessary backtracks
– VPSearch: Extending KMP search algorithm
to nested words and visibly pushdown words
– Optimizing non-determinisitc queries
• i.e. all-match query modes
K*SQL vs. XML Engines
References
• [1] Data mining: Staking a claim on your privacy. Information and
Privacy Commissioner, Ontario, Jan. 1998.
• [2] Directive on privacy protection. European Union, Oct. 1998.
• [3] The end of privacy. The Economist, May 1999.
• [4] Daniel J. Abadi, Donald Carney, Ugur C etintemel, Mitch
Cherniack, Christian Convey, C. Erwin, Eduardo F. Galvez, M.
Hatoun, Anurag Maskey, Alex Rasin, A. Singer, Michael
Stonebraker, Nesime Tatbul, Ying Xing, R. Yan, and Stanley B.
Zdonik. Aurora: A data stream management system. In SIGMOD
Conference, page 666, 2003.
• [5] Mads Sig Ager, Olivier Danvy, and Henning Korsholm Rohde.
Fast partial evaluation of pattern matching in strings. In PEPM,
2003.
• [6] Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil
Immerman. E cient pattern matching over event streams. In
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international
conference on Management of data, pages 147{160, New York, NY,
USA, 2008. ACM.
References
• [7] R. Agrawal and R. Srikant. Fast algorithms for mining
association rules in large
• databases. In VLDB, 1994.
• [8] Rakesh Agrawal and Ramakrishnan Srikant. Privacypreserving data mining. In SIG• MOD, 2000.
• [9] Shipra Agrawal, Vijay Krishnan, and Jayant R.
Haritsa. On addressing e ciency
• concerns in privacy-preserving mining. In DASFAA,
2004.
• [10] Rajeev Alur. Marrying words and trees. In PODS,
2007.
References
• [11] Rajeev Alur, Marcelo Arenas, Pablo Barcelo, Kousha Etessami,
Neil Immerman, and1 Leonid Libkin. First-order and temporal logics
for nestedwords. In LICS, 2007.
• [12] Rajeev Alur, Swarat Chaudhuri, and P. Madhusudan.
Languages of nested trees. In CAV, 2006.
• [13] Rajeev Alur and P. Madhusudan. Visibly pushdown languages.
In STOC, pages 202{ 211, 2004.
• [14] Rajeev Alur and P. Madhusudan. Adding nesting structure to
words. In Developments in Language Theory, pages 1{13, 2006.
• [15] Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar,
Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom.
Stream: The stanford stream data manager. In SIGMOD, 2003.
• [16] Brian Babcock, Mayur Datar, and Rajeev Motwani. Load
shedding for aggregation queries over data streams. In ICDE '04:
Proceedings of the 20th International Conference on Data
Engineering, page 350, Washington, DC, USA, 2004. IEEE
Computer Society.
References
•
•
•
•
•
•
•
[17] RICARDO A. BAEZA-YATES and GASTON H. GONNET. Fast text
searching for regular expressions or automaton searching on tries. 1996.
[18] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo.
A data stream language and system designed for power and extensibility. In
CIKM, pages 337{346, 2006.
[19] Gerard Berry and Ravi Sethi. From regular expressions to deterministic
automata.
[20] Philip Bille and Martin Farach-Colton. Fast and compact regular
expression matching. 2008.
[21] Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren
Shakib, Simon Weaver, and Jingren Zhou. Scope: Easy and e cient parallel
processing of massive data sets. VLDB, 29(2):282{318, 2008.
[22] Ronnie Chaiken, Bob Jenkins, Per- Ake Larson, Bill Ramsey, Darren
Shakib, Simon Weaver, and Jingren Zhou. Scope: easy and e cient parallel
processing of massive data sets. PVLDB, 1(2):1265{1276, 2008.
[23] Hei Chan and Adnan Darwiche. Sensitivity analysis in Bayesian
networks: From single to multiple parameters. In 20'th Conference on
Uncertainty in Arti
cial Intelligence (UAI), 2004.
References
•
•
•
•
•
•
•
•
[24] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed
frequent itemsets over a stream sliding window. In Proceedings of the 2004 IEEE
International Conference on Data Mining (ICDM'04), November 2004.
[25] Yun Chi, Philip S. Yu, Haixun Wang, and Richard R. Muntz. Loadstar: A load
shedding scheme for classifying data streams. In SDM, 2005.
[26] Alexandre Ev
mievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in
privacy preserving data mining. In PODS, 2003.
[27] Sudipto Guha, Dimitrios Gunopulos, and Nick Koudas. Correlating synchronous
and asynchronous data streams. In KDD, pages 529{534, 2003.
[28] Daniel Gyllstrom, Eugene Wu 0002, Hee-Jin Chae, Yanlei Diao, Patrick
Stahlberg, and Gordon Anderson. Sase: Complex event processing over streams.
CoRR, abs/cs/0612128, 2006.
[29] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
In SIGMOD, 2000.
[30] HARUO HOSOYA, JERO ME VOUILLON, and BENJAMIN C. PIERCE. Regular
expression types for xml. ACM Transactions on Programming Languages and
Systems, 27(1):46{90, January 2005.
[31] Jeong-Hyon Hwang, Sanghoon Cha, Ugur C etintemel, and Stanley B. Zdonik.
Borealisr: a replication-transparent stream processing system for wide-area
monitoring applications. In SIGMOD Conference, pages 1303{1306, 2008.
References
•
•
•
•
•
•
•
•
[32] E. Keogh and M. Pazzani. Learning augmented bayesian classi
ers: A comparison of distribution-based and classi
cation-based approaches. In 7th. Int'l Workshop on AI and Statistics, 1999.
[33] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast
pattern matching in strings. SIAM J. Comput., 6(2):323{350, 1977.
[34] S. Rao Kosaraju. E cient tree pattern matching. 1989.
[35] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Query languages and
data models for database sequences and data streams. In VLDB, 2004.
[36] Yan-Nei Law and Carlo Zaniolo. Improving the accuracy of continuous
aggregates and mining queries on data streams under load shedding.
[37] Yan-Nei Law and Carlo Zaniolo. Improving the accuracy of
continuous aggregates and mining queries on data streams under load
shedding. Int. J. Bus. Intell. Data Min., 3(1):99{117, 2008.
[38] JaeGil Lee, Jiawei Han, Xiaolei Li, and Hector Gonzalez. Traclass:
Trajectory classification using hierarchical regionbased and trajectorybased
clustering. VLDB, 29(2):282{ 318, 2008.
[39] C.K.-S. Leung, Q.I. Khan, and T. Hoque. Cantree: A tree structure for
efficient incremental mining of frequent patterns. In ICDM, 2005.
References
•
•
•
•
•
•
•
[40] Feifei Li, Jimeng Sun, Spiros Papadimitriou, George A. Mihaila, and Ioana
Stanoi. Hiding in the crowd: Privacy preservation on evolving streams through
correlation tracking. In ICDE, pages 686{695, 2007.
[41] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. Verifying and mining
frequent patterns from large windows over data streams. In the 24th International
Conference on Data Engineering (ICDE), 2008.
[42] Barzan Mozafari and Carlo Zaniolo. Publishing naive bayesian classi
ers: Privacy without accuracy loss. In the 35th International Conference on Very
Large Data Bases (VLDB), 2009.
[43] Barzan Mozafari and Carlo Zaniolo. A scalable algorithm for optimal load
shedding with aggregates and mining queries. In Under review process, 2009.
[44] Gonzalo Navarro and Mathieu Rafnot. Fast regular expression search. WAE,
pages 198{212, 1999.
[45] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and
Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Jason
Tsong-Li Wang, editor, SIGMOD Conference, pages 1099{1110. ACM, 2008.
[46] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto,
Qiming Chen, Umeshwar Dayal, and M. Hsu. Mining sequential patterns by patterngrowth:The PrefixSpan approach. IEEE TKDE, 16(11):1424{1440, November 2004.
References
• [47] Vibhor Rastogi, Sungho Hong, and Dan Suciu. The boundary
between privacy and utility in data publishing. In VLDB, 2007. 34
• [48] Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, and Jafar Adibi. A
sequential pattern query language for supporting instant data mining
for e-services. In VLDB, pages 653{656, 2001.
• [49] Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, and Jafar Adibi.
Expressing and optimizing sequence queries in database systems.
ACM Trans. Database Syst., 29(2):282{318, 2004.
• [50] Nesime Tatbul, Ugur C etintemel, Stanley B. Zdonik, Mitch
Cherniack, and Michael Stonebraker. Load shedding in a data
stream manager. In VLDB, pages 309{320, 2003.
• [51] Nesime Tatbul and Stanley B. Zdonik. Window-aware load
shedding for aggregation queries over data streams. In VLDB,
pages 799{810, 2006.
References
•
•
•
•
•
•
[52] Hetal Thakkar, Barzan Mozafari, and Carlo Zaniolo. A data stream
mining system. In ICDM, pages 79{88, 2008.
[53] Hetal Thakkar, Barzan Mozafari, and Carlo Zaniolo. Designing an
inductive data stream management system: the stream mill experience. In
SSPS in conjunction with EDBT, pages 79{88, 2008.
[54] Yi-Cheng Tu, Song Liu, Sunil Prabhakar, and Bin Yao. Load shedding
in stream databases: A control-based approach. In VLDB, pages 787{798,
2006.
[55] Haixun Wang, Carlo Zaniolo, and Chang Luo. Atlas: A small but
complete sql extension for data mining and data streams. In VLDB, pages
1113{1116, 2003.
[56] J. T. Yao and M. Zhang. A fast tree pattern matching algorithm for xml
query.
[57] Fred Zemke, Andrew Witkowski, Mitch Cherniak, and Latha Colby.
Pattern matching in sequences of rows. In [sql change proposal, march
2007], http://asktom.oracle.com/tkyte/row-patternrecogniton-11-public.pdf
Download