Sequence OLAP - The University of Hong Kong

advertisement
OLAP on
Sequence Data
Published in SIGMOD 2008 Vancouver, Canada.
Authors : Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan
Lee, Chun Kit Chui and David W. Cheung
Chun Kit Chui (Kit),
Presenter : The University of Hong Kong
ckchui@cs.hku.hk
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
OLAP on
Sequence Data
Problem Motivation
Sequence Data Cube and Cuboids
New OLAP operations
System architecture
Experimental evaluations
Future works
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
OLAP on
Sequence Data

Web server access logs
Many kinds of real-life data exhibit logical ordering
among their data items and are thus sequential in nature.
Stock market data
U.S. OIL FUND ETF
MEXCO
ENERGY CORP
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
The product dimension is
associated with a concept
hierarchy in which the
finest level of abstraction is
product ID, followed by
product type, and brand.
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
…
…
…
2008-1-01 03:49
688
/product.html?pid=329
329
Adidas T-shirts
Adidas
…
2008-1-01 03:45
…
14230
/checkout.xhtml
…
Nil
…
Nil
Nil
/product.html?pid=12800
Nike
Sequence Data

Web server access logs
Many kinds of real-life data exhibit logical ordering
among their data items and are thus sequential in nature.
Stock market data
U.S. OIL FUND ETF
MEXCO
ENERGY CORP
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
The product dimension is
associated with a concept
hierarchy in which the
finest level of abstraction is
product ID, followed by
product type, and brand.
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
…
…
…
2008-1-01 03:49
688
/product.html?pid=329
329
Adidas T-shirts
Adidas
…
2008-1-01 03:45
…
14230
/checkout.xhtml
…
Nil
…
Nil
Nil
/product.html?pid=12800
Nike
Sequence Data

Web server access logs
Many kinds of real-life data exhibit logical ordering
among their data items and are thus sequential in nature.
From the access logs we can trace back the
browsing sequences of all members.
Browsing Sequence
Member 688
Nike shoes
Adidas shoes
Nike shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
Manager
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
…
…
…
2008-1-01 03:49
688
/product.html?pid=329
329
Adidas T-shirts
Adidas
…
2008-1-01 03:45
…
14230
/checkout.xhtml
…
Nil
…
Nil
Nil
/product.html?pid=12800
Nike
Sequence Data

Many kinds of real-life data exhibit logical ordering
among their data items and are thus sequential in nature.
Web server access logs
Browsing Sequence
Member 688
Nike shoes
Adidas shoes
Nike shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
Manager
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
…
…
…
2008-1-01 03:49
688
/product.html?pid=329
329
Adidas T-shirts
Adidas
…
2008-1-01 03:45
…
14230
/checkout.xhtml
…
Nil
…
Nil
Nil
/product.html?pid=12800
Nike
Sequence Data
Pattern template
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
?
< Nike Shoes, Puma Shoes, Nike Shoes >
5,432
< Nike Shoes, Nike Shoes, Nike Shoes >
13,200
The query is referring to a particular kind of
…
…
pattern in the browsing
sequences.
< Adidas Shoes, Nike Shoes, Adidas Shoes >
1,020
The
comparison shopping semantics can
be
< Adidas Shoes,
Shoes, Adidas
Shoes >
4,331
expressed
by Puma
the pattern
template
< X, Y, X >.
Browsing Sequence
Member 688
Nike shoes
Adidas shoes
Nike shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
…
…
…
2008-1-01 03:49
688
/product.html?pid=329
329
Adidas T-shirts
Adidas
…
2008-1-01 03:45
…
14230
/checkout.xhtml
…
Nil
…
Nil
Nil
Manager
/product.html?pid=12800
Nike
Sequence Data
Pattern template
Instantiated pattern
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
1
< Nike Shoes, Puma Shoes, Nike Shoes >
?
< Nike Shoes, Nike Shoes, Nike Shoes >
?
…
…
< Adidas Shoes, Nike Shoes, Adidas Shoes >
?
< Adidas Shoes, Puma Shoes, Adidas Shoes >
?
<Nike shoes, Adidas Shoes, Nike Shoes> is one of the
instantiations of the pattern template.
Since the browsing sequence of member 688 contains/
possesses the pattern, the sequence contributes to 1
count in the cell.
Browsing Sequence
Member 688
Nike shoes
Adidas shoes
Nike shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
2008-1-01 03:49
…
2008-1-01 03:45
…
688
…
14230
…
329
…
Nil
…
Adidas T-shirts
…
Nil
Manager
The aggregated number of
members is counted and a
tabulated view of the
sequence data should be
returned.
/product.html?pid=12800
/product.html?pid=329
/checkout.xhtml
Nike
Adidas
Nil
Sequence Data
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
200,000
< Nike Shoes, Puma Shoes, Nike Shoes >
5,432
< Nike Shoes, Nike Shoes, Nike Shoes >
13,200
…
…
< Adidas Shoes, Nike Shoes, Adidas Shoes >
1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes >
4,331
<Nike shoes, Adidas Shoes, Nike Shoes> is one of the
instantiations of the pattern template.
Since the browsing sequence of member 688 contains/
possesses the pattern, the sequence contributes to 1
count in the cell.
Browsing Sequence
Member 688
Nike shoes
Adidas shoes
Nike shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Web server access logs (Web retailor selling sports wear products)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
Query
Time
member- ID
URL
Product
Product type
Brand
2008-1-01 00:01
688
/product.html?pid=12800
12800
Nike shoes
Nike
…
…
…
…
2008-1-01 00:02
688
/product.html?pid=13250
13250
Adidas shoes
Adidas
2008-1-01 00:10
14230
/product.html?pid=324
324
Puma shoes
Puma
…
…
…
…
2008-1-01 02:45
688
12800
Nike shoes
…
…
…
…
2008-1-01 03:49
688
/product.html?pid=329
329
Adidas T-shirts
Adidas
…
2008-1-01 03:45
…
14230
/checkout.xhtml
…
Nil
…
Nil
Nil
Sequence OLAP system
• Support “pattern based” grouping and aggregation.
Manager
The aggregated number of
members is counted and a
tabulated view of the
sequence data should be
returned.
/product.html?pid=12800
Nike
Result
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
200,000
< Nike Shoes, Puma Shoes, Nike Shoes >
5,432
< Nike Shoes, Nike Shoes, Nike Shoes >
13,200
…
…
< Adidas Shoes, Nike Shoes, Adidas Shoes >
1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes >
4,331
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
There are so many members did
comparison shopping between
Nike shoes and Addidas shoes, I
would like to further investigate
whether those members would
browse one more product and if
so what is the product.
Follow up
Query
Sequence OLAP system
• Support “pattern based” grouping and aggregation.
• Obtain query results in real time (OLAP feature).
Manager
< X, Y, X, Z >
X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any
< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes >
Result
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
200,000
< Nike Shoes, Puma Shoes, Nike Shoes >
5,432
< Nike Shoes, Nike Shoes, Nike Shoes >
13,200
…
…
< Adidas Shoes, Nike Shoes, Adidas Shoes >
1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes >
4,331
+
# Members
15,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >
180,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes >
9,000
…
…
The new query can be expressed by appending a
pattern symbol “Z” to form a new pattern template
<X,Y,X,Z>.
The result shows the statistics of one more browsing
step after the comparison shopping between Nike
Shoes and Adidas Shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
There are so many members did
comparison shopping between
Nike shoes and Addidas shoes, I
would like to further investigate
whether those members would
browse one more product and if
so what is the product.
Follow up
Query
Sequence OLAP system
• Support “pattern based” grouping and aggregation.
• Obtain query results in real time (OLAP feature).
Manager
This manager find out the Adidas Tshirts page is the most popular page
for the members who did comparison
shopping between Nike shoes and
Adidas shoes pages.
< X, Y, X, Z >
X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any
< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes >
Result
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
200,000
< Nike Shoes, Puma Shoes, Nike Shoes >
5,432
< Nike Shoes, Nike Shoes, Nike Shoes >
13,200
…
…
< Adidas Shoes, Nike Shoes, Adidas Shoes >
1,020
< Adidas Shoes, Puma Shoes, Adidas Shoes >
4,331
+
# Members
15,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >
180,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes >
9,000
…
…
The new query can be expressed by appending a
pattern symbol “Z” to form a new pattern template
<X,Y,X,Z>.
The result shows the statistics of one more browsing
step after the comparison shopping between Nike
Shoes and Adidas Shoes
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
I would like to know the
number of members that did
comparison shopping and
their distributions over all
product web page to
product web page pairs
within 2008 Quarter 1.
There are so many members did
comparison shopping between
Nike shoes and Addidas shoes, I
would like to further investigate
whether those members would
browse one more product and if
so what is the product.
The comparison shopping
patterns displayed in the
“product type” abstraction
level is too detailed, I would
like to view some higher
level statistics.
Sequence OLAP system
• Support “pattern based” grouping and aggregation.
• Obtain query results in real time (OLAP feature).
• Provide OLAP operations to ease sequence analysis.
Query
Manager
Result
“Product type” abstraction level
< X, Y, X >
# Members
< Nike Shoes, Adidas Shoes, Nike Shoes >
200,000
< Nike Shoes, Puma Shoes, Nike Shoes >
5,432
< X, Y, X, Z >
# Members
X=“Nike
Shoes”,
Y=“Adidas
Shoes”, Z=Any
A simple
“roll
up” operation
on the pattern
< Niketemplate
Shoes, Adidas
Shoes, Nike Shoes,
Nike Shoesstatistics
>
15,000
transforms
the summary
to
< Nike Shoes,
Adidas Shoes,
Nike Shoes,level.
Adidas T-shirts >
the brand
abstraction
180,000
< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes >
9,000
…
…
“brand” abstraction level
Nike
< X, Y, X >
# Members
< Nike, Adidas, Nike>
3,150,000
< Nike Shoes, Nike Shoes, Nike Shoes >
13,200
…
…
< Nike, Puma, Nike >
2,180,000
< Adidas Shoes, Nike Shoes, Adidas Shoes >
1,020
< Nike, Nike, Nike >
19,000,000
< Adidas Shoes, Puma Shoes, Adidas Shoes >
4,331
…
…
Nike shoes
Nike T-shirts
Nike Basketballs
Nike socks
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Research Objective
Sequence
OLAP

< X, Y >
# Members
< Nike, Adidas>
1,315,000
< Nike, <Puma
X, Y, >X >
6,480,000
# Members
< <Nike,
3,189,000
Nike,Nike>
Adidas, Nike>
315,000
< Nike,
… Puma, Nike > …2,180,000
< Nike, Nike, Nike >
189,000
…
…
To design and implement an OLAP system
that is able to


support “pattern based” grouping and aggregation.
obtain query results in real-time.


Especially optimized for interactive/iterative queries.
provide OLAP operations to ease explorative
analysis of sequence data.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
RFID Logs
Smart card


Radio-frequency identification (RFID) is an automatic identification
method, relying on storing and remotely retrieving data using
devices called RFID tags.
The smart card system in public transits




Octopus card Hong Kong, Orca card in Seattle (2009)…etc
Electronic money
Travel history of passengers are logged in a database.
Generate massive amount of sequence data.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
RFID Logs
Smart card


Event Database
Card reader
Time
Card-ID
Location
Action
Amount
2008-7-25 09:01
Kit
Shatin
in
-
2008-7-25 09:25
Kit
Central
out
- $5
…
…
…
…
2008-7-25 18:23
Kit
…
Central
Machine #10
Add value
+ $100
2008-7-25 18:25
Kit
Central
in
-
…
…
…
…
…
2008-7-25 18:49
Kit
Shatin
out
- $5
…
…
…
…
…
Radio-frequency identification (RFID) is an automatic identification
method, relying on storing and remotely retrieving data using
devices called RFID tags.
The smart card system in public transits





Octopus card Hong Kong, Orca card in Seattle (2009)…etc
Electronic money
Payment can be done easily by waving the card over the card reader.
Travel history of passengers are logged in a database.
Generate massive amount of sequence data .
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Event Database
Time
Card-ID
Location
Action
Amount
2008-7-25 09:01
Kit
Shatin
in
-
2008-7-25 09:25
Kit
Central
out
- $5
…
…
…
…
2008-7-25 18:23
Kit
…
Central
Machine #10
Add value
+ $100
2008-7-25 18:25
Kit
Central
in
-
…
…
…
…
…
2008-7-25 18:49
Kit
Shatin
out
- $5
…
…
…
…
…
Marketing Manager
The number of round-trip
passengers and their distributions
over all origin-destination station
pairs within 2008 Quarter 4.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Event Database
Round trip statistics (Stations level)
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2,032
< Shatin, Admiralty, Admiralty, Shatin>
1,982
…
…
< Admiralty, Central, Central, Admiralty >
22,822
< Admiralty, Kowloon, Kowloon, Admiralty >
10,020
Time
Card-ID
Location
Action
Amount
2008-7-25 09:01
Kit
Shatin
in
-
2008-7-25 09:25
Kit
Central
out
- $5
…
…
…
…
2008-7-25 18:23
Kit
…
Central
Machine #10
Add value
+ $100
2008-7-25 18:25
Kit
Central
in
-
…
…
…
…
…
2008-7-25 18:49
Kit
Shatin
out
- $5
…
…
…
…
…
Result
Sequence OLAP system
• Support “pattern based” grouping and aggregation.
• Obtain query results in real time.
• Provide OLAP operations to ease explorative analysis.
Query
Marketing Manager
The number of round-trip
passengers and their distributions
over all origin-destination station
pairs within 2008 Quarter 4.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence
Data Cuboid
A logical view of sequence data at a particular
degree of summarization.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Preliminary

Sequence Cuboid
(S-Cuboid)


Marketing Manager
a logical view of sequence
data at a particular degree
of summarization.
sequences can be
characterized by


attributes’ values of the events in
the sequence (e.g. time,
spending, product type)
the subsequence/ substring
patterns they possess.
(e.g. <X,Y,X> , <X,Y,Y,X>)
The number of roundtrip passengers and
their distributions over
all origin-destination
station pairs within
2008 Quarter 4.
Sequence
OLAP
An S-Cuboid
< X, Y, Y, X >
#
Users
< Shatin, Central, Central, Shatin >
2
< Kowloon, Admiralty, Admiralty, Kowloon >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 1. Sequence Formation
Event Database
Time
Card-ID
Location
Action
Amount
Time
Card-ID
Location
Action
Amount
2008-6-09 00:01
Kit
Shatin
in
0
2008-6-09 00:01
Kit
Shatin
in
0
2008-6-09 02:25
Kit
Central
out
-5
2008-6-09 02:25
Kit
Central
out
-5
…
…
…
…
…
…
…
…
…
2008-6-14 02:25
Kit
Central
in
0
2008-6-14 02:23
Kit
…
Central
Machine #10
Add value
+100
…
…
…
…
…
2008-6-14 02:25
Kit
Central
in
0
2008-6-14 18:49
Kit
Shatin
out
-5
…
…
…
…
…
…
…
…
…
…
2008-6-14 18:49
Kit
Shatin
out
-5
…
…
…
…
…
Event
Selection
An event selection step to select
a set of a relevant records and
attributes.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 1. Sequence Formation
Event Database
Time
Card-ID
Location
Action
Amount
Time
Card-ID
Location
Action
Amount
2008-6-09 00:01
Kit
Shatin
in
0
2008-6-09 00:01
Kit
Shatin
in
0
2008-6-09 02:25
Kit
Central
out
-5
2008-6-09 02:25
Kit
Central
out
-5
…
…
…
…
…
…
…
…
…
2008-6-14 02:25
Kit
Central
in
0
2008-6-14 02:23
Kit
…
Central
Machine #10
Add value
+100
…
…
…
…
…
2008-6-14 18:49
Kit
Shatin
out
-5
…
…
…
…
…
Event
Selection
2008-6-14 02:25 formation
Kit
Central
A sequence
step to in
…
…from the
… event …
form sequences
2008-6-14 18:49
Kit
Shatin
out
dataset.
…
…
…
…
0
…
-5
…
Sequence Formation
User : Individual, Time : Day
Seq ID
Kit’s trip on monday
Sequence of events
S1
< e1, e2, e102, e180>
S2
< e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3
< e4, e5, e9, e13 , e14, e290 , e292, e352 >
…
…
Sequences can be formed per
day and for each individual user.
By doing this, we have a number
of daily travel sequences of each
user.
E.g. S1 is Kit’s trip on Monday
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 1. Sequence Formation
Event Database
Time
Card-ID
Location
Action
Amount
Time
Card-ID
Location
Action
Amount
2008-6-09 00:01
Kit
Shatin
in
0
2008-6-09 00:01
Kit
Shatin
in
0
2008-6-09 02:25
Kit
Central
out
-5
2008-6-09 02:25
Kit
Wan Chai
out
-5
…
…
…
…
…
…
…
…
…
2008-6-14 02:25
Kit
Central
in
0
2008-6-14 02:23
Kit
…
Wan Chai
Machine #10
Add value
+100
…
…
…
…
…
2008-6-14 18:49
Kit
Shatin
out
-5
…
…
…
…
…
Event
Selection
2008-6-14 02:25 can
Kit alsoWan
Sequences
beChai
formedin
…
…
…
according
to time
dimension
at…
2008-6-14 18:49
Kit
Shatin
out
the abstraction
level
of
year
and
…
…
…
…
per individual user.
0
…
-5
…
Sequence Formation
User : Individual, Time : Day
Seq ID
Kit’s trip on monday
Sequence of events
User : Individual, Time : Year
Seq ID
S1
< e1, e2, e102, e180>
S1
S2
< e3, e7, e8, e12 , e19, e232 , e234, e235 >
S2
S3
< e4, e5, e9, e13 , e14, e290 , e292, e352 >
…
…
S3
…
Kit’s trip in 2008
Sequence of events
< e1, e2 , e102, e180 , e1002, e1800 , e1801 ,… >
< e3, e7, e8, e12 , e19, e232 , e234, e235 , e2134, e2135
>
< e4, e5, e9, e13 , e14, e290 , e292 , e352 , e3252,…>
“OLAP on Sequence …
Data” , Presenter : Chun Kit Chui (Kit)
User : Individual, Time : Day
Seq ID
Kit’s trip on monday
Sequence of events
S1
< e1, e2, e102, e180>
S2
< e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3
< e4, e5, e9, e13 , e14, e290 , e292, e352 >
…
Shing
Ben
…
Kit
User : individual
Phase 2. S-Cuboid construction
S4
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
Time : day
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
A sequence grouping step to group
the sequences that share the same
dimensions’ values into a sequence
group.
E.g. travel sequences are grouped
according to their fare groups.
Regular
Group
User : fare-group
Phase 2. S-Cuboid construction
S4
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
Sequence
Grouping
User : Individual, Time : Day
Seq ID
Sequence of events
S1
< e1, e2, e102, e180>
S2
< e3, e7, e8, e12 , e19, e232 , e234, e235 >
S3
< e4, e5, e9, e13 , e14, e290 , e292, e352 >
…
Shing
Ben
…
Kit
User : individual
time : day
S4
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
Time : day
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
User : fare-group
Pattern
Grouping
S4
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
The pattern
grouping
step Time : Day
User : Individual,
Seq ID
further
groups Sequence of events
S1 sequences
< e1, e2, e102, e180>
the
according
S2 < e3, eto
7, ethe
8, e12 , e19, e232 , e234, e235 >
“patterns”
S3 < e4, ethey
5, e9, e13 , e14, e290 , e292, e352 >
possess.
…
…
Sequence
Grouping
time : day
X (Location : station)
User : individual
Y (Location : station)
Pattern X,Y,Y,X
S4
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
Time : day
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Each cell represents an instantiated
pattern E.g. <Shatin, Central, Central,
S4 S29 S129 S2529
Shatin>
Pattern
We assign Ssequences
cell if that
S23 S242to a
S2453
Grouping
3
sequence contains the instantiated
S2 S90 S124 S9230
pattern.
User : fare-group
Y (Location : station)
Pattern X,Y,Y,X
S1
S100
Event
S1
S3
Shatin
Time
e2
2008-6-09
00:01
2008-6-09
02:25
…
…
e1
Central
S1020
time : day
X (Location : station)
The pattern
grouping step
further groups
the sequences
according to the
“patterns” they
possess.
S388
e102
…
e180
…
2008-6-09
22:25
…
2008-6-09
23:49
…
Card-ID
Location
Action
Amount
Kit
Shatin
in
0
Kit
Central
out
-5
…
…
…
…
Kit
Central
in
0
…
…
…
…
Kit
Shatin
out
-5
…
…
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Each cell represents an instantiated
pattern E.g. <Shatin, Central, Central,
S4 S29 S129 S2529
Shatin>
Pattern
We assign Ssequences
cell if that
S23 S242to a
S2453
Grouping
3
sequence contains the instantiated
S2 S90 S124 S9230
pattern.
User : fare-group
Y (Location : station)
Pattern X,Y,Y,X
S1
S100
S388
S1020
time : day
X (Location : station)
Count: 2
Central
Aggregated Value
S1
S3
Finally, an aggregation function
is applied to the sequences in
each cuboid cell.
Shatin
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Pattern
Grouping
User : fare-group
Y (Location : station)
Pattern X,Y,Y,X
S4
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
time : day
X (Location : station)
Aggregated Value
Central
Count: 2
S1
S3
Shatin
4D S-Cuboid
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Phase 2. S-Cuboid construction
Pattern
Grouping
User : fare-group
Y (Location : station)
Pattern X,Y,Y,X
S29
S129
S2529
S3
S23
S242
S2453
S2
S90
S124
S9230
S1
S100
S388
S1020
time : day
X (Location : station)
Global
Dimensions
Pattern
Dimensions
Aggregated Value
S1
Central
Count: 2
S4
S3
Shatin
4D S-Cuboid
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
This query specifies the
construction of the SCuboid that answer the
round trip query in the
running example.
Sequence
Formation
Sequence
Grouping
Pattern
Grouping
The number of
round-trip
passengers and their
distributions over all
origin-destination
station pairs within
2007 Quarter 4.
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
Form individual daily
travel sequences.
Sequence
Formation
We specify the global dimensions in
the sequence grouping step.
Group the sequences with the same
fare-group within the same day.
Sequence
Grouping
Pattern
Grouping
The number of
round-trip
passengers and their
distributions over all
origin-destination
station pairs within
2007 Quarter 4.
Group the sequences according to
the pattern template <X,Y,Y,X>,
where X, Y are referring to the
location dimension at station
abstraction level.
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
Form individual daily
travel sequences.
Sequence
Formation
Sequence
Grouping
We specify the global dimensions in
the sequence grouping step.
Group the sequences with the same
fare-group within the same day.
Pattern
Grouping
Group the sequences according to
the pattern template <X,Y,Y,X>,
where X, Y are referring to the
location dimension at station
abstraction level.
The predicates further increases
the expression power of pattern
matching in the query language.
What exactly is a round-trip pattern?
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
Sequence
Formation
Sequence
Grouping
Global dimensions
Pattern template
Pattern dimensions
Pattern
Grouping
E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin >
The cell restriction defines how to deal with the
situations when a data sequence contains
multiple occurrences of a cell’s pattern.
E.g. A sequence contribute to 1 count whenever
we can find one match of the pattern in the
sequence.
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Sequence Cuboid query language
Any changes to the cuboid
specification transforms the SCuboid to another.
E.g. changing the pattern template
to (X,Y,Y,X,Z) generates another
S-Cuboid.
Sequence
Formation
Sequence
Grouping
Global dimensions
Pattern template
Pattern dimensions
Pattern
Grouping
E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin >
The cell restriction defines how to deal with the
situations when a data sequence contains
multiple occurrences of a cell’s pattern.
E.g. A sequence contribute to 1 count whenever
we can find one match of the pattern in the
sequence.
4D S-Cuboid
< X, Y, Y, X >
# Users
< Shatin, Central, Central, Shatin >
2
< Shatin, Kowloon, Kowloon, Shatin >
9
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids

Exponential number of S-cuboids

The length of the pattern template is infinite


Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
Recall that changing the pattern template
essentially changes the cuboid specification and
thus generates a new cuboid.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids

Exponential number of S-cuboids

The length of the pattern template is infinite


Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
Traditional OLAP
Finer
summaries
Coarser
summaries
# Sales
# Sales
1
1
1
1
7
Whole week
1
1
1
In traditional OLAP systems,
data are summarizable.
i.e. Summaries in finer
abstraction level can be used to
construct the summary in higher
abstraction level.
Summarizable!
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids
Sequence Database
Seq ID

S-Cuboid (Finer aggregates)
< X, Y, Z >
Count
< Kowloon, Central, Kowloon >
1
< Kowloon, Central, Central >
1
Sequence of events
Infinite number of S-cuboids
Kit < Kowloon, Central, Kowloon, Central >
Ben < Kowloon, Central, Central, Kowloon >

The number of pattern dimensions is infinite
The S-Cuboid with

pattern template
<X,Y,Z>

Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
Traditional OLAP
Finer
summaries
Coarser
summaries
# Sales
1
1
1
1
1
Sequence OLAP
1
1
< A, B, A>
< A, B, B>
#Sequences
1
1
#Sequences
# Sales
7
Whole week
Summarizable!
< A, B >
?
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids
Sequence Database
Seq ID

Can we compute the S-Cuboid
with pattern <X,Y> (coarser
summary) from the S-Cuboid
with pattern <X,Y,Z> (finer
summary) without looking at
the sequence database?
S-Cuboid (Finer aggregates)
< X, Y, Z >
Count
< Kowloon, Central, Kowloon >
1
< X, Y >
Count
< Kowloon, Central, Central >
1
< Kowloon, Central>
?
Sequence of events
Infinite number of S-cuboids
Kit < Kowloon, Central, Kowloon, Central >
S-Cuboid
(Coarser aggregates)
Ben < Kowloon, Central, Central, Kowloon >

The number of pattern dimensions is infinite
The S-Cuboid with

pattern template
<X,Y,Z>

Pattern Template (X,Y,Y,X,A,B,…)
Non-summarizable
Traditional OLAP
Finer
summaries
Coarser
summaries
# Sales
1
1
1
1
1
Sequence OLAP
1
1
< A, B, A>
< A, B, B>
#Sequences
1
1
#Sequences
# Sales
7
Whole week
Summarizable!
< A, B >
?
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids
Sequence Database
Seq ID

Can we compute the S-Cuboid
with pattern <X,Y> (coarser
summary) from the S-Cuboid
with pattern <X,Y,Z> (finer
summary) without looking at
the sequence database?
S-Cuboid (Finer aggregates)
< X, Y, Z >
Count
< Kowloon, Central, Kowloon >
1
< X, Y >
Count
< Kowloon, Central, Central >
1
< Kowloon, Central>
2
Sequence of events
Infinite number of S-cuboids
Kit < Kowloon, Central, Kowloon, Central >
S-Cuboid
(Coarser aggregates)
Ben < Kowloon, Central, Central, Kowloon >

The number of S-Cuboid
pattern
dimensions is
infinite
S-Cuboid
(Finer aggregates)
Sequence Database
Seq ID
< X, Y, Z >
Count
Pattern Template
(X,Y,Y,X,A,B,…)
< Kowloon, Central, Kowloon >
1
Sequence of events

Kit < Kowloon, Central, Kowloon, Central, Central >
< Kowloon, Central, Central >
1
(Coarser aggregates)
< X, Y >
Count
< Kowloon, Central>
1
Ben < Kowloon, Admiralty >

Non-summarizable
The problem is that we don’t know if the counts in
these two patterns are generated from the same
sequence, or two different sequences.
Traditional OLAP
Finer
summaries
Coarser
summaries
# Sales
1
1
1
1
1
Sequence OLAP
1
1
< A, B, A>
< A, B, B>
#Sequences
1
1
#Sequences
# Sales
7
Whole week
Summarizable!
< A, B >
?
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids
Sequence Database
Seq ID

Can we compute the S-Cuboid
with pattern <X,Y> (coarser
summary) from the S-Cuboid
with pattern <X,Y,Z> (finer
summary) without looking at
the sequence database?
S-Cuboid (Finer aggregates)
< X, Y, Z >
Count
< Kowloon, Central, Kowloon >
1
< X, Y >
Count
< Kowloon, Central, Central >
1
< Kowloon, Central>
2
Sequence of events
Infinite number of S-cuboids
Kit < Kowloon, Central, Kowloon, Central >
S-Cuboid
(Coarser aggregates)
Ben < Kowloon, Central, Central, Kowloon >

The number of S-Cuboid
pattern
dimensions is
infinite
S-Cuboid
(Finer aggregates)
Sequence Database
Seq ID
< X, Y, Z >
Count
Pattern Template
(X,Y,Y,X,A,B,…)
< Kowloon, Central, Kowloon >
1
Sequence of events

Kit < Kowloon, Central, Kowloon, Central, Central >
< Kowloon, Central, Central >
1
(Coarser aggregates)
< X, Y >
Count
< Kowloon, Central>
1
Ben < Kowloon, Admiralty >

Non-summarizable
The problem is that we don’t know if the counts in
these two patterns are generated from the same
sequence, or two different sequences.
Traditional OLAP
Finer
summaries
Coarser
summaries
# Sales
1
1
1
1
1
Sequence OLAP
1
1
< A, B, A>
< A, B, B>
#Sequences
1
1
#Sequences
# Sales
7
Whole week
Summarizable!
< A, B > Non-Summarizable!
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids
Sequence Database
Seq ID

Can we compute the S-Cuboid
with pattern <X,Y> (coarser
summary) from the S-Cuboid
with pattern <X,Y,Z> (finer
summary) without looking at
the sequence database?
S-Cuboid (Finer aggregates)
< X, Y, Z >
Count
< Kowloon, Central, Kowloon >
1
< X, Y >
Count
< Kowloon, Central, Central >
1
< Kowloon, Central>
2
Sequence of events
Infinite number of S-cuboids
Kit < Kowloon, Central, Kowloon, Central >
S-Cuboid
(Coarser aggregates)
Ben < Kowloon, Central, Central, Kowloon >

The number of S-Cuboid
pattern
dimensions is
infinite
S-Cuboid
(Finer aggregates)
Sequence Database
Seq ID
< X, Y, Z >
Count
Pattern Template
(X,Y,Y,X,A,B,…)
< Kowloon, Central, Kowloon >
1
Sequence of events

Kit < Kowloon, Central, Kowloon, Central, Central >
< Kowloon, Central, Central >
1
(Coarser aggregates)
< X, Y >
Count
< Kowloon, Central>
1
Ben < Kowloon, Admiralty >

Non-summarizable

The problem is that we don’t know if the counts in
these two patterns are generated from the same
sequence, or two different sequences.
Coarser aggregates cannot be computed
solely from the corresponding finer
aggregates.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids

Exponential number of S-cuboids

The length of the pattern template is infinite



Pattern Template (X,Y,Y,X,A,B,…)
Full materialization is impossible!
Non-summarizable

Coarser aggregates cannot be computed
solely from the corresponding finer
aggregates.
 Partial materialization is infeasible!
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Properties of S-Cuboids

Research direction

Precompute some other auxiliary data structures
so that queries can be computed online using the
pre-built data structures
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
S-OLAP Specific
Operations
Assist explorative analysis of the sequence data
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
S-OLAP specific operations


Navigate between cuboids with ease
Traditional OLAP operations for Global Dimensions


SLICE, DICE, ROLL-UP, DRILL-DOWN, etc.
New S-OLAP operations for Pattern Dimensions /
Pattern Template






APPEND(X)
DE-TAIL
PREPEND(Z)
DE-HEAD
(X,Y,Y)  (X,Y,Y,X)
(X,Y,Y,X)  (X,Y,Y)
(X,Y,Y,X)  (Z,X,X,Y,Y)
(Q,Y,Y,X)  (Y,Y,X)
Coarser abstraction level
PATTERN-ROLL-UP(X)
(X,Y,Y,X)  (X,Y,Y,X)
PATTERN-DRILL-DOWN(X) (X,Y,Y,X)  (x,Y,Y,x)
Finer abstraction level
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X ,Y >
Sequence
OLAP
Tell me the summary statistics of the single trip
travel patterns of passengers among different
Rail Lines, please .
CUBOID by SUBSTRING(X,Y) WITH
X as location at “Rail Lines”,
Y as location at “Rail Lines”
LEFT-MAXIMALITY (x1, y1) WITH
x1.action = “in” AND
y1.action = “out”
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X ,Y >
S-Cuboid 1 (10 * 10 cells)
Sequence
OLAP
< X, Y > , X and Y at Line level
# Passenger
< Tsuen Wan Line, Island Line>
120,000
< Island Line, Tsuen Wan Line >
8,000
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X ,Y >
S-Cuboid 1 (10 * 10 cells)
Sequence
OLAP
< X, Y > , X and Y at Line level
# Passenger
< Tsuen Wan Line, Island Line>
120,000
< Island Line, Tsuen Wan Line >
8,000
…
…
More detailed statistics of passengers
traveling from the Tsuen Wan Line to each of
the Island Line stations, please .
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X ,Y >
Slice, P-DRILL-DOWN
S-Cuboid 1 (10 * 10 cells)
Sequence
OLAP
< X, Y > , X and Y at Line level
# Passenger
< Tsuen Wan Line, Island Line>
120,000
< Island Line, Tsuen Wan Line >
8,000
…
…
S-Cuboid 2 (1 * 14 cells)
< X, Y > , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”
# Passenger
< Tsuen Wan Line, Central>
100,000
< Tsuen Wan Line, Admiralty >
8,300
< Tsuen Wan Line, Wan Chai >
4,030
< Tsuen Wan Line, Causeway Bay >
12,430
…
…
Instead of specifying the S-Cuboid
construction query, a SLICE plus a PDRILL-DOWN(Y) is done.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X ,Y >
Slice, P-DRILL-DOWN
APPEND (Y)
S-Cuboid 1 (10 * 10 cells)
Sequence
OLAP
< X, Y > , X and Y at Line level
# Passenger
< Tsuen Wan Line, Island Line>
120,000
< Island Line, Tsuen Wan Line >
8,000
…
…
S-Cuboid 2 (1 * 14 cells)
< X, Y > , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”
# Passenger
< Tsuen Wan Line, Central>
100,000
< Tsuen Wan Line, Admiralty >
8,300
< Tsuen Wan Line, Wan Chai >
4,030
< Tsuen Wan Line, Causeway Bay >
12,430
…
…
S-Cuboid 3 (1 * 14 * 14 cells)
< X, Y ,Y> , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”
# Passenger
< Tsuen Wan Line, Central, Central >
90,000
< Tsuen Wan Line, Admiralty, Admiralty >
8,300
< Tsuen Wan Line, Wan Chai, Wan Chai >
4,030
< Tsuen Wan Line, Admiralty, Admiralty >
2,430
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
< X ,Y >
Slice, P-DRILL-DOWN
APPEND (Y)
DE-TAIL
S-Cuboid 1 (10 * 10 cells)
Sequence
OLAP
< X, Y > , X and Y at Line level
# Passenger
< Tsuen Wan Line, Island Line>
120,000
< Island Line, Tsuen Wan Line >
8,000
…
…
S-Cuboid 2 (1 * 14 cells)
< X, Y > , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”
# Passenger
< Tsuen Wan Line, Central>
100,000
< Tsuen Wan Line, Admiralty >
8,300
< Tsuen Wan Line, Wan Chai >
4,030
< Tsuen Wan Line, Causeway Bay >
12,430
…
…
S-Cuboid 3 (1 * 14 * 14 cells)
< X, Y ,Y> , X at Line level, Y at Station level
X=“Tsuen Wan Line”, Y=“Island Line”
# Passenger
< Tsuen Wan
Line, Central,
Central
90,000
The S-OLAP
operations
not
only> assists the
< Tsuen Wan
Line, Admiralty,
Admiralty
>
8,300
exploratory
analysis
of the
sequence
data,
it also
hides
allLine,
theWan
technical
< Tsuen
Wan
Chai, Wandetails
Chai > of
4,030
specifying
the Line,
S-Cuboid
query
from
< Tsuen Wan
Admiralty,
Admiralty
> the
2,430
business users. …
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System
Architecture
Skip
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
Event
Dataset
The raw data of an SOLAP system is a set
of events that are
deposited in an Event
Dataset.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the
event dataset (Phase 1 in S-Cuboid
construction).
Event
Dataset
Sequence
Query Engine
Sequence
Cache
The raw data of an SOLAP system is a set
of events that are
deposited in an Event
Dataset.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the
event dataset (Phase 1 in S-Cuboid
construction).
Event
Dataset
Sequence
Query Engine
Queries
User
Interface
Sequence
Cache
The raw data of an SOLAP system is a set
of events that are
deposited in an Event
Dataset.
The User Interface provides certain
user-friendly components to help a
user specify an S-cuboid.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an Scuboid has been previously computed and stored.
Queries
Cuboid
Repository
Event
Dataset
Sequence
Query Engine
Sequence OLAP Engine
Sequence
Cache
The raw data of an SOLAP system is a set
of events that are
deposited in an Event
Dataset.
User
Interface
Results
The User Interface provides certain
user-friendly components to help a
user specify an S-cuboid.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an Scuboid has been previously computed and stored.
The SOLAP Engine computes the Scuboid with the help of certain
Auxiliary Data Structures.
Queries
Cuboid
Repository
Event
Dataset
Sequence
Query Engine
Auxiliary
Data Structures
Sequence OLAP Engine
Sequence
Cache
The raw data of an SOLAP system is a set
of events that are
deposited in an Event
Dataset.
User
Interface
Results
The User Interface provides certain
user-friendly components to help a
user specify an S-cuboid.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
System Architecture
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an Scuboid has been previously computed and stored.
The SOLAP Engine computes the Scuboid with the help of certain
Auxiliary Data Structures.
Queries
Cuboid
Repository
Event
Dataset
Sequence
Query Engine
Auxiliary
Data Structures
Sequence OLAP Engine
Sequence
Cache
The raw data of an SOLAP system is a set
of events that are
deposited in an Event
Dataset.
User
Interface
Results
The User Interface provides certain
user-friendly components to help a
user specify an S-cuboid.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Auxiliary
Data Structures
Counter based approach
Inverted indices approach
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Counter-Based approach

Counter-Based approach





Each cell in an S-cuboid is associated with a counter.
To determine the counters’ values, the entire set of sequences
is scanned.
For each sequence s, we determine the cells whose
associated patterns are contained in s and increment each of
such counters by 1.
Basic and simple
But processing iterative queries requires Counting
from scratch.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
S-OLAP query evaluation

Inverted-Index Approach
 Based
on the fragment cube (X. Li, J. Han, and H.
Gonzalez. VLDB 2004) concept.
 A set of inverted indices are created by preprocessing the data offline.

Algorithm BuildIndex (see paper)
 During
query processing, the relevant inverted indices
are joined based on the matching pattern, in real-time.

Algorithm QueryIndices (see paper)
 By-products
of answering a query is the creation of
new inverted indices.

Newly built indices are useful to the processing of iterative
S-OLAP operations (see paper for algorithms)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments


A prototype S-OLAP system was implemented
using C++.
Real Data
 Passenger traveling history.
 KDD Cup 2000
 Clickstream data from a web retailer selling legwear and
legcare products.
 50,524 sequences.
 KDD Cup 2000 Question 1


Look for page-click patterns
We answer this question in an exploratory way via three
iterative queries.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
The corresponding pattern template to
capture the 2 steps navigation semantics is
<X,Y>.
Experiments
Cuboid Qa (44*44 cells)
Qa: Look for the statistics of all 2- steps
navigations in the “page category” level.
Comparatively speaking, there are
very few visitors browse from a
product catalog page to a Legcare
product page.
 KDD


< X, Y>
X,Y at “page category” level
# User
sessions
< Main page, Product Catalog>
6,524
…
…
< Product Catalog, Legwear Product >
2,201
…
…
< Main page, Promotion ad >
852
…
…
< Product Catalog, Legcare Product >
150
Cup 2000 Question 1
Look for page-click patterns
We answer this question in an exploratory way via three
iterative queries
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
2. P-DRILL-DOWN
Experiments
Qa: Look for the statistics of all 2- steps
navigations in the “page category” level.
Qb: Since there are many visitors browse
from the product catalog to a legwear
product page. What exactly are the
products they browse?
Cuboid Qa (44*44 cells)
< X, Y>
X,Y at “page category” level
< Main page, Product Catalog>
…
# User
sessions
1.SLICE
6,524
…
< Product Catalog, Legwear Product >
2,201
…
…
< Main page, Promotion ad >
852
…
…
< Product Catalog, Legcare Product >
150
Cuboid Qb (1*279 cells)
The most popular product that visitors browse
from the catalog page is the product 34839
(DKNY skin legwear collection product)
< X, Y > (sliced)
X at “page category” level ; Y at “page” level
# User
sessions
< Product Catalog, Null>
181
< Product Catalog, PID - 34839 >
172
< Product Catalog, PID - 34897 >
163
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
2. P-DRILL-DOWN
Experiments
Qa: Look for the statistics of all 2- steps
navigations in the “page category” level.
Qb: Since there are many visitors browse
from the product catalog to a legwear
product page. What exactly are the
products they browse?
Qc: APPEND(Z)
The runtime of II is
higher than CB in Qa
because we include the
indices precomputation
time in Qa.
Cuboid Qa (44*44 cells)
< X, Y>
X,Y at “page category” level
< Main page, Product Catalog>
…
# User
sessions
1.SLICE
6,524
…
< Product Catalog, Legwear Product >
2,201
…
…
< Main page, Promotion ad >
852
…
…
< Product Catalog, Legcare Product >
150
Cuboid Qb (1*279 cells)
< X, Y > (sliced)
X at “page category” level ; Y at “page” level
# User
sessions
< Product Catalog, Null>
181
< Product Catalog, PID - 34839 >
172
< Product Catalog, PID - 34897 >
163
…
…
Cuboid Qc (1*279*279 cells)
< X, Y, Z > (sliced)
X at “page category” level ; Y, Z at “page” level
# User
sessions
…
…
< Product Catalog, PID - 34839, PID - 34839 >
17
< Product Catalog, PID - 34839, PID - 34897 >
14
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
2. P-DRILL-DOWN
Experiments
Qa: Look for the statistics of all 2- steps
navigations in the “page category” level.
Qb: Since there are many visitors browse
from the product catalog to a legwear
product page. What exactly are the
products they browse?
Qc: APPEND(Z)
For the iterative queries,
II takes the advantage of
processing only the
sequences that possess
the pattern < Product
catalog, Legwear
Product>.
The runtime of II is
higher than CB in Qa
because we include the
indices precomputation
time in Qa.
Cuboid Qa (44*44 cells)
< X, Y>
X,Y at “page category” level
< Main page, Product Catalog>
…
# User
sessions
1.SLICE
6,524
…
< Product Catalog, Legwear Product >
2,201
…
…
< Main page, Promotion ad >
852
…
…
< Product Catalog, Legcare Product >
150
Cuboid Qb (1*279 cells)
< X, Y > (sliced)
X at “page category” level ; Y at “page” level
# User
sessions
< Product Catalog, Null>
181
< Product Catalog, PID - 34839 >
172
< Product Catalog, PID - 34897 >
163
…
…
Cuboid Qc (1*279*279 cells)
< X, Y, Z > (sliced)
X at “page category” level ; Y, Z at “page” level
# User
sessions
…
…
< Product Catalog, PID - 34839, PID - 34839 >
17
< Product Catalog, PID - 34839, PID - 34897 >
14
…
…
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Study the scalability of Counter-Based
approach (CB) and Inverted-Index
approach (II) under a series of APPEND
operations
 QA1 SUBSTRING(X,Y)
 SLICE + APPEND  QA2 (X,Y,Z)
 SLICE + APPEND  QA3 (X,Y,Z,A)
 SLICE + APPEND  QA4 (X,Y,Z,A,B)
 SLICE + APPEND  QA5 (X,Y,Z,A,B,C)

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Cumulative runtime
Both CB and II scale linearly w.r.t.
number of sequences.
II outperformed CB in all datasets in
this experiment.
II precomputation time : less than 4 secs in all cases
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data
Cumulative runtime
Both CB and II scale linearly w.r.t.
number of sequences.
II outperformed CB in all datasets in
this experiment.
Cumulative # sequence scanned
II precomputation time : less than 4 secs in all cases
CB scans the entire dataset once on
each iterative query.
For Qa1, II does not need to scan any
data sequences because the query
can be answered by inverted indices
directly.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Experiments on synthetic data

Vary








Average sequence length (L)
Data distribution (Skew factor)
Domain of the events (I)
P-ROLL-UP operation
P-DRILL-DOWN operation
<X,Y,Y,X> pattern templates
Substring / Subsequence pattern templates
(See technical report)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Conclusion


We propose a new online analytical processing system for
sequence data analysis (The S-OLAP system).
The proposed system is motivated by real-life problems.




We defined basic concepts

S-Cuboid, S-Cube
Identified two properties of S-Cube




Page click analysis
RFID log analysis
…etc
Infinite number of S-Cuboid
Non-summarizable
Illustrated the usability of the proposed S-OLAP system through a
prototype system that works on real data.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
The End
Thank you!
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Synthetic dataset generator

Synthetic sequence databases are synthesized in the following
manner:


The generated sequence database has D sequences.
Each sequence s in a dataset is generated independently

The sequence length l, with mean L, is first determined by a random


variable following a Poisson distribution.
Then, we repeatedly add events to the sequence until the target length l is
reached.
The first event symbol is randomly selected according to a pre-determined
distribution following Zipf’s law with parameter I and Θ



Subsequent events are generated one after the other using a Markov
chain of degree 1.


I is the number of possible symbols, and
Θ is the skew factor
The conditional probabilities are pre-determined and are skewed according to
Zipf’s law.
All the generated sequences form a single sequence group and that
is served as the input data to the algorithms.
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Related Work

Sequence Databases:




OLAP


PREDATOR (Seshadri, Livny, and Ramakrishnan; SIGMOD
94, VLDB 96)
DEVise (Ramakrishnan et al.; SSDBM 98)
TS-SQL (Sadri et al.; PODS 01)
Data-cube operator (Gray et al.; 95), iceberg-cube,
star-schema, …, etc.
OLAP on unconventional data



RFID-cube (Gonzalez, Han, and Li; VLDB 06)
Stream-cube (Chen et al.; VLDB 02)
XML-cube (Wiwatwattana el al.; ICDE 07)
“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)
Download