OLAP on Sequence Data Published in SIGMOD 2008 Vancouver, Canada. Authors : Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. Cheung Chun Kit Chui (Kit), Presenter : The University of Hong Kong ckchui@cs.hku.hk “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Problem Motivation Sequence Data Cube and Cuboids New OLAP operations System architecture Experimental evaluations Future works “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Web server access logs Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. Stock market data U.S. OIL FUND ETF MEXCO ENERGY CORP “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) The product dimension is associated with a concept hierarchy in which the finest level of abstraction is product ID, followed by product type, and brand. Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … … … … 2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas … 2008-1-01 03:45 … 14230 /checkout.xhtml … Nil … Nil Nil /product.html?pid=12800 Nike Sequence Data Web server access logs Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. Stock market data U.S. OIL FUND ETF MEXCO ENERGY CORP “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) The product dimension is associated with a concept hierarchy in which the finest level of abstraction is product ID, followed by product type, and brand. Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … … … … 2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas … 2008-1-01 03:45 … 14230 /checkout.xhtml … Nil … Nil Nil /product.html?pid=12800 Nike Sequence Data Web server access logs Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. From the access logs we can trace back the browsing sequences of all members. Browsing Sequence Member 688 Nike shoes Adidas shoes Nike shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. Manager Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … … … … 2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas … 2008-1-01 03:45 … 14230 /checkout.xhtml … Nil … Nil Nil /product.html?pid=12800 Nike Sequence Data Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. Web server access logs Browsing Sequence Member 688 Nike shoes Adidas shoes Nike shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. Manager Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … … … … 2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas … 2008-1-01 03:45 … 14230 /checkout.xhtml … Nil … Nil Nil /product.html?pid=12800 Nike Sequence Data Pattern template < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > ? < Nike Shoes, Puma Shoes, Nike Shoes > 5,432 < Nike Shoes, Nike Shoes, Nike Shoes > 13,200 The query is referring to a particular kind of … … pattern in the browsing sequences. < Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020 The comparison shopping semantics can be < Adidas Shoes, Shoes, Adidas Shoes > 4,331 expressed by Puma the pattern template < X, Y, X >. Browsing Sequence Member 688 Nike shoes Adidas shoes Nike shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … … … … 2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas … 2008-1-01 03:45 … 14230 /checkout.xhtml … Nil … Nil Nil Manager /product.html?pid=12800 Nike Sequence Data Pattern template Instantiated pattern < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > 1 < Nike Shoes, Puma Shoes, Nike Shoes > ? < Nike Shoes, Nike Shoes, Nike Shoes > ? … … < Adidas Shoes, Nike Shoes, Adidas Shoes > ? < Adidas Shoes, Puma Shoes, Adidas Shoes > ? <Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template. Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell. Browsing Sequence Member 688 Nike shoes Adidas shoes Nike shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … 2008-1-01 03:49 … 2008-1-01 03:45 … 688 … 14230 … 329 … Nil … Adidas T-shirts … Nil Manager The aggregated number of members is counted and a tabulated view of the sequence data should be returned. /product.html?pid=12800 /product.html?pid=329 /checkout.xhtml Nike Adidas Nil Sequence Data < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > 200,000 < Nike Shoes, Puma Shoes, Nike Shoes > 5,432 < Nike Shoes, Nike Shoes, Nike Shoes > 13,200 … … < Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020 < Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331 <Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template. Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell. Browsing Sequence Member 688 Nike shoes Adidas shoes Nike shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. Query Time member- ID URL Product Product type Brand 2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike … … … … 2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas 2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma … … … … 2008-1-01 02:45 688 12800 Nike shoes … … … … 2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas … 2008-1-01 03:45 … 14230 /checkout.xhtml … Nil … Nil Nil Sequence OLAP system • Support “pattern based” grouping and aggregation. Manager The aggregated number of members is counted and a tabulated view of the sequence data should be returned. /product.html?pid=12800 Nike Result < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > 200,000 < Nike Shoes, Puma Shoes, Nike Shoes > 5,432 < Nike Shoes, Nike Shoes, Nike Shoes > 13,200 … … < Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020 < Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331 “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if so what is the product. Follow up Query Sequence OLAP system • Support “pattern based” grouping and aggregation. • Obtain query results in real time (OLAP feature). Manager < X, Y, X, Z > X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any < Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > Result < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > 200,000 < Nike Shoes, Puma Shoes, Nike Shoes > 5,432 < Nike Shoes, Nike Shoes, Nike Shoes > 13,200 … … < Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020 < Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331 + # Members 15,000 < Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts > 180,000 < Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000 … … The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if so what is the product. Follow up Query Sequence OLAP system • Support “pattern based” grouping and aggregation. • Obtain query results in real time (OLAP feature). Manager This manager find out the Adidas Tshirts page is the most popular page for the members who did comparison shopping between Nike shoes and Adidas shoes pages. < X, Y, X, Z > X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any < Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > Result < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > 200,000 < Nike Shoes, Puma Shoes, Nike Shoes > 5,432 < Nike Shoes, Nike Shoes, Nike Shoes > 13,200 … … < Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020 < Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331 + # Members 15,000 < Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts > 180,000 < Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000 … … The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1. There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if so what is the product. The comparison shopping patterns displayed in the “product type” abstraction level is too detailed, I would like to view some higher level statistics. Sequence OLAP system • Support “pattern based” grouping and aggregation. • Obtain query results in real time (OLAP feature). • Provide OLAP operations to ease sequence analysis. Query Manager Result “Product type” abstraction level < X, Y, X > # Members < Nike Shoes, Adidas Shoes, Nike Shoes > 200,000 < Nike Shoes, Puma Shoes, Nike Shoes > 5,432 < X, Y, X, Z > # Members X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any A simple “roll up” operation on the pattern < Niketemplate Shoes, Adidas Shoes, Nike Shoes, Nike Shoesstatistics > 15,000 transforms the summary to < Nike Shoes, Adidas Shoes, Nike Shoes,level. Adidas T-shirts > the brand abstraction 180,000 < Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000 … … “brand” abstraction level Nike < X, Y, X > # Members < Nike, Adidas, Nike> 3,150,000 < Nike Shoes, Nike Shoes, Nike Shoes > 13,200 … … < Nike, Puma, Nike > 2,180,000 < Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020 < Nike, Nike, Nike > 19,000,000 < Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331 … … Nike shoes Nike T-shirts Nike Basketballs Nike socks “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Research Objective Sequence OLAP < X, Y > # Members < Nike, Adidas> 1,315,000 < Nike, <Puma X, Y, >X > 6,480,000 # Members < <Nike, 3,189,000 Nike,Nike> Adidas, Nike> 315,000 < Nike, … Puma, Nike > …2,180,000 < Nike, Nike, Nike > 189,000 … … To design and implement an OLAP system that is able to support “pattern based” grouping and aggregation. obtain query results in real-time. Especially optimized for interactive/iterative queries. provide OLAP operations to ease explorative analysis of sequence data. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) RFID Logs Smart card Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags. The smart card system in public transits Octopus card Hong Kong, Orca card in Seattle (2009)…etc Electronic money Travel history of passengers are logged in a database. Generate massive amount of sequence data. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) RFID Logs Smart card Event Database Card reader Time Card-ID Location Action Amount 2008-7-25 09:01 Kit Shatin in - 2008-7-25 09:25 Kit Central out - $5 … … … … 2008-7-25 18:23 Kit … Central Machine #10 Add value + $100 2008-7-25 18:25 Kit Central in - … … … … … 2008-7-25 18:49 Kit Shatin out - $5 … … … … … Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags. The smart card system in public transits Octopus card Hong Kong, Orca card in Seattle (2009)…etc Electronic money Payment can be done easily by waving the card over the card reader. Travel history of passengers are logged in a database. Generate massive amount of sequence data . “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Event Database Time Card-ID Location Action Amount 2008-7-25 09:01 Kit Shatin in - 2008-7-25 09:25 Kit Central out - $5 … … … … 2008-7-25 18:23 Kit … Central Machine #10 Add value + $100 2008-7-25 18:25 Kit Central in - … … … … … 2008-7-25 18:49 Kit Shatin out - $5 … … … … … Marketing Manager The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Event Database Round trip statistics (Stations level) < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2,032 < Shatin, Admiralty, Admiralty, Shatin> 1,982 … … < Admiralty, Central, Central, Admiralty > 22,822 < Admiralty, Kowloon, Kowloon, Admiralty > 10,020 Time Card-ID Location Action Amount 2008-7-25 09:01 Kit Shatin in - 2008-7-25 09:25 Kit Central out - $5 … … … … 2008-7-25 18:23 Kit … Central Machine #10 Add value + $100 2008-7-25 18:25 Kit Central in - … … … … … 2008-7-25 18:49 Kit Shatin out - $5 … … … … … Result Sequence OLAP system • Support “pattern based” grouping and aggregation. • Obtain query results in real time. • Provide OLAP operations to ease explorative analysis. Query Marketing Manager The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Sequence Data Cuboid A logical view of sequence data at a particular degree of summarization. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Preliminary Sequence Cuboid (S-Cuboid) Marketing Manager a logical view of sequence data at a particular degree of summarization. sequences can be characterized by attributes’ values of the events in the sequence (e.g. time, spending, product type) the subsequence/ substring patterns they possess. (e.g. <X,Y,X> , <X,Y,Y,X>) The number of roundtrip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4. Sequence OLAP An S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Kowloon, Admiralty, Admiralty, Kowloon > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 1. Sequence Formation Event Database Time Card-ID Location Action Amount Time Card-ID Location Action Amount 2008-6-09 00:01 Kit Shatin in 0 2008-6-09 00:01 Kit Shatin in 0 2008-6-09 02:25 Kit Central out -5 2008-6-09 02:25 Kit Central out -5 … … … … … … … … … 2008-6-14 02:25 Kit Central in 0 2008-6-14 02:23 Kit … Central Machine #10 Add value +100 … … … … … 2008-6-14 02:25 Kit Central in 0 2008-6-14 18:49 Kit Shatin out -5 … … … … … … … … … … 2008-6-14 18:49 Kit Shatin out -5 … … … … … Event Selection An event selection step to select a set of a relevant records and attributes. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 1. Sequence Formation Event Database Time Card-ID Location Action Amount Time Card-ID Location Action Amount 2008-6-09 00:01 Kit Shatin in 0 2008-6-09 00:01 Kit Shatin in 0 2008-6-09 02:25 Kit Central out -5 2008-6-09 02:25 Kit Central out -5 … … … … … … … … … 2008-6-14 02:25 Kit Central in 0 2008-6-14 02:23 Kit … Central Machine #10 Add value +100 … … … … … 2008-6-14 18:49 Kit Shatin out -5 … … … … … Event Selection 2008-6-14 02:25 formation Kit Central A sequence step to in … …from the … event … form sequences 2008-6-14 18:49 Kit Shatin out dataset. … … … … 0 … -5 … Sequence Formation User : Individual, Time : Day Seq ID Kit’s trip on monday Sequence of events S1 < e1, e2, e102, e180> S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 > S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 > … … Sequences can be formed per day and for each individual user. By doing this, we have a number of daily travel sequences of each user. E.g. S1 is Kit’s trip on Monday “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 1. Sequence Formation Event Database Time Card-ID Location Action Amount Time Card-ID Location Action Amount 2008-6-09 00:01 Kit Shatin in 0 2008-6-09 00:01 Kit Shatin in 0 2008-6-09 02:25 Kit Central out -5 2008-6-09 02:25 Kit Wan Chai out -5 … … … … … … … … … 2008-6-14 02:25 Kit Central in 0 2008-6-14 02:23 Kit … Wan Chai Machine #10 Add value +100 … … … … … 2008-6-14 18:49 Kit Shatin out -5 … … … … … Event Selection 2008-6-14 02:25 can Kit alsoWan Sequences beChai formedin … … … according to time dimension at… 2008-6-14 18:49 Kit Shatin out the abstraction level of year and … … … … per individual user. 0 … -5 … Sequence Formation User : Individual, Time : Day Seq ID Kit’s trip on monday Sequence of events User : Individual, Time : Year Seq ID S1 < e1, e2, e102, e180> S1 S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 > S2 S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 > … … S3 … Kit’s trip in 2008 Sequence of events < e1, e2 , e102, e180 , e1002, e1800 , e1801 ,… > < e3, e7, e8, e12 , e19, e232 , e234, e235 , e2134, e2135 > < e4, e5, e9, e13 , e14, e290 , e292 , e352 , e3252,…> “OLAP on Sequence … Data” , Presenter : Chun Kit Chui (Kit) User : Individual, Time : Day Seq ID Kit’s trip on monday Sequence of events S1 < e1, e2, e102, e180> S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 > S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 > … Shing Ben … Kit User : individual Phase 2. S-Cuboid construction S4 S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 Time : day “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) A sequence grouping step to group the sequences that share the same dimensions’ values into a sequence group. E.g. travel sequences are grouped according to their fare groups. Regular Group User : fare-group Phase 2. S-Cuboid construction S4 S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 Sequence Grouping User : Individual, Time : Day Seq ID Sequence of events S1 < e1, e2, e102, e180> S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 > S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 > … Shing Ben … Kit User : individual time : day S4 S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 Time : day “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 2. S-Cuboid construction User : fare-group Pattern Grouping S4 S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 The pattern grouping step Time : Day User : Individual, Seq ID further groups Sequence of events S1 sequences < e1, e2, e102, e180> the according S2 < e3, eto 7, ethe 8, e12 , e19, e232 , e234, e235 > “patterns” S3 < e4, ethey 5, e9, e13 , e14, e290 , e292, e352 > possess. … … Sequence Grouping time : day X (Location : station) User : individual Y (Location : station) Pattern X,Y,Y,X S4 S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 Time : day “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 2. S-Cuboid construction Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, S4 S29 S129 S2529 Shatin> Pattern We assign Ssequences cell if that S23 S242to a S2453 Grouping 3 sequence contains the instantiated S2 S90 S124 S9230 pattern. User : fare-group Y (Location : station) Pattern X,Y,Y,X S1 S100 Event S1 S3 Shatin Time e2 2008-6-09 00:01 2008-6-09 02:25 … … e1 Central S1020 time : day X (Location : station) The pattern grouping step further groups the sequences according to the “patterns” they possess. S388 e102 … e180 … 2008-6-09 22:25 … 2008-6-09 23:49 … Card-ID Location Action Amount Kit Shatin in 0 Kit Central out -5 … … … … Kit Central in 0 … … … … Kit Shatin out -5 … … … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 2. S-Cuboid construction Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, S4 S29 S129 S2529 Shatin> Pattern We assign Ssequences cell if that S23 S242to a S2453 Grouping 3 sequence contains the instantiated S2 S90 S124 S9230 pattern. User : fare-group Y (Location : station) Pattern X,Y,Y,X S1 S100 S388 S1020 time : day X (Location : station) Count: 2 Central Aggregated Value S1 S3 Finally, an aggregation function is applied to the sequences in each cuboid cell. Shatin “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 2. S-Cuboid construction Pattern Grouping User : fare-group Y (Location : station) Pattern X,Y,Y,X S4 S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 time : day X (Location : station) Aggregated Value Central Count: 2 S1 S3 Shatin 4D S-Cuboid 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Phase 2. S-Cuboid construction Pattern Grouping User : fare-group Y (Location : station) Pattern X,Y,Y,X S29 S129 S2529 S3 S23 S242 S2453 S2 S90 S124 S9230 S1 S100 S388 S1020 time : day X (Location : station) Global Dimensions Pattern Dimensions Aggregated Value S1 Central Count: 2 S4 S3 Shatin 4D S-Cuboid 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Sequence Cuboid query language This query specifies the construction of the SCuboid that answer the round trip query in the running example. Sequence Formation Sequence Grouping Pattern Grouping The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4. 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Sequence Cuboid query language Form individual daily travel sequences. Sequence Formation We specify the global dimensions in the sequence grouping step. Group the sequences with the same fare-group within the same day. Sequence Grouping Pattern Grouping The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4. Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level. 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Sequence Cuboid query language Form individual daily travel sequences. Sequence Formation Sequence Grouping We specify the global dimensions in the sequence grouping step. Group the sequences with the same fare-group within the same day. Pattern Grouping Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level. The predicates further increases the expression power of pattern matching in the query language. What exactly is a round-trip pattern? 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Sequence Cuboid query language Sequence Formation Sequence Grouping Global dimensions Pattern template Pattern dimensions Pattern Grouping E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin > The cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence. 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Sequence Cuboid query language Any changes to the cuboid specification transforms the SCuboid to another. E.g. changing the pattern template to (X,Y,Y,X,Z) generates another S-Cuboid. Sequence Formation Sequence Grouping Global dimensions Pattern template Pattern dimensions Pattern Grouping E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin > The cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence. 4D S-Cuboid < X, Y, Y, X > # Users < Shatin, Central, Central, Shatin > 2 < Shatin, Kowloon, Kowloon, Shatin > 9 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Exponential number of S-cuboids The length of the pattern template is infinite Pattern Template (X,Y,Y,X,A,B,…) Non-summarizable Recall that changing the pattern template essentially changes the cuboid specification and thus generates a new cuboid. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Exponential number of S-cuboids The length of the pattern template is infinite Pattern Template (X,Y,Y,X,A,B,…) Non-summarizable Traditional OLAP Finer summaries Coarser summaries # Sales # Sales 1 1 1 1 7 Whole week 1 1 1 In traditional OLAP systems, data are summarizable. i.e. Summaries in finer abstraction level can be used to construct the summary in higher abstraction level. Summarizable! “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Sequence Database Seq ID S-Cuboid (Finer aggregates) < X, Y, Z > Count < Kowloon, Central, Kowloon > 1 < Kowloon, Central, Central > 1 Sequence of events Infinite number of S-cuboids Kit < Kowloon, Central, Kowloon, Central > Ben < Kowloon, Central, Central, Kowloon > The number of pattern dimensions is infinite The S-Cuboid with pattern template <X,Y,Z> Pattern Template (X,Y,Y,X,A,B,…) Non-summarizable Traditional OLAP Finer summaries Coarser summaries # Sales 1 1 1 1 1 Sequence OLAP 1 1 < A, B, A> < A, B, B> #Sequences 1 1 #Sequences # Sales 7 Whole week Summarizable! < A, B > ? “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Sequence Database Seq ID Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? S-Cuboid (Finer aggregates) < X, Y, Z > Count < Kowloon, Central, Kowloon > 1 < X, Y > Count < Kowloon, Central, Central > 1 < Kowloon, Central> ? Sequence of events Infinite number of S-cuboids Kit < Kowloon, Central, Kowloon, Central > S-Cuboid (Coarser aggregates) Ben < Kowloon, Central, Central, Kowloon > The number of pattern dimensions is infinite The S-Cuboid with pattern template <X,Y,Z> Pattern Template (X,Y,Y,X,A,B,…) Non-summarizable Traditional OLAP Finer summaries Coarser summaries # Sales 1 1 1 1 1 Sequence OLAP 1 1 < A, B, A> < A, B, B> #Sequences 1 1 #Sequences # Sales 7 Whole week Summarizable! < A, B > ? “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Sequence Database Seq ID Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? S-Cuboid (Finer aggregates) < X, Y, Z > Count < Kowloon, Central, Kowloon > 1 < X, Y > Count < Kowloon, Central, Central > 1 < Kowloon, Central> 2 Sequence of events Infinite number of S-cuboids Kit < Kowloon, Central, Kowloon, Central > S-Cuboid (Coarser aggregates) Ben < Kowloon, Central, Central, Kowloon > The number of S-Cuboid pattern dimensions is infinite S-Cuboid (Finer aggregates) Sequence Database Seq ID < X, Y, Z > Count Pattern Template (X,Y,Y,X,A,B,…) < Kowloon, Central, Kowloon > 1 Sequence of events Kit < Kowloon, Central, Kowloon, Central, Central > < Kowloon, Central, Central > 1 (Coarser aggregates) < X, Y > Count < Kowloon, Central> 1 Ben < Kowloon, Admiralty > Non-summarizable The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences. Traditional OLAP Finer summaries Coarser summaries # Sales 1 1 1 1 1 Sequence OLAP 1 1 < A, B, A> < A, B, B> #Sequences 1 1 #Sequences # Sales 7 Whole week Summarizable! < A, B > ? “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Sequence Database Seq ID Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? S-Cuboid (Finer aggregates) < X, Y, Z > Count < Kowloon, Central, Kowloon > 1 < X, Y > Count < Kowloon, Central, Central > 1 < Kowloon, Central> 2 Sequence of events Infinite number of S-cuboids Kit < Kowloon, Central, Kowloon, Central > S-Cuboid (Coarser aggregates) Ben < Kowloon, Central, Central, Kowloon > The number of S-Cuboid pattern dimensions is infinite S-Cuboid (Finer aggregates) Sequence Database Seq ID < X, Y, Z > Count Pattern Template (X,Y,Y,X,A,B,…) < Kowloon, Central, Kowloon > 1 Sequence of events Kit < Kowloon, Central, Kowloon, Central, Central > < Kowloon, Central, Central > 1 (Coarser aggregates) < X, Y > Count < Kowloon, Central> 1 Ben < Kowloon, Admiralty > Non-summarizable The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences. Traditional OLAP Finer summaries Coarser summaries # Sales 1 1 1 1 1 Sequence OLAP 1 1 < A, B, A> < A, B, B> #Sequences 1 1 #Sequences # Sales 7 Whole week Summarizable! < A, B > Non-Summarizable! “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Sequence Database Seq ID Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? S-Cuboid (Finer aggregates) < X, Y, Z > Count < Kowloon, Central, Kowloon > 1 < X, Y > Count < Kowloon, Central, Central > 1 < Kowloon, Central> 2 Sequence of events Infinite number of S-cuboids Kit < Kowloon, Central, Kowloon, Central > S-Cuboid (Coarser aggregates) Ben < Kowloon, Central, Central, Kowloon > The number of S-Cuboid pattern dimensions is infinite S-Cuboid (Finer aggregates) Sequence Database Seq ID < X, Y, Z > Count Pattern Template (X,Y,Y,X,A,B,…) < Kowloon, Central, Kowloon > 1 Sequence of events Kit < Kowloon, Central, Kowloon, Central, Central > < Kowloon, Central, Central > 1 (Coarser aggregates) < X, Y > Count < Kowloon, Central> 1 Ben < Kowloon, Admiralty > Non-summarizable The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences. Coarser aggregates cannot be computed solely from the corresponding finer aggregates. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Exponential number of S-cuboids The length of the pattern template is infinite Pattern Template (X,Y,Y,X,A,B,…) Full materialization is impossible! Non-summarizable Coarser aggregates cannot be computed solely from the corresponding finer aggregates. Partial materialization is infeasible! “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Properties of S-Cuboids Research direction Precompute some other auxiliary data structures so that queries can be computed online using the pre-built data structures “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) S-OLAP Specific Operations Assist explorative analysis of the sequence data “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) S-OLAP specific operations Navigate between cuboids with ease Traditional OLAP operations for Global Dimensions SLICE, DICE, ROLL-UP, DRILL-DOWN, etc. New S-OLAP operations for Pattern Dimensions / Pattern Template APPEND(X) DE-TAIL PREPEND(Z) DE-HEAD (X,Y,Y) (X,Y,Y,X) (X,Y,Y,X) (X,Y,Y) (X,Y,Y,X) (Z,X,X,Y,Y) (Q,Y,Y,X) (Y,Y,X) Coarser abstraction level PATTERN-ROLL-UP(X) (X,Y,Y,X) (X,Y,Y,X) PATTERN-DRILL-DOWN(X) (X,Y,Y,X) (x,Y,Y,x) Finer abstraction level “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) < X ,Y > Sequence OLAP Tell me the summary statistics of the single trip travel patterns of passengers among different Rail Lines, please . CUBOID by SUBSTRING(X,Y) WITH X as location at “Rail Lines”, Y as location at “Rail Lines” LEFT-MAXIMALITY (x1, y1) WITH x1.action = “in” AND y1.action = “out” “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) < X ,Y > S-Cuboid 1 (10 * 10 cells) Sequence OLAP < X, Y > , X and Y at Line level # Passenger < Tsuen Wan Line, Island Line> 120,000 < Island Line, Tsuen Wan Line > 8,000 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) < X ,Y > S-Cuboid 1 (10 * 10 cells) Sequence OLAP < X, Y > , X and Y at Line level # Passenger < Tsuen Wan Line, Island Line> 120,000 < Island Line, Tsuen Wan Line > 8,000 … … More detailed statistics of passengers traveling from the Tsuen Wan Line to each of the Island Line stations, please . “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) < X ,Y > Slice, P-DRILL-DOWN S-Cuboid 1 (10 * 10 cells) Sequence OLAP < X, Y > , X and Y at Line level # Passenger < Tsuen Wan Line, Island Line> 120,000 < Island Line, Tsuen Wan Line > 8,000 … … S-Cuboid 2 (1 * 14 cells) < X, Y > , X at Line level, Y at Station level X=“Tsuen Wan Line”, Y=“Island Line” # Passenger < Tsuen Wan Line, Central> 100,000 < Tsuen Wan Line, Admiralty > 8,300 < Tsuen Wan Line, Wan Chai > 4,030 < Tsuen Wan Line, Causeway Bay > 12,430 … … Instead of specifying the S-Cuboid construction query, a SLICE plus a PDRILL-DOWN(Y) is done. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) < X ,Y > Slice, P-DRILL-DOWN APPEND (Y) S-Cuboid 1 (10 * 10 cells) Sequence OLAP < X, Y > , X and Y at Line level # Passenger < Tsuen Wan Line, Island Line> 120,000 < Island Line, Tsuen Wan Line > 8,000 … … S-Cuboid 2 (1 * 14 cells) < X, Y > , X at Line level, Y at Station level X=“Tsuen Wan Line”, Y=“Island Line” # Passenger < Tsuen Wan Line, Central> 100,000 < Tsuen Wan Line, Admiralty > 8,300 < Tsuen Wan Line, Wan Chai > 4,030 < Tsuen Wan Line, Causeway Bay > 12,430 … … S-Cuboid 3 (1 * 14 * 14 cells) < X, Y ,Y> , X at Line level, Y at Station level X=“Tsuen Wan Line”, Y=“Island Line” # Passenger < Tsuen Wan Line, Central, Central > 90,000 < Tsuen Wan Line, Admiralty, Admiralty > 8,300 < Tsuen Wan Line, Wan Chai, Wan Chai > 4,030 < Tsuen Wan Line, Admiralty, Admiralty > 2,430 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) < X ,Y > Slice, P-DRILL-DOWN APPEND (Y) DE-TAIL S-Cuboid 1 (10 * 10 cells) Sequence OLAP < X, Y > , X and Y at Line level # Passenger < Tsuen Wan Line, Island Line> 120,000 < Island Line, Tsuen Wan Line > 8,000 … … S-Cuboid 2 (1 * 14 cells) < X, Y > , X at Line level, Y at Station level X=“Tsuen Wan Line”, Y=“Island Line” # Passenger < Tsuen Wan Line, Central> 100,000 < Tsuen Wan Line, Admiralty > 8,300 < Tsuen Wan Line, Wan Chai > 4,030 < Tsuen Wan Line, Causeway Bay > 12,430 … … S-Cuboid 3 (1 * 14 * 14 cells) < X, Y ,Y> , X at Line level, Y at Station level X=“Tsuen Wan Line”, Y=“Island Line” # Passenger < Tsuen Wan Line, Central, Central 90,000 The S-OLAP operations not only> assists the < Tsuen Wan Line, Admiralty, Admiralty > 8,300 exploratory analysis of the sequence data, it also hides allLine, theWan technical < Tsuen Wan Chai, Wandetails Chai > of 4,030 specifying the Line, S-Cuboid query from < Tsuen Wan Admiralty, Admiralty > the 2,430 business users. … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture Skip “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture Event Dataset The raw data of an SOLAP system is a set of events that are deposited in an Event Dataset. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture The job of the Sequence Query Engine is to compose sets of event sequences out of the event dataset (Phase 1 in S-Cuboid construction). Event Dataset Sequence Query Engine Sequence Cache The raw data of an SOLAP system is a set of events that are deposited in an Event Dataset. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture The job of the Sequence Query Engine is to compose sets of event sequences out of the event dataset (Phase 1 in S-Cuboid construction). Event Dataset Sequence Query Engine Queries User Interface Sequence Cache The raw data of an SOLAP system is a set of events that are deposited in an Event Dataset. The User Interface provides certain user-friendly components to help a user specify an S-cuboid. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an Scuboid has been previously computed and stored. Queries Cuboid Repository Event Dataset Sequence Query Engine Sequence OLAP Engine Sequence Cache The raw data of an SOLAP system is a set of events that are deposited in an Event Dataset. User Interface Results The User Interface provides certain user-friendly components to help a user specify an S-cuboid. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an Scuboid has been previously computed and stored. The SOLAP Engine computes the Scuboid with the help of certain Auxiliary Data Structures. Queries Cuboid Repository Event Dataset Sequence Query Engine Auxiliary Data Structures Sequence OLAP Engine Sequence Cache The raw data of an SOLAP system is a set of events that are deposited in an Event Dataset. User Interface Results The User Interface provides certain user-friendly components to help a user specify an S-cuboid. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) System Architecture Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an Scuboid has been previously computed and stored. The SOLAP Engine computes the Scuboid with the help of certain Auxiliary Data Structures. Queries Cuboid Repository Event Dataset Sequence Query Engine Auxiliary Data Structures Sequence OLAP Engine Sequence Cache The raw data of an SOLAP system is a set of events that are deposited in an Event Dataset. User Interface Results The User Interface provides certain user-friendly components to help a user specify an S-cuboid. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Auxiliary Data Structures Counter based approach Inverted indices approach “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Counter-Based approach Counter-Based approach Each cell in an S-cuboid is associated with a counter. To determine the counters’ values, the entire set of sequences is scanned. For each sequence s, we determine the cells whose associated patterns are contained in s and increment each of such counters by 1. Basic and simple But processing iterative queries requires Counting from scratch. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) S-OLAP query evaluation Inverted-Index Approach Based on the fragment cube (X. Li, J. Han, and H. Gonzalez. VLDB 2004) concept. A set of inverted indices are created by preprocessing the data offline. Algorithm BuildIndex (see paper) During query processing, the relevant inverted indices are joined based on the matching pattern, in real-time. Algorithm QueryIndices (see paper) By-products of answering a query is the creation of new inverted indices. Newly built indices are useful to the processing of iterative S-OLAP operations (see paper for algorithms) “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Experiments A prototype S-OLAP system was implemented using C++. Real Data Passenger traveling history. KDD Cup 2000 Clickstream data from a web retailer selling legwear and legcare products. 50,524 sequences. KDD Cup 2000 Question 1 Look for page-click patterns We answer this question in an exploratory way via three iterative queries. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) The corresponding pattern template to capture the 2 steps navigation semantics is <X,Y>. Experiments Cuboid Qa (44*44 cells) Qa: Look for the statistics of all 2- steps navigations in the “page category” level. Comparatively speaking, there are very few visitors browse from a product catalog page to a Legcare product page. KDD < X, Y> X,Y at “page category” level # User sessions < Main page, Product Catalog> 6,524 … … < Product Catalog, Legwear Product > 2,201 … … < Main page, Promotion ad > 852 … … < Product Catalog, Legcare Product > 150 Cup 2000 Question 1 Look for page-click patterns We answer this question in an exploratory way via three iterative queries “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) 2. P-DRILL-DOWN Experiments Qa: Look for the statistics of all 2- steps navigations in the “page category” level. Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse? Cuboid Qa (44*44 cells) < X, Y> X,Y at “page category” level < Main page, Product Catalog> … # User sessions 1.SLICE 6,524 … < Product Catalog, Legwear Product > 2,201 … … < Main page, Promotion ad > 852 … … < Product Catalog, Legcare Product > 150 Cuboid Qb (1*279 cells) The most popular product that visitors browse from the catalog page is the product 34839 (DKNY skin legwear collection product) < X, Y > (sliced) X at “page category” level ; Y at “page” level # User sessions < Product Catalog, Null> 181 < Product Catalog, PID - 34839 > 172 < Product Catalog, PID - 34897 > 163 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) 2. P-DRILL-DOWN Experiments Qa: Look for the statistics of all 2- steps navigations in the “page category” level. Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse? Qc: APPEND(Z) The runtime of II is higher than CB in Qa because we include the indices precomputation time in Qa. Cuboid Qa (44*44 cells) < X, Y> X,Y at “page category” level < Main page, Product Catalog> … # User sessions 1.SLICE 6,524 … < Product Catalog, Legwear Product > 2,201 … … < Main page, Promotion ad > 852 … … < Product Catalog, Legcare Product > 150 Cuboid Qb (1*279 cells) < X, Y > (sliced) X at “page category” level ; Y at “page” level # User sessions < Product Catalog, Null> 181 < Product Catalog, PID - 34839 > 172 < Product Catalog, PID - 34897 > 163 … … Cuboid Qc (1*279*279 cells) < X, Y, Z > (sliced) X at “page category” level ; Y, Z at “page” level # User sessions … … < Product Catalog, PID - 34839, PID - 34839 > 17 < Product Catalog, PID - 34839, PID - 34897 > 14 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) 2. P-DRILL-DOWN Experiments Qa: Look for the statistics of all 2- steps navigations in the “page category” level. Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse? Qc: APPEND(Z) For the iterative queries, II takes the advantage of processing only the sequences that possess the pattern < Product catalog, Legwear Product>. The runtime of II is higher than CB in Qa because we include the indices precomputation time in Qa. Cuboid Qa (44*44 cells) < X, Y> X,Y at “page category” level < Main page, Product Catalog> … # User sessions 1.SLICE 6,524 … < Product Catalog, Legwear Product > 2,201 … … < Main page, Promotion ad > 852 … … < Product Catalog, Legcare Product > 150 Cuboid Qb (1*279 cells) < X, Y > (sliced) X at “page category” level ; Y at “page” level # User sessions < Product Catalog, Null> 181 < Product Catalog, PID - 34839 > 172 < Product Catalog, PID - 34897 > 163 … … Cuboid Qc (1*279*279 cells) < X, Y, Z > (sliced) X at “page category” level ; Y, Z at “page” level # User sessions … … < Product Catalog, PID - 34839, PID - 34839 > 17 < Product Catalog, PID - 34839, PID - 34897 > 14 … … “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Experiments on synthetic data Study the scalability of Counter-Based approach (CB) and Inverted-Index approach (II) under a series of APPEND operations QA1 SUBSTRING(X,Y) SLICE + APPEND QA2 (X,Y,Z) SLICE + APPEND QA3 (X,Y,Z,A) SLICE + APPEND QA4 (X,Y,Z,A,B) SLICE + APPEND QA5 (X,Y,Z,A,B,C) “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Experiments on synthetic data Cumulative runtime Both CB and II scale linearly w.r.t. number of sequences. II outperformed CB in all datasets in this experiment. II precomputation time : less than 4 secs in all cases “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Experiments on synthetic data Cumulative runtime Both CB and II scale linearly w.r.t. number of sequences. II outperformed CB in all datasets in this experiment. Cumulative # sequence scanned II precomputation time : less than 4 secs in all cases CB scans the entire dataset once on each iterative query. For Qa1, II does not need to scan any data sequences because the query can be answered by inverted indices directly. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Experiments on synthetic data Vary Average sequence length (L) Data distribution (Skew factor) Domain of the events (I) P-ROLL-UP operation P-DRILL-DOWN operation <X,Y,Y,X> pattern templates Substring / Subsequence pattern templates (See technical report) “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Conclusion We propose a new online analytical processing system for sequence data analysis (The S-OLAP system). The proposed system is motivated by real-life problems. We defined basic concepts S-Cuboid, S-Cube Identified two properties of S-Cube Page click analysis RFID log analysis …etc Infinite number of S-Cuboid Non-summarizable Illustrated the usability of the proposed S-OLAP system through a prototype system that works on real data. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) The End Thank you! “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Synthetic dataset generator Synthetic sequence databases are synthesized in the following manner: The generated sequence database has D sequences. Each sequence s in a dataset is generated independently The sequence length l, with mean L, is first determined by a random variable following a Poisson distribution. Then, we repeatedly add events to the sequence until the target length l is reached. The first event symbol is randomly selected according to a pre-determined distribution following Zipf’s law with parameter I and Θ Subsequent events are generated one after the other using a Markov chain of degree 1. I is the number of possible symbols, and Θ is the skew factor The conditional probabilities are pre-determined and are skewed according to Zipf’s law. All the generated sequences form a single sequence group and that is served as the input data to the algorithms. “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit) Related Work Sequence Databases: OLAP PREDATOR (Seshadri, Livny, and Ramakrishnan; SIGMOD 94, VLDB 96) DEVise (Ramakrishnan et al.; SSDBM 98) TS-SQL (Sadri et al.; PODS 01) Data-cube operator (Gray et al.; 95), iceberg-cube, star-schema, …, etc. OLAP on unconventional data RFID-cube (Gonzalez, Han, and Li; VLDB 06) Stream-cube (Chen et al.; VLDB 02) XML-cube (Wiwatwattana el al.; ICDE 07) “OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)