Advanced Querying - Information Systems

advertisement
Advanced Querying
OLAP
Data Warehousing
Database Applications
• Transaction processing
– Online setting
– Supports day-to-day operation of business
• Decision support
– Offline setting
– Strategic planning (statistics)
Transaction Processing
Transaction processing
• Operational setting
• Up-to-date = critical
• Simple data
• Simple queries
Flight reservations
• ticket sales
• do not sell a seat
twice
• reservation, date,
name
• Give flight details of X
List flights to Y
Transaction Processing
• Database must support
– simple data
• tables
– simple queries
• select from where …
– consistency & integrity CRITICAL
– concurrency
• Relational databases, Object-Oriented,
Object-Relational
Decision Support
Decision support
•
•
•
•
Off-line setting
« Historical » data
Summarized data
Different databases
• Statistical queries
Flight company
•
•
•
•
Evaluate ROI flights
Flights of last year
# passengers on line L
Passengers, fuel costs,
maintenance info
• Average % of seats
sold/month/destination
Data Warehouse
A decision support DB that is maintained separately from the
organization’s operational databases.
Why Separate Data Warehouse?
• High performance for both systems
– DBMS— tuned for OLTP
• access methods, indexing, concurrency control, recovery
– Warehouse—tuned for OLAP
• complex OLAP queries, multidimensional view, consolidation.
• Different functions and different data
– Missing data: Decision support requires historical data which
operational DBs do not typically maintain
– Data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
– Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Three-Tier Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
OLAP
Server
Analysis
Query/Reporting
Data
Warehouse
Serve
Data Mining
Data Marts
Data Sources
Data Storage
ROLAP
Server
OLAP Engine Front-End Tools
OLAP
• OLAP = OnLine Analytical Processing
– Online = no waiting for answers
• OLAP system = system that supports
analytical queries that are dimensional
in nature.
This Lecture
• Examples of decision support queries
• Data Cubes
– Conceptual data model
– Typical operations
• Implementation
– ROLAP vs MOLAP
– Indexing structures
• SQL:1999 support for OLAP
Examples of Queries
• Flight company: evaluate ticket sales
– give total, average, minimal, maximal amount
– per date: week, month, year
– by destination/source port/country/continent
– by ticket type
– by # of connections
–…
Characteristics
• One special attribute: amount
 measure
• Other attributes: select relevant regions
 dimensions
Different levels of generality (month, year, …)
 hierarchies
• Measure data is summarized: sum, min,
max, average
 aggregations
Supermarket example
• Evaluate the sales of products
measure
Dim.
– Product cost in $
– Customer: ID, city, state, country,
– Store: chain, size, location,
hierarchies
– Product: brand, type, …
–…
• What are the measure and dimensional
attributes, where are the hierarchies?
Why dimensions?
• Multidimensional view on the data
store
Cost in $
customer
product
Cross Tabulation
• Cross-tabulations are highly useful
– Sales of clothes JuneAugust ‘06
Product: color
Date:month,
JuneAugust
2006
Blue
Red
Orange
Total
June
51
25
158
234
July
58
20
120
198
August
65
22
51
138
Total
174
67
329
570
Data cubes
• Extension of Cross-Tables to multiple
dimensions
• Conceptual notion
Dimensions
Blue
Red
Orange
Total
June
51
25
158
234
July
58
Data Points/
20
120
1st level of aggregation
198
August
65
22
51
138
Total
174
67
329
570
Aggregated
w.r.t. Y-dim
Aggregated
w.r.t. X-dim
Aggregated
w.r.t. X and Y
Data Cubes
2Qtr
3Qtr
4Qtr
sum
Ireland
France
Germany
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Data Cubes
• Base cuboid = n-dimensional cube with n
number of dimensions
• The top most 0-D cuboid, which holds the
highest-level of summarization, is called
the apex cuboid
• The lattice of cuboids forms a data cube
Lattice of Cuboids
all
product
date
country
date, country
product, date
product, country
product, date, country
Operations with Data Cubes
Scenario:
• Before starting the analysis task:
– what data?
• select a few relevant dimensions
• define hierarchy
• aggregation functions of interest
– Pre-materialize
• load data
• compute counts/max, min, avg, … on beforehand
Operations with Data Cubes
• What operations can you think of an
analyst might find useful? (e.g., store)
Operations with Data Cubes
• What operations can you think of that an
analyst might find useful? (e.g., store)
– only look at stores in the Netherlands
– look at cities instead of individual stores
– look at the cross-table for product-date
– restrict analysis to 2006, product O1
– go back to a finer granularity at the store level
Roll-Up
• Move in one dimension from a lower
granularity to a higher one
– store  city
– cities  country
– product  product type
Drill-down
• Move in one dimension from a higher
granularity to a lower one
– city  store
– country  cities
– product type  product
• Drill-through:
– go back to the original, individual data records
Pivoting
• Change the dimensions that are
“displayed”; select a cross-tab.
– look at the cross-table for product-date
– display cross-table for date-customer
Slice & dice
• Select a part of the cube by restricting one
or more dimensions
– restrict analysis to “city = Eindhoven”
Summary of Concepts
• Cube: Multidimensional view on data
– dimensional attributes
– measure attribute
• Operations:
– roll-up/drill-down
– pivoting
– slice and dice
Implementation
• To make query answering more efficient:
consolidate (materialize) aggregations
• Obvious implementation: multidimensional
array.
– Fast lookup: cell(prod. p, date d, prom. pr):
• look up index of p1, index of d, index of pr:
index = (p x D x PR) + (d x PR) + pr
Implementation
• Multidimensional array
– obvious problem: sparse data
can easily be solved, though.
Example:
binary search tree, key on index
hash table.
Implementation
• However: very quickly people were
confronted with the Data Explosion
Problem
Consolidating the summaries blows up the data
enormously !
Reasons are often misunderstood and confusing.
Data Explosion Problem
• Why?
Suppose:
– n dimensions, every dimension has d values
– dn possible tuples.
– Number of cells in the cube: (d+1)n
– So, this is not the problem
Data Explosion Problem
• Why?
Suppose
– n dimensions, every dimension has d values
– every dimension has a hierarchy
– most extreme case: binary tree
 2d possibilities/dimension
Data Explosion Problem
• Why?
Suppose
– n dimensions, every dimension has d values
– every dimension has a hierarchy
– most extreme case: binary tree
 2d possibilities/dimension  2n x dn cells
Only partial explanation (factor 2n comes
from an extremely pathological case)
Data Explosion Problem
• Why?
– The problem is that most data is not dense,
but sparse.
– Hence, not all dn combinations are possible.
Example: 10 dimensions with 10 values
– 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
Data Explosion Problem
Example: 10 dimensions with 10 values
– 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
Every tuple increases count of 210 cells !
With hierarchies: effect even worse!
If every hierarchy has 5 items:
510 = 9 765 625 cells!
View Selection Problem
• Suffices to precompute some aggregates, and
compute others on demand.
– aggregate on (item-name, color) from an aggregate
on (item-name, color, size)
– For all but a few “non-decomposable” aggregates
such as median
• Several optimizations for computing multiple
aggregates
– Compute aggregate on (item-name, color) from an
aggregate on
(item-name, color, size)
– Compute aggregates on (item-name, color, size),
(item-name, color) and (item-name) in single DB sort
View Selection Problem
all
product
date
country
date, country
product, date
product, country
product, date, country
View Selection Problem
all
product
country
Which views to select:
hard research problem !
date
date, country
product, date
product, country
product, date, country
Implementation
Nowadays systems can be divided in three
categories:
– ROLAP (Relational OLAP)
• OLAP supported on top of a relational database
– MOLAP (Multi-Dimensional OLAP)
• Use of special multi-dimensional data structures
– HOLAP: (Hybrid)
• combination of previous two
ROLAP
• Cubes can easily be represented in
relational tables: special value “all”
Month
Jan
Jan
Jan
Feb
Prod.
p1
p2
p1
p1
all
Jan
Jan
all
p1
all
p1
all
all
all
Cust.
c1
c1
c2
c1
…
c1
c1
all
c1
…
all
Price
10
8
10
9
102
18
1 230
4 235
1 253 458
ROLAP
• Typical database scheme:
– star schema
• fact table is central
• links to dimensional tables
– Extensions:
• snowflake schema
– dimensions have hierarchy/extra information attached
• Star constellation
– multiple star schemas sharing dimensions
Example of a Star Schema
Order
Product
Order No
ProductNO
Order Date
ProdName
Customer
Customer No
Customer Name
Customer
Address
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Fact Table
ProdDescr
OrderNO
Category
SalespersonID
CategoryDescription
CustomerNO
UnitPrice
ProdNo
Date
DateKey
DateKey
CityName
Date
Quantity
Total Price
City
CityName
State
Country
Example of a Snowflake
Schema
Order
Order No
Product
ProductNO
Order Date
ProdName
CategoryName
ProdDescr
CategoryDescr
Fact Table
Customer
Customer No
Customer Name
Customer
Address
City
Salesperson
OrderNO
SalespersonID
CustomerNO
Category
Category
UnitPrice
ProdNo
Date
DateKey
DateKey
CityName
Date
SalespersonID
Quantity
Month
City
SalespersonName
Total Price
CityName
City
Quota
Category
State
Country
Month
Month
Year
State
StateName
Country
Year
Year
Example of Fact Constellation
Multiple fact tables share dimension tables
Time_key
Time
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
Time_key
Item_key
Branch_key
Item
item_key
item_name
brand
type
supplier_key
Location_key
Branch
branch_key
branch_name
branch_type
Measures
Shipping Fact Table
Unit_sold
Euros_sold
Avg_sales
Location
location_key
street
city
Province/street
country
Item_key
shipper_key
from_location
to_location
Euros_sold
unit_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
SQL 1999 support for OLAP
• see other set of slides
Download