Presentation - College of Computing & Informatics

advertisement
Analysis of Additivity in OLAP
Systems
John Horner and Il-Yeol Song
john.horner@drexel.edu
College of Information Science & Technology
Drexel University
Philadelphia, PA 19104
USA
Peter P. Chen
Department of Computer Science
Louisiana State University
Baton Rouge, LA 70803
Online Analytical Processing
(OLAP) Systems
• Historical, integrated, relatively static data
• Magnitudes larger than transactional
systems
• Used for strategic decision making
• Query outputs nearly always aggregated
sets of base data
• Effective summarizability is of paramount
concern
2
Structure
• Facts are measures of interest
• Dimensions are attributes used to
identify, select, group, and aggregate
measures of interest.
• Attributes that are used to aggregate
measures are labeled classification
attributes, and are typically conceptualized
as hierarchies
3
Operations
• Roll-up increases the level of aggregation along
one or more classification hierarchies
• Drill-down decreases the level of aggregation
along one or more classification hierarchies
• Slice-Dice selects and projects the data
• Pivoting reorients the multi-dimensional data
view to allow exchanging facts for dimensions
symmetrically
• Merging performs a union of separate roll-up
operations
4
Additivity
• The ability to use the aggregate summation operator to
accurately summarize data is known as Additivity
• A measure is Additive along a dimension if the sum
operator can be used to meaningfully aggregate values
along all hierarchies in that dimension
• Fully-additive measures are additive across all
dimensions
• Semi-additive measures are only additive across certain
dimensions
• Non-additive measures are not additive across any
dimension
5
Additivity Example
Customer
Date
100001
100002
100003
100004
TOTAL
1/1/2000
500
700
9890
600
ADDITIVE
2/1/2000
800
450
10050 200
…
3/1/2000
980
900
8700
800
…
4/1/2000
400
360
7800
750
…
…
…
…
…
…
…
TOTAL
NONADDITIVE
…
…
…
…
6
Classification
Classification
Examples
1.0 Non-Additive
1.1 Fractions
1.1.1 Ratios
GMROI, Profitability ratios
1.1.2 Percentages
Profit margin percent, return percentage
1.2 Measurements of intensity
Temperature, Blood pressure
1.3 Average/Maximum/Minimum
1.3.1 Averages
Grade point average, Temperature
1.3.2 Maximums
Temperature, Hourly hospital admissions, Electricity usage,
Blood pressure
1.3.3 Minimums
Temperature, Hourly hospital admissions, Electricity usage,
Blood pressure
1.4 Measurements of direction
Wind direction, Cartographic bearings, Geometric angles
1.5 Identification attributes
1.5.1 Codes
Zip code, ISBN, ISSN, Area Code, Phone Number, Barcode
1.5.1 Sequence numbers
Surrogate key, Order number, Transaction number, Invoice
number
2.0 Semi-Additive
2.1 Dirty Data
Missing data, Duplicate data, Incorrect data
2.2 Changing data
Area codes, Department names, customer address
2.3 Temporally non-additive
Account balances, Quantity on hand, Quantity sold
2.4 Categorically non-additive
Basket counts, Quantity on hand, Quantity sold
7
Non-Additive Measures
•
•
•
•
Ratios and Percentages
Measures of Intensity
Average / Maximum / Minimum
Measures of Direction
8
Semi-Additive Facts
•
•
•
•
•
Dirty Data
Changing Data
Temporally Non-Additive
Categorically Non-Additive
Not Mutually Exclusive
– e.g. Measures can be both temporally and
categorically non-additive
9
Causes of Dirty Data
CustomerID
Arbitrary Missing
Data Value
000001
01245
4
20145
4
74565
4
99999
9
Customers as Stored in Database
Actual Customers
Customer who
pre-dates
system
• Summing measures associated with dirty data can result in
inaccurate summaries if not all instances are counted, if
instances are counted multiple times, or if instances are
counted in the wrong group
10
Rolling-up Dirty Data
Classification Hierarchy
Transactions
Anomaly will
disappear when
rolled up to the
State level
Anomaly will
disappear when
rolled up to the zip
code level
Anomaly will
disappear when
rolled up to the
country level
• As measures are rolled up further along hierarchies, certain
inaccurate values will be merged into the appropriate groups
11
Hierarchy Completeness
• All instances belong to one higher level instance, which consists of
those instances only
• Complete hierarchy (top), country consists of only the provinces
listed
• Incomplete hierarchy (bottom), not all customers in the city are
stored in the data warehouse; or not all customers in data
warehouse have a city listed
Pro1
C1
Country
Pro2
Province
Pro3
Complete
City
City
Incomplete
Cust1
Cust2
Custn
Custx
Customer
12
Example of Additivity Problems
Associated with Incomplete Hierarchies
CustID
City
SalesAmt
1
Washington
100
2
New York
200
999
Unknown
100
4
New York
150
5
Washington
150
6
999
Total
Washington
Unknown
150
100
Summary
City
Sales
Washington
400
New York
350
Total
750
Unknown
200
950
• If Sales are rolled up to the city, but not all customers have a city stored in
the database, then the summary will not accurately portray the sales grouped
by city.
13
Changing Data
• It is important to track merges, splits, and
overlapping hierarchies, especially those
that affect classification hierarchies, as the
characteristics of the data and
environment change
14
Changing Data Example
Year
City
Area Code
Population
1990
Philadelphia
215
200
2000
Philadelphia
610
150
2000
2000
Philadelphia
Philadelphia
215
484
150
100
• Area code 215 split into 3 area codes. Looking at population
trend in 215 area code would show a decrease, when in fact
population in area originally covered by 215 area code has
doubled.
15
Temporally Non-Additive
• Measures that cannot be meaningfully
added across different time periods are
temporally non-additive
• Examples
– Account balances
– Quantity on hand
16
Temporally Non-Additive
Example
Date
100001
100002
100003
100004
TOTAL
1/1/2000
500
700
9890
600
…
2/1/2000
800
450
10050
200
…
3/1/2000
980
900
8700
800
…
4/1/2000
400
360
7800
750
…
…
…
…
…
…
…
TOTAL
NON…
ADDITIVE
…
…
…
17
Temporally Non-Additive
SQL
Select sum(balance), CustomerID
From AccountFact
Group by CustomerID;
Select sum(balance), date
From AccountFact
Group by date;
Must group by time interval of snapshot
18
Categorically Non-Additive
• Measures that cannot meaningfully be
summed across different types of items
can be considered categorically nonadditive
• Examples
– Basket counts
– Quantity on hand
19
Categorically Non-Additive
Example
Date
Customer
Item ID
Product
Name
…
Basket
Count
1/1/2000
1
10001
X Brand
Soup
…
5
1/1/2000
1
10002
Y Brand
Soup
…
2
1/1/2000
2
12510
Z Brand
Television
…
1
1/1/2000
3
10001
X Brand
Soup
…
4
…
…
…
…
…
…
TOTAL
…
…
…
…
NONADDITIVE
20
Categorically Non-Additive
SQL
Select sum(BasketCount)
From SalesFact;
Select sum(BasketCount), ProductName
From SalesFact
Group by ProductName;
Must group by attribute in
product family hierarchy
21
Others’ Suggestions
•
The distinction between meaningful and meaningless aggregation data should be stored in an
appendix
»
•
Data should be normalized into a General Multidimensional Normal Form (GMNF), whereby
aggregation anomalies are avoided through a conceptual modeling approach that emphasizes
sorting out dimensions, dimensional hierarchies, and which measures belong where.
»
•
Golfarelli and Rizzi (1998)
We need to rigorously classify hierarchies and detailed characteristics of hierarchies, such as
completeness and multiplicity
»
•
Hüsemann et al (2000)
Conceptual models should explicitly depicts hierarchies and aggregation constraints along
hierarchies, and a fact glossary should be developed describing how each fact was derived from
an ER model
»
•
Hüsemann et al (2000)
Pourabbas and Rafanelli (1999)
Slowly Changing Dimensions (Kimball and Ross, 2002)
–
–
–
Type 1: simply overwriting data
Type 2: storing the new data instance in a new row, but with a common field to link the dimensions as being
the same
Type 3: Adding a new attribute to the dimension table to store both the new and old values
22
Our Suggestions
• No simple solution
– Can’t always eliminate potential inaccuracies
– Categorically Non-additive data
– Glossaries may be ignored
– Conceptual models may be overly complex
– This doesn’t mean that we shouldn’t have glossaries and include
constraints in conceptual models
• Online Summarizability Constraints
– Imagine abundance of update anomalies in transactional
systems if possible violations are only stored in glossaries or
conceptual models
• Where measures are imprecise, queries should show
error bounds
23
Hierarchies
• Strict - each object at a lower level belongs to only one
value at a higher level
• Non-strict - can be thought of as a many-to-many
relationship between a higher level of the hierarchy and
the lower level
• Complete - all members belong to one higher-class
object, which consists of those members only
• Incomplete – not complete
• Multiple path - lower object splits into two distinct higher
level objects
• Alternate path - multiple path hierarchy that joins again
at a higher level
24
Hierarchy Strictness
• In strict hierarchies, lower level instances in
hierarchy belong to only one higher level
instance
D1
D2
Department
Strict
P1
P2
P3
D1
P4
P5
D2
Person
Department
Non-Strict
Pr1
Pr2
Pr3
Pr4
Pr5
Project
25
Example of Additivity Problems
Associated with Non-Strict Hierarchies
Project
Dollars
1
10000
2
3
15000
120000
4
50000
5
30000
Total
225000
Denormalized Fact Table
Dept
Project
Dollars
1
1
10000
1
2
15000
1
3
120000
2
3
120000
2
4
50000
2
5
30000
Total
345000
26
Alternate and Multiple Path
Hierarchies
a. Alternate Path Classification
Hierarchy
b. Multiple Path Classification
Hierarchy
Store
Date
Week
City
AreaCode
Month
DayOfWeek
ZipCode
County
State
Quarter
Year
Country
• Inaccurate summaries can result
from merging aggregates from
multiple paths of a hierarchy.
27
Example of Problems Associated with
Merging Multiple Path Hierarchies
140 hrs
320 hrs
460 hrs
Person
Dept
Project
Hours
1
1
1
40
2
1
2
100
3
2
2
50
4
2
2
50
5
2
2
40
6
2
2
80
Multiple Path
Hierarchy
Person
Department
Project
Should be 360 hrs
• Adding Hours from all the people in Department 1 with all the people who
worked on Project 2 results in an inaccurate summary because Person 2 is
counted twice.
• The summary would not be inaccurate if each project mapped directly to 1
department
28
Our Suggestions (Cont.)
29
Our Suggestions (Cont.)
30
Conclusions
• Recognizing whether measures are fully-, semi-, or non-additive is
essential to identifying and resolving potential inaccurate summaries
in OLAP systems
• Non-additive measures cannot be aggregated using the sum
operator
• Semi-additive measures can sometimes be aggregated using the
sum operator, but at other times cannot
• Therefore, semi-additive attributes pose the highest risk for
unrecognized inaccurate summaries
• There are several reasons why data could be semi-additive
–
–
–
–
Adding different types of items together
Adding measures multiple times in the same summary
Not including all instances when aggregating measures
Including measures in the wrong groups
• Metadata could be used to alert analysts to potentially inaccurate
queries
31
References
•
Golfarelli, M., Maio, D., and Rizzi, S. (1998). Conceptual Design of Data
Warehouses from E/R Schemes. Proceedings of the Thirty-First Hawaii
International Conference, 6-9 Jan. 1998, 7, 334 – 343.
•
Hüsemann, B., Lechtenbörger, J, and Vossen, G. (2000). Conceptual data
warehouse design. Proc. International Workshop on Design and
Management of Data Warehouses, 2000.
•
Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: Second
Edition. John Wiley and Sons, Inc.
•
Pourabbas, E. and Rafanelli, M. (1999). Characterizations of hierarchies
and some operators in OLAP environments..Proceedings of the 2nd ACM
international workshop on Data warehousing and OLAP. Kansas City,
Missouri. 54 – 59.
•
Shoshani, A. (1997) OLAP and statistical databases: Similarities and
differences. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART
symposium on Principles of database systems. Tucson, Arizona. 185 – 196.
ACM Press New York, NY.
32
Download