Analysis of Additivity in OLAP Systems John Horner and Il-Yeol Song john.horner@drexel.edu College of Information Science & Technology Drexel University Philadelphia, PA 19104 USA Peter P. Chen Department of Computer Science Louisiana State University Baton Rouge, LA 70803 Online Analytical Processing (OLAP) Systems • Historical, integrated, relatively static data • Magnitudes larger than transactional systems • Used for strategic decision making • Query outputs nearly always aggregated sets of base data • Effective summarizability is of paramount concern 2 Structure • Facts are measures of interest • Dimensions are attributes used to identify, select, group, and aggregate measures of interest. • Attributes that are used to aggregate measures are labeled classification attributes, and are typically conceptualized as hierarchies 3 Operations • Roll-up increases the level of aggregation along one or more classification hierarchies • Drill-down decreases the level of aggregation along one or more classification hierarchies • Slice-Dice selects and projects the data • Pivoting reorients the multi-dimensional data view to allow exchanging facts for dimensions symmetrically • Merging performs a union of separate roll-up operations 4 Additivity • The ability to use the aggregate summation operator to accurately summarize data is known as Additivity • A measure is Additive along a dimension if the sum operator can be used to meaningfully aggregate values along all hierarchies in that dimension • Fully-additive measures are additive across all dimensions • Semi-additive measures are only additive across certain dimensions • Non-additive measures are not additive across any dimension 5 Additivity Example Customer Date 100001 100002 100003 100004 TOTAL 1/1/2000 500 700 9890 600 ADDITIVE 2/1/2000 800 450 10050 200 … 3/1/2000 980 900 8700 800 … 4/1/2000 400 360 7800 750 … … … … … … … TOTAL NONADDITIVE … … … … 6 Classification Classification Examples 1.0 Non-Additive 1.1 Fractions 1.1.1 Ratios GMROI, Profitability ratios 1.1.2 Percentages Profit margin percent, return percentage 1.2 Measurements of intensity Temperature, Blood pressure 1.3 Average/Maximum/Minimum 1.3.1 Averages Grade point average, Temperature 1.3.2 Maximums Temperature, Hourly hospital admissions, Electricity usage, Blood pressure 1.3.3 Minimums Temperature, Hourly hospital admissions, Electricity usage, Blood pressure 1.4 Measurements of direction Wind direction, Cartographic bearings, Geometric angles 1.5 Identification attributes 1.5.1 Codes Zip code, ISBN, ISSN, Area Code, Phone Number, Barcode 1.5.1 Sequence numbers Surrogate key, Order number, Transaction number, Invoice number 2.0 Semi-Additive 2.1 Dirty Data Missing data, Duplicate data, Incorrect data 2.2 Changing data Area codes, Department names, customer address 2.3 Temporally non-additive Account balances, Quantity on hand, Quantity sold 2.4 Categorically non-additive Basket counts, Quantity on hand, Quantity sold 7 Non-Additive Measures • • • • Ratios and Percentages Measures of Intensity Average / Maximum / Minimum Measures of Direction 8 Semi-Additive Facts • • • • • Dirty Data Changing Data Temporally Non-Additive Categorically Non-Additive Not Mutually Exclusive – e.g. Measures can be both temporally and categorically non-additive 9 Causes of Dirty Data CustomerID Arbitrary Missing Data Value 000001 01245 4 20145 4 74565 4 99999 9 Customers as Stored in Database Actual Customers Customer who pre-dates system • Summing measures associated with dirty data can result in inaccurate summaries if not all instances are counted, if instances are counted multiple times, or if instances are counted in the wrong group 10 Rolling-up Dirty Data Classification Hierarchy Transactions Anomaly will disappear when rolled up to the State level Anomaly will disappear when rolled up to the zip code level Anomaly will disappear when rolled up to the country level • As measures are rolled up further along hierarchies, certain inaccurate values will be merged into the appropriate groups 11 Hierarchy Completeness • All instances belong to one higher level instance, which consists of those instances only • Complete hierarchy (top), country consists of only the provinces listed • Incomplete hierarchy (bottom), not all customers in the city are stored in the data warehouse; or not all customers in data warehouse have a city listed Pro1 C1 Country Pro2 Province Pro3 Complete City City Incomplete Cust1 Cust2 Custn Custx Customer 12 Example of Additivity Problems Associated with Incomplete Hierarchies CustID City SalesAmt 1 Washington 100 2 New York 200 999 Unknown 100 4 New York 150 5 Washington 150 6 999 Total Washington Unknown 150 100 Summary City Sales Washington 400 New York 350 Total 750 Unknown 200 950 • If Sales are rolled up to the city, but not all customers have a city stored in the database, then the summary will not accurately portray the sales grouped by city. 13 Changing Data • It is important to track merges, splits, and overlapping hierarchies, especially those that affect classification hierarchies, as the characteristics of the data and environment change 14 Changing Data Example Year City Area Code Population 1990 Philadelphia 215 200 2000 Philadelphia 610 150 2000 2000 Philadelphia Philadelphia 215 484 150 100 • Area code 215 split into 3 area codes. Looking at population trend in 215 area code would show a decrease, when in fact population in area originally covered by 215 area code has doubled. 15 Temporally Non-Additive • Measures that cannot be meaningfully added across different time periods are temporally non-additive • Examples – Account balances – Quantity on hand 16 Temporally Non-Additive Example Date 100001 100002 100003 100004 TOTAL 1/1/2000 500 700 9890 600 … 2/1/2000 800 450 10050 200 … 3/1/2000 980 900 8700 800 … 4/1/2000 400 360 7800 750 … … … … … … … TOTAL NON… ADDITIVE … … … 17 Temporally Non-Additive SQL Select sum(balance), CustomerID From AccountFact Group by CustomerID; Select sum(balance), date From AccountFact Group by date; Must group by time interval of snapshot 18 Categorically Non-Additive • Measures that cannot meaningfully be summed across different types of items can be considered categorically nonadditive • Examples – Basket counts – Quantity on hand 19 Categorically Non-Additive Example Date Customer Item ID Product Name … Basket Count 1/1/2000 1 10001 X Brand Soup … 5 1/1/2000 1 10002 Y Brand Soup … 2 1/1/2000 2 12510 Z Brand Television … 1 1/1/2000 3 10001 X Brand Soup … 4 … … … … … … TOTAL … … … … NONADDITIVE 20 Categorically Non-Additive SQL Select sum(BasketCount) From SalesFact; Select sum(BasketCount), ProductName From SalesFact Group by ProductName; Must group by attribute in product family hierarchy 21 Others’ Suggestions • The distinction between meaningful and meaningless aggregation data should be stored in an appendix » • Data should be normalized into a General Multidimensional Normal Form (GMNF), whereby aggregation anomalies are avoided through a conceptual modeling approach that emphasizes sorting out dimensions, dimensional hierarchies, and which measures belong where. » • Golfarelli and Rizzi (1998) We need to rigorously classify hierarchies and detailed characteristics of hierarchies, such as completeness and multiplicity » • Hüsemann et al (2000) Conceptual models should explicitly depicts hierarchies and aggregation constraints along hierarchies, and a fact glossary should be developed describing how each fact was derived from an ER model » • Hüsemann et al (2000) Pourabbas and Rafanelli (1999) Slowly Changing Dimensions (Kimball and Ross, 2002) – – – Type 1: simply overwriting data Type 2: storing the new data instance in a new row, but with a common field to link the dimensions as being the same Type 3: Adding a new attribute to the dimension table to store both the new and old values 22 Our Suggestions • No simple solution – Can’t always eliminate potential inaccuracies – Categorically Non-additive data – Glossaries may be ignored – Conceptual models may be overly complex – This doesn’t mean that we shouldn’t have glossaries and include constraints in conceptual models • Online Summarizability Constraints – Imagine abundance of update anomalies in transactional systems if possible violations are only stored in glossaries or conceptual models • Where measures are imprecise, queries should show error bounds 23 Hierarchies • Strict - each object at a lower level belongs to only one value at a higher level • Non-strict - can be thought of as a many-to-many relationship between a higher level of the hierarchy and the lower level • Complete - all members belong to one higher-class object, which consists of those members only • Incomplete – not complete • Multiple path - lower object splits into two distinct higher level objects • Alternate path - multiple path hierarchy that joins again at a higher level 24 Hierarchy Strictness • In strict hierarchies, lower level instances in hierarchy belong to only one higher level instance D1 D2 Department Strict P1 P2 P3 D1 P4 P5 D2 Person Department Non-Strict Pr1 Pr2 Pr3 Pr4 Pr5 Project 25 Example of Additivity Problems Associated with Non-Strict Hierarchies Project Dollars 1 10000 2 3 15000 120000 4 50000 5 30000 Total 225000 Denormalized Fact Table Dept Project Dollars 1 1 10000 1 2 15000 1 3 120000 2 3 120000 2 4 50000 2 5 30000 Total 345000 26 Alternate and Multiple Path Hierarchies a. Alternate Path Classification Hierarchy b. Multiple Path Classification Hierarchy Store Date Week City AreaCode Month DayOfWeek ZipCode County State Quarter Year Country • Inaccurate summaries can result from merging aggregates from multiple paths of a hierarchy. 27 Example of Problems Associated with Merging Multiple Path Hierarchies 140 hrs 320 hrs 460 hrs Person Dept Project Hours 1 1 1 40 2 1 2 100 3 2 2 50 4 2 2 50 5 2 2 40 6 2 2 80 Multiple Path Hierarchy Person Department Project Should be 360 hrs • Adding Hours from all the people in Department 1 with all the people who worked on Project 2 results in an inaccurate summary because Person 2 is counted twice. • The summary would not be inaccurate if each project mapped directly to 1 department 28 Our Suggestions (Cont.) 29 Our Suggestions (Cont.) 30 Conclusions • Recognizing whether measures are fully-, semi-, or non-additive is essential to identifying and resolving potential inaccurate summaries in OLAP systems • Non-additive measures cannot be aggregated using the sum operator • Semi-additive measures can sometimes be aggregated using the sum operator, but at other times cannot • Therefore, semi-additive attributes pose the highest risk for unrecognized inaccurate summaries • There are several reasons why data could be semi-additive – – – – Adding different types of items together Adding measures multiple times in the same summary Not including all instances when aggregating measures Including measures in the wrong groups • Metadata could be used to alert analysts to potentially inaccurate queries 31 References • Golfarelli, M., Maio, D., and Rizzi, S. (1998). Conceptual Design of Data Warehouses from E/R Schemes. Proceedings of the Thirty-First Hawaii International Conference, 6-9 Jan. 1998, 7, 334 – 343. • Hüsemann, B., Lechtenbörger, J, and Vossen, G. (2000). Conceptual data warehouse design. Proc. International Workshop on Design and Management of Data Warehouses, 2000. • Kimball, R. and Ross, M. (2002). The Data Warehouse Toolkit: Second Edition. John Wiley and Sons, Inc. • Pourabbas, E. and Rafanelli, M. (1999). Characterizations of hierarchies and some operators in OLAP environments..Proceedings of the 2nd ACM international workshop on Data warehousing and OLAP. Kansas City, Missouri. 54 – 59. • Shoshani, A. (1997) OLAP and statistical databases: Similarities and differences. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. Tucson, Arizona. 185 – 196. ACM Press New York, NY. 32