Week 09 Critiques on Paper:An Analysis of Additivity in OLAP Systems This week there were a number of excellent critique points that we should discuss. Name: non-strict hierarchy Discussion: The author specifies the definition of strict hierarchy and proposes that an inaccurate summarization can result if the hierarchy is non-strict as multiple or alternate path hierarchies. Location: 2.1 Classification Hierarchies, 1st, 2nd and 3rd paragraphs. Significance: The summary can be inaccurate if summarization from different paths of the same hierarchy is merged. Several examples are given to describe these anomaly summary caused by non-strict hierarchy. But the author doesn’t provide any solution to this problem. Suggestions: The rolls-up query of a non-strict hierarchy will be affected only if the lower level values are calculated multiple times in the query. Especially the alternate/multiple path hierarchies can lead to inaccurate aggregation by counting multiple times. A possible practical way to manage this problem is to run the scripts at the query time to check the number times for which the measures are counted in the summary. In this way, the duplicated values can be identified [1]. Reference: [1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their management in OLAP Systems. Name: Multiple Path Hierarchies Discussion: This section outlines the problems caused in summarization when multiple path hierarchies are involved. It goes further to suggest that it’s “important to recognize which merges will result in duplicate values in summary data” Location: Page 84, Section 2.1, Paragraph 3 Significance: An important issue has been identified but paper did not provide any guidance on how to tackle this issue to avoid duplication in summary data. Suggestion: In order to avoid duplication in summary data [1] suggests running scripts at query time to identify whether measures are counted in multiple groups or multiple times in a single group. The script suggests the use of COUNT to identify such ambiguities to avoid duplications. Reference: [1] A Taxonomy of Inaccurate Summaries and Their Management in OLAP Systems by John Horner and Il-Yeol Song Name: Classification Hierarchies Discussion: The paper in this section presents a general overview of the classification of hierarchies describing strict, non-strict, complete, incomplete, multiple and alternate path. Location: Page 84, Section 2.1 classification of hierarchies Significance: According to the paper "classification hierarchies are of central concern because the primary method of rolling-up and drilling-down data is along these pre-defined hierarchies. Additionally, data warehouse design often involves materializing Aggregate views along these hierarchies." Therefore, a more detailed view of these classification hierarchies would help. Suggestion: A conceptual and more systematic view of these hierarchies could be presented [1] as follows to help come up with more logical summarizations: 1. Simple Hierarchies: those hierarchies where the relationship between their members can be represented as a tree. It can be further classified into: Symmetric: one path at schema level where all levels are mandatory Asymmetric: only one path at the schema level but, some lower levels of the hierarchy are not mandatory Generalized: can contain multiple exclusive paths sharing some levels o Ragged hierarchy 2. Non-strict / strict hierarchies 3. Multiple hierarchies: Can be classified into: Multiple-inclusive: In this all members of the splitting level participate simultaneously in parent levels belonging to different hierarchies Multiple-alternative: it is not semantically correct to simultaneously traverse the different composing hierarchies 4. Parallel hierarchies: These hierarchies results when a dimension has associated several hierarchies. Can be classified into: Parallel independent hierarchy: different hierarchies do not share levels Parallel dependent hierarchy: different hierarchies sharing some levels REFERENCE: [1] OLAP Hierarchies: A Conceptual Perspective by Elzbieta Malinowski and Esteban Zimányi Name: Cases of Incompleteness in hierarchies Discussion: The author gives the definition of completeness in hierarchies. “Completeness in hierarchy means that all members belong to one higher-class object, which consists of those members only.” This definition doesn’t tell exactly what an incomplete hierarchy looks like. The author only specifies one case of an incomplete hierarchy using US states example. There should be other cases which also represent the incomplete hierarchies. Location: 2.1 Classification Hierarchies, 5th paragraph. Significance: Besides the definition of complete hierarchy, it is instructive for us to know the cases of incomplete hierarchy. It will help us handle the summarization problems result from incomplete hierarchy. Suggestions: The incomplete hierarchies have three cases, namely Orphaned-incomplete, Omitted-incomplete, Not Applicable-incomplete [1]. Orphaned-incomplete means that lower level data of hierarchy are stored, but they are not associated with parents. For example, a student belongs to a particular department. In this example, Orphaned-incomplete means that one student’s record is stored in the database, but the department which he belongs to is not stored in the database. Omitted-incomplete means the records haven’t been stored in the database. In the student-department example, it means an existed student with a department is not stored in the database. It can also be explained as no record exists in database about this truly existed student and his department. Not Applicable-incomplete means that the stored record doesn’t has applicable parents. In the student-department example, it means a non-matriculated student can’t be assigned to any department. The figure above is cited from [1]. Reference: [1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their management in OLAP Systems. Name: A new case of the Non-Additive measures Discussion: The author introduces various non-additive facts in details and describes suggestions dealing with these measures. Location: 2.2 Non-additive measures Significance: Fractions, measurements of intensity, average/Max/Min, measurement of direction and identification attributes are five kinds of non-addictive facts mentioned in the paper. However, the non-addictive measures should not be limited to these five types. A new case of non-additive measure can also be added in this category. Suggestions: This new non-additive case is Identification units. Aggregating measures with heterogeneous units is meaningless and is most likely misleading. Incorrect summary may result if measures with different units are aggregated. All measures have units like dollars, meters, inches. These units must be stored in metadata. Additive operations shouldn’t be allowed to use on measures with different units unless they can changed into one common unit [1]. Reference: [1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their management in OLAP Systems. Name: Summarizing Ratios Discussion: How to summarize ratios Location: 2.2.1 Ratios and Percentages and Figure 8: Identification and Suggestions for dealing with Non-additive Attributes Significance: It is suggested that ratios are to be summarized by storing numerators and denominators separately, then taking the ratio of the sums. This may be confusing and error prone, and needlessly adds difficulty. Suggestion: Doing this type of operation provides the average of the ratios. It would be much simpler to just use the Average aggregate operator on the ratios instead. Name: Enforcing Constraints Discussion: This section talks about the non additive nature of measures of intensities like temperature and blood pressure. It tells why these measures are non-additive and then talks about the problems these facts can cause if they are added along any dimension Location: Section 2.2.2, Measure of Intensity Significance: This section towards the end suggest that all the default aggregate operators for measures to be set and made visible to analyst performing queries. Suggestion: When dealing with OLAP applications the data to be analyzed is substantial, providing default aggregate operators to an analyst may not help when there are lot of measures involved which falls into this category. I think this problem should be dealt by defining constraints on such attributes and keeping track of such constraints in metadata files which should be made available to the analyst to ensure such measures are not added during summarization. Name: Non-numeric Data Discussion: Dealing with all types of data when summarizing Location: Throughout, touched briefly upon in 2.2.3 Other Non-additive Facts Significance: The paper goes on and on about numerical data, describing which is additive, which is non-additive, and how to deal with each, but does not seem overly concerned with non-numerical data. Suggestion: When the data is non-numeric, other ways of working with it must be found. For example, when working with biology data ( http://biolap.sourceforge.net ) and new ways of aggregating the data is needed. When looking at protein sequences, it is possible to create numerical similarity scores in order to create useful data from summarization, or finding most common traits shared between the sequences may be a good way to aggregate it. It must be noted that summarization is not the only way to aggregate information with OLAP and that other types of aggregation like this can be applied even to non-numeric data. Name: Other Aggregate Operators Discussion: Aggregate operators in a data warehouse besides summarization Location: 2.2.3 Average/Maximum/Minimum and 2.5 Other Aggregate Operators Significance: The paper briefly mentions that other aggregate operators exist and are in use, and briefly mentions Averaging, Minimum, and Maximum, but does not show that OLAP allows for many other aggregate operators. This would be useful as it would allow for better understanding of what else can be done to attain more information from a data warehouse. Suggestion: Oracle's OLAP User Guide offers a nice list of Aggregate Operators and their functions. Some other basic operators include First and Last Non-NA Data Value (which returns the first and last real data value, respectively). There are also Weighted versions of the aggregate operators mentioned, which allow for the inclusion of a weight factor to multiple each value by to stress the importance of different information. Scaled Summarization is also possible, allowing for the addition of a weighted amount to each value before summarization. Hierarchical and Hierarchical Weighted aggregate operators for Avergage, First, and Last are also available, allowing specific hierarchies to be targeted Reference: ( http://download.oracle.com/docs/cd/B28359_01/olap.111/b28124/aggregate.htm #OLAUG9281 ) Name: Imprecision Discussion: This section talks about the dirty data and its semi additive nature. It further tells us about the summarization issues related to such data and suggest providing a measure of precision when rolling up data Location: Section 2.3.1, Dirty Data Significance: The paper did not tell us explicitly what those measures of precision can be and how they can be helpful when analyzing dirty data. Suggestion: A possible approach to deal with imprecise and uncertain data is provided by [1]: Extension of the OLAP model to support uncertainty in measure values and imprecision in dimensional values: This step involves generalization of the OLAP model to represent ambiguity. They propose relaxing the restriction that dimension attributes in a fact must be assigned leaf-level values from the underlying domain hierarchy, in order to model imprecision. They then introduce criteria that guide the choice for semantics for aggregation queries over ambiguous data (consistency, faithfulness, and correlation-preservation criteria) They then provide possible-worlds interpretation of data ambiguity that leads to a novel allocation-based approach to defining semantics for aggregation queries, and a careful study of choices arising in the treatment of data ambiguity, using the criteria mentioned in 2nd step They then shows through a series of experiment quality of this approach to handle imprecision in data Reference: [1] OLAP Over Uncertain and Imprecise Data by Doug Burdick‡ Prasad M. Deshpande∗ T.S. Jayram∗ Raghu Ramakrishnan‡ Shivakumar Vaithyanathan] Name: Flagging Data Discussion: Cleaning of data can be difficult because machines are unable to tell what is valid data. Location: 2.3.1 Dirty Data – last paragraph Significance: The provided example shows a customer with an zip code of 19050 being mislabeled as 19005 when purchasing items. This data will be in accurate, and will not be detectable once a report of the state level or higher is generated. Could a flag be used to determine the accuracy that the machine thinks of the data? Suggestion: Data that is easily prone to user error, or any other type of error that is frequent, should have extra “cleaning” before it is stored in the database. The machine should check this information for its validity, which would be highly dependent on the data being captured, and if it is susceptible to errors than a flag should be stored on that piece of data. An example of extra checking could be the spatial relationship between the stores zip code to that of the customers. If a customer’s address states that, they are in city X, which is 200 miles away from the store where the goods were purchased then there could be an issue. The customer could have moved and they did not change their address. This information could trigger a flag. Name: Extra non-additive measures Discussion: Temporally non-additive measures and categorically non-additive measures are non-additive. Location: 2.3.3 Temporally non-additive measures and 2.3.4 Categorically non-additive measures Significance: These classifications are stated to be non-additive, although it conflicts with the definition of non-additive. Suggestion: These classifications are semi-additive because they are “additive across some dimensions”, whereas a non-additive measures are “not additive across any dimension”. Temporally non-additive measures are “not additive along the time dimensions”, although they are additive across all other dimensions. Categorically non-additive measures are “additive on the dimension that holds the different types of information” although they are additive on other dimensions (i.e. customer and store in the example provided in the paper). Name: Methods to identify Inaccurate summaries Discussion: This paper examines the effect of additivity on the accuracy and meaning of summary data. And provide methods in dealing with these attributes. Location: 3. Data Dictionary and summary constraints, Figure 8 and Figure 9. Significance: It is helpful for us to look forward to see the more suggestions in deal with inaccurate summaries resulting from aggregating measures. Suggestions: More suggestions are listed as follows [1]. Link all measures to their associated units. Measures can only be aggregated with measures that have similar units. Follow the design documents wherever possible. Conforming dimensions is important in reducing likelihood for misinterpretation of aggregate summaries. Track the dimensions on which measures are non-additive. OLAP systems should give alerts when queries try to summarize measures improperly. Reference: [1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their management in OLAP Systems. Name: Business Rules Discussion: The table presents an approach to identify the various additive and semi-additive measures; it asks for reviewing business rules to do so. Location: Figure 8, Page 90 Significance: The business rules identifies the way a data entry should be perceived in a business environment. This paper identifies various ambiguities caused when mergers or union of businesses take place. What will happen when the data from two different companies with different rules on same data is combined? The paper did not tell us in such scenarios which business rule should be followed. Suggestion: Using Business rules may cause problems when data from two different companies is combined due to a merger. Therefore, I think if we want to use business rules to ensure quality summarization, these business rules should be evolved over time otherwise these rules can cause major conflicts while summarizing. I think the metadata regarding the business rules should also reflect the changes to these business rules, to help avoid such conflicts. There are many other types of comparisons and relationships that we might want to analyze in our data warehouses beyond Additivity For Thursday by 9:00 am, I would like each of you to identify and describe at least 5, and preferably as many more as possible types of measures not already discussed that could be identified automatically from a data warehouse if suitable metadata was available to aid in this identification and explain how this could be done and how they could be used