Critiques

advertisement
Week 09
Critiques on
Paper:An Analysis of Additivity in OLAP Systems
This week there were a number of excellent critique points that we should discuss.
Name: non-strict hierarchy
Discussion: The author specifies the definition of strict hierarchy and proposes that
an inaccurate summarization can result if the hierarchy is non-strict as multiple or
alternate path hierarchies.
Location: 2.1 Classification Hierarchies, 1st, 2nd and 3rd paragraphs.
Significance: The summary can be inaccurate if summarization from different paths
of the same hierarchy is merged. Several examples are given to describe these
anomaly summary caused by non-strict hierarchy. But the author doesn’t provide any
solution to this problem.
Suggestions: The rolls-up query of a non-strict hierarchy will be affected only if the
lower level values are calculated multiple times in the query. Especially the
alternate/multiple path hierarchies can lead to inaccurate aggregation by counting
multiple times.
A possible practical way to manage this problem is to run the scripts at the query
time to check the number times for which the measures are counted in the summary.
In this way, the duplicated values can be identified [1].
Reference:
[1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their
management in OLAP Systems.
Name: Multiple Path Hierarchies
Discussion: This section outlines the problems caused in summarization when
multiple path hierarchies are involved. It goes further to suggest that it’s “important
to recognize which merges will result in duplicate values in summary data”
Location: Page 84, Section 2.1, Paragraph 3
Significance: An important issue has been identified but paper did not provide any
guidance on how to tackle this issue to avoid duplication in summary data.
Suggestion: In order to avoid duplication in summary data [1] suggests running
scripts at query time to identify whether measures are counted in multiple groups or
multiple times in a single group. The script suggests the use of COUNT to identify
such ambiguities to avoid duplications.
Reference:
[1] A Taxonomy of Inaccurate Summaries and Their Management in OLAP Systems by
John Horner and Il-Yeol Song
Name: Classification Hierarchies
Discussion: The paper in this section presents a general overview of the classification
of hierarchies describing strict, non-strict, complete, incomplete, multiple and
alternate path.
Location: Page 84, Section 2.1 classification of hierarchies
Significance: According to the paper "classification hierarchies are of central concern
because the primary method of rolling-up and drilling-down data is along these
pre-defined hierarchies. Additionally, data warehouse design often involves
materializing
Aggregate views along these hierarchies." Therefore, a more detailed view of these
classification hierarchies would help.
Suggestion: A conceptual and more systematic view of these hierarchies could be
presented [1] as follows to help come up with more logical summarizations:
1. Simple Hierarchies: those hierarchies where the relationship between their
members can be represented as a tree. It can be further classified into:
 Symmetric: one path at schema level where all levels are mandatory
 Asymmetric: only one path at the schema level but, some lower levels of the
hierarchy are not mandatory
 Generalized: can contain multiple exclusive paths sharing some levels
o Ragged hierarchy
2. Non-strict / strict hierarchies
3. Multiple hierarchies: Can be classified into:
 Multiple-inclusive: In this all members of the splitting level participate
simultaneously in parent levels belonging to different hierarchies
 Multiple-alternative: it is not semantically correct to simultaneously traverse
the different composing hierarchies
4. Parallel hierarchies: These hierarchies results when a dimension has associated
several hierarchies. Can be classified into:
 Parallel independent hierarchy: different hierarchies do not share levels
 Parallel dependent hierarchy: different hierarchies sharing some levels
REFERENCE:
[1] OLAP Hierarchies: A Conceptual Perspective by Elzbieta Malinowski and Esteban
Zimányi
Name: Cases of Incompleteness in hierarchies
Discussion: The author gives the definition of completeness in hierarchies.
“Completeness in hierarchy means that all members belong to one higher-class
object, which consists of those members only.” This definition doesn’t tell exactly
what an incomplete hierarchy looks like. The author only specifies one case of an
incomplete hierarchy using US states example. There should be other cases which
also represent the incomplete hierarchies.
Location: 2.1 Classification Hierarchies, 5th paragraph.
Significance: Besides the definition of complete hierarchy, it is instructive for us to
know the cases of incomplete hierarchy. It will help us handle the summarization
problems result from incomplete hierarchy.
Suggestions: The incomplete hierarchies have three cases, namely
Orphaned-incomplete, Omitted-incomplete, Not Applicable-incomplete [1].
 Orphaned-incomplete means that lower level data of hierarchy are stored, but
they are not associated with parents. For example, a student belongs to a
particular department. In this example, Orphaned-incomplete means that one
student’s record is stored in the database, but the department which he belongs
to is not stored in the database.
 Omitted-incomplete means the records haven’t been stored in the database. In
the student-department example, it means an existed student with a
department is not stored in the database. It can also be explained as no record
exists in database about this truly existed student and his department.
 Not Applicable-incomplete means that the stored record doesn’t has applicable
parents. In the student-department example, it means a non-matriculated
student can’t be assigned to any department.
The figure above is cited from [1].
Reference:
[1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their
management in OLAP Systems.
Name: A new case of the Non-Additive measures
Discussion: The author introduces various non-additive facts in details and describes
suggestions dealing with these measures.
Location: 2.2 Non-additive measures
Significance: Fractions, measurements of intensity, average/Max/Min, measurement
of direction and identification attributes are five kinds of non-addictive facts
mentioned in the paper. However, the non-addictive measures should not be limited
to these five types. A new case of non-additive measure can also be added in this
category.
Suggestions: This new non-additive case is Identification units. Aggregating
measures with heterogeneous units is meaningless and is most likely misleading.
Incorrect summary may result if measures with different units are aggregated.
All measures have units like dollars, meters, inches. These units must be stored in
metadata. Additive operations shouldn’t be allowed to use on measures with
different units unless they can changed into one common unit [1].
Reference:
[1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their
management in OLAP Systems.
Name: Summarizing Ratios
Discussion: How to summarize ratios
Location: 2.2.1 Ratios and Percentages and Figure 8: Identification and Suggestions
for dealing with Non-additive Attributes
Significance:
It is suggested that ratios are to be summarized by storing numerators and
denominators separately, then taking the ratio of the sums. This may be confusing
and error prone, and needlessly adds difficulty.
Suggestion:
Doing this type of operation provides the average of the ratios. It would be much
simpler to just use the Average aggregate operator on the ratios instead.
Name: Enforcing Constraints
Discussion: This section talks about the non additive nature of measures of
intensities like temperature and blood pressure. It tells why these measures are
non-additive and then talks about the problems these facts can cause if they are
added along any dimension
Location: Section 2.2.2, Measure of Intensity
Significance: This section towards the end suggest that all the default aggregate
operators for measures to be set and made visible to analyst performing queries.
Suggestion: When dealing with OLAP applications the data to be analyzed is
substantial, providing default aggregate operators to an analyst may not help when
there are lot of measures involved which falls into this category. I think this problem
should be dealt by defining constraints on such attributes and keeping track of such
constraints in metadata files which should be made available to the analyst to ensure
such measures are not added during summarization.
Name: Non-numeric Data
Discussion: Dealing with all types of data when summarizing
Location: Throughout, touched briefly upon in 2.2.3 Other Non-additive Facts
Significance:
The paper goes on and on about numerical data, describing which is additive, which
is non-additive, and how to deal with each, but does not seem overly concerned with
non-numerical data.
Suggestion:
When the data is non-numeric, other ways of working with it must be found. For
example, when working with biology data ( http://biolap.sourceforge.net ) and new
ways of aggregating the data is needed. When looking at protein sequences, it is
possible to create numerical similarity scores in order to create useful data from
summarization, or finding most common traits shared between the sequences may
be a good way to aggregate it.
It must be noted that summarization is not the only way to aggregate information
with OLAP and that other types of aggregation like this can be applied even to
non-numeric data.
Name: Other Aggregate Operators
Discussion: Aggregate operators in a data warehouse besides summarization
Location: 2.2.3 Average/Maximum/Minimum and 2.5 Other Aggregate Operators
Significance:
The paper briefly mentions that other aggregate operators exist and are in use, and
briefly mentions Averaging, Minimum, and Maximum, but does not show that OLAP
allows for many other aggregate operators. This would be useful as it would allow
for better understanding of what else can be done to attain more information from a
data warehouse.
Suggestion:
Oracle's OLAP User Guide offers a nice list of Aggregate Operators and their functions.
Some other basic operators include First and Last Non-NA Data Value (which returns
the first and last real data value, respectively). There are also Weighted versions of
the aggregate operators mentioned, which allow for the inclusion of a weight factor
to multiple each value by to stress the importance of different information. Scaled
Summarization is also possible, allowing for the addition of a weighted amount to
each value before summarization.
Hierarchical and Hierarchical Weighted aggregate operators for Avergage, First, and
Last are also available, allowing specific hierarchies to be targeted
Reference:
( http://download.oracle.com/docs/cd/B28359_01/olap.111/b28124/aggregate.htm
#OLAUG9281 )
Name: Imprecision
Discussion: This section talks about the dirty data and its semi additive nature. It
further tells us about the summarization issues related to such data and suggest
providing a measure of precision when rolling up data
Location: Section 2.3.1, Dirty Data
Significance: The paper did not tell us explicitly what those measures of precision
can be and how they can be helpful when analyzing dirty data.
Suggestion: A possible approach to deal with imprecise and uncertain data is
provided by [1]:
 Extension of the OLAP model to support uncertainty in measure values and
imprecision in dimensional values: This step involves generalization of the
OLAP model to represent ambiguity. They propose relaxing the restriction
that dimension attributes in a fact must be assigned leaf-level values from
the underlying domain hierarchy, in order to model imprecision.
 They then introduce criteria that guide the choice for semantics for
aggregation queries over ambiguous data (consistency, faithfulness, and
correlation-preservation criteria)
 They then provide possible-worlds interpretation of data ambiguity that
leads to a novel allocation-based approach to defining semantics for
aggregation queries, and a careful study of choices arising in the treatment
of data ambiguity, using the criteria mentioned in 2nd step
 They then shows through a series of experiment quality of this approach to
handle imprecision in data
Reference:
[1] OLAP Over Uncertain and Imprecise Data by Doug Burdick‡ Prasad M.
Deshpande∗ T.S. Jayram∗ Raghu Ramakrishnan‡ Shivakumar Vaithyanathan]
Name: Flagging Data
Discussion: Cleaning of data can be difficult because machines are unable to tell
what is valid data.
Location: 2.3.1 Dirty Data – last paragraph
Significance:
The provided example shows a customer with an zip code of 19050 being mislabeled
as 19005 when purchasing items. This data will be in accurate, and will not be
detectable once a report of the state level or higher is generated. Could a flag be
used to determine the accuracy that the machine thinks of the data?
Suggestion:
Data that is easily prone to user error, or any other type of error that is frequent,
should have extra “cleaning” before it is stored in the database. The machine
should check this information for its validity, which would be highly dependent on
the data being captured, and if it is susceptible to errors than a flag should be stored
on that piece of data.
An example of extra checking could be the spatial relationship between the stores zip
code to that of the customers. If a customer’s address states that, they are in city X,
which is 200 miles away from the store where the goods were purchased then there
could be an issue. The customer could have moved and they did not change their
address. This information could trigger a flag.
Name: Extra non-additive measures
Discussion: Temporally non-additive measures and categorically non-additive
measures are non-additive.
Location: 2.3.3 Temporally non-additive measures and 2.3.4 Categorically
non-additive measures
Significance:
These classifications are stated to be non-additive, although it conflicts with the
definition of non-additive.
Suggestion:
These classifications are semi-additive because they are “additive across some
dimensions”, whereas a non-additive measures are “not additive across any
dimension”. Temporally non-additive measures are “not additive along the time
dimensions”, although they are additive across all other dimensions. Categorically
non-additive measures are “additive on the dimension that holds the different types
of information” although they are additive on other dimensions (i.e. customer and
store in the example provided in the paper).
Name: Methods to identify Inaccurate summaries
Discussion: This paper examines the effect of additivity on the accuracy and meaning
of summary data. And provide methods in dealing with these attributes.
Location: 3. Data Dictionary and summary constraints, Figure 8 and Figure 9.
Significance: It is helpful for us to look forward to see the more suggestions in deal
with inaccurate summaries resulting from aggregating measures.
Suggestions: More suggestions are listed as follows [1].
 Link all measures to their associated units. Measures can only be aggregated
with measures that have similar units.
 Follow the design documents wherever possible. Conforming dimensions is
important in reducing likelihood for misinterpretation of aggregate summaries.
 Track the dimensions on which measures are non-additive. OLAP systems should
give alerts when queries try to summarize measures improperly.
Reference:
[1] John Horner and I1-Yeol Song. A Taxonomy of Inaccurate Summaries and their
management in OLAP Systems.
Name: Business Rules
Discussion: The table presents an approach to identify the various additive and
semi-additive measures; it asks for reviewing business rules to do so.
Location: Figure 8, Page 90
Significance: The business rules identifies the way a data entry should be perceived
in a business environment. This paper identifies various ambiguities caused when
mergers or union of businesses take place. What will happen when the data from
two different companies with different rules on same data is combined? The paper
did not tell us in such scenarios which business rule should be followed.
Suggestion: Using Business rules may cause problems when data from two different
companies is combined due to a merger. Therefore, I think if we want to use business
rules to ensure quality summarization, these business rules should be evolved over
time otherwise these rules can cause major conflicts while summarizing. I think the
metadata regarding the business rules should also reflect the changes to these
business rules, to help avoid such conflicts.
There are many other types of comparisons and relationships
that we might want to analyze in our data warehouses beyond Additivity
For Thursday by 9:00 am,
I would like each of you to identify and describe
at least 5, and preferably as many more as possible
types of measures not already discussed
that could be identified automatically from a data warehouse
if suitable metadata was available to aid in this identification
and explain
how this could be done
and how they could be used
Download