Supporting Imprecision in Multidimen

advertisement
Supporting Imprecision in
Multidimensional Databases
Using Granularities
T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2
1 Center
for Health Information Services,
Kommunedata, www.kmd.dk
2 Nykredit Center for Database Research,
Department of Computer Science,
Aalborg University, www.cs.auc.dk/NDB
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Talk Overview
•
•
•
•
•
•
•
•
•
Motivation
Data model and query language context
Handling imprecision
Alternative queries
Imprecision in grouping
Imprecision in computations
Presenting imprecise results
Using pre-aggregated data
Conclusion
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
2
Motivation
• Online Analytical Processing (OLAP) tools are increasingly
used in many different areas:



Business applications
Medical applications
Other scientific applications
• Data “imperfection” is a problem for medical and other
applications.


Some data is missing.
Some data has varying degrees of imprecision.
• Current OLAP tools assume that data imperfections are
handled during the “data cleansing” process.



Not realistic for most cases
Introduces mapping errors
Hides the “true quality” of the data
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
3
Previous Work
• Imprecision versus uncertainty




Imprecision is a property of the content of an attribute.
Uncertainty concerns the degree of truth of a statement.
We handle only imprecision.
Our focus is on imprecision in aggregate queries.
• Most work on imprecision deals with relational databases




Fuzzy sets - specifies a degree of set membership for a value
Partial values - one of a set of values is the true value
Multiple imputation - substitute multiple values for a missing value
High computational complexity
• Only “incomplete datacubes” have treated imprecision in
multidimensional databases.


Imprecision fixed at schema level
Imprecision in computations not handled
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
4
A Diabetes Case Study
•
•
•
•
E/R schema of case study
Patients have a diagnosis.
Diagnoses may be missing.
Diagnoses may be specified at
the Low-level Diagnosis or
Diagnosis Group level.
• HbA1c% (long-term blood sugar
level) measured with several
methods of varying precision.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Patient
Has
(0,1)
Diagnosis
(0,n)
* Name
* SSN
* HbA1c%
* Precision
* Code
* Text
D
Low-level
Diagnosis
Diagnosis
Is part of
(1,1)
(1,n)
Diagnosis
Family
5
Data Model - Schema
• Fact type: Patient
• Dimension types: Diagnosis
and HbA1c%
• There are no “measures”, all
data are dimensions.
• Category types: Low-level
Diagnosis, Diagnosis Group,
Precise, Imprecise
• Top category types:
corresponds to ALL of the
dimension.
• Bottom category types: the
lowest level in each dimension
• The category types of a
dimension type form a lattice.
• Category types ~ granularities
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Diagnosis
Dimension
Type
HbA1c%
Dimension
Type
TDiagnosis
THbA1c%
Diagnosis Group
LL Diagnosis
Imprecise
Precise
Patient
6
Data Model - Instances
• Categories: instances of
category types, consist of
dimension values.
• Top categories contain only one
“T” value.
• Dimensions = categories +
partial order on category values
• Facts: instances of fact type
with separate identity.
• Fact-dimension relations: links
facts to dimensions, may map
to values of any granularity.
• Multidimensional object (MO) =
schema + dimension + facts +
fact-dimension relations
Diagnosis
Dimension
HbA1c%
Dimension
T
T
6
7
ID
NID
Diabetes Diabetes 5.4 5.5 .. 7.0
Jim
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
5
Diabetes
John
Jane
7
Algebraic Query Language
• Close to relational algebra with
aggregation functions

Selection, projection, rename,
union, difference, identitybased join operators
• Aggregation operator: takes
“grouping categories”,
aggregation function and result
dimension as arguments.


Groups together facts
characterized by the same
dimension values, applies
aggregation function to groups,
and places result in result
dimension.
Example: COUNT by Diagnosis
Family and THbA1c%
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Diagnosis
Dimension
Result
Dimension
T
T
Diabetes
0-2
012
>2
3 ..
{Jim,John,Jane}
8
Overview
•
•
•
•
•
•
•
•
•
Motivation
Data model and query language context
Handling imprecision
Alternative queries
Imprecision in grouping
Imprecision in computations
Presenting imprecise results
Using pre-aggregated data
Conclusion
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
9
Handling Imprecision
• We use the granularity of the
data to capture imprecision.
• The dimensions specify an
imprecision hierarchy.
• Data is mapped to the
appropriate granularity.
• Pseudo-code:

Procedure EvalImprecise(Q,M)
if PreciseEnough(Q,M) then Eval(Q,M)
else Q’=Alternative(Q,M)
if Q’ is ok then Eval(Q’,M)
else
Handle Imprecision in Grouping
Handle Imprecision in Computation
Return Imprecise Result
end if
end if
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Most Imprecise=T=Unknown
More Imprecise
.
.
.
More precise
Most precise=bottom category
10
Alternative Queries
• We use a “Precision MO” (PMO)
to capture the data granularities.
• The precision MO has a
granularity dimension for every
MO dimension D.
• A granularity dimension has two
categories: T and GranularityD.
• GranularityD contains a value for
every category type in D.
• The set of facts stays the same.
• Facts are mapped to the
appropriate GranularityD value.
T
T
LLD
DG
Jim
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
GranHbA1c%
Dimension
GranDiagnosis
Dimension
TD
John
P
I
TH
Jane
11
Alternative Queries
• Queries are rewritten into
“testing queries” on the PMO
that counts the number of facts
mapped to different granularity
combinations (details in paper).
• The result of the testing queries
can be used to suggest
alternative queries.
• Example: COUNT by Low-level
Diagnosis (and THbA1c%)) on MO



Testing query: COUNT by
GranularityDiagnosis on PMO
Result shows that data is not
precise enough to group on
Low-level Diagnosis
Alternative query: COUNT by
Diagnosis Group
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Result
Dimension
GranDiagnosis
Dimension
T
T
LLD
DG
{Jim}
TD
0
1
2
{John,Jane}
12
Overview
•
•
•
•
•
•
•
•
•
Motivation
Data model and query language context
Handling imprecision
Alternative queries
Imprecision in grouping
Imprecision in computations
Presenting imprecise results
Using pre-aggregated data
Conclusion
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
13
Imprecision in Grouping
• If facts are mapped to a dimension value of coarser
granularity than the grouping category, with which
dimension value should they be grouped ? - we do not
know !
• So, we return several answers, based on different
groupings of facts:



Conservative grouping: only include in a group what is known to
belong to it - discard imprecise data.
Liberal grouping: include in a group everything that might belong to
it - facts may be in several groups.
Weighted grouping: include everything that might belong, but give
more weight to likely members.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
14
Conservative Grouping
• Corresponds to the standard
aggregation operator.
• Imprecise data are simply
discarded.
• Example: COUNT by Low-level
Diagnosis (and THbA1c%)) on MO



Jim does not show in any group.
The count for both groups is 1.
The result is “too conservative”
as not all data is accounted for
in the result.
Diagnosis
Dimension
Result
Dimension
T
T
Diabetes
ID
NID
Diabetes Diabetes 0
{John}
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
1
2
{Jane}
15
Liberal Grouping
• We modify the aggregation
operator from the query
language to compute the liberal
grouping (formal semantics in
the paper).
• The computed groups may now
overlap, i.e., the same fact may
be in several groups.
• Example: COUNT by Low-level
Diagnosis (and THbA1c%)) on MO



Jim ends up up both groups.
The count for both groups is 2.
The result is “too liberal” as the
same data may be counted
several times in the result.
Diagnosis
Dimension
T
T
Diabetes
ID
NID
Diabetes Diabetes 0
{Jim,John}
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Result
Dimension
1
2
{Jim,Jane}
16
Weighted Grouping
• Compromise between the
conservative and liberal
groupings:


Weights assigned to the partial
order on dimension values and
to fact membership in groups
(groups become fuzzy sets).
Aggregation operator modified
to compute the weighted
grouping (see paper).
• Example: COUNT by Low-level
Diagnosis (and THbA1c%)) on MO



80% of Diabetes patient have
ID Diabetes, 20% NID.
Jim ends up up both groups,
but weighted differently.
Result is a weighted COUNT
(details on next slide).
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Diagnosis
Dimension
Result
Dimension
T
T
1.0
Diabetes
.8
.2
ID
NID
Diabetes Diabetes 0 1.2 1.8
{Jim.8,John1.0}
{Jim.2,Jane1.0}
17
Imprecision in Computation
• We also need to handle imprecision in the aggregate
computation itself, e.g., the varying precision for HbA1c%.
• In computations, we impute the expected value for values
of non-bottom granularity (generalized imputation). This
allows normal computation of the result.
• A precision computation is performed along with the
aggregate computation:




A granularity computation measure (GCM) is used to capture the
imprecision of a dimension value during computation.
A measure combination function (MCF) is used to combine GCMs.
The MCF must be distributive.
A final granularity measure (FGM) represent the “true” imprecision
of a result value.
A final granularity function (FGF) maps from a GCM to a FGM.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
18
Imprecision in Computation
• Example: WAVG(HbA1c%) by
Low-Level Diagnosis







We use FGM=WAVG(Level) as
a measure of precision (Precise
values have level 0, etc.)
The GCM for a value e with
weight w is: (w * Level(e),w)
The MCF is: h((n1, n2),(n3,n4)) =
(n1+n3,n2+n4)
The FGF is: f(n1,n2) = n1/n2
6.0 & 7.0 imputed for T & 7 in
WAVG(HbA1c%) computation
Jim, John, and Jane have
levels 2, 0, and 1, respectively
Results of query: {(IDD,5.7,.9),
(NIDD,5.6,1.2)}
T 6.0
T
2
1.0
5 6 7 7.0
Diabetes
.8
.2
ID
NID
Diabetes Diabetes 5.4 5.5 .. 7.0
0
1
Jim
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
HbA1c%
Dimension
Diagnosis
Dimension
John
Jane
19
Presenting Imprecise Results
• Several ways to present the
imprecise results to the user:
• Show result values along with
the corresponding final
granularity measure values.


Very precise estimate of result
precision, but hard to grasp.
Resulting (value,FGM) =
{(IDD,5.7,.9),(NIDD,5.6,1.2)}
• Map result values into different
granularities using a value
coarsening function (VCF).


More intuitive result, but less
precise estimate of precision.
Example: Weighted grouping
with VCF: r(x) = v such that xv
and Level(v) = Ceiling(x).
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
Diagnosis
Dimension
Result
Dimension
T
T
Diabetes
5
6
7
ID
NID
Diabetes Diabetes 5.4 5.5 .. 7.0
{Jim.8,John1.0}
{Jim.2,Jane1.0}
20
Using Pre-Aggregated Data
• The approach can utilize pre-aggregated data effectively.
• Aggregate results for the precision MO can usually be fully
materialized due to the relatively small multidimensional
space (~ 1.000.000 cells).
• The aggregate computation with expected values can use
standard pre-aggregation techniques, e.g., partial preaggregation, for good response-time versus
storage/update-time tradeoffs.
• The distributive MCF allows for partial pre-aggregation of
precision results.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
21
Conclusion
• Current OLAP tools and models do not support the
imprecision found in real-world data.
• We have shown an approach to handling imprecision in
OLAP databases based on the common multidimensional
concept of granularities.
• The approach can suggest alternative queries when data is
not precise enough and handles imprecision both in the
grouping of data and in the aggregate computation.
• The approach has a low computational overhead and is able
to exploit pre-aggregated results effectively.
• The approach can be implemented using existing technology
such as SQL and OLAP tools.
• Future work: presentation of results, MIN/MAX functions, etc.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999
22
Download