Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen1,2, C. S. Jensen2, and C. E. Dyreson2 1 Center for Health Information Services, Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research, Department of Computer Science, Aalborg University, www.cs.auc.dk/NDB 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Talk Overview • • • • • • • • • Motivation Data model and query language context Handling imprecision Alternative queries Imprecision in grouping Imprecision in computations Presenting imprecise results Using pre-aggregated data Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 2 Motivation • Online Analytical Processing (OLAP) tools are increasingly used in many different areas: Business applications Medical applications Other scientific applications • Data “imperfection” is a problem for medical and other applications. Some data is missing. Some data has varying degrees of imprecision. • Current OLAP tools assume that data imperfections are handled during the “data cleansing” process. Not realistic for most cases Introduces mapping errors Hides the “true quality” of the data 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 3 Previous Work • Imprecision versus uncertainty Imprecision is a property of the content of an attribute. Uncertainty concerns the degree of truth of a statement. We handle only imprecision. Our focus is on imprecision in aggregate queries. • Most work on imprecision deals with relational databases Fuzzy sets - specifies a degree of set membership for a value Partial values - one of a set of values is the true value Multiple imputation - substitute multiple values for a missing value High computational complexity • Only “incomplete datacubes” have treated imprecision in multidimensional databases. Imprecision fixed at schema level Imprecision in computations not handled 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 4 A Diabetes Case Study • • • • E/R schema of case study Patients have a diagnosis. Diagnoses may be missing. Diagnoses may be specified at the Low-level Diagnosis or Diagnosis Group level. • HbA1c% (long-term blood sugar level) measured with several methods of varying precision. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Patient Has (0,1) Diagnosis (0,n) * Name * SSN * HbA1c% * Precision * Code * Text D Low-level Diagnosis Diagnosis Is part of (1,1) (1,n) Diagnosis Family 5 Data Model - Schema • Fact type: Patient • Dimension types: Diagnosis and HbA1c% • There are no “measures”, all data are dimensions. • Category types: Low-level Diagnosis, Diagnosis Group, Precise, Imprecise • Top category types: corresponds to ALL of the dimension. • Bottom category types: the lowest level in each dimension • The category types of a dimension type form a lattice. • Category types ~ granularities 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Diagnosis Dimension Type HbA1c% Dimension Type TDiagnosis THbA1c% Diagnosis Group LL Diagnosis Imprecise Precise Patient 6 Data Model - Instances • Categories: instances of category types, consist of dimension values. • Top categories contain only one “T” value. • Dimensions = categories + partial order on category values • Facts: instances of fact type with separate identity. • Fact-dimension relations: links facts to dimensions, may map to values of any granularity. • Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations Diagnosis Dimension HbA1c% Dimension T T 6 7 ID NID Diabetes Diabetes 5.4 5.5 .. 7.0 Jim 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 5 Diabetes John Jane 7 Algebraic Query Language • Close to relational algebra with aggregation functions Selection, projection, rename, union, difference, identitybased join operators • Aggregation operator: takes “grouping categories”, aggregation function and result dimension as arguments. Groups together facts characterized by the same dimension values, applies aggregation function to groups, and places result in result dimension. Example: COUNT by Diagnosis Family and THbA1c% 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Diagnosis Dimension Result Dimension T T Diabetes 0-2 012 >2 3 .. {Jim,John,Jane} 8 Overview • • • • • • • • • Motivation Data model and query language context Handling imprecision Alternative queries Imprecision in grouping Imprecision in computations Presenting imprecise results Using pre-aggregated data Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 9 Handling Imprecision • We use the granularity of the data to capture imprecision. • The dimensions specify an imprecision hierarchy. • Data is mapped to the appropriate granularity. • Pseudo-code: Procedure EvalImprecise(Q,M) if PreciseEnough(Q,M) then Eval(Q,M) else Q’=Alternative(Q,M) if Q’ is ok then Eval(Q’,M) else Handle Imprecision in Grouping Handle Imprecision in Computation Return Imprecise Result end if end if 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Most Imprecise=T=Unknown More Imprecise . . . More precise Most precise=bottom category 10 Alternative Queries • We use a “Precision MO” (PMO) to capture the data granularities. • The precision MO has a granularity dimension for every MO dimension D. • A granularity dimension has two categories: T and GranularityD. • GranularityD contains a value for every category type in D. • The set of facts stays the same. • Facts are mapped to the appropriate GranularityD value. T T LLD DG Jim 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 GranHbA1c% Dimension GranDiagnosis Dimension TD John P I TH Jane 11 Alternative Queries • Queries are rewritten into “testing queries” on the PMO that counts the number of facts mapped to different granularity combinations (details in paper). • The result of the testing queries can be used to suggest alternative queries. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO Testing query: COUNT by GranularityDiagnosis on PMO Result shows that data is not precise enough to group on Low-level Diagnosis Alternative query: COUNT by Diagnosis Group 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Result Dimension GranDiagnosis Dimension T T LLD DG {Jim} TD 0 1 2 {John,Jane} 12 Overview • • • • • • • • • Motivation Data model and query language context Handling imprecision Alternative queries Imprecision in grouping Imprecision in computations Presenting imprecise results Using pre-aggregated data Conclusion 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 13 Imprecision in Grouping • If facts are mapped to a dimension value of coarser granularity than the grouping category, with which dimension value should they be grouped ? - we do not know ! • So, we return several answers, based on different groupings of facts: Conservative grouping: only include in a group what is known to belong to it - discard imprecise data. Liberal grouping: include in a group everything that might belong to it - facts may be in several groups. Weighted grouping: include everything that might belong, but give more weight to likely members. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 14 Conservative Grouping • Corresponds to the standard aggregation operator. • Imprecise data are simply discarded. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO Jim does not show in any group. The count for both groups is 1. The result is “too conservative” as not all data is accounted for in the result. Diagnosis Dimension Result Dimension T T Diabetes ID NID Diabetes Diabetes 0 {John} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 1 2 {Jane} 15 Liberal Grouping • We modify the aggregation operator from the query language to compute the liberal grouping (formal semantics in the paper). • The computed groups may now overlap, i.e., the same fact may be in several groups. • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO Jim ends up up both groups. The count for both groups is 2. The result is “too liberal” as the same data may be counted several times in the result. Diagnosis Dimension T T Diabetes ID NID Diabetes Diabetes 0 {Jim,John} 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Result Dimension 1 2 {Jim,Jane} 16 Weighted Grouping • Compromise between the conservative and liberal groupings: Weights assigned to the partial order on dimension values and to fact membership in groups (groups become fuzzy sets). Aggregation operator modified to compute the weighted grouping (see paper). • Example: COUNT by Low-level Diagnosis (and THbA1c%)) on MO 80% of Diabetes patient have ID Diabetes, 20% NID. Jim ends up up both groups, but weighted differently. Result is a weighted COUNT (details on next slide). 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Diagnosis Dimension Result Dimension T T 1.0 Diabetes .8 .2 ID NID Diabetes Diabetes 0 1.2 1.8 {Jim.8,John1.0} {Jim.2,Jane1.0} 17 Imprecision in Computation • We also need to handle imprecision in the aggregate computation itself, e.g., the varying precision for HbA1c%. • In computations, we impute the expected value for values of non-bottom granularity (generalized imputation). This allows normal computation of the result. • A precision computation is performed along with the aggregate computation: A granularity computation measure (GCM) is used to capture the imprecision of a dimension value during computation. A measure combination function (MCF) is used to combine GCMs. The MCF must be distributive. A final granularity measure (FGM) represent the “true” imprecision of a result value. A final granularity function (FGF) maps from a GCM to a FGM. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 18 Imprecision in Computation • Example: WAVG(HbA1c%) by Low-Level Diagnosis We use FGM=WAVG(Level) as a measure of precision (Precise values have level 0, etc.) The GCM for a value e with weight w is: (w * Level(e),w) The MCF is: h((n1, n2),(n3,n4)) = (n1+n3,n2+n4) The FGF is: f(n1,n2) = n1/n2 6.0 & 7.0 imputed for T & 7 in WAVG(HbA1c%) computation Jim, John, and Jane have levels 2, 0, and 1, respectively Results of query: {(IDD,5.7,.9), (NIDD,5.6,1.2)} T 6.0 T 2 1.0 5 6 7 7.0 Diabetes .8 .2 ID NID Diabetes Diabetes 5.4 5.5 .. 7.0 0 1 Jim 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 HbA1c% Dimension Diagnosis Dimension John Jane 19 Presenting Imprecise Results • Several ways to present the imprecise results to the user: • Show result values along with the corresponding final granularity measure values. Very precise estimate of result precision, but hard to grasp. Resulting (value,FGM) = {(IDD,5.7,.9),(NIDD,5.6,1.2)} • Map result values into different granularities using a value coarsening function (VCF). More intuitive result, but less precise estimate of precision. Example: Weighted grouping with VCF: r(x) = v such that xv and Level(v) = Ceiling(x). 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Diagnosis Dimension Result Dimension T T Diabetes 5 6 7 ID NID Diabetes Diabetes 5.4 5.5 .. 7.0 {Jim.8,John1.0} {Jim.2,Jane1.0} 20 Using Pre-Aggregated Data • The approach can utilize pre-aggregated data effectively. • Aggregate results for the precision MO can usually be fully materialized due to the relatively small multidimensional space (~ 1.000.000 cells). • The aggregate computation with expected values can use standard pre-aggregation techniques, e.g., partial preaggregation, for good response-time versus storage/update-time tradeoffs. • The distributive MCF allows for partial pre-aggregation of precision results. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 21 Conclusion • Current OLAP tools and models do not support the imprecision found in real-world data. • We have shown an approach to handling imprecision in OLAP databases based on the common multidimensional concept of granularities. • The approach can suggest alternative queries when data is not precise enough and handles imprecision both in the grouping of data and in the aggregate computation. • The approach has a low computational overhead and is able to exploit pre-aggregated results effectively. • The approach can be implemented using existing technology such as SQL and OLAP tools. • Future work: presentation of results, MIN/MAX functions, etc. 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 22