The -model: A theory of multidimensional structures of statistics Bo Sundgren Paper prepared for the MetaNet conference in Voorburg, the Netherlands, 2-4 April 2001 Statistical information lends itself naturally to multidimensional structuring. Thus statistical tables can be seen as two-dimensional projections of multidimensional structures. From a logical point of view, a statistical table often consists of more than two dimensions, but if one wants to present the table on paper, or even on a computer monitor, one has to project the table onto the two dimensions that are available on those media. Statistical information – or “statistics” for short – is built up from structured sets of estimates of statistical characteristics. Thus the concept of a statistical characteristic is very central to statistics and statistics production. This paper explains the -model of multidimensional structures of statistics, starting from the concept of statistical characteristics and some other basic concepts. The model was first introduced in Sundgren (1973) and Sundgren (1975). It was further developed in several paper, e.g. in Sundgren (1990). More recently it was briefly presented in Sundgren (1999). 1 Basic concepts 1.1 What is a statistical characteristic? A statistical characteristic S can be defined as a statistical measure, f, applied to the values of a variable, V, of the object instances in a set of objects, O, in order to summarize some aspect of those values: S = O.V.f The value of a statistical characteristic, S, is on an aggregated level, or macro level, relative to the values of the object characteristic C = O.V that it summarizes. The object characteristic C has a value of the variable V for each object instance in O. Example 1: “Average income during the year 1999 for those persons who were registered in Stockholm at the end of the year 2000.” Here the set of objects, O, usually called the population, is “the persons who were registered in Stockholm at the end of the year 2000”, the variable, V, is “income during the year 1999”, and the statistical measure, f, is “(arithmetical) average”. 1 The object characteristic in the example is “income during the year 1999 of person registered in Stockholm at the end of the year 2000. Example 2: “Number of persons registered in Stockholm at the end of the year 2000.” This example can be viewed in two ways. According to both interpretations the population is “the persons who were registered at the end of the year 2000”. One view is that the statistical measure is a function, “count”, that summarizes “the frequency aspect” of the objects directly, not via any particular variable. The other view is that the statistical measure summarizes the values of a variable that takes the value “1” for all objects in the population. Example 3: “The correlation between age at the end of the year 2000 and income during the year 1999 for persons registered in Stockholm at the end of the year 2000.” In this example, it is not the values of a single variable that are summarized, but the values of a vector of variables. Thus we have to generalize the concept of a statistical characteristic by allowing the “V” in “S = O.V.f” to be interpreted as a vector of variables. 1.2 Estimation of statistical characteristics If it were possible to make perfectly correct observations of exactly those objects that are in the population aimed for, the target population, then it would be possible to obtain perfectly correct values of statistical characteristics for this population; the complete and true values of the object characteristics would lead to the true values of the statistical characteristics; one would just have to apply the appropriate statistical measures correctly. In practice this “ideal procedure” for computing statistical characteristics is almost always impossible to implement. Some important reasons for this are: 1. One cannot identify and localize exactly those objects that are in the target population. Typically one uses som kind of list or register, called the frame, in order to find the objects concerned. The set of objects that the frame leads to is called the frame population. The frame population may differ from the target population by containing objects that are not part of the target population, over-coverage, and by not containing objects that are part of the target population, under-coverage. 2. One cannot afford to investigate/observe all objects in the target population. This can lead to a sample survey instead of a complete survey. 3. Regardsless of whether one goes for a sample survey or a complete survey, one usually will not succeed in observing all objects and all variables aimed for, total or partial non-response. 4. The observations that are actually made will be subject to errors and uncertainties of different kinds, measurement errors, processing errors, etc, that is, one will not always be able to obtain the true values of the variables, the values of which are to be summarized by the statistical measures. 5. Sometimes it is not even possible to observe the target population and/or the target variables directly. Then one may observe them indirectly, either by using other sources, like other surveys or administrative registers, or by observing related objects and/or related variables and deriving the object and variables aimed for. This procedure may also lead to overcoverage, under-coverage, non-response, and measurement errors. Thus instead of the target population O and the target variables V, aimed at, one will in practice obtain an actually observed set of objects O’, differing from O because of over-coverage, undercoverage, sampling, and total non-response, and an actually observed variable V’, differing from V because of total and partial non-response, measuring errors, and processing errors. 2 Hence one has to be satisfied with approximations, estimates, of the true values of the statistical characteristics aimed for, the target characteristics. These estimations have to be based on the incomplete, erroneous, and uncertain observations that one has been able to make, directly or indirectly. When estimating the true value of a statistical characteristic S = O.V.f on the basis of an actually observed set of objects, O’, and actually obtained, processed, and finally registered values of a variable, or variable vector, V’, one has to apply a function f’, the estimator, which is somehow related to, but usually not identical with, the statistical measure f. The difference between f’ and f is the result of an attempt to compensate for the deviations of O’ and V’ from the ideal O and V. In summary, the basic idea of a statistical survey, in a broad sense (including surveys based on administrative registers), implemented by means of a statistical production system, is to 1.3 estimate the true values of statistical characteristics, O.V.f, on the basis of observed values of object entities, O’.V’, by applying an estimator, f’, on the observed values, thus computing O’.V’.f’. Estimates of the uncertainties of estimates It follows from what has been said, that estimates of statistical characteristics, and hence statistics as such, are subject to uncertainties of different kinds. One can try to decrease these uncertainties by improving the statistical production processes, including the estimation procedures, but some uncertainties will always remain. At best the uncertainties can themselves be estimated, once again with some uncertainty, of course. The estimates and even more verbal descriptions that one can produce in order to describe and quantify the uncertainties of estimates of statistical characteristics may form the basis for quality declarations of the produced statistics. We have thus identified the three main tasks of the science of statistics production: how to design a statistical survey, in a broad sense, in order to be able to make optimal estimations of statistical target characteristics within certain restrictions (time, costs, etc) designing an optimal process for estimating given statistical target characteristics on the basis of a given statistical production system designing an optimal process for estimating the uncertainties of given estimation procedures in a given statistical production system, thereby providing the basis for a quality declaration of the produced statistics Naturally there has to be a lot of interaction and feed-back between the tree main tasks. 1.4 Summary of the basic concepts Figure 1 summarizes the discussion so far. 2 Multidimensional structures of statistics As should be clear from the previous chapter of this paper, when we talk about “statistics” in daily life, we actually mean “estimates of statistical characteristics”, or, even better, “qualitydeclared estimates of statistical characteristics”. 3 Target population, O, w ith target variable, V p1 inc1 p3 p2 inc3 inc2 p5 p4 ink5 Statistical measure, f A function that, on the basis of the true values of the incomes during 1999 for the persons w ho w ere actually registered in Stockholm at the end of 2000, computes... inc4 p6 p8 Frame population Statistical target characteristic, S = O.V.f inc6 p7 inc8 A process that observes the variable "income during 1999" for a sample of the objects in the frame population Collection, registration, and preparation (coding, editing, correction) of the obseved values of the variable "income during 1999" Observation register w ith microdata och metadata p2: obsinc2: meta2 p4: obsinc4: meta4 p6: obsinc6: meta6 p8: obsinc8: meta8 p9 ...the true value of the statistical characteristic "Average income during 1999 for the persons w ho w ere actually registered in Stockholm at the end of 2000" inc7 inc9 Observed objects ...the uncertainty in the estimate, that is, the deviation of the estimated value from the true value of the statistical characteristic Expressions that estimate and describe... A function that, on the basis of the registered and "cleaned" values (and metadata concerning these values) computes... ...the estimated value of the statistical characteristic "Average income during 1999 for the persons w ho w ere registered in Stockholm at the end of 2000" Estimator, f' Estimated value, S' = O'.V'.f', of the statistical characteristic S Observed values, O'.V' Figure 1. Some basic concepts of statistics production: statistical characteristics, estimates of statistical characteristics, and estimates of the uncertainties of estimates of statistical characteristics. 4 Statistics are typically presented by means of tables and graphs. Traditionally paper was the medium, but nowadays computer-supported media, like displays and CD:s, provide powerful and attractive alternatives. Among other things, the computer-support makes it possible for the user to make the final decisions as to how the statistics should actually be presented. For example, pivot functions enable the user to rearrange the dimensions of a statistical table, moving variables between the stub and the heading, switching the order of variables, etc. Statistical tables can be perceived in two different ways, depending on whether you focus on how they look when presented on a piece of paper, or on a computer-supported display, or wheter you focus on the logics behind these presentations. When looked upon in the first way, statistical tables often seem to be very complex. When looked upon in the second way, many statistical tables become quite simple. The -model focuses on the fundamental logic and information structure behind statistical tables and other presentation forms of presentation of statistics. 2.1 Structuring statistics by means of population crossclassification A typical statistical table contains statistics concerning a family of statistical characteristics, where the family members are related in a certain way. Example 4. Consider the following statistics: “Average income during the year 1999 for those persons who were registered in Stockholm at the end of the year 2000: by sex and age group.” Suppose that there are two sexes, male and female, and three agegroups, young, middle-aged, old. Then Example 4 specifies at least six statistics in addition to the statistic in Example 1 earlier in this paper. The six statistics are represented by the six cells in the following crossclassification: O = Persons registered Young in Stockholm at the end of the year 2000 Middle-aged Old Male Female Thus the original target population O, has been subdivided into six subpopulations, or domains of interest, and the subpopulations have been formed by crossclassifying the original population by means of the variables “sex” and “age group”. When statistics are presented in statistical tables, some marginal sums are usually computed as well. In the example above we could get: O = Persons registered in Stockholm at the end of the year 2000 Young Middle-aged Male Female All sexes 5 Old All age-groups This kind of structuring of statistics, and underlying statistical characteristics, corresponds to the -dimension of the -model. 2.2 Major dimensions of the -model: the -matrix Figure 2 illustrates the following discussion of the four major dimensions in the -model. It contains a matrix, a so-called -matrix, with different columns for the major dimensions, as well as for the major components of the concept of a statistical characteristic. 2.2.1 The -dimension The -dimension contains the populations of the statistical characteristics. In figure 2 there are the following populations: 1: 2: 3: “Persons registered in Sweden at the end of t” in S1, S2, and S3. “Domestic migrations during t” in S4 and S6. “Domestic migrations during t, where the target commune of the migration has a lower tax rate during t than the home commune of the migrating person” in S5 “Domestic migrations” are defined as “migrations inside Sweden”. Note that “O” in “O.V.f” denotes alternatively the whole -population, O, or each one of the subdomains of interest, O by V, or O \V, defined by the classification O \V, where V, is a crossclassification of n variables, the -variables: V = V1 V2 ... Vn. 2.2.2 The -dimension The -dimension contains the summarizing functions of the statistical characteristics, that is, statistical measures that are applied to zero, one, or more variables, the -variables. Thus a statistical measures may have zero, one, or more arguments. Some examples: count sum average correlation percentage counts the number of object instances in O, a function with zero arguments summarizes the values of a variable V, a function with one argument averages the values of a variable V, a function with one argument computes the correlation between two variables, V1 and V2, thus two arguments computes the percentage of object instances in O satisfying a Boolean variable V, a function with one variable The average function can alternatively be expressed as a sum divided by a count, and a count can alternatively be expressed as a sum of a variable that takes the value 1 for all objects in O. Correlations and percentages can also be expressed in terms of other functions. The following -variables appear in figure 2: 1: 2: 3: “income(t-1): the person’s income during t-1 according to taxation performed during t” in S1 and S3 “age(t): the person’s age in whole years at the end of t” in S3 “to_lower_tax(t): migration to commune with lower tax during t than the migrator’s home commune”, a Boolean variable in S6. 6 STATISTICAL CHARACTERISTICS S = O.V.f: by variables G REFERENCE TIME t -dimension S1: “Average income during the Year t = 1995, year t-1 for those persons who were 1996, ... registered in Sweden at the end of the year t: by commune, sex, and age.” SETS OF OBJECTS O POPULATION -dimension Persons registered in Sweden at the end of t S2: “Number of persons registered in Sweden at the end of the year t: by sex, age, and income bracket.” Year t = 1995, 1996, ... Persons registered in Sweden at the end of t S3: “The correlation between age at the end of the year t and income during the year t-1 for persons registered in Sweden at the end of the year t: by commune and sex.” S4: “Domestic migrations during the year t: by sex and income bracket.” Year t = 1995, 1996, ... Persons registered in Sweden at the end of t Year t = 1995, 1996, ... Domestic migrations during t S5: “Domestic migrations during Year t = 1995, the year t from a commune with 1996, ... higher tax to a commune with lower tax: by sex and income bracket.” S6: “Percentage of the domestic Year t = 1995, migrations during the year t that 1996, ... took place from a commune with higher tax to a commune with lower tax: by sex and income bracket.” SUMMARIZING FUNCTION -dimension VARIABLES V STATISTICAL MEASURES f average income(t-1): the person’s income during t-1 according to taxation performed during t CLASSIFICATION -dimension commune(t): the commune where the person was registered at the end of t sex(t): the person’s sex at the end of t age(t): the person’s age in whole years at the end of t sex(t): see above age(t): see above income bracket: the person’s income bracket according to classification xxx, based on the person’s income during t-1 commune(t): see above age(t): the person’s age in whole years sex(t): see above at the end of t income(t): the person’s income during the year t-1 sex(t): the migrating person’s sex at the time of migration income bracket(t-1): the migrating person’s income bracket during t-1 according to classification xxx based upon the person’s income during t-1 sex(t): see above income bracket(t): see above Domestic migrations during t where the target commune of the migration has a lower tax rate during t than the home commune of the migrating person Domestic migrations during t sex(t): see above income bracket: see above Figure 2. -matrix. 7 to_lower_tax: migration from commune with higher tax during t to commune with lower tax during t count correlation count count percentage Alternatively: S6 = 100%*S5/S4 The following summarizing functions are formed by applying statistical measures to the variables: S1: S2, S4, S5: S3: S6: average(1) or, with dot notation, 1.average count correlation(1, 2) or (1, 2).correlation percentage(3) or 3.percentage 2.2.3 The -dimension The -dimension contains variables that crossclassifies the population into domains of interest, to which the statistical measures are applied in the same way as they are applied to the crossclassified population itself. In figure 2 the following cross-classifications occur: S1: The population 1 is crossclassified by 1: “commune(t): commune where the person was registerd at the end of t” 2: “sex(t): the person’s sex at the end of t” 3: “age(t): the person’s age in whole years at the end of t”, that is, =2 S2: The population 1 is crossclassified by 2, 3, and 4: “income_bracket(t-1): the person’s income bracket according to the classification xxx, based upon the person’s income during t-1” S3: The population 1 is crossclassified by 1 and 2. S4: The population is crossclassified by 5: “sex(t): the migrating person’s sex at the time of migration” 6: “income_bracket(t-1): the migrating person’s income bracket according to classification xxx based upon the person’s income during t-1” Note that 5 and 6 are different from 2 and 4, since the former are (derived) variables of migrations, whereas the latter are variables of persons. S5: The population is crossclassified by 5 and 6. S6: The population is crossclassified by 5 and 6. Note that the -variables form subdimensions of the -dimension. 2.2.4 The -dimension The -dimension specifies reference times for the statistical characteristics. Time can be explicitly specified for all populations, variables, etc, but this is often unpractical. Instead a time parameter t is used, and all times are expressed as functions of t. The -dimension also states the value set of t. In figure 2 the -dimension specifies the parameter t with a value set consisting of the years 1995, 1996, and onwards. 8 person id commune code TO MIGRATION COMMUNE name RECEIVES tax rate G RE AT TR S I OR LOCALIZES R ITO B HA IN MIGRATOR MAKES abroad LOCALIZATION migration date person id sex age organisation id MAIN EMPLOYER PERSON EMPLOYER category MAIN EMPLOYER OF income Figure 3. OVR-graph: visualization of an object system. FINAL OBSERVATION REGISTER cons is ting of OV-m atrixe s MIGRATIONS PERSONS migrant person id sex migration date age abroad? income to_commune home commune main employer COMMUNES commune code name tax rate EMPLOYERS organisation_id category commune Figure 4. OV-matrixes corresponding to the OVR-graph in figure 3. 9 2.3 Visualizing multidimensional structures of statistics Figure 3 visualizes a universe of interest that covers the concepts needed to express, among other things, the statistical characteristics specified in the -matrix in figure 2. The graph used for visualizing the objects, variables, and relationships in the universe of discourse is called an object graph or an ObjectVariableRelation (OVR) graph. Figure 4 visualizes a hypothetical final observation register, accomodating a set of observations concerning the universe of discourse in figure 3. The observation register consists of a number of ObjectVariable (OV) matrixes, in principle one matrix per object type in the universe of discourse. This model of the final observation register could also be seen as a specification of a relational database implementation. Figure 5 visualizes a twodimensional and a threedimensional structure, often called “box” or “cube”, corresponding to some statistical characteristics expressible in terms of the concepts of the universe of discourse in figure 3. Figures 6 and 7 two so-called star models that are also used for representing multidimensional structures. The stars are not limited to any particular number of dimensions. STATISTICAL CHARACTERISTICS IN MULTIDIMENSIONAL STRUCTURES SEX AGE GROUP Domestic migrations during the year t: by age group and income bracket number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations number of migrations COMMUNE average income average income average income average income average income average income average income average income average income average income average income average income average income average income average income Average income during the year t-1 for those persons who were registered in Sweden at the end of the year t: by commune, sex, and age INCOME BRACKET AGE GROUP Figure 5. A twodimensional and a threedimensional cube accomodating estimated values of some statistical characteristics concerning the universe of discourse in figure 3. 10 name HOME COMMUNE SEX tax rate AGE GROUP PERSON age income INCOME BRACKET EMPLOYER name CATEGORY COMMUNE tax rate Figure 6. Star model corresponding to figure 2, S1-S3. DATE name TO COMMUNE ABROAD? tax rate name HOME COMMUNE MIGRATION SEX tax rate PERSON age income AGE GROUP INCOME BRACKET EMPLOYER name CATEGORY COMMUNE tax rate Figure 7. Star model corresponding to figure 2, S4-S6. 11 Statistical -Query By Example (SQBE-) and Statistical Query Specification Language (SQSL-) 2.4 In figure 8 the final observation register of figure 4, corresponding to the universe of discourse in figure 3, has been used as the basis for a Statistical -Query By Example (SQBE-) expressing requests for estimated values of the six statistical characteristics S1-S6 definitions of the derived variables “to_lower_tax”, a Boolean variable, and “incbr”, a classification of “income” called “cx”; both derived variables are needed in the specification of the requests for estimates of S1-S6 Figure 8 also contains equivalent specifications in a non-procedural language called Statistical -Query Specification Language (SQSL-), which goes back to the language INFOL as described in Sundgren (1973) and Sundgren (1992). The basic language elements and constructs can be defined as follows: Language element/construct Defined as <statistical -query> {<-set of objects> by <vector of -variables>}. <vector of -variables>.<statistical measure> <object type> [<time reference>] [with <property>] [<path segment>]*<atomic property> | <valid expression of properties> <relation>.<object type> [<time reference>] [with <property>]. <variable><comparison operator><variable or value> | <vector of variables>.<aggregator><comparator><variable or value> [<path segment>]*<atomic variable>[<time reference>] | <valid expression of variables> <variable that classificies the -set of objects> <variable of the -objects, argument of the statistical measure> <expression in terms of time locators, atomic times, and time parameter > at [the beginning of | the end of], during [the whole of], during [the former | the middle | the latter] part of <point in time> | <time interval> <operator that uses the latter argument to (cross)classify the former <operator that uses the latter argument to restrict the former to a subset <-set of objects> <property> <path segment> <atomic property> <variable> -variable -variable <time reference> <time locator> <atomic time> by, \ with, | 12 FINAL OBSERVATION REGISTER Default time = t TO RECEIVES false y, count false to lower tax COMMUNE q % z j :=true commune code S1: S2: S3: S4: S5: S6: def1: name tax rate q, r (q) < (r) j, k (j) < (k) MAIN EMPLOYER MAIN EMPLOYER OF PERSON person id S1: S2: S3: S4: S5: S6: def1: def2: count sex age \ \ \ \ \ corr x \ y \ income commune (t-1) average \ corr \ main employer incbr (t-1) \ \ r EMPLOYER organisation id category LOCALIZATION MAKES x, count commune LOCALIZES migration abroad date INHABITOR MIGRATOR S1: S2: S3: S4: S5: S6: def1: migrant REGISTRATOR MIGRATION commune S1: S2: S3: S4: S5: S6: k z inc :=inc\cx S1: {PERSON\commune,sex,age}.income(t-1).avg; S2: {PERSON\sex,age,incbr(t-1)}.count; S3: {PERSON\commune,sex}.age(t),income(t-1).corr; S4: {MIGRATION|abroad\sex,incbr(t-1)}.count; S5: {MIGRATION|abroad, to_lower_tax\sex,incbr(t-1)}.count; S6: {MIGRATION|abroad\sex,incbr(t-1)}.to_lower_tax.percentage; where incbr := income\cx; to_lower_tax := TO.COMMUNE.tax_rate < MIGRATOR.PERSON.REGISTRATOR.COMMUNE.tax_rate; Figure 8. Illustration of some queries in Statistical -Query By Example (SQBE-) and Statistical -Query Specification Language (SQSL-). 13 References Sundgren, Bo (1973): “An Infological Approach to Data Bases”, Urval no. 7, Statistics Sweden and University of Stockholm 1973, ISBN 91-38-01750-4. Sundgren, Bo (1975): “Theory of Data Bases”, Mason/Charter Publishers, New York 1975, ISBN 0-88405-307-5. Sundgren, Bo (1990): “Conceptual Modelling and Related Methods and Tools for ComputerAided Design of Information systems”, The Second International Conference on Information Systems Developers Workbench, Gdansk, Poland, September 25 - 28, 1990. Sundgren, Bo (1992): “Databasorienterad systemutveckling” (”Database-oriented systems development”), In Swedish, Studentlitteratur, Lund, 1992. Sundgren, Bo (1999): "Information Systems Architecture for National and International Statistical Offices: Guidelines and Recommendations", United Nations, Geneva, 1999. 14