A theory of multidimensional structures of sets of estimates of

advertisement
The -model:
A theory of multidimensional structures of statistics
Bo Sundgren
Paper prepared for the MetaNet conference in Voorburg, the Netherlands, 2-4 April 2001
Statistical information lends itself naturally to multidimensional structuring. Thus statistical
tables can be seen as two-dimensional projections of multidimensional structures. From a logical
point of view, a statistical table often consists of more than two dimensions, but if one wants to
present the table on paper, or even on a computer monitor, one has to project the table onto the
two dimensions that are available on those media.
Statistical information – or “statistics” for short – is built up from structured sets of estimates of
statistical characteristics. Thus the concept of a statistical characteristic is very central to statistics
and statistics production.
This paper explains the -model of multidimensional structures of statistics, starting from the
concept of statistical characteristics and some other basic concepts. The model was first
introduced in Sundgren (1973) and Sundgren (1975). It was further developed in several paper,
e.g. in Sundgren (1990). More recently it was briefly presented in Sundgren (1999).
1
Basic concepts
1.1
What is a statistical characteristic?
A statistical characteristic S can be defined as a statistical measure, f, applied to the values of a
variable, V, of the object instances in a set of objects, O, in order to summarize some aspect of
those values:
S = O.V.f
The value of a statistical characteristic, S, is on an aggregated level, or macro level, relative to the
values of the object characteristic
C = O.V
that it summarizes. The object characteristic C has a value of the variable V for each object
instance in O.
Example 1: “Average income during the year 1999 for those persons who were registered in
Stockholm at the end of the year 2000.”
Here the set of objects, O, usually called the population, is “the persons who were registered in
Stockholm at the end of the year 2000”, the variable, V, is “income during the year 1999”, and
the statistical measure, f, is “(arithmetical) average”.
1
The object characteristic in the example is “income during the year 1999 of person registered in
Stockholm at the end of the year 2000.
Example 2: “Number of persons registered in Stockholm at the end of the year 2000.”
This example can be viewed in two ways. According to both interpretations the population is “the
persons who were registered at the end of the year 2000”. One view is that the statistical measure
is a function, “count”, that summarizes “the frequency aspect” of the objects directly, not via any
particular variable. The other view is that the statistical measure summarizes the values of a
variable that takes the value “1” for all objects in the population.
Example 3: “The correlation between age at the end of the year 2000 and income during the year
1999 for persons registered in Stockholm at the end of the year 2000.”
In this example, it is not the values of a single variable that are summarized, but the values of a
vector of variables. Thus we have to generalize the concept of a statistical characteristic by
allowing the “V” in “S = O.V.f” to be interpreted as a vector of variables.
1.2
Estimation of statistical characteristics
If it were possible to make perfectly correct observations of exactly those objects that are in the
population aimed for, the target population, then it would be possible to obtain perfectly correct
values of statistical characteristics for this population; the complete and true values of the object
characteristics would lead to the true values of the statistical characteristics; one would just have
to apply the appropriate statistical measures correctly. In practice this “ideal procedure” for
computing statistical characteristics is almost always impossible to implement. Some important
reasons for this are:
1. One cannot identify and localize exactly those objects that are in the target population.
Typically one uses som kind of list or register, called the frame, in order to find the objects
concerned. The set of objects that the frame leads to is called the frame population. The frame
population may differ from the target population by containing objects that are not part of the
target population, over-coverage, and by not containing objects that are part of the target
population, under-coverage.
2. One cannot afford to investigate/observe all objects in the target population. This can lead to
a sample survey instead of a complete survey.
3. Regardsless of whether one goes for a sample survey or a complete survey, one usually will
not succeed in observing all objects and all variables aimed for, total or partial non-response.
4. The observations that are actually made will be subject to errors and uncertainties of different
kinds, measurement errors, processing errors, etc, that is, one will not always be able to
obtain the true values of the variables, the values of which are to be summarized by the
statistical measures.
5. Sometimes it is not even possible to observe the target population and/or the target variables
directly. Then one may observe them indirectly, either by using other sources, like other
surveys or administrative registers, or by observing related objects and/or related variables
and deriving the object and variables aimed for. This procedure may also lead to overcoverage, under-coverage, non-response, and measurement errors.
Thus instead of the target population O and the target variables V, aimed at, one will in practice
obtain an actually observed set of objects O’, differing from O because of over-coverage, undercoverage, sampling, and total non-response, and an actually observed variable V’, differing from
V because of total and partial non-response, measuring errors, and processing errors.
2
Hence one has to be satisfied with approximations, estimates, of the true values of the statistical
characteristics aimed for, the target characteristics. These estimations have to be based on the
incomplete, erroneous, and uncertain observations that one has been able to make, directly or
indirectly.
When estimating the true value of a statistical characteristic S = O.V.f on the basis of an actually
observed set of objects, O’, and actually obtained, processed, and finally registered values of a
variable, or variable vector, V’, one has to apply a function f’, the estimator, which is somehow
related to, but usually not identical with, the statistical measure f. The difference between f’ and f
is the result of an attempt to compensate for the deviations of O’ and V’ from the ideal O and V.
In summary, the basic idea of a statistical survey, in a broad sense (including surveys based on
administrative registers), implemented by means of a statistical production system, is to




1.3
estimate the true values of statistical characteristics, O.V.f,
on the basis of observed values of object entities, O’.V’,
by applying an estimator, f’, on the observed values,
thus computing O’.V’.f’.
Estimates of the uncertainties of estimates
It follows from what has been said, that estimates of statistical characteristics, and hence statistics
as such, are subject to uncertainties of different kinds. One can try to decrease these uncertainties
by improving the statistical production processes, including the estimation procedures, but some
uncertainties will always remain. At best the uncertainties can themselves be estimated, once
again with some uncertainty, of course. The estimates and even more verbal descriptions that one
can produce in order to describe and quantify the uncertainties of estimates of statistical
characteristics may form the basis for quality declarations of the produced statistics.
We have thus identified the three main tasks of the science of statistics production:



how to design a statistical survey, in a broad sense, in order to be able to make optimal
estimations of statistical target characteristics within certain restrictions (time, costs, etc)
designing an optimal process for estimating given statistical target characteristics on the basis
of a given statistical production system
designing an optimal process for estimating the uncertainties of given estimation procedures
in a given statistical production system, thereby providing the basis for a quality declaration
of the produced statistics
Naturally there has to be a lot of interaction and feed-back between the tree main tasks.
1.4
Summary of the basic concepts
Figure 1 summarizes the discussion so far.
2
Multidimensional structures of statistics
As should be clear from the previous chapter of this paper, when we talk about “statistics” in
daily life, we actually mean “estimates of statistical characteristics”, or, even better, “qualitydeclared estimates of statistical characteristics”.
3
Target population, O, w ith
target variable, V
p1
inc1
p3
p2
inc3
inc2
p5
p4
ink5
Statistical measure, f
A function that, on the basis of the
true values of the incomes during
1999 for the persons w ho w ere
actually registered in Stockholm at
the end of 2000, computes...
inc4
p6
p8
Frame population
Statistical target
characteristic,
S = O.V.f
inc6
p7
inc8
A process that observes the
variable "income during
1999" for a sample of the
objects in the frame
population
Collection, registration, and
preparation (coding, editing,
correction) of the obseved
values of the variable
"income during 1999"
Observation register w ith
microdata och metadata
p2: obsinc2: meta2
p4: obsinc4: meta4
p6: obsinc6: meta6
p8: obsinc8: meta8
p9
...the true value of the
statistical characteristic
"Average income during 1999
for the persons w ho w ere
actually registered in
Stockholm at the end of 2000"
inc7
inc9
Observed objects
...the uncertainty in the
estimate, that is, the
deviation of the estimated
value from the true value of
the statistical characteristic
Expressions that estimate and
describe...
A function that, on the basis of the
registered and "cleaned" values (and
metadata concerning these values)
computes...
...the estimated value of the
statistical characteristic
"Average income during 1999
for the persons w ho w ere
registered in Stockholm at the
end of 2000"
Estimator, f'
Estimated value,
S' = O'.V'.f',
of the statistical
characteristic S
Observed values, O'.V'
Figure 1. Some basic concepts of statistics production: statistical characteristics, estimates of
statistical characteristics, and estimates of the uncertainties of estimates of statistical
characteristics.
4
Statistics are typically presented by means of tables and graphs. Traditionally paper was the
medium, but nowadays computer-supported media, like displays and CD:s, provide powerful and
attractive alternatives. Among other things, the computer-support makes it possible for the user to
make the final decisions as to how the statistics should actually be presented. For example, pivot
functions enable the user to rearrange the dimensions of a statistical table, moving variables
between the stub and the heading, switching the order of variables, etc.
Statistical tables can be perceived in two different ways, depending on whether you focus on how
they look when presented on a piece of paper, or on a computer-supported display, or wheter you
focus on the logics behind these presentations. When looked upon in the first way, statistical
tables often seem to be very complex. When looked upon in the second way, many statistical
tables become quite simple.
The -model focuses on the fundamental logic and information structure behind statistical
tables and other presentation forms of presentation of statistics.
2.1
Structuring statistics by means of population crossclassification
A typical statistical table contains statistics concerning a family of statistical characteristics,
where the family members are related in a certain way.
Example 4. Consider the following statistics: “Average income during the year 1999 for those
persons who were registered in Stockholm at the end of the year 2000: by sex and age group.”
Suppose that there are two sexes, male and female, and three agegroups, young, middle-aged,
old. Then Example 4 specifies at least six statistics in addition to the statistic in Example 1 earlier
in this paper. The six statistics are represented by the six cells in the following crossclassification:
O = Persons registered Young
in Stockholm at the
end of the year 2000
Middle-aged
Old
Male
Female
Thus the original target population O, has been subdivided into six subpopulations, or domains of
interest, and the subpopulations have been formed by crossclassifying the original population by
means of the variables “sex” and “age group”.
When statistics are presented in statistical tables, some marginal sums are usually computed as
well. In the example above we could get:
O = Persons
registered in
Stockholm at the
end of the year
2000
Young
Middle-aged
Male
Female
All sexes
5
Old
All age-groups
This kind of structuring of statistics, and underlying statistical characteristics, corresponds to the
-dimension of the -model.
2.2
Major dimensions of the -model: the -matrix
Figure 2 illustrates the following discussion of the four major dimensions in the -model. It
contains a matrix, a so-called -matrix, with different columns for the major dimensions, as
well as for the major components of the concept of a statistical characteristic.
2.2.1
The -dimension
The -dimension contains the populations of the statistical characteristics. In figure 2 there are
the following populations:
1:
2:
3:
“Persons registered in Sweden at the end of t” in S1, S2, and S3.
“Domestic migrations during t” in S4 and S6.
“Domestic migrations during t, where the target commune of the migration has a
lower tax rate during t than the home commune of the migrating person” in S5
“Domestic migrations” are defined as “migrations inside Sweden”.
Note that “O” in “O.V.f” denotes alternatively the whole -population, O, or each one of the
subdomains of interest, O by V, or O \V, defined by the classification O \V, where V, is a
crossclassification of n variables, the -variables: V = V1  V2  ...  Vn.
2.2.2
The -dimension
The -dimension contains the summarizing functions of the statistical characteristics, that is,
statistical measures that are applied to zero, one, or more variables, the -variables. Thus a
statistical measures may have zero, one, or more arguments. Some examples:
count
sum
average
correlation
percentage
counts the number of object instances in O, a function with zero arguments
summarizes the values of a variable V, a function with one argument
averages the values of a variable V, a function with one argument
computes the correlation between two variables, V1 and V2, thus two arguments
computes the percentage of object instances in O satisfying a Boolean variable V, a
function with one variable
The average function can alternatively be expressed as a sum divided by a count, and a count can
alternatively be expressed as a sum of a variable that takes the value 1 for all objects in O.
Correlations and percentages can also be expressed in terms of other functions.
The following -variables appear in figure 2:
1:
2:
3:
“income(t-1): the person’s income during t-1 according to taxation performed
during t” in S1 and S3
“age(t): the person’s age in whole years at the end of t” in S3
“to_lower_tax(t): migration to commune with lower tax during t than the migrator’s
home commune”, a Boolean variable in S6.
6
STATISTICAL
CHARACTERISTICS
S = O.V.f: by variables G
REFERENCE
TIME t
-dimension
S1: “Average income during the
Year t = 1995,
year t-1 for those persons who were 1996, ...
registered in Sweden at the end of
the year t: by commune, sex, and
age.”
SETS OF OBJECTS O
POPULATION
-dimension
Persons registered in
Sweden at the end of t
S2: “Number of persons registered
in Sweden at the end of the year t:
by sex, age, and income bracket.”
Year t = 1995,
1996, ...
Persons registered in
Sweden at the end of t
S3: “The correlation between age at
the end of the year t and income
during the year t-1 for persons
registered in Sweden at the end of
the year t: by commune and sex.”
S4: “Domestic migrations during
the year t: by sex and income
bracket.”
Year t = 1995,
1996, ...
Persons registered in
Sweden at the end of t
Year t = 1995,
1996, ...
Domestic migrations
during t
S5: “Domestic migrations during
Year t = 1995,
the year t from a commune with
1996, ...
higher tax to a commune with lower
tax: by sex and income bracket.”
S6: “Percentage of the domestic
Year t = 1995,
migrations during the year t that
1996, ...
took place from a commune with
higher tax to a commune with lower
tax: by sex and income bracket.”
SUMMARIZING FUNCTION
-dimension
VARIABLES V
STATISTICAL
MEASURES f
average
 income(t-1):
the person’s income during t-1
according to taxation performed
during t
CLASSIFICATION
-dimension
 commune(t):
the commune where the person was
registered at the end of t
 sex(t):
the person’s sex at the end of t
 age(t):
the person’s age in whole years at the end of t
 sex(t): see above
 age(t): see above
 income bracket:
the person’s income bracket according to
classification xxx, based on the person’s
income during t-1
 commune(t): see above
 age(t):
the person’s age in whole years
 sex(t): see above
at the end of t
income(t): the person’s income
during the year t-1
 sex(t):
the migrating person’s sex at the time of
migration
 income bracket(t-1):
the migrating person’s income bracket during
t-1 according to classification xxx based upon
the person’s income during t-1
 sex(t): see above
 income bracket(t): see above
Domestic migrations
during t where the
target commune of the
migration has a lower
tax rate during t than
the home commune of
the migrating person
Domestic migrations

during t

sex(t): see above
income bracket: see above
Figure 2. -matrix.
7
 to_lower_tax:
migration from commune with
higher tax during t to commune
with lower tax during t
count
correlation
count
count
percentage
Alternatively:
S6 = 100%*S5/S4
The following summarizing functions are formed by applying statistical measures to the variables:
S1:
S2, S4, S5:
S3:
S6:
average(1) or, with dot notation, 1.average
count
correlation(1, 2) or (1, 2).correlation
percentage(3) or 3.percentage
2.2.3
The -dimension
The -dimension contains variables that crossclassifies the population into domains of interest, to
which the statistical measures are applied in the same way as they are applied to the crossclassified population itself. In figure 2 the following cross-classifications occur:
S1:
The population 1 is crossclassified by
1:
“commune(t): commune where the person was registerd at the end of
t”
2:
“sex(t): the person’s sex at the end of t”
3:
“age(t): the person’s age in whole years at the end of t”, that is, =2
S2:
The population 1 is crossclassified by 2, 3, and
4:
“income_bracket(t-1): the person’s income bracket according to the
classification xxx, based upon the person’s income during t-1”
S3:
The population 1 is crossclassified by 1 and 2.
S4:
The population is crossclassified by
5:
“sex(t): the migrating person’s sex at the time of migration”
6:
“income_bracket(t-1): the migrating person’s income bracket
according to classification xxx based upon the person’s income during
t-1”
Note that 5 and 6 are different from 2 and 4, since the former are (derived)
variables of migrations, whereas the latter are variables of persons.
S5:
The population is crossclassified by 5 and 6.
S6:
The population is crossclassified by 5 and 6.
Note that the -variables form subdimensions of the -dimension.
2.2.4
The -dimension
The -dimension specifies reference times for the statistical characteristics. Time can be
explicitly specified for all populations, variables, etc, but this is often unpractical. Instead a time
parameter t is used, and all times are expressed as functions of t. The -dimension also states the
value set of t.
In figure 2 the -dimension specifies the parameter t with a value set consisting of the years 1995,
1996, and onwards.
8
person id
commune
code
TO
MIGRATION
COMMUNE
name
RECEIVES
tax rate
G
RE
AT
TR
S
I
OR
LOCALIZES
R
ITO
B
HA
IN
MIGRATOR
MAKES
abroad
LOCALIZATION
migration
date
person id
sex
age
organisation
id
MAIN EMPLOYER
PERSON
EMPLOYER
category
MAIN EMPLOYER OF
income
Figure 3. OVR-graph: visualization of an object system.
FINAL OBSERVATION REGISTER
cons is ting of OV-m atrixe s
MIGRATIONS
PERSONS
migrant
person
id
sex
migration
date
age
abroad?
income
to_commune
home
commune
main
employer
COMMUNES
commune
code
name
tax rate
EMPLOYERS
organisation_id
category
commune
Figure 4. OV-matrixes corresponding to the OVR-graph in figure 3.
9
2.3
Visualizing multidimensional structures of statistics
Figure 3 visualizes a universe of interest that covers the concepts needed to express, among other
things, the statistical characteristics specified in the -matrix in figure 2. The graph used for
visualizing the objects, variables, and relationships in the universe of discourse is called an object
graph or an ObjectVariableRelation (OVR) graph.
Figure 4 visualizes a hypothetical final observation register, accomodating a set of observations
concerning the universe of discourse in figure 3. The observation register consists of a number of
ObjectVariable (OV) matrixes, in principle one matrix per object type in the universe of
discourse. This model of the final observation register could also be seen as a specification of a
relational database implementation.
Figure 5 visualizes a twodimensional and a threedimensional structure, often called “box” or
“cube”, corresponding to some statistical characteristics expressible in terms of the concepts of
the universe of discourse in figure 3.
Figures 6 and 7 two so-called star models that are also used for representing multidimensional
structures. The stars are not limited to any particular number of dimensions.
STATISTICAL CHARACTERISTICS IN MULTIDIMENSIONAL
STRUCTURES
SEX
AGE
GROUP
Domestic migrations during the year t: by
age group and income bracket
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
number of
migrations
COMMUNE
average income
average income
average income
average income
average income
average income
average income
average income
average income
average income
average income
average income
average income
average income
average income
Average income during the year t-1 for those
persons who were registered in Sweden at the
end of the year t: by commune, sex, and age
INCOME
BRACKET
AGE
GROUP
Figure 5. A twodimensional and a threedimensional cube accomodating estimated values of
some statistical characteristics concerning the universe of discourse in figure 3.
10
name
HOME
COMMUNE
SEX
tax rate
AGE
GROUP
PERSON
age
income
INCOME
BRACKET
EMPLOYER
name
CATEGORY
COMMUNE
tax rate
Figure 6. Star model corresponding to figure 2, S1-S3.
DATE
name
TO
COMMUNE
ABROAD?
tax rate
name
HOME
COMMUNE
MIGRATION
SEX
tax rate
PERSON
age
income
AGE
GROUP
INCOME
BRACKET
EMPLOYER
name
CATEGORY
COMMUNE
tax rate
Figure 7. Star model corresponding to figure 2, S4-S6.
11
Statistical -Query By Example (SQBE-) and Statistical Query Specification Language (SQSL-)
2.4
In figure 8 the final observation register of figure 4, corresponding to the universe of discourse in
figure 3, has been used as the basis for a Statistical -Query By Example (SQBE-)
expressing


requests for estimated values of the six statistical characteristics S1-S6
definitions of the derived variables “to_lower_tax”, a Boolean variable, and “incbr”, a
classification of “income” called “cx”; both derived variables are needed in the specification
of the requests for estimates of S1-S6
Figure 8 also contains equivalent specifications in a non-procedural language called Statistical
-Query Specification Language (SQSL-), which goes back to the language INFOL as
described in Sundgren (1973) and Sundgren (1992).
The basic language elements and constructs can be defined as follows:
Language element/construct
Defined as
<statistical -query>
{<-set of objects> by <vector of -variables>}.
<vector of -variables>.<statistical measure>
<object type> [<time reference>] [with <property>]
[<path segment>]*<atomic property> |
<valid expression of properties>
<relation>.<object type> [<time reference>] [with <property>].
<variable><comparison operator><variable or value> |
<vector of variables>.<aggregator><comparator><variable or value>
[<path segment>]*<atomic variable>[<time reference>] |
<valid expression of variables>
<variable that classificies the -set of objects>
<variable of the -objects, argument of the statistical measure>
<expression in terms of time locators, atomic times, and time parameter >
at [the beginning of | the end of],
during [the whole of], during [the former | the middle | the latter] part of
<point in time> | <time interval>
<operator that uses the latter argument to (cross)classify the former
<operator that uses the latter argument to restrict the former to a subset
<-set of objects>
<property>
<path segment>
<atomic property>
<variable>
-variable
-variable
<time reference>
<time locator>
<atomic time>
by, \
with, |
12
FINAL OBSERVATION REGISTER
Default time = t
TO
RECEIVES
false
y, count
false
to lower
tax
COMMUNE
q
%
z
j
:=true
commune
code
S1:
S2:
S3:
S4:
S5:
S6:
def1:
name
tax rate
q, r
(q) < (r)
j, k
(j) < (k)
MAIN EMPLOYER
MAIN EMPLOYER OF
PERSON
person
id
S1:
S2:
S3:
S4:
S5:
S6:
def1:
def2:
count
sex
age
\
\
\
\
\
corr
x
\
y
\
income
commune
(t-1)
average
\
corr
\
main
employer
incbr
(t-1)
\
\
r
EMPLOYER
organisation id
category
LOCALIZATION
MAKES
x, count
commune
LOCALIZES
migration
abroad
date
INHABITOR
MIGRATOR
S1:
S2:
S3:
S4:
S5:
S6:
def1:
migrant
REGISTRATOR
MIGRATION
commune
S1:
S2:
S3:
S4:
S5:
S6:
k
z
inc
:=inc\cx
S1: {PERSON\commune,sex,age}.income(t-1).avg;
S2: {PERSON\sex,age,incbr(t-1)}.count;
S3: {PERSON\commune,sex}.age(t),income(t-1).corr;
S4: {MIGRATION|abroad\sex,incbr(t-1)}.count;
S5: {MIGRATION|abroad, to_lower_tax\sex,incbr(t-1)}.count;
S6: {MIGRATION|abroad\sex,incbr(t-1)}.to_lower_tax.percentage;
where
incbr := income\cx;
to_lower_tax := TO.COMMUNE.tax_rate < MIGRATOR.PERSON.REGISTRATOR.COMMUNE.tax_rate;
Figure 8. Illustration of some queries in Statistical -Query By Example (SQBE-) and
Statistical -Query Specification Language (SQSL-).
13
References
Sundgren, Bo (1973): “An Infological Approach to Data Bases”, Urval no. 7, Statistics Sweden
and University of Stockholm 1973, ISBN 91-38-01750-4.
Sundgren, Bo (1975): “Theory of Data Bases”, Mason/Charter Publishers, New York 1975,
ISBN 0-88405-307-5.
Sundgren, Bo (1990): “Conceptual Modelling and Related Methods and Tools for ComputerAided Design of Information systems”, The Second International Conference on Information
Systems Developers Workbench, Gdansk, Poland, September 25 - 28, 1990.
Sundgren, Bo (1992): “Databasorienterad systemutveckling” (”Database-oriented systems
development”), In Swedish, Studentlitteratur, Lund, 1992.
Sundgren, Bo (1999): "Information Systems Architecture for National and International
Statistical Offices: Guidelines and Recommendations", United Nations, Geneva, 1999.
14
Download