Cube Dataset & Data Structure models

advertisement
Cube Dataset & Data Structure models
Examples of Cube Datasets
1. One single figure, e.g. the total population of the Netherlands on January 1 st, 2012
2. The total population of the Netherlands, broken down by sex: two figures, no total.
3. One level only, e.g. the total population of the Netherlands on Jan 1st, 2012 by Province (but
without the total number for the whole of the Netherlands
4. As 3, but including the total for the Netherlands
5. One of the cubes from the set of (60) cubes of the Census 2011. Ref Regulations (EC) No
763/2008 and 1201/2009
6. The Statline table “Population, key figures” as shown in the Annex
7. A cross-section or “slice” out of the Statline table, e.g. only the time series showing the “green
pressure” percentages by year.
In the IDP specification team we agreed that the whole set of 60 cubes of the Census 2011 is not
considered a Cube, but a set of (related) cubes. The specific relation between cubes we did not
model, but other constructs within GSIM may be used for that purpose.
Other examples
From the explanatory text of the information object Unit Data Point:
For example (1212123, 43) could be the age in years on the 1st of January 2012 of a person
(Unit) with the social security number 1212123. The social security number is an identifying
variable for the person whereas the age, in this example, is a variable measured on the 1st of
January 2012. The value can be obtained directly from the Unit or indirectly via a process of
some kind.
Where is the information “on the 1st of January 2012” held in the model? I.e. on which information
object, which attribute? Is it part of the definition of the (Instance, Represented) Variable? Or is it an
attribute of the Dataset or a DataAttribute of the UnitDataPoint itself? What if this DataPoint becomes
part of a larger Dataset that contains data about other moments in time, like a time series?
Some questions here are:

What happens (to the metadata) when a Dataset is split (by selection)?

What happens (to the metadata) when a Dataset is combined with another Dataset (join,
append, etc.) where DataPoints do not “overlap”?

What happens when two Datasets are combined that contain the same DataPoints (overlap),
but different quality or different source? (I.e. same id, same measure, different Data Attribute)

What happens (to the metadata) when data is aggregated?
Analysis of the examples
In example 1, there are no (apparent) dimensions, there’s just one measure, and one cell (Data Point):
the total number. This cell is “identified” by the label on the dataset, which suggests that in fact this
“cube” is a selection or cross-section from a larger cube having dimensions: region (selection: “The
Netherlands”, level: “Country”) and Time (selection: “Jan 1st, 2012”). It also suggests that it is an
aggregate (“total population”), which prompts the question about underlying details such as given by
the next examples.
Example 6 shows that the data may be percentages (e.g. “demographic pressure”). (So what …..?)
and that the various “topics” do not seem to have a common underlying level in the Dataset. Which
makes the dataset just a convenient “packaging of data” that describe related phenomena rather than
a structural and content-wise coherent set of data.
The Census cubes seem to be structured more or less as envisioned by the GSIM Cube Data
Structure. But the GSIM Data Structure does contain nothing about the value domains of the data
(neither dimensions, nor measures). Neither does it tell anything about the leveling of the dimensions.
The GSIM Data Structure therefore is not sufficient for constructing a query other than “give me the
whole thing”, or best case: give me the marginals with this dimension “left out” (which may be
interpreted as “totaled” or rolled up using some other operation e.g. averaging).
Microdata vs Aggregated data
Using the following example1, it can be seen that from the column structure alone, one cannot
(always) make the distinction between microdata (describing units in a population) and aggregated
data (summarizing data about a population on some higher level of granularity).
Activity
Size class
Turnover
Agriculture
10
12
Agriculture
10
24
Industry
10
72
Activity
Size Class
Turnover
Agriculture
10
18
Industry
10
72
The table on the left is in fact a microdata table giving details about individual companies, whereas the
table on the right is aggregated data giving the average turnover per company by activity and by size
class.
In order to say something about the nature of the data in the dataset, information is needed above and
beyond the column structure (i.e. the variables) of the dataset. One needs to know what is represented
by the rows (records). In other words, one needs to know the domain of the function that resulted in
the image represented by the dataset.
The question is whether this type of information is to be considered part of “structural” information (and
therefore needs to be part of the GSIM Data Structure). Since it is necessary for a correct
interpretation of the data, we would strongly argue that in fact it is. However, although we feel we are
close, at this point in time we do not yet have a satisfactory modeling solution for this problem.
Why alternative models?
The reason I believe there is a need for another way of modeling is that I am not satisfied with the
current information models. In my opinion, the current models (Cube Dataset and Data Structure)
1

Do not reflect the true nature of a Cube, that is structured for slicing and dicing, drill-down/up
and pivoting, let alone for (further) aggregation (roll-up). For instance, the current Data
Structure does not contain the information necessary for the construction of queries against
the Cube. It does not recognize the leveling of the dimensions and does not recognize the fact
that a cube usually contains multiple levels (aggregates and marginals) in one dataset.

Do not specify the “boundaries” of the data contained within a Cube Dataset. A selection out
of a Cube Dataset may have the same Data Structure as the cube that it was taken from. A
DataFlow that exposes a time series has the same structure as the individual Datasets that
contain the time slices that are produced one by one over time and gradually “added” to the
DataFlow (or maybe a new accumulative time series Dataset).

Do not show or explain the relation between micro-data and aggregate (by which I do not
mean the process model or even aggregation operations but the structural information such
as: this population figure was aggregated from this micro data. Conceptually, the relation is
there, even if the data is kept separated in different datasets).

Do not recognize the fact that an aggregate may be considered micro data by the user (each
of the examples above may be considered micro data or cross-sectional, partial cubes by
Eurostat).
Taken from: Tjalling Gelsema, The Organization of Information in a Statistical Office, Journal of
Official Statistics, Vol. 28, No. 3, 2012, p. 413-440.

Show, but do not explain the need for, a difference in structure between Cube and Unit data.
The alternative models
Some of the shortfalls of the current model may be only due to the way we constructed the views.
There are relationships in the GSIM model that allow navigation to other types of meta information, but
this additional data is apparently not considered part of a Data Structure.
This leads me to the following questions
1. What is in fact a Data Structure?
2. What is the purpose of a Data Structure?
3. What does it need to contain in order to fulfill that purpose?
One of the things I believe a Data Structure must allow the user to do is to understand the data
sufficiently to be able to construct queries against it. Queries are specifications for sub-sets of the
data contained in the Dataset. For a user to be able to specify such subsets, he/she must know the
values (categories) of the various levels of the dimensions. If he only is interested in a part of the
data, e.g. a certain period or time slice or a certain regional cross-section, he must be able to indicate
the slice he wants by the correct values. These are categories in the underlying levels (subaggregates or micro data).
The use of aggregated, dimensional data as microdata by a user of a dataset prompts the question
whether that user is still considering the data in the same context as meant by the supplier. We tend
to believe that there is a change in population. Example: data given as the summary of population of
the Netherlands (a set of persons or households) may be used by Eurostat as information about the
Netherlands i.e. properties of a member of a set of countries. This is a more general observation:
users and suppliers may have different viewpoint with respect to the data and the question arises
whether that has an impact on the way the user perceives the structure of the data. We tend to
believe that such a change in view of necessity brings about a transformation of the structure and of
the data itself.
An attempt at definition
A Cube contains data pertaining to a Population. A Cube has dimensions. Each dimension is
associated with a Classification Scheme or possibly a Classification. A Classification has
Categories that are organized into Levels. Each Level contains a non-overlapping (disjunctive) and
exhaustive set of Categories.
A Strict Cube is a Cube that contains data for a crossing of dimensions where each dimension is
restricted to just one Level of its Classification Scheme.
A Marginal is a Strict Cube that has at least one dimension whose Categories belong to a Level
higher than that of the underlying Strict Cube. A Marginal therefore is a role, it is a kind of
relationship between two Strict Cubes.
A Primary Marginal is a Marginal where there is just one dimension that is “rolled up” into its next
higher level. A Higher Level Marginal is a Marginal where one or more dimensions are rolled up to a
higher Level in de Classification.
A Strict Cube has as many Primary Marginals as it has dimensions.
A Strict Cube may be Primary Marginal with respect to more than one underlying Strict Cube (see
figure).
A Strict Cube and all its Marginals are defined on one and the same Population, although described
(summarized) on different levels of granularity.
Question: If a dimension is rolled up to its highest level (“Total”), does it disappear from the Strict
Cube? Such a dimension cannot be rolled up any further.
A Cube Dataset may contain many Strict Cubes, that may or may not be related through Marginal
relationships.
Aggregation in Cubes
D3
D1
D2
A secundary Marginal
with respect to the big
cube (with both D1 and
D3 rolled up), but a
primary Marginal with
respect to both its two
neighbours (with D1 or
D3 rolled up, respectively)
A Primary
Marginal
(with relation to
the big cube,
with D1 rolled up
one Level)
This figure shows 5 “Strict Cubes” that are related through aggregation.
The smaller cubes are (primary and secundary) Marginals of the big cube.
Question: Is there any reason to make a distinction between (a) Cube Datasets that contain Strict
Cubes that are all related through Marginal relationships (whether present in the Dataset or not) and
(b) Cube Datasets that contain Strict Cubes that do not (all) have such relationships? Is there any
reason to forbid the second type of Cube Datasets (a more or less “random” packaging of Cubes)?
Remark: The Statline Table shown in the annex shows a number of Marginals, but the underlying
Strict Cube is not part of the Dataset. So, although the Marginals are all defined on the same
Population, the relationship in the data is lost and may only be known through the metadata. This
example is still considered to be of type (a).
Assertion: A selection from a Cube Dataset is again a Cube Dataset. (Proof?)
Assertion: A selection from a Strict Cube such as a cross-section results in another Strict Cube, but
changes the Population that is being described. (Proof?)
Question: Does a selection across the Time-dimension constitute a “break” in the sense that the
Population is changed? If not, in what respect is the Time Dimension different from other types of
Dimensions?
Within one Cube, a dimension may refer to multiple Classification Schemes under one
Classification. It must be clear which Classification Scheme is valid for what part of the dimension.
Usually, the validity is related to the time dimension, which creates a relation between dimensions.
Remark: Contrary to the proposal above, it may be helpful to define a “supercube” as the crossing of
all dimensions, across all levels, and define the strict cubes and Marginals as selections from this
supercube.
Microdata
A micro data (Unit Data) record contains an identifying key (possibly consisting out of more than one
element or variable), one or more classifying categories (dimensions) en zero or more characterizing
variables (measures).
Examples of data in the categorizing variables: date of birth, sex, address
Initial remarks:

Summary variables in the aggregates are different from the variables in the unit data. Income
of a person or household is different from the average or total income of a sub-population
(persons aggregated by region or by age class), even if their value domains are the same.

Each aggregation is the mapping of a variable on the lower level on a variable on the higher
level, a function.

Categorizing variables (dimensions) in the aggregates are different from the corresponding
variables in the micro data. Each level in a classification has its own value domain (the set of
categories for that Level). E.g. an address is not the same variable as the region in the various
levels (each with its own set of categories) of the region classification scheme. And a date of
birth (or even the age of a person) is not the same as an age (class)?

As a result, a dimension is not the role of a single (represented) variable, but in fact a set of
(represented) variables, one for each level in the classification scheme.

When counting units (number of) in an aggregate, in fact a new variable is introduced,
although it can be reasoned that this variable on the unit level is already implicitly present, as
a characteristic of the unit.

The identifying part of the unit data disappears in the aggregates. The unit is “assimilated” in
the sub-population. The unit-key is only relevant for the coupling of micro data (linking
distributed parts of the logical unit description). Note that distinguishing one unit from another
is not the purpose of a key, a key is only used for identification of a unit. A record separator
(e.g. CrLf) is sufficient for distinguishing one record from another.
Ref the question about top level dimensions that seem to disappear out of the Cube.
Consequences for the GSIM model

A Variable can be defined as a mapping, a function (ref “The Organization of information in a
Statistical Office”), and is therefore associated with both a input (domain) and an output
(codomain) (NL: resp. “Domein” and “Bereik”). For micro data, the domain is the (set of Units
within the) Population, the codomain is the Value Domain. For an aggregate, the domain is the
Value Domain of the Represented Variable on the lower level (either microdata or aggregate),
the codomain is the Value Domain of the aggregate Represented Variable itself. The
relationship with the underlying Population that the aggregate describes therefore is an
indirect one, but still part of the definition of the Variable.

A dimension is not the role of a single Represented Variable. A dimension is the application of
a Classification Scheme for the purpose of summarizing information about a Population. A
dimension has as many Represented Variables associated with it as the Classification
Scheme has Levels.

A measure is not the role of a single Represented Variable. A measure has as many
Represented Variables as there are separate Marginals associated with the lowest level Cube.
Each Marginal has its own unique set of Represented Variables. Not necessarily because it
has a different Value Domain (avg age in years of persons in the Population is still a positive
real or integer number), but because of the mapping.

The Represented Variables associated with a dimension or a measure are all based on the
same Variable, on the same Concept (and on the same Population). It therefore seems
unavoidable to include the Variable in the Data Structure. If the use of Represented Variable
for dimensions still is deemed necessary, their Value Domains must be defined on the sets of
Categories in the Classification Schemes (the Levels).

The scope of Data Structure must be extended to include all object types and relations that
are necessary for the interested parties to be able to understand (in detail) the structure and
content of a dataset. The Data Structure must be sufficient for someone interested to
construct a query against the data without having to rely on other kinds of information.

The Data Structure should be defined in terms of Strict Cubes and Marginals. For each
dimension, it should include Classifications, Classification Schemes (with indication of validity
of each Scheme in relation to the part of the Dataset for which it is valid).

Unit Data and Cube Data should be related in a more functional way. Also, as a structure, Unit
Data becomes a special case rather than a separate case.

UML diagrams still to be developed. Some of the concepts described before may turn out hard
to model (at least I do not see a simple solution) ….
Additional remarks:

A Value Domain is associated with a Concept once it is applied to a Population Type.
Example: Sex of a person vs gender of a noun (NL both “geslacht”) or age of a mayfly (NL:
“eendagsvlieg”) vs age of a person vs age of the universe.
Open questions

How to deal with “ragged” classifications, i.e. classifications where certain branches have
“missing” levels. Example: countries with and without ‘state’ level.
A special case of the is the combination of cells due to statistical security (non-disclosure)

How to deal with ‘shared’ dimensions, i.e. dimensions that are used in multiple cubes and
datasets. Common examples are: Time and Region. These have been standardized in
Statline. Ref Wikipedia: ‘Conformed Dimension’

How to deal with Measures that cannot be aggregated for all dimensions? Ref Wikipedia: Nonadditive or Semi-additive measures.

How to deal with parallel classifications (sometimes considered non-hierarchical
classifications) like region classifications where (at least in the Netherlands) water board
districts often cross province boundaries.
On the nature of data
A number of review comments on GSIM V0.8 have to do with the distinction that has been made
between Unit data and Cube data, where the first has been associated with microdata and the second
with “aggregated” data. It is contended that non-aggregated (micro) data can be structured as Cube
data. It seems therefore that there is a distinction between the nature of data and the way that data is
structured or represented.
In this chapter we discuss the true nature of data and thereby try to discover the inherent similarities
and differences between aggregated and non-aggregated data.
Non-aggregated (micro) data

Is about individual units in the population

May even be about still smaller pieces of information than we are actually interested in, like
individual sales transactions where we want the turnover (in a certain period). Adding up or
otherwise calculating the observation value we are interested in is not considered
“aggregation” (but how do we call this, then?). Ref Wikipedia: Degenerate dimensions.
(Is the more detailed data about a different Population, different Unit or just multiple events or
properties of the unit under observation?)
Aggregated data

Is about clusters of units, sub-populations (defined by the Cartesian product of dimensions)

Is summarized data, calculated (estimated) from the unit observations in the cluster

therefore, by necessity, needs dimensions and classifications. Only microdata that contains
categorical variables can be aggregated.
On populations
A population is defined by describing the properties that distinguish members of the population from
those outside. Usually, the definition includes properties that members must have, but non-members
must not have.
As all members of a population have these defining properties in common (otherwise they would not
be a member), it is normally not deemed necessary to include these properties in the description of
individual members. However, if a population is divided into sub-populations, the sub-populations are
defined using properties that the members of the original population have, and that are available in the
descriptions of each individual member. But as soon as the sub-population is formed, the particular
properties that are added (as additional discriminating properties) to the definition of the subpopulation, are no longer relevant for describing the members. Which means that these properties
move from the unit level to the population level and are included in the (meta)data of the population.
This “moving up” of properties is confusing. It makes things difficult, for instance when taking together
datasets that describe different populations to form a new population. Example: collecting data from
different regions (countries) into a dataset describing a bigger region. Now all of a sudden a property
of the original populations needs to “descend” from the definition of the population to the description of
the individual units.
Usually, the properties in question are part of a classification. In the sub-population, the classification
used is “partial” in that it ignores the higher levels, those describing the world outside of the population
of relevance.
In order to tackle this “problem”, maybe classifications need to be extendable, and the definition of
population based on the dimensions and their classifications?
On dimensions
The structural dimensions of a cube have to do with the nature and meaning of the data. In addition,
there may be reasons to introduce additional dimensions for other reasons. For instance: Measure
dimensions and Attribute dimensions. These are for physical efficiency purposes only and have no
place in a conceptual model. They do not play a role in identifying the cells in the Cube. A Measure
dimension is often used in a relational star schema, in order to handle a varying number of Measures
in a fact record. It causes the different variables to be handled as key-value pairs. Drawback is the
duplication of the foreign keys (dimensions) for each value. An Attribute dimension is used for the
same reason and helps distinguishing the attribute from the measure.
Measure dimensions
In SDMX, it is common practice to use a “Measure dimension” in cases where there is more than one
Measure in a dataset. This may be seen as a simple trick to handle a varying number of measures (a
problem in relational databases), by exchanging “columns” for “rows”. This has no impact on the
meaning and value of the data, but in fact it does change the structure of the dataset. The resulting
“observation value” column becomes a very awkward type of variable, since data type and unit of
measure may vary with the value of the category in the Measure dimension. Like SDMX, the current
GSIM does not dictate this kind of usage, but neither does it warn against it.
On attributes
It is clear that there is a practical need to be able to attach additional information to the data in
datasets. The current GSIM deliberately handles this in a different way than SDMX does. But the
current model is too vague, leaves room for incorrect implementation and mis-use GSIM should be
modeled to promote a correct and standardized way of handling attribute information. This includes
being able to determine the correct type and meaning of the each attribute. In particular, GSIM should
prevent against Attributes being used as Measures and vice versa. By not being strict about this,
GSIM may lead to different practices and thereby hinder in stead of help. It should be made clear what
areas GSIM is in danger by being obscure.
Annex: Example from Statline (Population, key figures)
The following screenshot shows an actual Statline table, as shown to the user on the screen.
This may be more a presentation issue, the separation of “Topics” and “Periods” does not (necessarily) reflect the “measures” vs “dimensions”. “Topics” are
not “measures”.
Surprisingly, this table does not give any regional breakdowns …
Download