Unit vs Dimensional Data

advertisement
Unit vs Dimensional Data
Summary of Findings
The lines of thought and the explanations set out below are different in nature to those I expected
to present when I first set out to explore in more detail


the underlying differences between Unit and Dimensional perspectives on data, and
whether or not these differences are significant enough, in conceptual terms, to warrant
retention of current divisions into “Unit” and “Dimensional” concrete classes in GSIM (eg for
DataStructures, DataSets, Components).
Having undertaken the analysis I am more convinced now, for conceptual and practical reasons, the
current division is – on balance – appropriate.
On the other hand, I will not assert there is an absolutely watertight and incontestable case for the
division.
It is proposed within the analysis that, not surprisingly, Unit data relates to individual Units.
An early barrier to exploring whether Unit and Dimensional data should remain differentiated in
GSIM, however, was arriving at a similarly clear and unambiguous definition - for the purposes of
comparison and analysis - of what Dimensional data relates to.
My recollection is that during the GSIM V0.8 to GSIM V1.0 process there was a common view that
the distinction between Unit data and Dimensional data was not about microdata vs aggregate data.
Nevertheless, 75% of definitions in GSIM V1.0 related to DimensionalDataSets and
DimensionalDataStructures mention the word “Aggregate”.
In seeking to arrive at a distinction that could be used for comparison and analysis, I started
wondering “What is Dimensional data ‘about’?”. For the purposes of this analysis, in line with long
standing sources quoted below, I propose that Dimensional data refers to (sub) Populations, rather
than individual Units. When selecting this working definition, it is recognised that it is possible to
have subpopulations that consist of zero or one Unit – or even engineer design of a
DimensionalDataSet such that every subpopulation identified within the DimensionalDataSet is
guaranteed to correspond to an individual Unit.
While this line of thinking is not reflected consistently by definitions in GSIM V1.0, it is strongly
supported by the definition of DimensionalMeasureComponent.
A Represented Variable that has been given a role in a collection of aggregated data to hold
the summary values (means, mode, total, index, etc.) for a specific sub-population.
If this line of thinking is accepted then I believe it is reasonable to consider Dimensional data relating
to “summary” (or aggregate) data for subpopulations.
A concept like the “population of Australia” could be considered as a summary (count) of the
persons (or person records) for the subpopulation identified as “All Persons“, “All Ages”, “All
Page 1.
Regions” from a dimensional dataset structured as Sex x Age x Region. It could equally well be
considered as the measure of a property of Australia as a top level administrative Unit (Country)
within the world. This means the a number given for “Population of Australia” might be considered
as summary or might be considered as a Unit measure – depending on context/perspective.
Similarly my age can be considered a count of the number of years I’ve been alive (an
aggregate/summary) or simply an attribute of me as a Unit (of UnitType Person).
We can then note that Units and Populations are different (but related) objects within GSIM.
From the difference “about Units” compared to “about (sub) Populations”, as detailed in this report,
there appear to flow a number of differences in what can be done (from a mathematical/statistical
perspective) with Unit data compared with Dimensional data. Examples include



“Finest grain” (sub) Populations limit the operations that are possible with Dimensional data
The ability to interrelate data from different records/datasets is different for Units compared
with (sub) Populations
Relationships between Units and relationships between (sub) Populations are different in
nature.
It may be possible to arrive at an alternative conceptual definition and model of “Dimensional” data
which side steps the above dichotomy between “about (sub) Populations” and “about Units” (for
Unit data). Even if that turns out to be the case, however, the differentiation appears likely to apply,
and be significant, in the majority of cases. In other words, the above distinction may be a
reasonable and useful basis for differentiation on heuristic grounds even if it is considered not to be
completely beyond doubt (or completely satisfying) from a pure conceptual perspective.
It remains essential to recognise that in many cases the same “physical” set of data (eg codes and
numbers stored in a relational database) could be viewed from both a Unit and a Dimensional
perspective. Para 94 of the GSIM V1.0 specification, for example, notes that “unit data” and
“dimensional data” are different perspectives on data and that although not typically the case, the
same set of data could be described both ways.
In fact, as illustrated in the section of this document titled “Edge Cases”


DimensionalDataStructures can be used to describe data (eg the contents of a business
register) that would typically be considered as relating to Units
UnitDataStructures can be used to describe data that would typically be considered as
relating to a Population
It is not proposed that it is “wrong” to make such choices – depending on the context – but it is
proposed that such edge cases


constitute different (and atypical) ways of “looking at” the data concerned from a
conceptual perspective, and
the choice of perspective – and data description based on perspective - can influence what it
is possible to do with the data in practice
In addition, under “Edge Cases”, it is highlighted
Page 2.


that there can be multiple Unit perspectives on the same set of data and multiple
Dimensional perspectives on the same set of data
that, once again, the choice of perspective can influence what it is possible to do with the
data in practice
Overall, these “edge cases” further highlight that the Unit vs Dimensional question is primarily about
how we choose to “think about” and characterise a particular set of data, and not so much about the
physical representation and implementation of the underlying data. For me this highlights that the
differentiation belongs in a conceptual (or, at a minimum, logical) characterisation of data rather
than at the physical implementation level.
If the “about Units” versus “about (sub) Populations” basis for differentiation is broadly accepted,
this may open the way for “tightening” some definitions in GSIM (eg related to Identifier, Measure
and Attribute Components for Unit and Dimensional data) and for providing additional guidance in
the User Guide on applying these concepts to design of Data Structures and Data Sets.
If the basis for differentiation is broadly accepted, the GSIM Implementation Group may wish to
consider whether the possible “tightening” of definitions for GSIM V1.1 can be pursued over coming
weeks and during the Sprint.
Structure of this report
The report starts by setting out the basic thesis. This includes referring back to discussions of
microdata and macrodata in the UNECE Guidelines For The Modeling Of Statistical Data And
Metadata from 1995 and suggesting that


the definitions of “microdata” and Unit Data are close matches
the definition of “macrodata” and Dimensional Data are not necessarily as close as matches,
but the extent of difference is ambiguous based on definitions used in GSIM V1.0 and the
majority of instances appear likely to match in practice.
While matching (with a possible degree of imprecision) the two sets of definitions appeared more
questionable to me at first, I gained greater confidence in the practical validity when the resulting
framework appeared to apply very naturally to the discussion and analysis of examples, including
edge cases.
The report then considers further the thesis that Dimensional Data is “about” sub Populations rather
than Units, including implications which explain some of the observed differences between what is
typically modelled, and what is typically possible, with Unit/microdata compared with
Dimensional/aggregate data.
Topics considered include




Page 3.
Consideration of the way “finest grain” sub Populations limit the operations which are
possible with Dimensional data
Differences when combining data from multiple DimensionalDataSets compared with
combining data from multiple UnitDataSets
Differences in relationships between Units and relationships between sub Populations
Edge cases
o
o
o
o
Using DimensionalDataStructures to describe data about Units
Using UnitDataStructures to describe “aggregate” data
Different Dimensional perspectives on the same data
Different Unit perspectives on the same data
Rather than documenting overall conclusions at the end, these have been presented at the start of
the report within the Summary of Findings.
Modelling statistical data, including identifying “what the data is about”
Why start here?
After analysing a number of examples, it seems possible the clearest “top down” explanation of the
distinction between Unit perspectives and Dimensional perspectives on data is rooted in the classics
– namely the work of Bo Sundgren on microdata and macrodata.
Source
For the purposes of this section of the report, I have drawn on Section 1.2 of the Guidelines For The
Modeling Of Statistical Data And Metadata
where 1.2.1 discusses microdata and 1.2.2 discusses macrodata.
These UNECE guidelines date from 1995 and are still referenced by Part B of the METIS Common
Metadata Framework. The preface records
Statistics Sweden has been responsible for preparing the material. The work was conducted
under the direction of Professor Bo Sundgren.
Bo’s modelling
What led me back to Bo’s work was trying to find a way of saying that a key difference between unit
and dimensional views of data (and varying unit views of the same data and varying dimensional
views of the same data) is what the data is considered to be “about”.
Microdata
Microdata are the result of observations or measurements of a set of object characteristics. An
object characteristic can be formalized as an ordered pair
Co = O(t).V(t)
where
(i) O is an object type;
(ii) V is a variable;
(iii) t is a time parameter.
Macrodata
Macrodata, in daily talk simply referred to as "statistics", are the result of estimations of a set of
statistical characteristics (statistical concepts).
A statistical characteristic can be formalised as a triple
Cs = O(t).V(t).f
Page 4.
where
(i) O(t).V(t) is an object characteristic;
(ii) f is a statistical measure, that is, an aggregation function (count, sum,
average, correlation, etc) summarizing the true values of V(t) for the objects in
O(t).
Discussion
As an aside, an interesting semantic difference between the two definitions is that microdata is said
to refer to “observations” (or “measurements”) where macrodata is said to refer to “statistics” (or
“estimates”).
A key (and possibly related) difference is that the “object” [loosely O(t), but debatable] for
macrodata is described in terms of “objects” or “a population of objects existing at/during
time t1” where microdata (eg a GSIM Unit Data Record] relates to a specific object [Unit in the GSIM
definition of Unit Data Record]
Is there a significant conceptual difference between “Microdata” and “Unit Data”?
An early question is whether what Sundgren means by “microdata” corresponds with what GSIM
V1.0 means by “unit data”.
I would contend they are close enough.
For example, GSIM defines a Unit Data Point as a placeholder for the value of a particular Instance
Variable with respect to a given Unit.
For microdata, Sundgren talks about “object characteristics” where this is an ordered pair of an
object identifier and a variable at a particular point in time. Elsewhere in the UNECE guidelines
“statistical units” is used as a synonym for “objects”, so where Sundgren refers to “object” I think the
fit with (statistical) Unit in GSIM is reasonable.
Is there a significant conceptual difference between “Macrodata” and “Dimensional
Data”?
I find this a less straightforward question to answer than the previous one.
A primary reason is that definitions related to “Dimensional” data in GSIM V1.0 appear somewhat
inconsistent with each other.
My recollection (which may be faulty) is that during the GSIM V0.8 to GSIM V1.0 process we agreed
the distinction between Unit data and Dimensional data was not about microdata vs aggregate data.
We agreed, instead, that it was about particular “perspectives” on data. After the analysis in this
report, however, I now wonder whether the underlying reason for choosing one perspective or the
other typically boils down to whether we wish to consider a particular set of data from a unit or
aggregate/summary perspective.
In any case, in GSIM V1.0 we have
Object
Dimensional Data Set
Page 5.
Definition in Glossary
A collection of aggregated data
Definition in UML
A collection of aggregated data
that conforms to a known
that conforms to a known
structure
structure
Dimensional Data Structure
Defines the structure of a collection of aggregated data by
Represented Variables (in their respective roles as Dimensional
Measure Components, Dimensional Attribute Component or
Dimensional Identifier Components) and their Value Domains
75% of the definitions refer to “aggregated”. Even the definition that doesn’t is associated with
“aggregated data” as a synonym.
Even if the GSIM Implementation Group wished to consider that Dimensional Data is not “by
definition” aggregated, I expect that in a substantial majority of practical cases “Dimensional” data
corresponds to aggregated data/macrodata.
In addition, it is impossible to assess the exact extent and significance of any difference between
“macrodata” and any alternative definition of “Dimensional” data until a more detailed alternative
definition of “Dimensional” is tabled.
Considering Dimensional Data as about sub Populations rather than Units
I agree with Professor Sundgren that macrodata – and all dimensional data depending on definition
– can typically be considered to be “about” (sub) Populations - rather than Units (recognising that
some of the subpopulations may consist of 1 Unit (or 0 Units)).
The fact GSIM does not consider Population and Unit to be the same thing may suggest a possible
significant difference between microdata and macrodata (to use the 1995 terms). It remains
necessary, however, to demonstrate – as explored in the subsequent sections – that this makes a
(significant enough) difference in practice.
Firstly, it is interesting that GSIM does not define a Dimensional Data Record as a counterpart for the
Unit Data Record. I don’t think this is because such an entity cannot be defined, just that it is
typically “less interesting” than a Unit Data Record because a Dimensional Data Record does not
refer to an individual Unit (in the common sense). A Dimensional Data Record could, however, be
visualised as something like a single row in a database table holding dimensional data.
The GSIM V1.0 definition of Unit Data Point is
A placeholder in a Unit Data Record to contain the value (Datum) for an Instance Variable
with respect to a given Unit.
The GSIM V1.0 definition of Dimensional Data Point is
A placeholder or cell in a Dimensional Data Set determined by the crossing of (all) the values
for the Identifier Components to contain the value (Datum) for an Instance Variable (defined
by a Measure Component) with respect to a given Unit.
As an aside, I’d question what the Unit is likely to be associated with most Dimensional Data Points.
More significantly, however, why does the Unit Data Point definition not refer to Identifier
Components?
Page 6.
A Unit Identifier Component is
The role that has been given to a Represented Variable, in a Unit Data Structure, to identify
the Unit
I suspect the reason Identifier Components is not mentioned in the definition of Unit Data Point is
because for Unit Data, once we know the Identifier Components we consider we have direct
identification of the Unit and we then feel comfortable talking in terms of the Unit rather than
Identifier Components.
In the case of Dimensional Data (at least if it is macrodata) the Identifier Components actually
identify a specific sub-Population rather than a Unit. The Dimensional Data Point then refers to the
value (Datum) for a Measure Component for the specific sub Population which is identified through
the specific combination of the values of the Identifier Components.
While there are edge cases that are explored further, in general the concept of “about a unit” vs
“about a sub population” seems to explain a lot of the difference in GSIM V1.0 between the
modelling of Unit Data and Dimensional Data.
It is recognised, however, that the concept “about a sub population” is not explicit in most of the
current definitions associated with Dimensional data. One exception is the definition of
DimensionalMeasureComponent.
A Represented Variable that has been given a role in a collection of aggregated data to hold
the summary values (means, mode, total, index, etc.) for a specific sub-population.
Observed differences between what is typically modelled, and what is
typically possible, with Unit compared with Dimensional data.
“Finest grain” subpopulations limit the operations that are possible with dimensional
data
Fundamentally, once you are dealing with the “finest grain” subpopulation identified by a particular
set of DimensionalIdentifierComponents then, unless you have an added ability to “drill down” to the
Unit Data Records associated with that subpopulation, you cannot further differentiate or analyse
the individual members of that subpopulation.
The idea of “finest grain” is important. In a DimensionalDataSet where the IdentifierComponents are
Occupation and Sex, it may be possible to further differentiate the subpopulation of “Medical
Practitioners” to consider “Male Medical Practitioners” or “Surgeons” (or “Male Surgeons”) by
“drilling down” on various dimensions. At some point, however, a “finest grain” will be reached.
If the measures available are a count of the subpopulation and a total income then it will be possible
to work out the mean income of “Male Surgeons” but not the median income (which would require
the associated Unit Data Records).
Page 7.
Similarly, even where Age is an IdentifierComponent, it will not be possible to identify the most
prevalent star sign for Male Surgeons – although this may be possible from Unit Data Records that
contain dates of birth.
It is not always the case, either, that measures for “coarser grain” populations can be derived from
finest grain populations, depending on the type of measure. For example, if I know the median
income for Male Surgeons and for Female Surgeons, I cannot derive the median income for Surgeons
as a whole. (If I only know the mean income of Male Surgeons and of Female Surgeons then I can’t
work out the mean income for Surgeons as a whole either, but if I have counts as well then I can – at
least if I set aside consideration of statistical error in estimation).
Combining data from multiple datasets
Linking (and then, typically, combining for the purposes of analysis) UnitData typically consists of
concluding that two different UnitDataRecords, in two different UnitDataSets, are referring to the
same Unit. This may be because the IdentifierComponents are the same - or because they can be
demonstrated to be equivalent (eg via a third source that correlates the IDs). Alternatively,
matching may be probabilistic based –eg - on values of a number of MeasureComponents.
Arguably, even longitudinal data linking tends to be about relating corresponding observations for
“the same” Unit over time. (The point can get philosophical; a “river of life” perspective would
argue that the person you are surveying with one question at one moment is not the same person
you are surveying with the next question the next moment – that’s within a single study, let alone
longitudinally).
Linking DimensionalData sometimes consists of ensuring the reference is to the same subpopulation.
In many of these cases, in practice, IdentifierComponents will not match. For example, an Australian
Population Census DataSet for 2011 may not include Time as a dimensional component at all, and
will use the code “0” for Australia in the spatial dimension. Data from an international time series
dataset might refer to (“close enough to”) the same subpopulation but the combination of
IdentifierComponents will almost certainly be different (eg an explicit time dimension and use of the
code “au” for Australia).
More typically, however, different DimensionalDataSets are combined to obtain information on
“related” subpopulations. A very common example, cited on Page 8 of the UNECE Guidelines,
relates to analysing “corresponding” subpopulations over time.
If I seek to combine data from a 1991 Population Census DimensionalDataSet and 1996 Population
Census DimensionalDataset, and I am interested in the characteristics of 15-19 Year Old Males in the
ACT (Australian Capital Territory), the subpopulation in 1991 and the subpopulation in 1996 should
consist entirely of different Units. (Any person who was 15 years old on 6 August 1991 should have
been 20 years old by 6 August 1996.)
Page 8.
Nevertheless, it would not be unusual to wish to explore characteristics of the two subpopulations
which might appear equivalent in terms of their IdentifierComponents. (Time is not usually an
explicit dimension in Australian Population Census Datasets – except for Time Series Datasets)
The above example includes an added complication, however, because the ACT changed its
definition between 1991 (when Jervis Bay was included) and 1996 (when it wasn’t). Similar things
can happen in terms of other dimensions over time (eg different scopes for some “seemingly
equivalent” Industry Divisions over different versions of the Industry Classification). This indicates
some of the particular risks and issues with “combining” based on similarity of subpopulations.
The other common example is analysing subpopulations which correspond to each other in all but
(ideally) one regard - other than time. An example would be comparing the characteristics of 15-19
Year Old Males in Germany compared with Australia. If one DimensionalDataSet comes from
Germany and the other from Australia then, unless there is harmonisation in advance such as
reporting against an internationally agreed SDMX Data Structure Definition, gauging the exact extent
of similarity (and difference) in regard to the subpopulations is likely to be particularly exacting.
Relationships
There are significant relationships between subpopulations for DimensionalDataSets based on
dimensionality. For example


Superset:subset (eg Medical Practitioners:Male Medical Practitioners or Medical
Practitioners:Surgeons)
Differing by one Dimension (eg Males 15-19 living in Germany:Males 15-19 living in
Australia)
As Units are individual “things” (people, businesses, events etc), however, they tend (as illustrated
below) to have more specific relationships, including with Units of different Unit Types. These
relationships are typically described through RecordRelationship. RecordRelationship is a construct
which is not associated with DimensionalDataStructures in GSIM V1.0 s (nor, as far as I am aware, in
modelling outside GSIM).
Typically for DimensionalDataSets, Superset:Subset relationships work basically the same way
regardless of which of the DimensionalIdentifierComponents you drill down on (or roll up on).
RecordRelationships for UnitDataSets, however, are not such a “generic” mechanism.
In Population Census data, I can have a subpopulation of Males 40-44 living in ACT in 2006 and
Males 15-19 living in ACT in 2006. These two subpopulations can be seen as having a “differing by
one Dimension” relationship with each other. If I had access to Unit Records underpinning this
dimensional data, however, I may discover additional relationships such as


In some cases a member of the first subpopulation lives in the same dwelling as a member
of the second subpopulation
In a subset of these cases, the member of the first subpopulation is the father of the
member of the second subpopulation
As per Pages 22-23 in
Page 9.
http://www.ausstats.abs.gov.au/Ausstats/subscriber.nsf/0/CACF387B87CE36F3CA2575B4001A2380
/$File/20370_2006.pdf
Australian Population Census data has a comparatively simple structure of Dwelling Records, Family
Records and Person Records. Each record type is associated with a different set of Unit Measure
Components (Represented Variables with specific roles) that are relevant to the UnitType associated
with that record type.
To be able to relate an individual Unit of one Unit Type with the corresponding Unit another Unit
Type the records record, for example, the Family to which Persons “belong” and the Dwelling to
which Families “belong”. (Based on current definitions in GSIM, it seems the Family Record
Identifier on the Person Record might not be considered a UnitIdentifierComponent because it is not
identifying the Unit associated with the Person Record.)
The record relationships for Population Census data can be seen as relatively straightforward. The
Unit Data Structure for microdata from the Survey of Disability, Aging and Carers in 2003 consisted
of ten record types arranged in a complex hierarchy.
Edge Cases
Using DimensionalDataStructures to describe data about units
It is possible to use DimensionalDataStructure to describe (what would usually be considered)
microdata.
This might be interpreted as structuring a DimensionalDataSet in a manner that ensures every
combination of DimensionalIdentifierComponents identifies a subpopulation which consists of a
single unit.
An example is that a business register could record information about Local Units, Global Enterprises
and Global Enterprise Groups. The DimensionalIdentifierComponent might become simply the ID of
the Unit.
Everything else that was recorded about the unit (eg Main Industry, Number of Employees, Reported
Turnover Last Year, Geographic Location of Headquarters) would be either a
DimensionalMeasureComponent or a DimensionalAttributeComponent.
DimensionalIdentifierComponents are typically coded (“cross classifications”) although this is not a
requirement within GSIM, or within common implementation standards such as SDMX.
If the hierarchy is simple enough (eg no unit at one level belongs to more than one unit at the next
level up) the relationship between Local Units, Global Enterprises and Global Enterprise Groups
could be represented by – and access systematically via – a CodeList. This would have the
disadvantage, however, of needing to maintain a large and complex CodeList which records each
unit and its parent. It could require updating (and, potentially, versioning) the CodeList each time a
new unit is recognised (including “births” and mergers) and each time a relationship changes (eg
Page 10.
acquisitions). Maintaining unit information as a CodeList, therefore, while possible in some cases
may be a cumbersome option.
Another alternative might be to store the required information within one or more
DimensionalAttributeComponents (eg “ID of parent unit”). This potentially requires particular
relationships between different components (including between the
DimensionalIdentifierComponent and other components) to be
1. documented when describing the data, and
2. harnessed when analysing the data.
A generic application for working with dimensional data (eg as “data cubes”) would usually have
significant issues being able to recognise, and correctly utilise, this additional information. A generic
application for working with Unit Data, however, should not face the same issues were the data
described using a UnitDataStructure.
This can be seen as a separate consideration to the possibility of describing such data as dimensional
for the purpose of exchange (eg using infrastructure based on SDMX) as opposed to describing it as
dimensional to support more applied and operational uses.
It would be possible, for example, to exchange (and, eg, synchronise) data using
DimensionalDataSets on a generic basis between two systems which, when operating on the data
after exchange, apply the specialised concepts related to unit relationships, record types and
linkages associated with UnitDataStructures.
Using UnitDataStructures to describe “aggregate” data
This is a scenario which has been raised several times during discussions.
An example might be to take a range of data (eg counts of population and selected sub-populations,
measures of economic activity) and consider them as measures relating to a particular
administrative unit (eg a country, state/provinces or local government area). In these cases the
measures would typically be thought of as “aggregate” (eg counts of people, estimates of turnover
and other measures for industry sectors). In this scenario, however, they would be seen as
attributes/measures of a (larger scale) “unit”.
Such an example might be seen as not be so different in concept to, eg, recording the number of
cars associated with a dwelling or the number of employees associated with an enterprise. In terms
of cars/employees as units these are aggregate counts but in terms of dwellings/enterprises as units
these are a measure/attribute associated with a particular unit
In this example, it would be quite possible there could be differences in the data recorded/available
at, eg, country, state/province and local government area level. In other words, even if held in a
single set of records physically, there could be multiple record types in a logical sense depending on
the UnitType associated with each record.
Page 11.
This forms the basis of quite a common technique used in practice for “bringing together” data with
quite different dimensionality and, in fact, quite different underlying concepts. For example, it is
possible to bring together quite diverse social, economic and environmental data by relating it back
to a common administrative unit – and potentially a common reference time period – where it
would be inordinately complex – and of very limited practical use - to try to take the different
DimensionalDataStructures associated with each of the sources of data that is being brought
together and to seek to synthesise from those structures a coherent “hypercube”.
It is worth noting that although the data sources that are brought together this way might be quite
readily interpretable from a dimensional perspective (without giving particular precedence to the
dimension that identifies the administrative unit) once the data starts being interpreted from a Unit
perspective some of the other constructs change.
For example, a sample of records from a simple dimensional dataset might be
Region
New South Wales
New South Wales
New South Wales
New South Wales
Victoria
Victoria
Victoria
Victoria
Sex
Male
Female
Male
Female
Male
Female
Male
Female
Age
0-4
0-4
5-9
5-9
0-4
0-4
5-9
5-9
Count
X
X
X
X
X
X
X
X
If this data starts being considered as relating to each Region as a Unit, however, there is a tendency
(but not an absolute requirement in GSIM) to consider each Unit as having one record (at least for
any one reference period). Thus “Number of Males 0-4”, “Number of Females 0-4”, “Number of
Males 5-9”, “Number of Females 5-9” (at a particular point in time) would all become
attributes/measures associated with the Unit “New South Wales”. In other words, the physical
structure of “several short records associated with each Unit” would more often be considered
logically as a “single wide record for each unit”.
I found out after the fact when I looked for ABS data presenting population by age by sex by region
that the first example I found does present it on a “single wide record for each unit” basis.
Incidentally, this example may suggest that the complex, multi-parent / multi path hierarchies
commonly used to roll countries (as political units) up into various regional, economic and political
groupings may be more driven by seeking to describe complex sets of unit relationships and less by
seeking to describe the structure of a generic “classificatory” dimension within a
DimensionalDataStructure. This topic has received much discussion in regard to GSIM. Perhaps it is
appropriate that these relationships are described by additional metadata useful if one wants to
consider the details of how the different “units” relate to each other rather than within the generic
“dimensional” (and “classificatory”) definition of data. This could be seen – more or less – as what
Hierarchical Code Lists within SDMX do – they are not used in the direct dimensional definition of a
DimensionalDataStructure.
Page 12.
Using UnitDataStructures to describe “aggregate” data in this manner appears reasonable
depending on how the designer of the DataStructure intends to be present and use the data. It
seems arguable, however, that what would appear to be aggregate data from a different perspective
is, in such cases, actually being presented as microdata associated with a “larger scale” unit.
Different Dimensional perspectives on the same data
The following example provides an illustration, and brief exploration, of two different dimensional
characterisations of the same (or equivalent) data.
Dimensional DataSet with Structure 1 (DDS 1)
Month
State
Sex
Employed
(‘000)
Unemployed
(‘000)
July 2013
July 2013
July 2013
July 2013
…
NSW
NSW
Vic
Vic
…
Male
Female
Male
Female
…
X
X
X
X
…
x
x
x
x
…
In
labour
force
(‘000)
X
X
X
x
…
Not in
labour
force
(‘000)
x
x
x
x
…
The overall population for Labour Force statistics in Australia is the Civilian population aged 15 years
and over.
The first row of data in DDS 1 relates to the subpopulation of males living in NSW.
Dimensional DataSet with Structure 2 (DDS 2)
Month
State
Sex
July 2013
July 2013
July 2013
NSW
NSW
NSW
Male
Male
Male
July 2013
NSW
Male
…
…
…
Labour
Force Status
Employed
Unemployed
In labour
force
Not in
Labour force
…
Estimate
(‘000)
x
x
X
x
…
DDS 2 can be considered a simple “repackaging” of the data associated with DDS 1. In DDS 2 the
first row relates to the subpopulation of employed males living in NSW.
The difference in packaging/perpsective can, however, be significant.
If we want to calculate “Unemployment Rate” then DDS1 makes this straightforward, it is simply a
new measure calculated through expressing “Unemployed” as a percentage of “In labour force”.
For DDS 2, however, adding “Unemployment Rate” is not straightforward because it is not a
measure related to any of the subpopulations expressed in DDS 2 (it is actually based on a ratio
between two of the subpopulations).
Page 13.
In other words, how dimensional data is characterised can impact – at least in some cases – the
“functions” that can (readily) be applied to it.
DDS 1 and DDS 2 would be characterised by different DimensionalDataStructure under GSIM even
though the same physical instance of data (given appropriate conceptual to physical mappings at a
level below GSIM) could be related to both structures. The difference in structure in this instance
depends on what is characterised as a Dimensional Identifier Component compared with a
Dimensional Measure Component.
One use case for GSIM is to provide a common reference model at the conceptual level to which
different physical/technical implementations can be related. This allows GSIM to assist translations
between different physical implementations based on whether the concepts (about the structure of
data and metadata) associated with the different technical implementations are the same or not.
This may result in a translation that is simpler to design, and easier to assess the semantic quality of,
than simply asking technical staff to try mapping (eg) two sets of XML schema elements to each
other at a technical level.
If we were using GSIM in this manner in this example, and if we were seeking to translate data from
an in house physical instance to a standard “dimensional data” format such as SDMX, a DDI NCube,
Google Dataset Publishing Language or the RDF Data Cube Vocabulary, it would make a practical
difference in all of these cases whether DDS 1 or DDS 2 were selected as the way the in house
physical instance of data was characterised in conceptual terms.
(In this scenario, there would also need to be a layer of metadata that related the conceptual
characterisation based on GSIM to the in house physical instance of data. Metadata that connects
“logical to physical” is outside the scope of GSIM but is addressed in standards such as DDI which can
be used for implementing GSIM.)
In conclusion, the conceptual/logical characterisation of data in GSIM is important to supporting a
range of use cases for GSIM as a common reference model. These characterisations are not matters
that should be left undifferentiated at the GSIM level. Leaving these characterisations
undifferentiated at the GSIM level would require them to be worked out on a case by case basis,
duplicatively - and potentially inconsistently, at the “physical implementation” level each time.
Different Unit perspectives on the same data
The following is an actual case that arose in the ABS.
Two different subject matter areas were considering the same physical data records. These data
records included a Person ID, the Person’s name and the Person’s address, as well as a range of
other data related to the person in question.
One subject matter area, possibly to facilitate record matching with other data, chose to identify
records through the combination of Name and Address. The other subject matter area chose to
identify records using Person ID.
The conclusion reached by analysts within the ABS is that the two areas were seeking to describe,
and work with, the data using two different UnitDataStructures at the conceptual level.
Page 14.
In the first case Person Name and Person Address were acting as UnitIdentifierComponents. Person
ID, arguably, was simply an UnitAttributeComponent which was not terribly relevant for the
purposes of most analysis by that user of the data.
In the second case, Person ID was being used as the UnitIdentifierComponent and potentially Person
Name and Person Address were UnitAttributeComponents. (Are there any circumstances under
which Person Address – or, eg, Person Sex - could be considered a UnitMeasureComponent or are
UnitMeasureComponents always numeric measures, even though this does not appear to be stated
explicitly in GSIM V1.0?)
Page 15.
Download