RAIRD Information Model RIM v1_0

advertisement
RAIRD Information Model (RIM)
Version 1.0 16th June 2014
RAIRD P 1.1 Information Model
Authors: Jenny Linnerud SSB, Ørnulf Risnes NSD, Arofan Gregory MTNA
Approvals
Type
Name
Date
Control of v0.9
Terje Risberg
28.05.2014
Atle Alvheim
27.05.2014
Rune Gløersen
11.06.2014
Approval of v0.99
Change log
Version
Author
Reason
0.99
Jenny Linnerud
Incorporated feedback on v0.9
from Atle Alvheim, Arofan
Gregory, Johan Heldal, Anne
Gro Hustoft, Terje Risberg and
Ørnulf Risnes.
1.0
Jenny Linnerud
Change to the attribution text
for Creative Commons on
v0.99.
This work is licensed under the Creative Commons Attribution 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of
this work, please attribute it jointly to Statistics Norway and the Norwegian
Social Science Data Services.
1
Contents
1.
Introduction ...........................................................................................................................................4
2.
RAIRD Information Model (RIM) ...........................................................................................................4
2.1 RIM Design Principles ............................................................................................................................ 5
2.2 Extending the model ............................................................................................................................. 6
2.3 Notation................................................................................................................................................. 6
3.
System Overview ...................................................................................................................................6
4.
Input Data and the Event History Data Store ......................................................................................10
4.1 A Simple Example ................................................................................................................................ 10
4.2 GSIM Unit Data Structure .................................................................................................................... 10
4.3 Datum-Based Data Structure .............................................................................................................. 12
5.
Analysis Data Sets ................................................................................................................................16
5.1 GSIM and Unit Data ............................................................................................................................. 16
5.2 Analysis Data in RIM ............................................................................................................................ 17
6.
Provisional and Final Outputs..............................................................................................................19
6.1 RAIRD Requirements ........................................................................................................................... 20
6.2 The GSIM Dimensional Data Model .................................................................................................... 20
7.
Topics and Concepts for Organization of Data ....................................................................................22
7.1 Requirements for Topic and Variable Navigation ............................................................................... 22
7.2 GSIM – Mapping to RIM Requirements .............................................................................................. 23
7.3 Administrative Details ......................................................................................................................... 24
8.
Concepts, Classifications, Variables and Related Metadata ...............................................................25
8.1 Concepts .............................................................................................................................................. 25
8.2 Units, Unit Types, and Populations ..................................................................................................... 25
8.3 Variables .............................................................................................................................................. 26
8.4 Value Domains, Statistical Classifications, and Codelists .................................................................... 29
9.
Disclosure Control Rules and Risk Minimization Actions ....................................................................31
9.1 GSIM and Disclosure Control............................................................................................................... 32
9.2 RIM Model ........................................................................................................................................... 32
2
10. Process Descriptions............................................................................................................................34
10.1 The GSIM Process Model................................................................................................................... 35
10.2 Load Event History Data Store Process ............................................................................................. 37
10.3 Interact with the Virtual Research Environment Process ................................................................. 38
10.4 Control of Disclosure Risk Process .................................................................................................... 40
10.5 Accredit Institutions and Manage Users Process .............................................................................. 41
i) Standard users ................................................................................................................................. 44
ii) Specific, large users ......................................................................................................................... 44
iii) Institutions with special requirements within health registers ....................................................... 44
iv) Students ........................................................................................................................................... 44
11. References ...........................................................................................................................................45
List of Figures
Figure 1: System Overview .............................................................................................................................. 7
Figure 2 Unit Data Structure......................................................................................................................... 11
Figure 3: Datum-based Data Structure.......................................................................................................... 13
Figure 4: Populations and Units .................................................................................................................... 15
Figure 5: Unit Relationships .......................................................................................................................... 15
Figure 6: Data Sets ......................................................................................................................................... 17
Figure 7: Concept .......................................................................................................................................... 24
Figure 8: Population ...................................................................................................................................... 26
Figure 9: Variable and Represented Variable ................................................................................................ 27
Figure 10: Instance Variable .......................................................................................................................... 28
Figure 11: RIM Represented Variable extension ........................................................................................... 29
Figure 12: Node ............................................................................................................................................. 31
Figure 13: Disclosure Risk Assessment .......................................................................................................... 33
Figure 14: Disclosure - Corrective Actions..................................................................................................... 34
Figure 15: Design Processes .......................................................................................................................... 36
Figure 16: Run Process .................................................................................................................................. 36
Figure 17: Accredit Institutions and Manage Users Process Flow ................................................................ 41
Figure 18: Accreditation ................................................................................................................................ 42
Figure 19: Researcher Access Request .......................................................................................................... 43
3
1. Introduction
There is an increasing demand for access to micro data from researchers and for empirical research
based on register data from public government. For register data this user demand has to be met in a
way that is in accordance with the principles of data protection and confidentiality for micro data.
These principles of confidentiality are normally legal founded and are common for official statistics in
most countries.
In Norwegian official statistics there is wide use of administrative data. Statistics Norway also has close
cooperation with administrative registers like the Population Registers and Central Coordinating
Register for Legal Entities. The common identifiers (persons and economic units) give Statistics Norway
the possibility to link information from different sources (administrative and own statistical surveys).
Data with identifiers are stored and accumulated over time. The data may be combined and presented
as cross section data. Data may also be combined in ways that give longitudinal data.
Statistics Norway and the Norwegian Social Science Data Services (NSD) aim at establishing a national
research infrastructure providing easy access to large amounts of rich high-quality statistical data for
scientific research while at the same time managing statistical confidentiality and protecting the
integrity of the data subjects. The work is organized as a project, RAIRD – Remote Access
Infrastructure for Register Data1, and funded by the Research Council of Norway.
2. RAIRD Information Model (RIM)
One of the first deliverables in the RAIRD project is the RAIRD Information Model (RIM) v1.0.
An information model (IM) is an abstract, formal representation of objects that includes their properties,
relationships and the operations that can be performed on them. The main purpose of RAIRD IM (RIM) is
to model managed objects at a conceptual level, independent of any specific implementations or
protocols used to transport the data. The degree of specificity (or detail) of the abstractions defined in
the RIM depends on the modelling needs of its designers.
RIM is an implementation of the Generic Statistical Information Model (GSIM) v1.1. It is extending
GSIM to include the users (producers, administrators and researchers), but does not include parts of
GSIM that we consider to be more related to the details of the official production of statistics, e.g.
Change Definition, than to the secondary use of the microdata by researchers.
We have used GSBPM as a tool to discuss which parts of the statistical production process could also
be of relevance for researchers using RAIRD. For example, the build and collect phases will not be
necessary for the researchers as the RAIRD system will be provided by Statistics Norway and the data
1
See: www.raird.no
4
are already collected by us. RAIRD will need to heavily support the Specify needs, Process and Analyse
phases for the researchers.
This document is aimed at metadata specialists, information architects and solutions architects.
There are also a number of annexes, which include a glossary (including relevant GSIM information
objects), relationship between RAIRD and business process models, a technical overview of
information flows and a mapping to DDI.
2.1 RIM Design Principles
The design principles for RIM listed below are based on the design principles in GSIM v0.8 and the design
principles for DDI.
Design
Principle
Name
Statement
0
Change control
RAIRD IM (RIM) has change control i.e. the
following principles for designing RIM apply to
every revision of RIM.
1
Complete coverage
RIM supports the whole business process
resulting in access to products for researchers.
2
Supports production of
products for researchers
RIM supports the design, documentation,
production and maintenance of products for
researchers.
3
Supports access to products
for researchers
RIM supports access to products for
researchers.
4
Separation of production
and access
RIM enables explicit separation of the
production and access phases.
5
Linking processes
RIM represents the information inputs and
outputs to the production and access process.
6
Common language
RIM provides a basis for a common
understanding of information objects
7
Agreed level of detail
RIM contains information objects only down to
the level of agreement between key
stakeholders.
8
Robustness, adaptability and RIM is robust, but can be adapted and
5
extensibility
extended to meet users’ needs.
9
Simple presentation
RIM objects and their relationships are
presented as simply as possible.
10
Reuse
RIM makes reuse of existing terms, definitions
and standards.
11
Platform independence
RIM does not refer to any specific IT setting or
tool.
12
Formal specification
RIM defines and classifies its information
objects appropriately, including specification of
attributes and relations.
2.2 Extending the model
One of the RIM design principles is that RIM can easily be adapted and extended to meet users'
needs. It is expected that implementers may wish to extend RIM during the RAIRD project, by
adding detail and indicating which information objects are used, and exactly how.
RIM itself should be used as an example of how to document extensions and restrictions. This
means providing the information in the template found in the GSIM v1.0 Specification and
providing the definitions and descriptions/examples in this tabular form, as well as providing an
overall narrative of each UML diagram produced.
2.3 Notation
Throughout this document we have used the following notation:



The Names of GSIM information objects start with a capital letter and are written in italics e.g.
Unit Data Set
The Names of RIM information objects, that are not GSIM information objects, start with a capital
letter and are written in bold italics e.g. Provisional Output
The Names of systems and processes in RAIRD start with a capital and are written in bold e.g.
Event History Data Store and Interact with the Virtual Research Environment Process.
3. System Overview
The RAIRD Information Model (RIM) is intended to support all aspects of the RAIRD implementation. In
order to understand this scope, a basic understanding of the systems and processes in RAIRD is helpful. In
the illustration below, the major systems and information flows in RAIRD are visible:
6
Figure 1: System Overview
We will briefly describe each of the systems and information types presented above. We begin by
describing how the data is provided to RAIRD. The SSB Data Management System is external to RAIRD – all
other systems and information flows are part of the overall RAIRD application.
There will always be some system or systems for managing the collection, production, and dissemination
of statistical data and metadata within SSB (SSB Data Mgt. System). These systems are liable to change
over time, driven by the various needs within the organization. Consequently, an API will be created
specifying how the data and metadata used by RAIRD should be supplied (Load API).
Two types of information will be supplied through this interface: the data itself (Event History Input Data
Set), and the relevant metadata (Input Metadata Set). The data could be supplied as a delimited ASCII file
in a form roughly similar to how it will be stored in the receiving system. The metadata could be
transmitted in a standard format, such as DDI.2 This information would be loaded into the Event History
2
(If DDI Lifecycle is used, there would be two “types” of metadata being loaded: the structural description and the
“study-unit”-specific metadata (a direct description of the data being loaded); and the underlying foundational
7
Data Store, where it would be placed into a system capable of handling the very large numbers of records
and serving them up in a performant way. It should be noted that the structures of the data inside the
store and in the data-loading format could be quite similar, consisting of a simple datum-based data
structure (see section 4.3).3
The store contains the information needed by the users to formulate meaningful requests, including some
calculated information such as for what time periods data is present for specific variables.
Having loaded the needed information into the storage system, the researchers now enter the picture.
Researchers will be associated with accredited institutions that have applied for access to RAIRD, and will
have been given necessary training on how to properly use RAIRD. The users will be given access to the
system, but they and their institutions are held liable for any misuse of the system or the confidential data
contained. The entire process of institutional accreditation and the management of users and access to
the system are part of the system and therefore the RAIRD Information Model, but they are not shown in
this particular diagram.
Users will access RAIRD using a browser-based anonymizing interface, loaded from the Data Catalogue,
which exposes the contents of the Event History Data Store using the metadata held in it. The researcher
can browse the available variables (grouped topically according to a set of concepts), see possible unit
types of study, the time period for which variables are available, and similar descriptive aspects of the
data. The researcher is thus able to familiarize themselves with the data available to answer any specific
research question, but will not see microdata.
The researcher will look at the data within a conceptual framework that does not reflect the actual
structure of the Event History Data Store, but allows them to think about data selection using the format
in which it will be extracted (eventually, the Analysis Data Set). This data structure is one that is very
familiar to researchers, as it is the one which is most often used within statistical packages such as SAS,
SPSS, Stata, and R.
The RAIRD Virtual Research Environment (the browser-based interface) now allows the researchers to
perform their data selection, processing, and analysis. User Operations are performed through browserbased interactions with the Virtual Statistical Machine. The operations specified by the user are logged,
metadata, giving such things as the list of variables in the register data, the representations and classifications of
variables, their labels, any associated concepts, their unit types and relationships, etc. This latter set could be
transmitted using a DDI “resource package”.)
3
Each of the many records documents a single datum, in a record structure which would contain an ID, a reference
to the value’s variable, the value itself, and start- and end-dates for the event measured by the variable (or period of
the observation [eg, for yearly data, year = 2005], in the case of a snapshot of the data). The variable reference
points to a variable documented in the metadata package (part of the Input Metadata Set) loaded at the same time.
When loaded into the store, the classifications used by different variables would be stored each in their own tables,
like a traditional star-schema based data warehouse. Other associated metadata such as labels could be attached in
a similar fashion. More documentary information could be held in the database at whatever level of granularity is
desired.
8
so that there is a clear audit trail. (RAIRD will have the capacity to re-create any researcher action
accurately, by re-executing the function on the correct version of the data store, for which any changes
are also logged.)
Researchers are able to select data by specifying one or more variables they wish to process, and by
specifying a point in time for which they want the variables populated. If event-history processing is
desired, a time period may be specified for variables of interest, so that changes across time can be seen.
Note that having more event variables at a time in an analytical data set will result in a non-uniform unit
identifier. Measures must be taken to communicate this complex structure to users. Such communication
techniques are not yet defined. Note also that data sets with multiple event variables cannot be analyzed
using the same techniques as data sets with a well-defined unit identifier. Analysis techniques for working
with multiple event-based variables have been defined as outside the scope of the first versions of RAIRD.
The latter type of variable is termed an event history variable, and the others are termed static variables.
The Unit Type for each variable is known – the selected data is presented to the researcher for analysis
using the finest grained unit from the selected variables. The selected data is the Analysis Data Set.
Additional processing may now be performed on this data set, which can be supplemented by additional
data selection as needed. The researcher can generate new variables from the old ones, reshape the data
set, perform cleaning and validation, and other types of processes.
The data analysis is based on statistical functions (tabulation, regression, etc.) executed by the Virtual
Statistical Machine on the Analysis Data Set. The result of this is a Provisional Output,. The researcher
does not see this Provisional Output – first, it is scrutinized by the Disclosure Control Engine, which
executes a set of operations to guarantee, to the extent possible, that it is not disclosive, according to the
criteria established by Disclosure Control Rules. Rules might cause corrective actions to be applied until
the Provisional Output meets all desired criteria. At this point the corrected Provisional Output is passed
to the researcher as a Final Output, which again appears in the browser interface. The researcher might
wish to re-run further statistical functions on the Analysis Data Set, or might go and try a different
selection of data altogether. This process continues until the researcher has the desired results.
GSIM Coverage
Note that there are three very different – but related – types of data structures in this process. The data
loaded into RAIRD has a very event-history-based structure, with a single datum recorded for each
available event. The Analysis Data Set has a different structure, organized into traditional rows and
columns – what GSIM calls a Unit Data Set. The results of statistical operations may be tabular, but more
typically as estimates of statistical model parameters and graphical displays – what GSIM describes as a
Presentation. The RAIRD Information Model describes all of these, using what GSIM offers, but extending
it to cover event-history data.
Further, we have many different examples of processing: the loading process, data catalogue browsing,
data selection, data processing, data analysis, and the application of disclosure control. These processes
will be modeled using the information objects supplied by GSIM.
9
Other functions (institutional accreditation, user management) are not as well-described in GSIM, and will
require RAIRD-specific extensions.
4. Input Data and the Event History Data Store
This section attempts to explore the datum-based modeling of data using a simple example, and contrasts
this with the similar way in which data can be described using the unit data model found in GSIM. The
goal is to provide a good basis for discussing and integrating the RAIRD datum-based approach with the
GSIM model, since the RIM will – wherever fit for use - be based upon it.
The datum-based model is one which is similar to some styles of data warehouse design, in which each
single fact has a set of descriptors, using a star schema approach. It is not, however, necessary to use this
approach with the datum-based model, and there are potentially large differences from an
implementation perspective. Like the fact in the star schema approach, the individual datum is central to
how we conceptualize the data structure.
In GSIM unit data, the conceptual model is more similar to the structure of relational database tables:
each record is associated with a single unit, and provides a set of values for a known sequence of
variables.
4.1 A Simple Example
We will use a very simple example as the basis for comparing the two styles of data modeling (the GSIM
unit data approach and the RIM datum-based approach).
We have a variable marital status, represented with the following codes:
N - single and never married
C - cohabitating and never married
M - married
S - separated
D – divorced
For our example, we have an identifier of 0937, which identifies an individual. This individual was married
on August 4, 2003, and remains married.
4.2 GSIM Unit Data Structure
In a Unit Data Set, the value for marital status would appear as a column in a table, typically alongside
several other variables describing the individual 0937. We can visualize this as shown in the table below.
CASE_ID
0937
DOB
1971-05-03
MAR_STAT
M
GENDER
1
DATE_MAR
2003-08-04
DATE_SEP
-
DATE_DIV
-
Here, we have first the case identifier (CASE_ID), then a date of birth variable (DOB), followed by our
marital status variable (MAR_STAT), gender (GENDER), the date married (DATE_MAR), date separated
10
(DATE_SEP), and date divorced (DATE_DIV). Note that the latter two variables have null values, as
individual 0937 is still married. We see this by the code found for the marital status variable, M. Of
course, any individual could have multiple marriage/separation/divorce events.
There are other ways of encoding data modeled as GSIM unit data, but this example is fairly typical – in
implementation, the data could be stored in relational database tables structured as shown, stored in
fixed-width or delimited ASCII files (such as CSV), or stored in this form in statistical packages such as SAS,
Stata, SPSS or R.
GSIM models this in a fashion that places emphasis on the Logical Record: the sequence of variables
associated with each Unit (as expressed by our Case ID above). In the GSIM model, we have the following
diagram:
Figure 2 Unit Data Structure
[Source: Part of GSIM Specification v1.1 Figure 19]
The Unit Data Set is structured by a Unit Data Structure, which has at least one Logical Record. This
Logical Record provides the structure for the Unit Data Record, which groups Unit Data Points. To
translate our example, each value in our table is a Unit Data Point. These are grouped into a row: the Unit
Data Record. Each Unit Data Record (row) – the sequence of variables repeated throughout the file - is
structured by the Logical Record, which describes a type of Unit Data Record for one Unit Type within a
Unit Data Set. Although not shown in this diagram a Logical Record groups Data Structure Components
that group Represented Variables.
11
GSIM tells us that Represented Variables may play three roles as Data Structure Components: Identifier
Components, Measure Components, and Attribute Components.
In our example, each value (0937, 1971-05-03, M, 1, 2003-08-04,-,-) is a Datum – the GSIM structural
description of the position occupied by each Datum is a Unit Data Point. When we group these together,
they become a Unit Data Record. The Logical Record then tells us that the first value is an Identifier
Component and all the remaining variables are Measure Components.
4.3 Datum-Based Data Structure
In our datum-based model, we have only a single Datum (value) in our conceptual row (that is, there
would be a separate row for each variable value):
Case_ID
0937
VAR_REF
MAR_STAT
VALUE
M
START_DATE
2003-08-04
END_DATE
-
The case identifier (CASE_ID) is needed for each entry because we need to be able to associate the value
with the Unit (Individual 0937). The variable reference (VAR_REF) is needed because we need to know
which variable (MAR_STAT) is associated with the value. The value (VALUE) is obviously needed. The start
and end dates (START_DATE, END_DATE) are needed in the case of event history data – for snapshot data
we need only a single date, the time of observation. Because individual 0937 could potentially become
separated or divorced, etc., we would also need to be able to record the dates of any changes in marital
status.
Many of the conceptual objects we need to describe this structure exist in GSIM: we have the idea of a
case identifier variable (it is an Identifier Component); we have the Represented Variable object to which
we can make reference; we have the idea of a Data Point.
It is worthwhile at this point to illustrate what the entire row of our example data would look like when
modeled according to the datum-based approach:
Case ID
VAR_REF
VALUE
START_DATE
END_DATE
0937
DOB
1971-05-03
1971-05-03
0937
MAR_STAT
M
2003-08-04
0937
GENDER
1
1971-05-03
Notice that the date variables – DATE_MAR, DATE_SEP, and DATE_DIV are no longer needed, as their
contents are carried in the start date and end date columns, in combination with a change in marital
status. DOB is a bit strange, because the value is a fixed one which cannot change. Still, the event remains
consistent as shown. This is because every datum is given a relationship to time – the Event Period. GSIM
does not give us such an information object.
We could, of course, use the GSIM Unit Data Structure to model this data set: the Logical Record is made
up of the Represented Variables: case identifier, variable reference, value, start date, and end date. Case
identifier is a Measure Component in the datum-based approach, variable reference is an Attribute
Component, value is a Measure Component, as are both start date and end date.
12
An artificial Identifier Component could be manufactured, which would be necessary in GSIM, because
case ID no longer functions as an Identifier Component (it no longer gives a unique value for each row),
and we are required by GSIM to have an Identifier Component for our Unit Data Records.
Thus, the table would look like this:
Sequence ID
001
002
003
Case ID
0937
0937
0937
VAR_REF
DOB
MAR_STAT
GENDER
VALUE
1971-05-03
M
1
START_DATE
1971-05-03
2003-08-04
1971-05-03
END_DATE
-
The problem with using this approach is that we have lost the understanding of our data. Where is the
marital status variable? Where is the date of birth variable? The values we see in the VAR_REF column are
just Categories associated with Codes, according to the description of GSIM’s Unit Data Structure. These
Codes happen to correspond with Represented Variables, but that is not something we can model with
GSIM. In the datum-based approach, the columns of our table no longer contain variable values – only the
value column does.
Further, we have lost the date variables altogether, as there is no relationship in GSIM between the start
date and the marital status variable, since the marital status variable has become a Code and a Category.
Below is a diagram of how a datum-based model could be created from what is available in GSIM, by
adding an additional object and some relationships:
Figure 3: Datum-based Data Structure
13
Here, we see many things found in GSIM: the Datum, the Data Point, the Data Set, the Data Structure, the
Data Structure Component, the Represented Variable, the Unit and the Unit Type are all present exactly as
seen normally in GSIM, with the one major exception that in GSIM, the Unit Type is associated with the
Logical Record, whereas here it is associated directly with the Datum. In GSIM, all of these objects are
common to both Unit Data and Dimensional Data, except for the Unit Type and Logical Record, which do
not exist explicitly in the description of the Dimensional Data.
The only new information object in this diagram is the Event Period. There are, however, several new
relationships. GSIM has an information object for Unit, but here we have a direct relationship to it from
the Datum, which does not exist in any of the existing GSIM Data Structure models.
There is no single Identifier Component – each Datum is identified by the union of the Unit, Event Period,
and referenced Represented Variable. In the datum-based approach, there is no single value identifying
the case, which is common with GSIM Unit Data Sets.
The example below shows how our earlier one could be expressed using the datum-based approach.
Here, we have our earlier example of marital status, but have included changes across time as different
events take place (in this example, divorce and re-marriage):
Unit Identifier
0937
0937
0937
0937
0937
Variable
Reference
DOB
GENDER
MAR_STAT
MAR_STAT
MAR_STAT
Value
Start Date
End Date
1971-05-03
1
M
D
M
1971-05-03
1971-05-03
2003-08-04
2010-02-02
2012-02-14
2010-02-02
2012-02-14
-
Each event is represented by a row in our table: the individual’s birth (DOB), his gender (GENDER), his first
marriage (MAR_STAT, 3rd row), his divorce (MAR_STAT, 4th row), and his subsequent re-marriage
(MAR_STAT, 5th row).
To understand this example in terms of the datum-based model shown above, the first column gives us
our Unit (0937), which we know is an individual (the Unit Type). The second column is the relationship
that the Datum has to a Represented Variable. The third column holds our Data Points, which contain our
Datums. The last two columns are the Event Period.
We have a further issue in that relationships between Unit Types in GSIM hinge on the Logical Record,
which no longer exists in our model. In GSIM Unit Data, the Logical Record represents all the variables
related to a single Unit Type. Do we still notionally have the Logical Record in the datum-based model,
since we will need to express the relationships between Units? How do we solve this? In GSIM, the
relationship between two Units is based on the relationship between Unit Types, since any Units of those
Unit Types could have that relationship. This is represented by the GSIM Record Relationship information
14
object, but this is not how relationships work in reality, only how they are structurally represented in Unit
Data Structures.
GSIM gives us:
Figure 4: Populations and Units
[Source GSIM Specification v1. Figure 11]
This is fine as far as it goes, but it is not enough. GSIM portrays relationships between Units as in Figure 2.
What is not shown in that view is that a Logical Record has an “isDefinedBy” relationship to Unit Type,
where any given Unit Type can be used to define 0 to n Logical Records.
Essentially what we see here is that GSIM uses the Unit Data Structure to describe relationships between
Unit Types, using the Record Relationship construct. It does not model the real world in which the Units
have actual relationships.
A better approach might be to model the actual relationships between Units like this:
Figure 5: Unit Relationships
The Relationship information object has a type property (“isMarriedTo”, etc.). For the RIM, we will use
this approach, which does not rely on how we structure our data, but instead directly models how Units
relate. To extend our example again, we could show a relationship between Units as follows:
15
Unit Identifier
0937
0937
0937
0937
0937
0937
0937
2100
2100
2100
2100
Variable
Reference
DOB
GENDER
MAR_STAT
CIVIL_UNION_ID
MAR_STAT
MAR_STAT
CIVIL_UNION_ID
DOB
GENDER
MAR_STAT
CIVIL_UNION_ID
Value
Start Date
End Date
1971-05-03
1
M
4765
D
M
5678
1972-03-05
2
M
5678
1971-05-03
1971-05-03
2003-08-04
2003-08-04
2010-02-02
2012-02-14
2012-02-14
1972-03-05
1972-03-05
2012-02-14
2012-02-14
2010-02-02
2010-02-02
2012-02-02
-
Here, we can see that two Units (0937 and 2100) have been married, because each Unit has a marital
status of M (married), and each Unit has a variable which makes reference to the civil union 5678, for the
period starting on 2012-02-14. The civil union is our Relationship information object, with a type marriage.
Note that Units of different Unit Types could also have relationships: an individual could be related to a
business as an employee, etc.
5. Analysis Data Sets
This section describes the structure of the Analysis Data Sets produced within RAIRD by applying user
selections to the Event History Data Store. Although never directly delivered to users, the Analysis Data
Sets are the ones operated on by users using the Virtual Statistical Machine, in order to process and
produce Final Outputs.
The structure of the Analysis Data Sets is different from that of the Event History Data Store, or any
other data structures found in RAIRD. The GSIM model gives us a reasonable set of information objects
for describing this data.
The structure of Unit Data is very familiar to most researchers, as it is a data description similar to those
used by the major statistical packages (e.g. Stata, SAS, SPSS, R), where each Unit is described in a row of
the table, and each column of the table represents the values of a single variable for all the units.
5.1 GSIM and Unit Data
As we have seen in the preceding section, GSIM provides a model of Unit Data Sets as well as a specific
description of the Unit Data Structure. All Data Sets have some information objects in common, and
these have their own diagram in GSIM:
16
Figure 6: Data Sets
[Source: GSIM Specification v1.1 Figure 18]
Here can see that a Data Set has Data Points, which have the individual Datum. All Data Sets are
structured by a Data Structure, which has Data Structure Components. These Data Structure Components
are of three subtypes: Identifier Components, Measure Components, and Attribute Components. Identifier
Components are values which form part or all of an identifier of the Unit for that record. The Measure
Component holds a value measuring or describing the observed phenomenon. The Attribute Component
supplies information other than identification or measures, for example the publication status of an
observation (e.g. provisional, final, revised). Note that Structure Components of all subtypes are defined
by Represented Variables.
GSIM also gives us a specialization of this generic model of data, for describing Unit Data as in Figure 2.
A Unit Data Set has Unit Data Points grouped into Unit Data Records. Each Unit Data Record is structured
by a Logical Record, which is a set of references to Represented Variables (not shown on this diagram). If
there are relationships between Unit Data Records, these are expressed at the Logical Record level using
Record Relationships. For example, if I have a “married to” variable in a record where the Unit is an
individual, the value of that variable might be the identifier of the individual who is the spouse or the
identifier of a civil contract which the spouses’ record also references.
5.2 Analysis Data in RIM
While GSIM is capable of describing Unit Data for many different Unit Types in the same Unit Data Set, for
RIM we have a stronger requirement – the Analysis Data Sets will only have records describing a single
17
Unit Type. There are some complexities however, depending on the type of selection a given user has
performed in order to create the Analysis Data Set.
The Analysis Data Set is created by the researcher as user operations are performed. Variables are added
at the researcher’s request, and are of two types. The simplest request is that a variable be added from
the Event History Data Store at a single point in time. An entire Analysis Data Set can be created using
only this approach, and these variables are termed static variables, as they show the status of the Units
for that point in time. This results in a snapshot of the Event History Data Store.
A second type of more complicated variable addition is the request for the values over a period of time,
with a requested observational periodicity and behavior (values for end of observational period, average
over observational period, etc.). This is termed an event history variable. The examples below assume
that one and only one such variable has been selected for any given Analysis Data Set. Although this is
not a fixed constraint in RIM, it ensures a uniform unit-identifier needed for the examples below.
As the selection of variables is made, and these are derived from the Event History Data Store, they are
arranged into records. The variables selected determine the Unit Type of the observation, such that the
variables associated with the finest level of granularity will determine the Unit Type for the entire
Analysis Data Set. In the example below, all variables are associated with a person, so this is the Unit
Type.
We could create a snapshot Analysis Data Set with records which might look like this:
Unit Identifier
Place of Residence
Date of Birth
007
BRGN
1971-07-03
Employment
Status
FT
It is easy to see how this record could be described by GSIM – each value is a Datum, filling a Unit Data
Point, and grouped into a Unit Data Record with a Logical Record referencing the Represented Variables
used for each Unit Data Point in the Unit Data Record (unit identifier, place of residence, etc.) The unit
identifier is our Identifier Component: because each unit will have only a single record, this also serves to
identify the record.
The complicating factor in RIM is time. Because the values of the variables change over time, our record
cannot be as simple as the one shown above, unless the user has asked for data at a single point in time.
This means that for any other type of selection, each record in our Unit Data Set will have to be qualified
by the time at which it is true.
If the researcher asks for an event history variable over the period from 2001 – 2003 (place of residence),
the values selected as the last valid value for each annual period, with some additional static variables for
2001, the researcher will get something which has three records for each Unit:
Unit Identifier
Period of
Place of
Observation Residence
Date of Birth (2001)
18
Employment
Status (2001)
007
007
007
2001
2002
2003
BRGN
OSLO
TRND
1971-07-03
1971-07-03
1971-07-03
FT
FT
FT
The GSIM model for Unit Data is still applicable here, as long as we recognize that both unit identifier and
period of observation act as Identifier Components: the record cannot be uniquely identified unless we
use both values (007-2001, 007-2002, etc.)
Some variables will have constant values due to their nature (date of Birth) but others will have values
which might have changed over time. Regardless, the values for the time of the snapshot will be repeated
for each row.
In the case where an event history variable is added to the Analysis Data Set, the unit identifier is no
longer sufficient to act as an identifier for the record – time is used to qualify the unit identifier. The
variable, time period, and unit identifier will combine to identify any specific value.
There are sometimes relationships between Units. In the example below, we see a relationship between
two individuals:
Unit ID
007
003
Period of
Observation
2001
2001
Gender
Civil Status
Civil Union ID
0
1
M
M
2345
2345
Conceptually, the relationship is that the two individuals are both married (civil status of “M”), and also
married to each other (civil union ID is “2345” in each case). The documentation for the civil union ID
variable would describe the relationship between the two individuals. Note that the individuals do not
make a reference to each other to indicate that they are married – they are instead making an external
reference to a civil union. The model presented for relationships between Units in the datum-based data
structure section can represent this.
Researchers will often identify relationships in which they are interested as they process data – these
could be captured in the fashion shown above using generated variables.
6. Provisional and Final Outputs
This section describes the structures and metadata requirements for the data on which disclosure control
is exercised, and which is presented to the user after a statistical operation has been carried out. Prior to
disclosure control, the processed data is termed the Provisional Output. After disclosure control – and
any needed Corrective Actions have been taken, the Final Output is produced, to be shown to the user
(GSIM Presentation).
19
The Provisional and Final Outputs may or may not be data sets in a traditional sense. They can be multidimensional matrixes, and in that sense are structured like data sets. Note that we are not describing here
those functions which produce metadata for the researcher to view – the metadata is non-disclosive, and
is not subject to disclosure control before being viewed. Summary statistics are not considered to be this
type of metadata – they are subject to disclosure control.
There is a lot of metadata regarding the structure of the matrixes in RAIRD, which will be needed by the
Disclosure Control System to check the Provisional Outputs and to perform Corrective Actions on it to
produce the Final Outputs. There must be sufficient metadata regarding the structure of the Provisional
Outputs to allow for further statistical operations to be performed on it, to produce a revised Provisional
Output.
6.1 RAIRD Requirements
In order to perform disclosure control and processing, there are some specific requirements in terms of
the metadata for Provisional and Final Outputs. These outputs are the result of known operations: the
user will have performed a tabulation, a regression, etc. This means that we will have information about
what analytical process acted upon the Analysis Data Set (or the Provisional Output of a prior operation)
to produce the Final Output.
One requirement is that we have sufficient structural metadata to check and process the Provisional
Output. Another requirement is that we understand how the multi-dimensional structure of the matrix
relates to the Analysis Data Set which was used to produce it. Ultimately, we need to be able to
understand which variables in the Analysis Data Set play which roles in the Provisional and Final Outputs,
how they are represented (classifications, numerical values, etc.), and how they have been processed to
populate the matrix (dependent and independent variables, algorithm, etc.). There are other
characteristics of the outputs which will need to be monitored: Population, Unit Type, skewness,
granularity, etc.
6.2 The GSIM Dimensional Data Model
GSIM gives us a useful way of describing the structures of matrices, the Dimensional Data Structure. Here
we see the common model for both the Unit Data Set and the Dimensional Data Set in GSIM (see Figure
6).
In a Dimensional Data Set, the Identifier Components are dimensions. These are related directly to a
Represented Variable (as are all Data Structure Components). The same Represented Variable will be used
in describing the structures of our Analysis Data Set, and Provisional and Final Outputs. For Measure
Components – the cells of our table – this becomes more problematic, because very often the values of
the cells in our table will be derived values, produced by some process of calculation.
To illustrate this, we will provide a simple tabulation example. Take the table below as our Analysis Data
Set:
Unit
ID
Period of
Observation
Gender
(2001)
Place of
Highest
Highest
Annual
Residence Education Education Income
20
Annual
Income
007
007
009
009
003
003
004
004
005
005
2001
2002
2001
2002
2001
2002
2001
2002
2001
2002
M
M
F
F
F
F
M
M
F
F
BRGN
BRGN
OSLO
OSLO
OSLO
BRGN
OSLO
OSLO
BRGN
BRGN
Level
(2001)
PHD
PHD
MAS
MAS
MAS
MAS
PHD
PHD
MAS
MAS
Level
(2002)
PHD
PHD
MAS
MAS
PHD
PHD
PHD
PHD
MAS
MAS
(2001)
(2002)
55000
55000
35000
35000
25000
25000
75000
75000
35000
35000
55000
55000
45000
45000
50000
50000
75000
75000
45000
45000
Note that highest education level and annual income for 2001 and 2002 are static variables, so their
values are repeated for each row – the variable only holds the value for a single time. We want to have a
table which shows average annual income by highest education level and place of residence.
Place of Residence
Oslo
Bergen
Master’s Degree
35000
40000
Doctorate
75000
53333
Some Represented Variables are being used as dimensions (that is, as Identifier Components) – in our
simple case, these are place of residence and highest education level. There are some Represented
Variables which are not being used in this tabulation at all: gender and period of observation. The annual
income has a calculation performed on it to produce the cell values – the relevant values are being
averaged to populate the table. This is, in effect, an implicit variable, as there is no variable in our Analysis
Data Set which contains these values – they are computed and used to populate the cells of our table.
This implicit derived variable serves as our Measure Component. We know exactly how it was calculated.
In order to perform our tabulation, we needed to know the possible values for each of our dimensions –
these would typically be either time or categorical variables (Represented Variables with Codelists or
Statistical Classifications for their representations). For each possible combination of these values, we
take all matching records for annual income, and then average them to populate our cell value, the
average annual income.
GSIM gives us the basic information about the structure of both our Analysis Data Set and our outputs
(the table) – we know which variables were used, and how they are represented (from the links back to
the Represented Variables in the Analysis Data Set structure). The data sets and data structures part of
GSIM does not cover information about the process itself: that it was an average using specific variables,
involving the implicit derivation of a new variable to act as a Measure Component.
The natural place to look to describe a tabulation in GSIM is the part relating to Process. In this case, we
have a set of variables (the one taken from Analysis Data Set) as inputs, the process itself (averaging) and
21
the derived variable (both as a Represented Variable and the set of Instance Variables making up the
resulting Dimensional Data Set) as outputs. Although the Process Step itself would be a tabulation
process, all the information about what was to be tabulated would need to be recorded as inputs. The
same modeling will be performed for each type of statistical operation supported by RAIRD – correlation,
summation, etc.
7. Topics and Concepts for Organization of Data
This section attempts to propose a structure in the RIM, based on GSIM, which will provide the
information needed by the Data Catalogue. A structure is proposed for exposing the available variables to
researchers, grouping them according to themes (and possible sub-themes), and using conceptual
references to enable more relevant lists of variables to be presented to the user.
In examining the organization of FD-Trygd, the input data warehouse for social insurance data at SSB, it
was identified that a structure much like this one does in fact exist, so that the metadata could be
exported for loading into the RAIRD Event History Data Store without a huge amount of additional effort.
While the official structure within SSB’s FD-Trygd gives us one practical organization of social insurance
data, other systems (such as keyword- and synonym-based searches) might also be useful to provide
access to the data. Where this information would come from would be an open question, but the RIM
model should provide a way to express this additional navigational information.
What is needed is to model the topic hierarchies and the variable links to them, so that we have a good
basis for the RIM that will support the information needed by this portion of the Data Catalogue. Further,
much of the in-depth documentation held within SSB regarding the FD-Trygd data is attached at the
higher levels of this structure – at the topic and sub-topic level. The RIM model would need to be able to
show not only a hierarchy of topics and variables, but also to link higher-level documentation to nodes at
different levels within this hierarchy.
7.1 Requirements for Topic and Variable Navigation
The following structure is proposed for FD-Trygd:
Social Insurance
Topics
(Possible) Sub-Topics
Variables
Connected to this hierarchy are the variables themselves – these are what need to be selected by users
and managed by any data processing. In order to do this we will need to have lots of information about
each variable: name, description, codelist, dates on which data are available, etc.
22
Key to the Data Catalogue is the ability to see the available Unit Types with which variables are associated
(persons, businesses, households, etc.), and the relationships between different Unit Types. Units – and
the relationships between them – are important for the selection of variables, with Unit Type dictating the
initial display, showing known links to other Unit Types. Thus, if we select person as a Unit Type, then we
may also be able to access a variable associated with a different Unit Type (business) through a variable
such as employer. An example of this is the industry variable associated with the business employing the
person. Businesses have industry classifications – people don’t. Researchers could request an industry
variable for data using persons as a Unit Type, where the industry variable is available through the link
individuals have with their employers.
As a last requirement, we may wish to use ELSST or a similar classification to provide keyword/synonym
searches to users who are unfamiliar with the official organization of the Social Insurance data. Such
keywords are typically hierarchical in nature, and allow for multiple assignments to any specific variable.
ELSST is a good example of such a keyword structure, as it has been translated into many European
languages, to support cross-language searches.
7.2 GSIM – Mapping to RIM Requirements
GSIM has a solid model around Concepts, and this is the typical approach for describing such a system of
organization i.e. as a Concept System. It feels quite natural, therefore, to start by looking at the part of
GSIM which describes Concepts and how they relate to other needed constructs.
GSIM gives us the following diagram, which proves to be very useful for modeling the interface needed
for browsing the Data Catalogue:
class Concept
Subj ectField
1..*
groups
0..*
ConceptSystem
0..*
has
0..*
Concept
-
Definition: String [1..*]
Population
Category
-
UnitType
Geography: String [0..1]
ReferencePeriod: DateRange [0..1]
1..*
0..*
0..*
parent/child
0..*
isSpecificationOf
23
InstanceVariable
RepresentedVariable
Variable
Figure 7: Concept
[Source: GSIM v1.1 Enterprise Architect file]
Here, we see that there is something called a Subject Field. This could be understood as the top level of
our structure, with our data set equating to a Subject Field of Social Insurance.
Under this, we have Concept Systems. The definition in GSIM defines it as a “Set of Concepts structured by
the relations among them.”
This allows us to do several things. First, the Concept System can cover our topics and sub-topics – these
are simply a set of groupings, which can be understood as views into our pool of concepts.
When we look at how Concepts themselves function, we can see that they are used in many ways: as
Variables of all types, as Unit Types, as Populations (and subtypes of Populations), and as Categories.
The immediate application of this is to allow us to relate our social insurance variables (Represented
Variables in GSIM terms) to the topics and sub-topics required. This works as follows: the Concept which
is a Variable can be related to the Concept which is the sub-topic.
Further, Unit Types are also Concepts, so these can be used as a way of grouping Variables as well, in a
second Concept System.
A third type of Concept System would be used for ELSST or other keyword and synonym system used to
help researchers search or browse the social insurance data.
Thus, we might have several types of Concept Systems in RIM: one for describing topics and sub-topics as
a way of grouping Variables; another for attaching Unit Types to Variables; and a third for describing
keywords and synonyms.
If we put this set of Concept Systems in place, we then have the ability to associate documentation with
them as well – GSIM allows for the attachment of documentation to any information object, using
administrative attributes. Here, our Concepts would provide attachment points as needed.
7.3 Administrative Details
In the GSIM specification, an object is provided so that implementing organizations can provide a set of
administrative details needed to identify, version, and otherwise manage their information. The way this
is done is for the implementer to decide what the set of needed administrative information is, and to
extend the Administrative Details information object with the desired set of attributes.
This set of administrative information has not yet been identified for the RIM. When it is, a RIM object
called Administrative Details will be added to the model, with the appropriate set of attributes. Any of
the identifiable GSIM objects will then carry these details.
A suggested set of administrative information is contained in the GSIM 1.1 Specification document, and
this could be used as the basis for defining what administrative information is needed for RIM.
24
One needed attribute for some objects will be a link to external documentation (similar to the Annotation
attribute suggested in the GSIM Specification document), in order to connect the Concepts used for
browsing the Event History Data Store to high-level documentation about the data.
8. Concepts, Classifications, Variables and Related Metadata
This section defines the approach used in RIM for describing the reusable metadata which appears in the
many types of information described in preceding sections. GSIM provides a good model for these types
of metadata, and so it can be directly used with only minor extensions for RIM. Here, we divide these
types of metadata into a set of standard types, and describe how RIM will use the GSIM model. Our types
include Concepts, Classifications, Variables, Populations and Unit Types . Those areas where RIM will
extend the GSIM model are indicated.
As described above, almost all of these types of basic metadata involve the application of Concepts into
concept roles, expressed in GSIM as extensions to the base Concept information object.
8.1 Concepts
In GSIM, Concepts are defined as a “Unit of thought differentiated by characteristics.” A Concept always
has a formal definition as a property. Concepts play many different roles in GSIM, expressed as subclasses of the Concept information object. See Figure 9.
Thus, we have Categories (found in Statistical Classifications, Category Sets and Codelists), Populations
and Unit Types, and three kinds of Variables: Instance Variables (those actually holding data values),
Represented Variables (those with a fixed representation, to be re-used across different data sets), and
Variables (those which are the application of a Concept as a characteristic of a Population, but which have
no fixed representation or value; these are good for comparison across data sets).
For the RIM, this set of objects is very nearly sufficient for describing all the types of data which we have,
according to the models set out in earlier sections.
8.2 Units, Unit Types, and Populations
Units, Unit Types, and Populations are one area where we will need to extend the GSIM model for the
RIM. In the diagram above, we see that we are given Units, Unit Types and Populations, which are all the
application of a specific Concept in a role. Unit Type is defined in GSIM: “A Unit Type is a class of objects of
interest”. We are further told that a “Unit Type is used to describe a class or group of Units based on a
single characteristic, but with no specification of time and geography.”
This makes sense when we look at how Unit Types correspond to Populations (see diagram below) which
are explained in GSIM: “A Population is used to describe the total membership of a group of people,
objects or events based on characteristics, e.g. time and geographic boundaries.”
Units in GSIM are “The object of interest in a Business Process.” These include the specific instances of the
Unit Types making up Populations: persons, businesses, households, etc.
25
Figure 8: Population
[Source: GSIM v1.1 Enterprise Architect file]
For the RIM, we will only need to extend this model to better describe the relations between Units, as
these are not part of this model in GSIM. The proposal we saw above was to have an explicit Relationship
information object between Units. The Relationship information object and its relationships to Units
would be extensions of the GSIM model. They would be typed “hasRelationshipWith” or
“hasRelationshipTo”. The Relationship object itself is also typed, describing what the relationship is – this
would be a property (ie, “employedBy”, “marriedTo”, etc.). See earlier discussion of relationships in the
Datum-Based Data Structure section 4.3.
8.3 Variables
As we have seen, there are three types of variables in GSIM, which will all be useful in the RIM.
26
Figure 9: Variable and Represented Variable
[Source: GSIM v1.1 Enterprise Architect file]
Here, we see how Variables (relating to a Conceptual Domain) are sub-classed from Concepts and
Represented Variables take their meaning from Variables, but have Value Domains corresponding to the
Variable’s Conceptual Domain. An Enumerated Value Domain is represented by a Code List information
object, which could be based on a Statistical Classification (see the next section).
Instance Variables are shown in the diagram below:
27
Figure 10: Instance Variable
[Source: GSIM v1.1 Enterprise Architect file]
28
Here, we see how an Instance Variable takes its meaning from a Represented Variable, which specifies the
value it may hold as a Data Point (taken from the Value Domain of the Represented Variable). We can also
see how an Instance Variable measures a Population, whose members are specified by a Unit Type. What
we do not find in this diagram is that each Instance Variable would contain a value pertinent to a specific
Unit coming from within that Population. This is implicit in GSIM, as you could have Populations made up
of individual Units – in the RIM, we make this explicit. Further, we see how Instance Variables hold the
specific values used in describing data corresponding to our Data Structure Components (Identifier
Components, Measure Components, and Attribute Components).
For the RIM, we will use all of these types of variables. Represented Variables will be reused across the
time-specific views in our Analysis Data Sets (ie, the employment status variable is a Represented
Variable, which will have an Instance Variable for a specific individual Unit across each observational
period in the Analysis Data Set). The Variable itself is useful when two Represented Variables have
different representations, and need to be re-coded, etc. They share the same Concept, but have different
representations over time, which can be aligned only through processing.
For RIM, one or more attributes will need to be added to the GSIM Variable information objects to
support disclosure control. Examples of these could be discoverability, sensitivity, etc. Represented
Variable is extended to add a sensitivity attribute.
Figure 11: RIM Represented Variable extension
8.4 Value Domains, Statistical Classifications, and Codelists
As visible in the diagrams above for Variables, we have an association between the Represented Variable
and the definition of its representation (the Value Domain object). There are two distinct cases – all
described by GSIM – in the RIM:
(1) Described Value Domains
(2) Enumerated Value Domains, Code Lists and Statistical Classifications
29
The first case is the simplest: the Described Value Domain describes a typed value (a string, a specific
numeric type, a time and/or date value, etc.) These are not enumerated, but are structured according to
some known system describing their allowed values.
The information objects in the second case are closely related, because what is actually implemented is a
Code List information object (in GSIM terms). In GSIM, Code Lists, Statistical Classifications and Category
Sets share a common basic structure, which allows them to be implemented almost identically, as seen in
the diagram below.
The key object in GSIM is the Node Set, which provides the basic structure for Statistical Classifications,
Category Sets and Code Lists. Although there is a large amount of detail in GSIM regarding Statistical
Classifications, this level of detail is not needed for all information objects in RIM. For the values of
enumerated Represented Variables, we will need to know the Code List and each of the Code Items which
make it up. The Code Items are Nodes, linking a Category (a meaning) with its Designation (the code
itself). This same structure is seen for Statistical Classifications and Classification Items, with the
difference that Statistical Classifications can have hierarchical relationships among their Classification
Items. This is important in the RIM for handling disclosure control (it is possible to aggregate up on the
hierarchies within a Statistical Classification). Further, it is necessary to understand the Level information
objects, which correspond to the hierarchies within Statistical Classifications. These allow for the
understanding of how individual Classification Items correspond to real-world constructs. A good example
here is geography – the Levels could describe Countries, Counties, Municipalities, etc.
30
Figure 12: Node
[Source: GSIM v1.1 Enterprise Architect file]
Further, the RIM will use the Map construct from GSIM, in cases where two Represented Variables have
Nodes which correspond, but are not identical. Maps are organized into Correspondence Tables, and can
show where two Nodes (whether from Code Lists, Category Sets or Statistical Classifications) are
equivalent.
9. Disclosure Control Rules and Risk Minimization Actions
This section presents the information model to support the application of disclosure control and risk
minimization in RAIRD. Disclosure control operates on the Provisional Output, before a Final Output can
be shown to the researcher. This model makes no attempt to venture into the domain of methodology in
this area. It is based on the assumption that a useful methodology can be expressed as a series of Rules,
operating on the data and metadata available for the Provisional Output.
31
9.1 GSIM and Disclosure Control
GSIM views disclosure control as it does any other statistical process – it is a Business Process which is
implemented via a Business Service, Process Steps and related information objects. The application of the
GSIM model for this process in RAIRD is presented in the Process Descriptions in section 10.
For the RIM, a specific model of the information objects required to support this process is needed at a
level of detail which goes beyond what GSIM offers us.
There are two relevant information objects in GSIM: Rules and Process Methods. These objects are found
in the Process Design portion of GSIM (GSIM Specification v1.1 Figure 4). The Process Method has a set of
Rules through which it will be implemented when the process is executed. The RIM model builds on these
objects, but adds more specificity to the model.
The RAIRD process involves more than just the application of Rules – it also associates Corrective Actions
with the Rules. In order to be meaningful, these Corrective Actions must operate on identified parts of the
data being checked.
9.2 RIM Model
The diagrams below show the RIM model for disclosure control information objects. Some of these
objects are extensions of GSIM information objects: Disclosure Control Rules are extensions of GSIM
Rules; Disclosive Data Point and Non-Disclosive Data Point are extensions of GSIM Data Point; and the
Provisional Output and Final Output are extensions of GSIM Data Set.
In the first diagram, we see the identification of potentially disclosive data:
32
Figure 13: Disclosure Risk Assessment
Here, the application of a Disclosure Control Rule to a Provisional Output produces a Disclosure Risk
Assessment by identifying Disclosive Data Points and Disclosive Data Point Combinations. The
Provisional Output has Non-Disclosive Data Points, Disclosive Data Points, and Disclosive Data Point
Combinations. The Disclosive Data Point Combinations may be composed of Disclosive and/or NonDisclosive Data Points.
There is the possibility of an up-front Disclosure Risk Assessment, in which some attributes may be
assigned to variables (i.e. discoverability, sensitivity) while other aspects of Disclosure Risk Assessment
can only be known at the time of disclosure control (e.g. granularity, skewness), because they depend on
understanding the context within which a variable is being used, and on having Disclosure Control Rules
which guide the Disclosure Risk Assessment.
The second diagram shows the information objects and their relationships after the potentially disclosive
data has been identified:
33
Figure 14: Disclosure - Corrective Actions
The Disclosure Control Rule identifies one or more Corrective Actions which are informed by the
Disclosure Risk Assessment. The Provisional Output is transformed by the Corrective Actions, through
operations on Disclosive and/or Non-Disclosive Data Points. (These operations could be aggregations
made in reference to the classifications associated with the Data Points; they could be cell suppressions,
or the application of any other formal techniques.) The result is a Final Output deemed to be at an
acceptable distribution risk level. As Corrective Actions are applied, it is possible that interactions with the
researcher occur, so that the Final Output is maximally useful for their purposes.
10.
Process Descriptions
This section attempts to describe the processes specific to RAIRD using the GSIM version 1.1 model. There
are several areas in which the processes to be used by RAIRD are fairly specific, notably the processing
and transformation of data, and the execution of tabulations, regressions, and similar analysis functions.
These are not covered in this document. The RAIRD-specific functions – including accreditation of
organizations and researchers – are included. We will provide extensions to the GSIM model for the RIM
because these are areas which are not covered in GSIM.
The GSIM Process model is very generic, and allows for many different types of processes to be described.
However, it is a model designed to describe statistical processes, and such processes as organizational
accreditation do not fall within this category. Thus, there is a difference between these two sections of
this document.
34
GSIM 1.1 offers a great deal of detail about Process Design, much of which is potentially relevant to the
RAIRD. This document describes both the design-time and run-time parts of the GSIM Process
descriptions.
The processes covered here include the Load the Event History Data Store Process, the Interact with the
Virtual Research Environment Process, and the Control Disclosure Risk Process, using the GSIM run-time
Process model. The Accredit Institution Process and Manage Users Process are described, but not using
the GSIM information objects intended only to describe statistical processes.
10.1 The GSIM Process Model
Below are the parts of the GSIM model which will be most important for describing RAIRD processes.
Note that this model can be used at different levels: RAIRD could be described as a single process using
GSIM, with several sub-processes. That is interesting from the perspective of a statistics organization
using GSIM to manage its production lifecycle, where RAIRD itself could become a dissemination service.
For the purposes of describing and supporting the implementation of RAIRD, however – the goal of the
RIM – this very high-level view is less useful. Instead, we will look at a level of detail which is slightly
greater.
There are two diagrams in GSIM which are of greatest interest: the design-time process model, and the
run-time process model. Note that in GSIM, a process can be automated, manual, or a combination of the
two – all kinds of process execution are modeled in the same way.
The design-time model for processes has already been mentioned, as it contains the Rule and Process
Method information objects which are used in the RIM for modeling information used in disclosure risk
control. This diagram is shown below:
35
Figure 15: Design Processes
[Source: GSIM Specification v1.1 Figure 4]
The entire RAIRD project plan and related materials could be seen as a Statistical Program Design in this
diagram – it specifies the Business Functions to be supported, and ultimately creates the Process Designs
through which this will be achieved. Note that GSIM Process Designs may utilize Process Designs for subprocesses. Because there is often conditional logic in the way that sub-processes are executed, this logic
must be specified. This is the Process Control Design. Another important aspect of the Process Design is
the specification of what types of inputs and outputs the Process has. We find these in the Process Input
Specification and Process Output Specification information objects. The Process Design itself is informed
by some Process Methods, which are the basis for the formulation of Rules which drive the Process
Control Design of the process.
Here is the GSIM diagram describing the running of processes:
Figure 16: Run Process
[Source: GSIM Specification v1.1 Figure 6]
In each case, we have a Business Process – the overall goal, described not in technical terms, but in terms
of the business goal. A Business Process can execute a Process Step, or execute a re-useable Business
Service. The latter information object is more appropriate here, because RAIRD has the goal of
implementing re-useable functionality. Business Services execute Process Steps, just indirectly.
All Process Steps are specified by a Process Design – this is the work needed to determine exactly how a
Process Step will be executed, and this typically exists in some documentary format. A Process Control is
the flow logic associated with a Process Step, among its sub-Process Steps. The Process Control
corresponds to its Process Control Design.
36
When executed, specific Process Inputs are specified, as well as specific Process Outputs. These must
agree with the types of inputs and outputs specified in the design phase. The information about the
process execution itself (logs, etc.) is seen in the Process Step Instance, which holds all the information
about the execution (e.g. success, failure, error).
10.2 Load Event History Data Store Process
In this process, we have the Business Process of taking the needed data and metadata and loading it into
the Event History Data Store. This overall process is executed as a Business Service, which is implemented
as an API (Application Program Interface) taking specified inputs (see below). Load Event History Data
Store process uses a top level Process Step –– which in turn has several sub-Process Steps:
The overall Process Step is composed of several sub-Process Steps:
1. Validate
2. Transform
3. Load
Each sub-Process Step is described below.
Process Control:
If the information is validated successfully, it passes to the preparation/transformation stage (sub-Process
Step). Once prepared/transformed, it is passed to the load stage.
Normal logging will be performed to populate the Process Step Instance.
Sub-Process Step 1: Validate
This sub-Process Step takes the Event History Input Data Set and the Input Metadata Set and validates
that the data is completely and validly described according to its structure. This includes checking not only
the event history file(s) against its immediate structural description, but also checking that any references
external to the data file(s) (references to variables described in the base metadata package, and any
Statistical Classifications/Codelists used by these) are correct. Outputs include an indication of success or
failure, and any error messages needed to correct fatal errors for load, or warnings to spot possible errors
which should be investigated by a staff member.
Sub-Process Step 2: Transform
This sub-Process Step performs any needed calculations/transformations on the input data, using
whatever metadata is needed, so that the Event History Data Store can be populated with any additional
information which can be derived from that supplied, including search indexes, calculated values, etc.
Sub-Process Step 3: Load
This process takes all prepared information and loads it into the Event History Data Store. Outputs
include an indication of success or failure and any errors or warnings generated by the load process.
37
10.3 Interact with the Virtual Research Environment Process
This process serves the purpose of allowing researchers to become familiar with the contents of the Event
History Data Store, to create Analysis Data Sets, and to explore, process and analyze the data to create
Provisional Outputs. The functionality is executed by several of GSIM Business Services: those offered for
browsing the Data Catalogue, and for executing various operations on the data through the Virtual
Statistical Machine.
The overall Process Step is composed of several sub-Process Steps:
(1)
(2)
(3)
(4)
Browse the Data Catalogue
Select Variables for the Analysis Data Set
Process Analysis Data Set
Create Provisional Outputs
Each sub-Process Step is described below.
Process Control:
Each of the sub-Process Steps may be performed in any order, and repeated as often as needed. The
primary interactions are determined by the researcher, who acts as the decision-maker in determining
which sub-Process Step to execute next. The process execution is done when a Provisional Output is
submitted for finalization. At this point another process is invoked: Control Disclosure Risk Process.
Sub-Process Step 1: Browse the Data Catalogue
This sub-process allows access to the metadata contained within the Event History Data Store, through
the VRE browser interface. The Process Inputs to this process are the metadata themselves, which are
largely based on the Input Metadata Set, but enhanced with additional information calculated during the
load process, and including presentational content. (Note that in GSIM terms, the presentational content
is a Presentation object, and the Data Catalogue is a Product.)
There are no Process Outputs, since the goal of the sub-process is to inform the researcher about the data
available in the Event History Data Store.
Sub-Process Control 1:
The researcher drives this process by navigating through the metadata. This may involve searches on
keywords, viewing according to topical structure (all functionality based on Concept Systems), and
exploring Units and Represented Variables. There is also additional information about which data are
available for which time periods, and how they are associated with Units.
Sub-Process 2: Select Variables for the Analysis Data Set
This process allows the researcher to create or add variables to the Analysis Data Set by requesting
variables one at a time, and either specifying a point in time (for static variables) or a time period (for
event history variables). When selecting variables, the observation periodicity may need to be specified.
38
The researcher may continue to add data to an existing Analysis Data Set, or may clear the existing one
and start to create a new one.
The inputs to this process are the Datum-Based Data Set held in the Event History Data Store, and the
outputs are Analysis Data Sets.
Every operation in this sub-process is logged, and is available to the researcher, so that Analysis Data Sets
may be accessed at any stage of their evolution. This allows researchers to come back to any iteration of
any Analysis Data Set, even after having invoked another sub-process within the overall process.
Sub-Process Control 2:
The researcher drives this process by making selections through the browser-based VRE. Because every
iteration of every Analysis Data Set is available to the researcher at any point during the overall process,
even after executing other sub-processes, the flow of this sub-process is very flexible.
Sub-Process Step 3: Process Analysis Data Set
This process allows the researcher to manipulate the Analysis Data Set in ways that do not involve the
addition of variables from the Event History Data Store. Several operations are possible, such as
collapsing the data set to aggregate some variables, reshaping the data (transposing rows and columns),
recoding variables, generating new variables by performing computations on existing ones, etc. The
Business Service being invoked is the Virtual Statistical Engine.
As these operations are performed, they are logged, and the iterations of the Analysis Data Set are
accessible to the researcher, so that it is possible to re-create the Analysis Data Set at any point, even
after invoking other sub-process steps in the overall process.
The Process Inputs to this sub-process are the Analysis Data Set, and the Process Outputs are modified
Analysis Data Sets and the log of what operations have been performed.
Sub-Process Control 3:
The researcher drives this process by executing processing operations through the browser-based VRE.
Because every operation made on any Analysis Data Set is available to the researcher at any point during
the overall process, even after executing other sub-processes, the flow of this sub-process is very flexible.
Sub-Process 4: Create Provisional Outputs
This sub-process allows researchers to perform analytical operations on the Analysis Data Set to produce
Provisional Outputs, which may then be submitted for finalization. These operations include the running
of tabulations, regressions, correlations, and similar analytical functions. The Business Service being
invoked is the Virtual Statistical Engine. As for other user operations in the VRE, analytical operations are
logged and made available to the researcher, so that all iterations of the Provisional Outputs may be
retrieved at any time, including those which were not submitted for finalization.
39
The Process Inputs to this sub-process are the Analysis Data Set, and the Process Outputs are Provisional
Outputs.
Sub-Process Control 4:
This sub-process has the potential for ending an iteration of the overall process if the researcher requests
that a Provisional Output be finalized, to produce a Final Output. A request for finalization always invokes
the execution of the Control Disclosure Risk Process. If the researcher does not attempt to finalize a
Provisional Output, then this sub-Process may be combined with the other sub-Processes as flexibly as
the others.
10.4 Control of Disclosure Risk Process
This process performs the Business Purpose of checking requested outputs of the analysis process against
a set of rules designed to minimize risk of disclosure. The Business Service is the Disclosure Control
System, and the inputs are the Provisional Output, relevant metadata describing the Provisional Output
(including classifications to be used for aggregation), the set of Disclosure Control Rules for checking the
Provisional Output against, the Disclosure Risk Assessment, and the set of Corrective Actions to be taken
in case of failure. The output is an acceptable Final Output.
The overall Process Step is composed of several sub-Process Steps:
(1) Apply Rules and Identify Risk
(2) Negotiate Corrective Actions
(3) Correct the Provisional Output
Each sub-Process Step is described below.
Process Control:
If the first sub-Process Step fails, then the second sub-Process Step is invoked. Otherwise, no
transformation is required and the Provisional Output becomes the Final Output because it is already
safe. It is possible that the first two sub-Process Steps will iterate several times until an acceptable Final
Output has been created.
Sub-Process Step 1: Apply Rules and Identify Risk
The Disclosure Control Rules are applied to the Provisional Output, and if any of the checks fail, the
points of failure within the data are identified. The inputs to this sub-Process Step are the Disclosure
Control Rules, any Disclosure Risk Assessments, and the Provisional Output. The outputs from this step
are an indicator of success or failure, along with details about which checks failed and what the points of
failure are within the Provisional Output, as well as a modified Disclosure Risk Assessment. If the
Disclosure Risk Assessment deems the Provisional Output acceptable, then this is promoted to being a
Final Output, and shown to the researcher.
Sub-Process Step 2: Negotiate Corrective Actions
40
The results of the preceding sub-Process Step are presented to the user for input regarding priorities for
possible Corrective Actions. The resulting output is passed to the next sub-Process Step.
Sub-Process Step 3: Correct the Provisional Output
In the case that there have been failures in checking the Provisional Output against the Disclosure Control
Rules, transformations will be performed to create an acceptable Final Output (eg, cell suppression,
aggregation, etc.) Inputs are the outputs of the preceding sub-Process Step, plus the Disclosure Risk
Assessment, and the Corrective Actions. The output is the Final Output to be passed to the researcher.
All information about rules, actions, input and output tables, etc. are logged along with user details for reuse and audit-trail purposes.
10.5 Accredit Institutions and Manage Users Process
As mentioned in the Introduction, this process is not modeled using GSIM because it is not a statistical
production process. Viewing it as a set of inputs and outputs to a transformative process is not
particularly useful. Despite this, it is important to have some information objects in the RIM which will
cover the needed information.
There are two aspects to this process, which are separated in time. Before any researcher can gain access
to RAIRD, the institution with which they are affiliated must become accredited. Accreditation is
performed by the agency which operates RAIRD and maintains the event-history data. Once accredited,
researchers may request access to RAIRD, at which point they become managed users who can be given
access to the system. Note that this process does not describe run-time user authentication, as that is a
generic process which presumably exists within the IT infrastructure of any organization. Rather, this
process is simply the creation of the information to be managed so that run-time authentication and
other needed generic processes can be performed. This process can be seen in the following diagram:
Figure 17: Accredit Institutions and Manage Users Process Flow
41
Here, we see the overall process of exchange. This diagram does not show all of the information objects
needed to support the process, but indicates the activities which are taking place.
The Research Institution wishing to be accredited makes an Accreditation Request to the Accrediting
Organization (the statistical agency operating RAIRD), which will either be approved or denied. If the
Accreditation Request is denied, then a Refusal Rationale is provided, which may form the basis for
further Accreditation Requests. If the Accreditation Request is approved, then the Accrediting
Organization will record the accreditation, and provide the Access Agreement to the requester for
acceptance.
Once accredited, users and projects wishing to access the RAIRD will send Access Requests to the
Accrediting Organization, which – if approved – will result in the setting up of User and Project Profiles,
provision of username and password, project start- and end-dates, and other needed information.
If the Access Request is denied, then a Refusal Rationale will be provided so that a more appropriate
Access Request can be re-submitted, if desired.
The following diagram shows the information objects involved in the Accredit Institutions part of this
process:
Figure 18: Accreditation
In this diagram, we can see that there will be information about the Research Institution submitting the
Accreditation Request (it is assumed that the Accrediting Organization is operating RAIRD, and therefore
we do not model the internal records a statistical agency maintains about its own operations.) The
42
Research Institution will be linked to the Accreditation Request, and this request will further be related
either to an Access Agreement, a Refusal Rationale, or both (if accreditation is granted after one or more
earlier refusals).
The part of the process concerned with Access Requests has a set of information objects, building on the
information objects we see in the accreditation-related part of the model:
Figure 19: Researcher Access Request
We still have the information objects representing the Research Institution and the Access Agreement,
but these are now placed in the context of the actual Access Request from a research Project.
The Project is an information object with several properties, among which are the start and end-dates
granted for data access. These result in an Access Request, submitted by the Project (which may be as
small as a single Researcher). Access will be granted based on the accreditation status of the Research
Institution, plus a review of the Researchers (eg, training certificates). If refused, a Refusal Rationale
would be associated with the Access Request so that it could be corrected and re-submitted.
There are two types of Researchers – primary Researchers, who have special legal responsibility for the
Project, and other Researchers. Some of the properties for Researchers are their degree status, training
certification, and contact details. The User Profile is the information which RAIRD manages about users
43
who are allowed access to the system: username and password, information about different activities for
re-use and audit trail, saved data of any sort, preferences, etc. Similarly Projects have their own Project
Profiles, as there is a set of similar information for data shared among all Researchers who are part of a
Project.
The output of a Project is ultimately a Research Publication (this may be done by an individual Researcher
or by the Project as a whole). Tracking the Research Publications for which RAIRD data was used is a
requirement for the purposes of auditing should any prove disclosive in the future.
It should be noted that there will be several different levels of users for the overall system, and this will
be a factor in controlling disclosure risk, as some levels of users are deemed riskier than others.
Depending on the level of the users, the disclosure control would be more or less restrictive in terms of
what Provisional Outputs could be finalized.
Users may be classified in 4 categories based on data needs and institutional infrastructure:
i)
Standard users: National and international institutions needing large quantities of data for single
projects.
ii) Specific, large users: National and international institutions with continuous projects and/or many
projects which are topically related and which often request long time series.
iii) Institutions with special requirements concerning health registers.
iv) Students.
i) Standard users
This category contains the majority of potential users. It is a very diverse group, but often with a need for
updates of data. It is expected that a remote access solution will function well for this user category, in
particular because of lessened bureaucracy.
ii) Specific, large users
This group is smaller in numbers, but contains more sophisticated large-scale data users. In particular it
represents more long-term projects that also include substantial amounts of added value to their
research data. This category of users is not as easy to satisfy through a remote access solution. An
investigation into possibilities for offering more automated services also to this group of users is planned
as part of the application.
iii) Institutions with special requirements within health registers
Within the health register system there are some specific regulations that have led to a specific mode of
operation for merging of data. Only the researcher is allowed to keep the merged data. The present
procedure is difficult to automate, but an investigation into possibilities for procedures giving more
efficiency and security is planned, based on experiences built up from other areas.
iv) Students
Education is in need of data that facilitates more relevant work and training. Such data are generally
easier to standardise, aggregate or anonymize. A secure data system is the most relevant solution here.
44
11.
References
The following sources of information have been used:
RAIRD: http://www.raird.no
GSIM: http://www1.unece.org/stat/platform/display/gsim/Generic+Statistical+Information+Model
Particularly Specification v1.0 incl. Annex C: Glossary and Annex D UML class diagrams and object
descriptions, but also GSIM v0.8 Annex D Expanded Set of Principles.
GSBPM:
http://www1.unece.org/stat/platform/display/GSBPM/Generic+Statistical+Business+Process+Model
DDI Specification: http://www.ddialliance.org/Specification/
ELSST: http://elsst.esds.ac.uk/Home.aspx
45
Download