Data warehousing – defining outputs and functions

advertisement
Data warehousing – defining outputs and functions
Consider first data appearing from nowhere:
Dataset
Dataset
DWH
Dataset
Aggregate
Statistics
Aggregate
Statistics
Microdata
Business
register
Input data
Storage,
combination
Outputs
Datasets go into the data warehouse (DWH) as does the business register (BR). The DWH in
this picture holds data and allows it to be extracted. The DWH in this model is a technical
facility. The DWH is used to produce aggregate statistics and microdatasets, which are used
for further analysis.
Now consider the structure of the DWH in more detail:
Working data
Dataset
Staging area
Dataset
Dataset
Business
register
Input data
Aggregate
Statistics
Aggregate
Statistics
Microdata
BR
snapshots
Storage,
combination
Outputs
For the purposes of this discussion we can summarise the DWH as having
 a staging area where inputs are cleaned/integrated
 a working area where outputs can be produced
The DWH also includes snapshots of the business register (possibly several of them) but
these are datasets in the DWH like any other; that is, they can be analysed, and can be
combined with other datasets to produce outputs. Hence they are coloured the same, above.
the snapshots held in the DWH for analysis are not updated in the DWH.
Now consider the outputs from the DWH in more detail:
Data
extracts
Data
extracts
Working data
Dataset
Staging area
Dataset
Dataset
Business
register
Input data
Aggregate
Statistics
Aggregate
Statistics
Microdata
BR
snapshots
Data
extracts
Storage,
combination
Outputs
In this picture we note that one of the outputs from the DWH is data extracts. These are
technically identical as the ‘microdata’ we noted might be produced for further analysis, but
conceptually we call these something different because we want to identify these types of
output as providing information to help the statistical system.
Take the Labour Force Survey as an example. By ‘microdata’ we could mean a de-identified
dataset intended for release to researchers under licence. A ‘data extract’ could be identified
microdata used for a follow up survey.
As another example, look at the register snapshops. ‘Microdata’ would be anonymised until
record information for analysis. ‘Data extracts’ could be a specific identified sample to collect
data from. However, before we take that leap forward, consider where the datasets come
from:
Selected
sample
Dataset
Admin data
source
Dataset
Admin data
source
Business
register
Input
reference
frame
Input data
Working data
Dataset
Staging area
Selected
sample
Data
extracts
Data
extracts
Aggregate
Statistics
Aggregate
Statistics
Microdata
BR
snapshots
Data
extracts
Storage,
combination
Outputs
Data could either come from (a) a sample, selected and then surveyed (b) administrative data
(c) both.
For arguments sake here we have assumed that the business register is fed entirely by admin
data. But the business register is also potentially updated from survey responses. How does
this relate to the DWH – should there not be a two-way link?
The answer is no, because we’ve already agreed that one of the outputs from the DWH is a
data extract – which could be information from a survey. What we then need are rules on how
to update the register:
Selected
sample
Dataset
Admin data
source
Dataset
Admin data
source
Business
register
Working data
Dataset
Staging area
Selected
sample
Data
extracts
Data
extracts
Aggregate
Statistics
Aggregate
Statistics
Microdata
BR
snapshots
Data
extracts
Rules for updating BR
Input
reference
frame
Input data
Storage,
combination
Outputs
So, the BR does not have a two-way relationship with the DWH. The BR generates snapshots
which go into the DWH like any other dataset. The DWH generates data extracts, which might
be useful for updating the BR, but might not. The question of how the extracts are used is not
part of the functionality of this technical facility called the DWH. The BR updating process is
independent of the functioning of the DWH – all that matter is that the necessary extracts can
be produced.
So we now have an almost complete view of the statistical process – but only almost. Where
do the samples come from? Again, the answer is to consider the use of DWH outputs:
Rules for generating samples etc
Selected
sample
Dataset
Admin data
source
Dataset
Admin data
source
Business
register
Working data
Dataset
Staging area
Selected
sample
Data
extracts
Data
extracts
Aggregate
Statistics
Aggregate
Statistics
Microdata
BR
snapshots
Data
extracts
Rules for updating BR
Input
reference
frame
Input data
Storage,
combination
Outputs
As for the BR, consider that the DWH produces data extracts. The fact that this extract is an
identified sampling frame rather than a dataset for more analysis is irrelevant to the running of
the DWH. The DWH just needs to be able to produce an extract; and the survey managers
just need an extract with relevant information, not any idea of how the data is produced.
Where does this leave the ‘data warehouse’? The DWH above was described as a ‘technical’
facility. Now let us consider what might be the domain of the ‘statistical data warehouse’ and
how this fits into the scope of the ESSNet:
Rules for generating samples etc
Selected
sample
Dataset
Admin data
source
Dataset
Admin data
source
Business
register
Working data
Dataset
Staging area
Selected
sample
Data
extracts
Data
extracts
Aggregate
Statistics
Aggregate
Statistics
Microdata
BR
snapshots
Data
extracts
Rules for updating BR
Input
reference
frame
Input data
Storage,
combination
Outputs
The shaded areas on the above diagram could be consider the ‘statistical data warehouse’
(SDWH): that is, the SDWH comprises
 technical facilities for storing and processing data, receiving data in and producing
outputs in a flexible way
 rules for updating the sources for the DWH
 rules for generating samples
 definitions necessary to achieve those sample/source generation
 the data flow model
Finally, what is the difference between ‘statistical data warehouse’ and the ‘data warehouse
architecture? I would argue that this is the difference between how and what:
 statistical data warehouse: what are we doing and why?
 data warehouse architecture: how do we do this?
but that both of these cover the same domain.
Download