Statistical Service Specification

advertisement
Statistical Service Specification
Statistical Service Specification: Linear Error Localisation
Protocol for Invoking the Service
The service can both be invoked as a command line executable and as a REST webservice.
Input Messages
All data is passed to the service by reference. Each dataset consists of three parts: the type
describing the type of data
data: the dataset (a UnitDataSet) in which the errors will be localised. The input data consists
of two parts:


data.data: a URL to the physical data set. Rectangular file in International Comma
Separated Value format, i.e. comma as a separator and period as decimal separator.
The header of the CSV file must contain the names of the variables. The columns of
the data set are the variables to be checked.
data.meta: a URL to the description of the physical data set.
rules: the set of rules against which the dataset will be checked

rules.data: in text format. Each line is a Linear Restriction Rule. Each variable used in
the Restriction Rules should correspond to the name of a numeric column in the input
data set.
weights (optional): a set of weights used in localising the errors: a fields with low weigh are
more likely to be localised as error.


weights.data: The input should be a rectangular file in International Comma Separated
Value format. The header of the CSV file must contain the names of the variables.
weigths.meta: a URL to the description of the weights data set.
There are three different methods for specifying the weights:

All variables have weight 1.
o

This is the default when no weights are specified.
For each variable a separate weight is specified; these weights are equal for each
record in the input dataset.
o
The input should be a rectangular file in International Comma Separated
Value format. The header of the CSV file must contain the names of the
variables. The number of columns must equal the number of columns of the
input dataset. There must be only one row of weights in the file.

For each variable and for each record a separate weight is specified.
o
The input should be a rectangular file in International Comma Separated
Value format. The header of the CSV file must contain the names of the
variables. The number of columns must equal the number of columns of the
input data set and the number of rows must be equal to the number of records
in the input dataset; i.e. for each cell (UnitDataPoint) in the input dataset a
separate weight is specified.
Output Message
The service generates three output data sets:
adapt: describes for each field in each record whether the corresponding field in the input
data set should be adapted to conform to the rule set.


adapt.data: Rectangular file in International Comma Separated Value format. The
header of the CSV file contains the names of the variables. The number of rows is
equal to the number of rows of the input dataset and the number of columns is equal
to the number of columns in the input dataset. The values in each record are either: 0
(FALSE) / 1 (TRUE). A value of 1 (TRUE) signifies that the corresponding value in
the input dataset should be adapted to conform to the rule set.
adapt.meta: a URL to the description of the adapt data set.
status: indicates for each record the time the algorithm took to locate the errors and the
number of errors found in the record.


status.data: Rectangular file in International Comma Separated Value format. The
number of rows is equal to the number of rows of the input dataset. The file contains
two numerical columns: one with the time in seconds and one with the number of
errors localised.
status.meta: a URL to the description of the status data set.
log: a text file in human readable format. Especially, error messages resulting in a non-zero
exit value are reported here.
In case of errors the command line executable will generate a nonzero exit value. In case of
the REST service an error will result in a status of ‘error’.
Applicable Methodologies
Error Localisation using the Fellegi and Holt principle, see for example:
Jonge, E. de and M. van der Loo (2014). ‘Error localization as a mixed integer problem with
the editrules package´, Discussion Paper 2014|07, CBS.
Fellegi, I.P. and D. Holt (1976). A systematic approach to automatic edit and imputation,
Journal of the American Statistical Association 71, 17-35.
Download