Statistical Service Specification

Statistical Service Specification Statistical Service Specification: Linear Error Localisation Protocol for Invoking the Service The service can both be invoked as a command line executable and as a REST webservice. Input Messages All data is passed to the service by reference. Each dataset consists of three parts: the type describing the type of data data: the dataset (a UnitDataSet) in which the errors will be localised. The input data consists of two parts:   data.data: a URL to the physical data set. Rectangular file in International Comma Separated Value format, i.e. comma as a separator and period as decimal separator. The header of the CSV file must contain the names of the variables. The columns of the data set are the variables to be checked. data.meta: a URL to the description of the physical data set. rules: the set of rules against which the dataset will be checked  rules.data: in text format. Each line is a Linear Restriction Rule. Each variable used in the Restriction Rules should correspond to the name of a numeric column in the input data set. weights (optional): a set of weights used in localising the errors: a fields with low weigh are more likely to be localised as error.   weights.data: The input should be a rectangular file in International Comma Separated Value format. The header of the CSV file must contain the names of the variables. weigths.meta: a URL to the description of the weights data set. There are three different methods for specifying the weights:  All variables have weight 1. o  This is the default when no weights are specified. For each variable a separate weight is specified; these weights are equal for each record in the input dataset. o The input should be a rectangular file in International Comma Separated Value format. The header of the CSV file must contain the names of the variables. The number of columns must equal the number of columns of the input dataset. There must be only one row of weights in the file.  For each variable and for each record a separate weight is specified. o The input should be a rectangular file in International Comma Separated Value format. The header of the CSV file must contain the names of the variables. The number of columns must equal the number of columns of the input data set and the number of rows must be equal to the number of records in the input dataset; i.e. for each cell (UnitDataPoint) in the input dataset a separate weight is specified. Output Message The service generates three output data sets: adapt: describes for each field in each record whether the corresponding field in the input data set should be adapted to conform to the rule set.   adapt.data: Rectangular file in International Comma Separated Value format. The header of the CSV file contains the names of the variables. The number of rows is equal to the number of rows of the input dataset and the number of columns is equal to the number of columns in the input dataset. The values in each record are either: 0 (FALSE) / 1 (TRUE). A value of 1 (TRUE) signifies that the corresponding value in the input dataset should be adapted to conform to the rule set. adapt.meta: a URL to the description of the adapt data set. status: indicates for each record the time the algorithm took to locate the errors and the number of errors found in the record.   status.data: Rectangular file in International Comma Separated Value format. The number of rows is equal to the number of rows of the input dataset. The file contains two numerical columns: one with the time in seconds and one with the number of errors localised. status.meta: a URL to the description of the status data set. log: a text file in human readable format. Especially, error messages resulting in a non-zero exit value are reported here. In case of errors the command line executable will generate a nonzero exit value. In case of the REST service an error will result in a status of ‘error’. Applicable Methodologies Error Localisation using the Fellegi and Holt principle, see for example: Jonge, E. de and M. van der Loo (2014). ‘Error localization as a mixed integer problem with the editrules package´, Discussion Paper 2014|07, CBS. Fellegi, I.P. and D. Holt (1976). A systematic approach to automatic edit and imputation, Journal of the American Statistical Association 71, 17-35.

Statistical Service Specification

Related documents

Products

Support

Statistical Service Specification

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib