Statistical Service Specification This document provides the specification of three services which implement the core functionality in confidentialisation on the fly process. The following assumptions apply to all three services: A micro dataset must be prepared for external analysis before it is used by these services. This may include removing existing variables from the dataset or adding new variables that will not be visible to the user or discretisizing existing continuous variables - just to name few cases; The dataset never leaves the service boundary - it exists only on the server-side; All metadata associated with a dataset is stored on the back-end and is used by these services. It is referred to in this document as dataset metadata; The methodology of the underlying confidentialization is driven by an extensive set of configuration parameters. These parameters are carefully chosen by those who are familiar with both the methodology and the data being confidentialized. It is critical to keep these configuration parameters confidential. Links to XSD types are provided for those service parameters that do have a comparable types in DDI. 1) Statistical Service Specification: ConfidentializedModelService This service builds a confidentialized statistical model. Not all properties of the model are returned to the user. For example, returning model parameters and residuals would allow one to re-produce the actual microdata values (within the level of perturbation that may have been applied before building the model). The service also allows the user to get a set of diagnostic plots for a specified model. Protocol for Invoking the Service The service can be invoked as a SOAP webservice. The service has two methods, and all service parameters are passed by value. a) Service method: getStatisticalModel This method builds and returns a confidentialized statistical model. Input Messages userID - user ID; datasetID – ID of the micro dataset on which to build the model; and modelDescriptor – an object of ModelDescriptor type. Output Message The outputs of the service method is: StatisticalModel - confidentialized statistical model. b) Service method: getModelPlots This method builds and returns one or more diagnostic plots for the specified model. Input Messages userID - user ID datasetID – ID of the micro dataset on which to build the model; modelDescriptor – an object of ModelDescriptor type; width - plot width; and height - plot height. Output Message The outputs of the service method is: Plot- diagnostic plot of the model. 2) Statistical Service Specification: ConfidentializedEDAService Exploratory Data Analysis (EDA) is a standard step in any statistical data analysis. Basically, it allows a user to look at microdata and identify patterns that can then be more formally explained in statistical models. However, in confidentialization-on-the-fly settings, the user is not allowed to look at the microdata. Confidentialized EDA Service provides two methods that allow a user to get familiar with the data, at a level specified by configuration parameters, without the having a respondent identified. Protocol for Invoking the Service The service can be invoked as a SOAP webservice. The service has two methods, and all service parameters are passed by value. a) Service method: getDataSummary This method returns a summary table. Input Messages userID - user ID datasetID – ID of the micro dataset on which to build the model; byVar– list of IDs of (categorical) variables by which to group other (cell) variables (references dataset metadata); cellVars - list of IDs of the variables to display in the table (references dataset); and Output Message The outputs of the service method is: unit data set containing the actual table data b) Service method: getWeightedDataSummary This method returns a summary table. Input Messages userID - user ID datasetID – ID of the micro dataset on which to build the model; byVar– list of IDs of (categorical) variables by which to group other (cell) variables (references dataset metadata); cellVars - list of IDs of the variables to display in the table (references dataset); and weightVarID - ID of the weight variable (references dataset metadata). Output Message The outputs of the service method is: a UnitDataSet containing the actual table data. c) Service method: getHexBinPlot This method creates a hex-bin plot for two continuous variables. Input Messages userID - user ID datasetID – ID of the micro dataset on which to build the model; plotDescritor – an object of BoxPlotDescriptor type; Output Message The outputs of the service method is: a plot. 3) Statistical Service Specification: ConfidentializedDataService This service allows a user to customize an existing data set. For example, certain variables may need to be added to the data set to accommodate building a particular statistical model. It may seem that the service provides functionality that is independent of the underlying confidentialization process, but it is not. For example, adding a variable to the dataset involves also adding certain metadata associated with that variable, as specified by the underlying configuration settings, and these are the key ingredients to preventing attacks (as described in the methodology paper). Thise metadata is added to the dataset metadata. Protocol for Invoking the Service The service can be invoked as a SOAP webservice. All method parameters are passed by value. The following are service methods: a) Service method: createNewDataSet Assigns one of the available dataset to the user. A user may modify his/her dataset through one or more of the methods below. To maintain service statelessness, the data set is serialized after every modifications ofits structure. Input Messages userID - user ID surveyID - survey ID datasetID – ID of the new micro dataset; variables - the list of variables with their existing and new names. User may rename variables through this parameter. Output Message The output of the service method is just the status of the operation. b) Service method: deleteDataSet This method delets user's dataset. Input Messages userID - user ID; datasetID – dataset ID; Output Message The output of the service method is just the status of the operation. c) Service method: keepRecords Keeps only those records in the data set that meet the specified criteria (expressed in terms of categrories of an existing variable). Records cannot currently be filtered based on continuous criteria. Input Messages userID - user ID datasetID – dataset ID; and keepCriteria– a logical expression specifying the criteria for keeping records (for example: state=10,20,30,40). Output Message The output of the service method is just the status of the operation. d) Service method: dropVariable This method creates a hex-bin plot for two continuous variables. Input Messages userID - user ID datasetID – dataset ID; and variableID – ID of the variable to drop. Output Message The output of the service method is just the status of the operation. e) Service method: dropVariables Drops one or more variables from the dataset. Input Messages userID - user ID; datasetID – dataset ID; and variableIDs – IDs of variables to drop (reference dataset metadata). Output Message The output of the service method is just the status of the operation. f) Service method: addContinuousVariable Adds a new continuous variable to the dataset. Input Messages o o o o userID - user ID fromDatasetID – ID of the dataset to which to add a variable; toDatasetID - ID of the new dataset; variableID– ID of the new variable to add; and expression - formula expression of the new variale (a function of existing variables that reference dataset metadata). Output Message The output of the service method is just the status of the operation. g) Service method: addCategoricalVariable Adds a new categorical variable to the data set. Input Messages userID - userID fromDatasetID –ID of the dataset to which to add a variable; toDatasetID - ID of the new dataset; variableID – ID of the new variable; variableCodes- the list of variable codes. Output Message The output of the service method is just the status of the operation. Applicable Methodologies Gwenda Thompson, Stephen Broadfoot and Daniel Elazar: “Methodology for the Automatic Confidentialisation of Statistical Outputs from Remote Servers at the Australian Bureau of Statistics”, Australian Bureau of Statistics.