in partnership with Title: Guidelines on detecting and treating outliers for the S-DWH WP: 2 Deliverable: 2.8 Version: 1.1 Date: 11-3-2013 Autor: NSI: ONS (UK) Gary Brown ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS 1. Introduction The European Statistical System Network project on Micro Data Linking and Data Warehousing (ESSnet DWH) is a joint venture between seven countries: Estonia, Italy, Lithuania, Netherlands, Portugal, Sweden and the UK. From October 2009 to October 2011, the first stage of the ESSnet DWH gathered data on use of data warehouses and integration between data warehouses and business registers across Europe. From October 2011 to October 2103, the second stage of the ESSnet DWH aims to identify best practice and write guidelines. The second stage is split in three separate work packages (WPs): WP1 on Metadata (led by Estonia); WP2 on Methodology (led by the UK); WP3 on Architecture (led by Italy). This report constitutes one of the eight deliverables of WP2. 2.1 Metadata 2.2 Business Register 2.3 Architecture 2.4 Data Linking 2.5 Confidentiality 2.6 Selective Editing 2.7 Mapping 2.8 Outliers 2. Definition of an outlier An outlier is defined in the Eurostat “Statistics Explained Glossary” and the Organisation for Economic Co-operation and Development (OECD) “Glossary of Statistical Terms” as: “A data value that lies in the tail of the statistical distribution of a set of data values” The UK Office for National Statistics “ONS Glossary” defines an outlier as: “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate” These two definitions illustrate that whether a data value is an outlier depends on the context. In the context of a data warehouse there are three types of outliers – outliers in survey data, outliers in administrative data, and outliers in modelling. Each of these types is explained below, but first a clear distinction is made between outliers and errors. 2.1 Outliers and errors Making a clear distinction between an outlier and an errors is vital. Errors are erroneous values, and are dealt with via editing rules. The OECD Glossary defines an edit rule as: “A logical condition or a restriction which must be met if the data is to be considered correct” whereas the ONS Glossary definition is: 2 “A rule designed to detect specific errors in data for potential subsequent correction” The common theme here is the requirement ‘to correct’ – errors are assumed to be incorrect, outliers are asumed to be correct. This assumption is the foundation for the rest of this report. 2.2 Outliers in survey data In the survey context, an outlier is an unrepresentative value. A survey selects a sample of a target population, collects information from this sample, and uses weighting to estimate population values. The quality of the population estimates depend on the representativeness of the sample – ie whether it includes all types of respondents found in the population – the size of the sample – ie whether enough are sampled for an accurate estimate – and the appropriateness of the weighting – ie whether the weighted sample represents the sampled and unsampled population. Common sampling methods are: systematic random sampling for individuals; probability proportional to size for households; simple random sampling for businesses. Using simple random sampling as an example, an outlier occurs when a sampled unit is assumed to represent N/n unsampled values – where N is the number of this type of units assumed to be in the population, n are sampled, and N/n is large – but the sampled unit is actually unique (or nearly unique) in the population. This type of non-representative outlier “would have an undue influence on the estimate” (ONS Glossary). A “data value that lies in the tail” (OECD Glossary) will only be an outlier in the survey context if the population weight is different from 1. 2.3 Outliers in administrative data In the context of administrative data, an outlier is an atypical value. Administrative data represent a census of a target population, so weighting to estimate population values is not required. Each unit is treated as unique in the population. This type of outlier is simply extreme: a “data value that lies in the tail” (OECD Glossary); and “an extreme value isolated from the bulk of the responses” (ONS Glossary). An outlier in administrative data would not be treated as it would be assumed to be a genuine correct value. 2.4 Outliers in modelling1 In the context of modelling, an outlier is a influential value. 1 Modelling includes setting processing rules (for example, editing/imputation), as well as statistical modelling 3 In the ONS Glossary, influence is defined as: “The amount of effect a particular point has on the parameters of a regression equation” An atypical “data value that lies in the tail” (OECD Glossary) is nearly always an outlier in modelling. For example, Figure 1 shows an outlying value in the horizontal ‘x’ axis – note the same unit is not an outlier in the vertical ‘y’ axis. Figure 2 shows the same dataset with an OLS2 regression line fitted. Whether the outlier is influential (on the regression line) depends on which regression line (figure 3) would be fitted if the outlier was removed: red = no, green = yes. The reality is likely to be less extreme – red is unlikely by chance, and green is unlikely if the outlier is not an error, as it should exhibit the same relationship between y and x as the other points. The expected influence of the outlier simply comes down to it’s high ‘leverage’ on the regression line – as this is directly proportional to the size of it’s x value. Figure 1: an outlier in x Figure 2: regression line with outlier Figure 3: regression line without outlier 2 Ordinary Least Squares 4 If WLS3 regression was fitted to the dataset in Figure 1, then even data values in the cloud with “a large sample weight” (ONS Glossary) could be influential. 3. Identification of outliers in a data warehouse A data warehouse stores survey and administrative data for repeated use in estimation and modelling. The data may not be pre-cleaned before storage, so editing might be required as part of the pre-processing of data. Similarly, the data may not be complete, so imputation of missing data may also be required as part of pre-processing. However, both editing and imputation will be complete before outliers are identified and dealt with, so whether these processes take place inside or outside the data warehouse is immaterial for this report. Before outliers are discussed, units in a data warehouse need to be defined: each unit represents an individual (a person or a business) or an aggregate (a household, an enterprise group, or a domain) each unit has multiple data fields (variables) linked to metadata fields (for example, dates) The permutations are challenging: each unit answers Q survey questions for P periods and has A administrative records for T periods – ie QP + AT entries each entry can be used in estimation at M multiple domains at N periods – ie (QP + AT)MN uses for modelling, any combination of fields can be used – ie QPATM combinations As any entry could be an outlier, in the context of it’s use in estimation, and any combination of fields may result in outliers influential in modelling, it is impossible to attach a meaningful outlier designation to any unit. The only statement that can be made with certainty is: every unit in a data warehouse is a potential outlier It is not even possible to attach an outlier designation to any entry – as it would have to record the use – ie the domain and period for estimation, and the fields combined and model used. Given that neither the units in a data warehouse, nor the specific entries of units, can be identified as outliers per se, identification is domain- and context-dependent. This means that outliers are identified during processing of source data, and reported as a quality indicator of the output – if the output itself is stored in the data warehouse, the outliers will become part of the metadata accompanying the output, but will not be identified as outliers at source. The expected uses of a data warehouse and proposed identification methods are listed in table 1. 3 Weighted Least Squares 5 Data use Processing (edit rules and imputation models) Updating the business register Estimation of survey variables Table 1: Identification methods for outliers in a data warehouse Outlier identification methods Plotting/sorting/testing/setting cut-offs Examples checking observed against expected numbers of edit failures of a rule sorting ratios to identify extremes when calculating an imputation link Edit rules Detail Business registers are updated from multiple sources – for example surveys, and administrative data. Priority rules decide between sources – for example, use administrative data unless 3 years older than the survey. If the difference between sources is above a limit (an edit rule) an outlier is identified One-sided winsorisation Detail Returned survey values yi > Ki are identified as outliers: where: Estimating calibration weights Modelling the relationship between survey and administrative variables is the fitted value for yi – often the stratum mean wi is the population weight for yi M minimises the mean squared error – based on previous survey data Set an acceptable range for calibration weights (Hedlin et al, 2001) Detail Any calibration weights outside acceptable range are identified as ‘outliers’ Plotting/sorting/setting cut-offs Examples Cook’s distance (Cook, 1977) calculates the influence of each datum on regression parameters – plotting these (against index) identifies outliers Least trimmed squares (Rousseeuw, 1984) sums the smallest k absolute residuals/errors – a sudden increase identifies the (n-k) largest as outliers One-sided winsorisation (see above) of the estimated variables Estimating missing survey variables from Detail administrative Treat the model-based survey estimates as returned survey values (as above) variables 6 4. Treatment of outliers in a data warehouse All identification methods fundamentally do the same thing: they set a limit above which any value is an outlier. Once identified, outliers need to be treated to prevent distortion of estimates/models. Treatment fundamentally consists of weight adjustment: an adjustment to 0 percent (of original) equates to deleting the outlier an adjustment to P percent (of original) equates to reducing the impact of the outlier an adjustment to 100 percent (of original) equates to no adjustment of the outlier Treament will reduce variance but introduce bias, so the challenge is to find P that minimises the mean squared error. Table 2 lists methods against expected uses of a data warehouse. Table 2: Treatment methods for identified outliers in a data warehouse Data use Outlier treatment methods Processing (edit rules Delete outliers, impute or robustify rules or models (see examples) and imputation models) Examples Use medians rather than means Updating the business register Use quartiles/quantiles rather than maxima/minima Use ratio of means rather than mean of ratios Decide which source is an outlier, and update the register based on the other Examples Check coherence of both sources with similar units on register Estimation of survey variables Estimating calibration weights Recontact data sources (if possible) to gather extra information Prioritise future collection of repeat data from survey source Winsorisation automatically replaces outlier yi by Restrict outliers to upper or lower limits of acceptable range of calibration weights, dependent on whether they are above the upper or below the lower, respectively Exclude outliers from modelling process (ie deletion) Modelling the relationship between survey and administrative variables Estimating missing Winsorisation (as above) survey variables from administrative 7 5. Recommendations 1. Neither data units nor their entries in a data warehouse should be labelled as outliers 2. Identification and treatment of outliers should be unique to each instance data are used 3. Metadata on outliers, including which domains were searched for outliers and which were not, should only be included in a data warehouse alongside outputs 4. Refer to tables 1 and 2 for some guidance – not necessarily exhaustive or exclusive – on use-specific identification and treatment of outliers References Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19, 15-18. Hedlin D., Falvey H., Chambers R., Kokic P. (2001). Does the Model Matter for GREG Estimation? A Business Survey Example. Journal of Official Statistics, 17, 527-544 Rousseeuw, P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79, 871-880. 8