in partnership with Title: Guidelines on detecting and treating outliers for the S-DWH WP: 2 Deliverable: 2.8 Version: 2.0 – final Date: September 2013 Author: NSI: ONS (UK) Gary Brown ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS Index 1. Introduction ....................................................................................................................... 3 2. Definition of an outlier ................................................................................................... 4 2.1 Outliers and errors ........................................................................................................................ 4 2.2 Outliers in survey data ................................................................................................................. 5 2.3 Outliers in administrative data ................................................................................................. 5 2.4 Outliers in modelling .................................................................................................................. 6 3. Identification of outliers in a data warehouse ............................................................................. 8 4. Treatment of outliers in a data warehouse ............................................................................... 11 5. Conclusions & Recommendations.................................................................................. 13 References ....................................................................................................................................................... 14 2 1. Introduction The European Statistical System Network project on Micro Data Linking and Data Warehousing (ESSnet DWH) is a joint venture between seven countries: Estonia, Italy, Lithuania, Netherlands, Portugal, Sweden and the UK. From October 2009 to October 2011, the first stage of the ESSnet DWH gathered data on use of data warehouses and integration between data warehouses and business registers across Europe. From October 2011 to October 2103, the second stage of the ESSnet DWH aims to identify best practice and write guidelines. The second stage is split in three separate work packages (WPs): WP1 on Metadata (led by Estonia); WP2 on Methodology (led by the UK); WP3 on Architecture (led by Italy). This report constitutes one of the eight deliverables of WP2. 2.1 Metadata 2.2 Business Register 2.3 Architecture 2.4 Data Linking 2.5 Confidentiality 2.6 Selective Editing 2.7 Mapping 2.8 Outliers 3 2. Definition of an outlier An outlier is defined in the Eurostat “Statistics Explained Glossary” and the Organisation for Economic Co-operation and Development (OECD) “Glossary of Statistical Terms” as: “A data value that lies in the tail of the statistical distribution of a set of data values” The UK Office for National Statistics “ONS Glossary” defines an outlier as: “A correct response, usually an extreme value isolated from the bulk of the responses, or has a large sample weight that would have an undue influence on the estimate” These two definitions illustrate that whether a data value is an outlier depends on the context. In the context of a data warehouse there are three types of outliers – outliers in survey data, outliers in administrative data, and outliers in modelling. Each of these types is explained below, but first a clear distinction is made between outliers and errors. 2.1 Outliers and errors Making a clear distinction between an outlier and an errors is vital. Errors are erroneous values, and are dealt with via editing rules. The OECD Glossary defines an edit rule as: “A logical condition or a restriction which must be met if the data is to be considered correct” whereas the ONS Glossary definition is: “A rule designed to detect specific errors in data for potential subsequent correction” The common theme here is the requirement ‘to correct’ – errors are assumed to be incorrect, outliers are assumed to be correct. This assumption is the foundation for the rest of this report. 4 2.2 Outliers in survey data In the survey context, an outlier is an unrepresentative value. A survey selects a sample of a target population, collects information from this sample, and uses weighting to estimate population values. The quality of the population estimates depend on the representativeness of the sample – i.e. whether it includes all types of respondents found in the population – the size of the sample – i.e. whether enough are sampled for an accurate estimate – and the appropriateness of the weighting – i.e. whether the weighted sample represents the sampled and unsampled population. Common sampling methods are: systematic random sampling for individuals; probability proportional to size for households; simple random sampling for businesses. Using simple random sampling as an example, an outlier occurs when a sampled unit is assumed to represent N/n unsampled values – where N is the number of this type of units assumed to be in the population, n are sampled, and N/n is large – but the sampled unit is actually unique (or nearly unique) in the population. This type of non-representative outlier “would have an undue influence on the estimate” (ONS Glossary). A “data value that lies in the tail” (OECD Glossary) will only be an outlier in the survey context if the population weight is different from 1. 2.3 Outliers in administrative data In the context of administrative data, an outlier is an atypical value. Administrative data represent a census of a target population, so weighting to estimate population values is not required. Each unit is treated as unique in the population. This type of outlier is simply extreme: a “data value that lies in the tail” (OECD Glossary); and “an extreme value isolated from the bulk of the responses” (ONS Glossary). An outlier in administrative data would not be treated as it would be assumed to be a genuine correct value. 5 2.4 Outliers in modelling1 In the context of modelling, an outlier is an influential value. In the ONS Glossary, influence is defined as: “The amount of effect a particular point has on the parameters of a regression equation” An atypical “data value that lies in the tail” (OECD Glossary) is nearly always an outlier in modelling. For example, Figure 1 shows an outlying value in the horizontal ‘x’ axis – note the same unit is not an outlier in the vertical ‘y’ axis. Figure 1: an outlier in x Figure 2 shows the same dataset with an OLS2 regression line fitted. Figure 2: regression line with outlier Modelling includes setting processing rules (for example: editing, imputation), as well as statistical modelling 1 2 Ordinary Least Squares 6 Whether the outlier is influential (on the regression line) depends on which regression line (figure 3) would be fitted if the outlier was removed: red = no, green = yes. The reality is likely to be less extreme – red is unlikely by chance, and green is unlikely if the outlier is not an error, as it should exhibit the same relationship between y and x as the other points. The expected influence of the outlier simply comes down to its high ‘leverage’ on the regression line – as this is directly proportional to the size of its x value. Figure 3: regression line without outlier If WLS3 regression was fitted to the dataset in Figure 1, then even data values in the cloud with “a large sample weight” (ONS Glossary) could be influential. 3 Weighted Least Squares 7 3. Identification of outliers in a data warehouse A data warehouse stores survey and administrative data for repeated use in estimation and modelling. The data may not be pre-cleaned before storage, so editing might be required as part of the pre-processing of data. Similarly, the data may not be complete, so imputation of missing data may also be required as part of pre-processing. However, both editing and imputation will be complete before outliers are identified and dealt with, so whether these processes take place inside or outside the data warehouse is immaterial for this report. Before outliers are discussed, units in a data warehouse need to be defined: each unit represents an individual (a person or a business) or an aggregate (a household, an enterprise group, or a domain) each unit has multiple data fields (variables) linked to metadata fields (for example, dates) The permutations are challenging: each unit answers Q survey questions for P periods and has A administrative records for T periods – i.e. QP + AT entries each entry can be used in estimation at M multiple domains at N periods – i.e. (QP + AT)MN uses for modelling, any combination of fields can be used – i.e. QPATM combinations As any entry could be an outlier, in the context of its use in estimation, and any combination of fields may result in outliers influential in modelling, it is impossible to attach a meaningful outlier designation to any unit. The only statement that can be made with certainty is: ‘Every unit in a data warehouse is a potential outlier’ It is not even possible to attach an outlier designation to any entry – as it would have to record the use – i.e. the domain and period for estimation, and the fields combined and model used. Given that neither the units in a data warehouse, nor the specific entries of units, can be identified as outliers per se, identification is domain- and context-dependent. This means that outliers are identified during processing of source data, and reported as a quality indicator of the output – if the output itself is stored in the data warehouse, the outliers will become part of the metadata accompanying the output, but will not be identified as outliers at source. The expected uses of a data warehouse and proposed identification methods are listed in table 1. 8 Table 1: Identification methods for outliers in a data warehouse Data use Outlier identification methods Processing Plotting/sorting/testing/setting cut-offs (edit rules and imputation Examples models) checking observed against expected numbers of edit failures of a rule sorting ratios to identify extremes when calculating an imputation link Updating the business register Edit rules Estimation of survey variables One-sided winsorisation Estimating calibration weights Detail Business registers are updated from multiple sources – for example surveys, and administrative data. Priority rules decide between sources – for example, use administrative data unless 3 years older than the survey. If the difference between sources is above a limit (an edit rule) an outlier is identified Detail Returned survey values yi > Ki are identified as outliers: M K i = μ̂i + wi − 1 where: μ̂i is the fitted value for yi – often the stratum mean wi is the population weight for yi M minimises the mean squared error – based on previous survey data Set an acceptable range for calibration weights (Hedlin et al, 2001) Detail Any calibration weights outside acceptable range are identified as ‘outliers’ 9 Modelling the Plotting/sorting/setting cut-offs relationship between Examples survey and Cook’s distance (Cook, 1977) calculates the influence of each datum administrative on regression parameters – plotting these (against index) identifies variables outliers Estimating missing survey variables from administrative variables Least trimmed squares (Rousseeuw, 1984) sums the smallest k absolute residuals/errors – a sudden increase identifies the (n-k) largest as outliers One-sided winsorisation (see above) of the estimated variables Detail Treat the model-based survey estimates as returned survey values (as above) 10 4. Treatment of outliers in a data warehouse All identification methods fundamentally do the same thing: they set a limit above which any value is an outlier. Once identified, outliers need to be treated to prevent distortion of estimates/models. Treatment fundamentally consists of weight adjustment: an adjustment to 0 percent (of original) equates to deleting the outlier an adjustment to P percent (of original) equates to reducing the impact of the outlier an adjustment to 100 percent (of original) equates to no adjustment of the outlier Treament will reduce variance but introduce bias, so the challenge is to find P that minimises the mean squared error. Table 2 lists methods against expected uses of a data warehouse. Table 2: Treatment methods for identified outliers in a data warehouse Data use Outlier treatment methods Processing (edit rules and imputation models) Delete outliers, impute or robustify rules or models (see examples) Examples Updating the business register Use medians rather than means Use quartiles/quantiles rather than maxima/minima Use ratio of means rather than mean of ratios Decide which source is an outlier, and update the register based on the other Examples Check coherence of both sources with similar units on register Recontact data sources (if possible) to gather extra information Prioritise future collection of repeat data from survey source 11 Estimation of survey variables Winsorisation automatically replaces outlier yi by Estimating calibration weights Restrict outliers to upper or lower limits of acceptable range of calibration weights, dependent on whether they are above the upper or below the lower, respectively Modelling the relationship between survey and administrative variables Exclude outliers from modelling process (i.e. deletion) Estimating missing survey variables from administrative Winsorisation (as above) yi +(wi −1)Ki wi 12 5. Conclusions & Recommendations Whatever the format of a data warehouse, the inputs are data and outputs are results. Both the inputs and outputs can be adversely affected by unusual values. The difference is that unusual values in inputs are assumed to be errors (and mostly are edited, see also deliverable 2.6) whilst unusual values in outputs are assumed to be correct values and are treated as outliers. Outliering changes the weight of unusual values to 0 (deletion), 1 (no treatment), or something in between. An outlier always is identified as such in terms of other data used in that particular output. But as in a S-DWH the same data can be re-used any number of times in different outputs, it is impossible to say a priori which values will be outliers. For example, someone 7 feet tall would be an outlier in a sample of the general population, but if the population is basketball players they will not be an outlier and vice versa for an average height person in a low height target population - hence all values are potential outliers. So in general you could state that outliering is (almost) impossible in a statistical data warehouse. However, the reality may not be so bleak. In the context of possible outliers in a S-DWH the focus should be on the key outputs from the S-DWH, for example, GDP, unemployment etc. Also, for any particular output metadata should be recorded and associated with the output, whether or not a value has been treated as an outlier. If the output is also stored in the data warehouse, these metadata will automatically accompany it and provide onwards guidance on outliers. So briefly summarized: 1. Neither data units nor their entries in a statistical data warehouse should/can be labelled as outliers up front. 2. The identification and treatment of outliers must be unique to each instance data (specific outputs) for which they are used. 3. Metadata on outliers, including which domains were searched for outliers and which were not, should only be included in a data warehouse alongside outputs 4. Refer to tables 1 and 2 for some guidance – not necessarily exhaustive or exclusive – on use-specific identification and treatment of outliers . 13 References Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19, 15-18. Hedlin D., Falvey H., Chambers R., Kokic P. (2001). Does the Model Matter for GREG Estimation? A Business Survey Example. Journal of Official Statistics, 17, 527-544 Rousseeuw, P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79, 871-880. 14