SBS Workshop: *Structural Business Statistics on Enterprise Groups*

advertisement
in partnership with
Title:
Guidelines on detecting and treating outliers for the S-DWH
WP:
2
Deliverable:
2.8
Version: 1.1
Date:
11-3-2013
Autor:
NSI:
ONS (UK)
Gary Brown
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
1. Introduction
The European Statistical System Network project on Micro Data Linking and Data
Warehousing (ESSnet DWH) is a joint venture between seven countries: Estonia, Italy,
Lithuania, Netherlands, Portugal, Sweden and the UK.
From October 2009 to October 2011, the first stage of the ESSnet DWH gathered data on use
of data warehouses and integration between data warehouses and business registers across
Europe. From October 2011 to October 2103, the second stage of the ESSnet DWH aims to
identify best practice and write guidelines. The second stage is split in three separate work
packages (WPs): WP1 on Metadata (led by Estonia); WP2 on Methodology (led by the UK);
WP3 on Architecture (led by Italy).
This report constitutes one of the eight deliverables of WP2.
2.1 Metadata
2.2 Business Register
2.3 Architecture
2.4 Data Linking
2.5 Confidentiality
2.6 Selective Editing
2.7 Mapping
2.8 Outliers
2. Definition of an outlier
An outlier is defined in the Eurostat “Statistics Explained Glossary” and the Organisation for
Economic Co-operation and Development (OECD) “Glossary of Statistical Terms” as:
“A data value that lies in the tail of the statistical distribution of a set of data values”
The UK Office for National Statistics “ONS Glossary” defines an outlier as:
“A correct response, usually an extreme value isolated from the bulk of the responses,
or has a large sample weight that would have an undue influence on the estimate”
These two definitions illustrate that whether a data value is an outlier depends on the context.
In the context of a data warehouse there are three types of outliers – outliers in survey data,
outliers in administrative data, and outliers in modelling. Each of these types is explained
below, but first a clear distinction is made between outliers and errors.
2.1 Outliers and errors
Making a clear distinction between an outlier and an errors is vital. Errors are erroneous
values, and are dealt with via editing rules. The OECD Glossary defines an edit rule as:
“A logical condition or a restriction which must be met if the data is to be considered correct”
whereas the ONS Glossary definition is:
2
“A rule designed to detect specific errors in data for potential subsequent correction”
The common theme here is the requirement ‘to correct’ – errors are assumed to be incorrect,
outliers are asumed to be correct. This assumption is the foundation for the rest of this report.
2.2 Outliers in survey data
In the survey context, an outlier is an unrepresentative value.
A survey selects a sample of a target population, collects information from this sample, and
uses weighting to estimate population values. The quality of the population estimates depend
on the representativeness of the sample – ie whether it includes all types of respondents found
in the population – the size of the sample – ie whether enough are sampled for an accurate
estimate – and the appropriateness of the weighting – ie whether the weighted sample
represents the sampled and unsampled population. Common sampling methods are:
systematic random sampling for individuals; probability proportional to size for households;
simple random sampling for businesses.
Using simple random sampling as an example, an outlier occurs when a sampled unit is
assumed to represent N/n unsampled values – where N is the number of this type of units
assumed to be in the population, n are sampled, and N/n is large – but the sampled unit is
actually unique (or nearly unique) in the population.
This type of non-representative outlier “would have an undue influence on the estimate”
(ONS Glossary). A “data value that lies in the tail” (OECD Glossary) will only be an outlier
in the survey context if the population weight is different from 1.
2.3 Outliers in administrative data
In the context of administrative data, an outlier is an atypical value.
Administrative data represent a census of a target population, so weighting to estimate
population values is not required. Each unit is treated as unique in the population.
This type of outlier is simply extreme: a “data value that lies in the tail” (OECD Glossary);
and “an extreme value isolated from the bulk of the responses” (ONS Glossary). An outlier in
administrative data would not be treated as it would be assumed to be a genuine correct value.
2.4 Outliers in modelling1
In the context of modelling, an outlier is a influential value.
1
Modelling includes setting processing rules (for example, editing/imputation), as well as statistical modelling
3
In the ONS Glossary, influence is defined as:
“The amount of effect a particular point has on the parameters of a regression equation”
An atypical “data value that lies in the tail” (OECD Glossary) is nearly always an outlier in
modelling. For example, Figure 1 shows an outlying value in the horizontal ‘x’ axis – note the
same unit is not an outlier in the vertical ‘y’ axis. Figure 2 shows the same dataset with an
OLS2 regression line fitted. Whether the outlier is influential (on the regression line) depends
on which regression line (figure 3) would be fitted if the outlier was removed: red = no, green
= yes. The reality is likely to be less extreme – red is unlikely by chance, and green is unlikely
if the outlier is not an error, as it should exhibit the same relationship between y and x as the
other points. The expected influence of the outlier simply comes down to it’s high ‘leverage’
on the regression line – as this is directly proportional to the size of it’s x value.
Figure 1: an outlier in x
Figure 2: regression line with outlier
Figure 3: regression line without outlier
2
Ordinary Least Squares
4
If WLS3 regression was fitted to the dataset in Figure 1, then even data values in the cloud
with “a large sample weight” (ONS Glossary) could be influential.
3. Identification of outliers in a data warehouse
A data warehouse stores survey and administrative data for repeated use in estimation and
modelling. The data may not be pre-cleaned before storage, so editing might be required as
part of the pre-processing of data. Similarly, the data may not be complete, so imputation of
missing data may also be required as part of pre-processing. However, both editing and
imputation will be complete before outliers are identified and dealt with, so whether these
processes take place inside or outside the data warehouse is immaterial for this report.
Before outliers are discussed, units in a data warehouse need to be defined:
 each unit represents an individual (a person or a business) or an aggregate (a household,
an enterprise group, or a domain)

each unit has multiple data fields (variables) linked to metadata fields (for example, dates)
The permutations are challenging:
 each unit answers Q survey questions for P periods and has A administrative records for T
periods – ie QP + AT entries

each entry can be used in estimation at M multiple domains at N periods – ie (QP +
AT)MN uses

for modelling, any combination of fields can be used – ie QPATM combinations
As any entry could be an outlier, in the context of it’s use in estimation, and any combination
of fields may result in outliers influential in modelling, it is impossible to attach a meaningful
outlier designation to any unit. The only statement that can be made with certainty is:
every unit in a data warehouse is a potential outlier
It is not even possible to attach an outlier designation to any entry – as it would have to record
the use – ie the domain and period for estimation, and the fields combined and model used.
Given that neither the units in a data warehouse, nor the specific entries of units, can be
identified as outliers per se, identification is domain- and context-dependent. This means that
outliers are identified during processing of source data, and reported as a quality indicator of
the output – if the output itself is stored in the data warehouse, the outliers will become part of
the metadata accompanying the output, but will not be identified as outliers at source. The
expected uses of a data warehouse and proposed identification methods are listed in table 1.
3
Weighted Least Squares
5
Data use
Processing
(edit rules and
imputation
models)
Updating the
business
register
Estimation of
survey
variables
Table 1: Identification methods for outliers in a data warehouse
Outlier identification methods
Plotting/sorting/testing/setting cut-offs
Examples
 checking observed against expected numbers of edit failures of a rule
 sorting ratios to identify extremes when calculating an imputation link
Edit rules
Detail
Business registers are updated from multiple sources – for example surveys,
and administrative data. Priority rules decide between sources – for example,
use administrative data unless 3 years older than the survey. If the difference
between sources is above a limit (an edit rule) an outlier is identified
One-sided winsorisation
Detail
Returned survey values yi > Ki are identified as outliers:
where:


Estimating
calibration
weights
Modelling the
relationship
between
survey and
administrative
variables
is the fitted value for yi – often the stratum mean
wi is the population weight for yi
 M minimises the mean squared error – based on previous survey data
Set an acceptable range for calibration weights (Hedlin et al, 2001)
Detail
Any calibration weights outside acceptable range are identified as ‘outliers’
Plotting/sorting/setting cut-offs
Examples
 Cook’s distance (Cook, 1977) calculates the influence of each datum on
regression parameters – plotting these (against index) identifies outliers

Least trimmed squares (Rousseeuw, 1984) sums the smallest k absolute
residuals/errors – a sudden increase identifies the (n-k) largest as outliers
One-sided winsorisation (see above) of the estimated variables
Estimating
missing survey
variables from Detail
administrative Treat the model-based survey estimates as returned survey values (as above)
variables
6
4. Treatment of outliers in a data warehouse
All identification methods fundamentally do the same thing: they set a limit above which any
value is an outlier. Once identified, outliers need to be treated to prevent distortion of
estimates/models.
Treatment fundamentally consists of weight adjustment:

an adjustment to 0 percent (of original) equates to deleting the outlier

an adjustment to P percent (of original) equates to reducing the impact of the outlier

an adjustment to 100 percent (of original) equates to no adjustment of the outlier
Treament will reduce variance but introduce bias, so the challenge is to find P that minimises
the mean squared error. Table 2 lists methods against expected uses of a data warehouse.
Table 2: Treatment methods for identified outliers in a data warehouse
Data use
Outlier treatment methods
Processing (edit rules
Delete outliers, impute or robustify rules or models (see examples)
and imputation models)
Examples
 Use medians rather than means
Updating the business
register
 Use quartiles/quantiles rather than maxima/minima
 Use ratio of means rather than mean of ratios
Decide which source is an outlier, and update the register based on
the other
Examples
 Check coherence of both sources with similar units on register
Estimation of survey
variables
Estimating calibration
weights

Recontact data sources (if possible) to gather extra information

Prioritise future collection of repeat data from survey source
Winsorisation automatically replaces outlier yi by
Restrict outliers to upper or lower limits of acceptable range of
calibration weights, dependent on whether they are above the upper
or below the lower, respectively
Exclude outliers from modelling process (ie deletion)
Modelling the
relationship between
survey and
administrative variables
Estimating missing
Winsorisation (as above)
survey variables from
administrative
7
5. Recommendations
1. Neither data units nor their entries in a data warehouse should be labelled as outliers
2. Identification and treatment of outliers should be unique to each instance data are used
3. Metadata on outliers, including which domains were searched for outliers and which were
not, should only be included in a data warehouse alongside outputs
4. Refer to tables 1 and 2 for some guidance – not necessarily exhaustive or exclusive – on
use-specific identification and treatment of outliers
References
Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19,
15-18.
Hedlin D., Falvey H., Chambers R., Kokic P. (2001). Does the Model Matter for GREG Estimation? A
Business Survey Example. Journal of Official Statistics, 17, 527-544
Rousseeuw, P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical
Association, 79, 871-880.
8
Download