SBS Workshop: *Structural Business Statistics on Enterprise Groups*

advertisement
in partnership with
Title:
Guidelines on detecting and treating outliers for the S-DWH
WP:
2
Deliverable:
2.8
Version: 2.0 – final
Date:
September 2013
Author:
NSI:
ONS (UK)
Gary Brown
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Index
1.
Introduction ....................................................................................................................... 3
2.
Definition of an outlier ................................................................................................... 4
2.1
Outliers and errors ........................................................................................................................ 4
2.2
Outliers in survey data ................................................................................................................. 5
2.3
Outliers in administrative data ................................................................................................. 5
2.4
Outliers in modelling .................................................................................................................. 6
3.
Identification of outliers in a data warehouse ............................................................................. 8
4.
Treatment of outliers in a data warehouse ............................................................................... 11
5.
Conclusions & Recommendations.................................................................................. 13
References ....................................................................................................................................................... 14
2
1.
Introduction
The European Statistical System Network project on Micro Data Linking and Data
Warehousing (ESSnet DWH) is a joint venture between seven countries: Estonia, Italy,
Lithuania, Netherlands, Portugal, Sweden and the UK.
From October 2009 to October 2011, the first stage of the ESSnet DWH gathered data on
use of data warehouses and integration between data warehouses and business
registers across Europe. From October 2011 to October 2103, the second stage of the
ESSnet DWH aims to identify best practice and write guidelines.
The second stage is split in three separate work packages (WPs): WP1 on Metadata (led
by Estonia); WP2 on Methodology (led by the UK); WP3 on Architecture (led by Italy).
This report constitutes one of the eight deliverables of WP2.
2.1
Metadata
2.2
Business Register
2.3
Architecture
2.4
Data Linking
2.5
Confidentiality
2.6
Selective Editing
2.7
Mapping
2.8
Outliers
3
2.
Definition of an outlier
An outlier is defined in the Eurostat “Statistics Explained Glossary” and the Organisation
for Economic Co-operation and Development (OECD) “Glossary of Statistical Terms” as:
“A data value that lies in the tail of the statistical distribution of a set of data values”
The UK Office for National Statistics “ONS Glossary” defines an outlier as:
“A correct response, usually an extreme value isolated from the bulk of the responses,
or has a large sample weight that would have an undue influence on the estimate”
These two definitions illustrate that whether a data value is an outlier depends on the
context. In the context of a data warehouse there are three types of outliers – outliers in
survey data, outliers in administrative data, and outliers in modelling. Each of these
types is explained below, but first a clear distinction is made between outliers and
errors.
2.1
Outliers and errors
Making a clear distinction between an outlier and an errors is vital. Errors are erroneous
values, and are dealt with via editing rules. The OECD Glossary defines an edit rule as:
“A logical condition or a restriction which must be met if the data is to
be considered correct”
whereas the ONS Glossary definition is:
“A rule designed to detect specific errors in data for potential
subsequent correction”
The common theme here is the requirement ‘to correct’ – errors are assumed to be
incorrect, outliers are assumed to be correct.
This assumption is the foundation for the rest of this report.
4
2.2
Outliers in survey data
In the survey context, an outlier is an unrepresentative value.
A survey selects a sample of a target population, collects information from this sample,
and uses weighting to estimate population values. The quality of the population
estimates depend on the representativeness of the sample – i.e. whether it includes all
types of respondents found in the population – the size of the sample – i.e. whether
enough are sampled for an accurate estimate – and the appropriateness of the weighting
– i.e. whether the weighted sample represents the sampled and unsampled population.
Common sampling methods are: systematic random sampling for individuals;
probability proportional to size for households; simple random sampling for businesses.
Using simple random sampling as an example, an outlier occurs when a sampled unit is
assumed to represent N/n unsampled values – where N is the number of this type of
units assumed to be in the population, n are sampled, and N/n is large – but the sampled
unit is actually unique (or nearly unique) in the population.
This type of non-representative outlier “would have an undue influence on the estimate”
(ONS Glossary). A “data value that lies in the tail” (OECD Glossary) will only be an outlier
in the survey context if the population weight is different from 1.
2.3
Outliers in administrative data
In the context of administrative data, an outlier is an atypical value.
Administrative data represent a census of a target population, so weighting to estimate
population values is not required. Each unit is treated as unique in the population.
This type of outlier is simply extreme: a “data value that lies in the tail” (OECD Glossary);
and “an extreme value isolated from the bulk of the responses” (ONS Glossary). An
outlier in administrative data would not be treated as it would be assumed to be a
genuine correct value.
5
2.4
Outliers in modelling1
In the context of modelling, an outlier is an influential value.
In the ONS Glossary, influence is defined as:
“The amount of effect a particular point has on the parameters of a regression equation”
An atypical “data value that lies in the tail” (OECD Glossary) is nearly always an outlier
in modelling. For example, Figure 1 shows an outlying value in the horizontal ‘x’ axis –
note the same unit is not an outlier in the vertical ‘y’ axis.
Figure 1: an outlier in x
Figure 2 shows the same dataset with an OLS2 regression line fitted.
Figure 2: regression line with outlier
Modelling includes setting processing rules (for example: editing, imputation), as well as
statistical modelling
1
2
Ordinary Least Squares
6
Whether the outlier is influential (on the regression line) depends on which regression
line (figure 3) would be fitted if the outlier was removed: red = no, green = yes. The
reality is likely to be less extreme – red is unlikely by chance, and green is unlikely if the
outlier is not an error, as it should exhibit the same relationship between y and x as the
other points. The expected influence of the outlier simply comes down to its high
‘leverage’ on the regression line – as this is directly proportional to the size of its x value.
Figure 3: regression line without outlier
If WLS3 regression was fitted to the dataset in Figure 1, then even data values in the
cloud with “a large sample weight” (ONS Glossary) could be influential.
3
Weighted Least Squares
7
3.
Identification of outliers in a data warehouse
A data warehouse stores survey and administrative data for repeated use in estimation
and modelling. The data may not be pre-cleaned before storage, so editing might be
required as part of the pre-processing of data. Similarly, the data may not be complete,
so imputation of missing data may also be required as part of pre-processing. However,
both editing and imputation will be complete before outliers are identified and dealt
with, so whether these processes take place inside or outside the data warehouse is
immaterial for this report.
Before outliers are discussed, units in a data warehouse need to be defined:

each unit represents an individual (a person or a business) or an aggregate (a
household, an enterprise group, or a domain)

each unit has multiple data fields (variables) linked to metadata fields (for example,
dates)
The permutations are challenging:

each unit answers Q survey questions for P periods and has A administrative records
for T periods – i.e. QP + AT entries

each entry can be used in estimation at M multiple domains at N periods – i.e. (QP +
AT)MN uses

for modelling, any combination of fields can be used – i.e. QPATM combinations
As any entry could be an outlier, in the context of its use in estimation, and any
combination of fields may result in outliers influential in modelling, it is impossible to
attach a meaningful outlier designation to any unit. The only statement that can be made
with certainty is:
‘Every unit in a data warehouse is a potential outlier’
It is not even possible to attach an outlier designation to any entry – as it would have to
record the use – i.e. the domain and period for estimation, and the fields combined and
model used.
Given that neither the units in a data warehouse, nor the specific entries of units, can be
identified as outliers per se, identification is domain- and context-dependent. This means
that outliers are identified during processing of source data, and reported as a quality
indicator of the output – if the output itself is stored in the data warehouse, the outliers
will become part of the metadata accompanying the output, but will not be identified as
outliers at source. The expected uses of a data warehouse and proposed identification
methods are listed in table 1.
8
Table 1: Identification methods for outliers in a data warehouse
Data use
Outlier identification methods
Processing
Plotting/sorting/testing/setting cut-offs
(edit rules and
imputation
Examples
models)
 checking observed against expected numbers of edit failures of a
rule

sorting ratios to identify extremes when calculating an imputation
link
Updating the
business
register
Edit rules
Estimation of
survey
variables
One-sided winsorisation
Estimating
calibration
weights
Detail
Business registers are updated from multiple sources – for example
surveys, and administrative data. Priority rules decide between sources
– for example, use administrative data unless 3 years older than the
survey. If the difference between sources is above a limit (an edit rule)
an outlier is identified
Detail
Returned survey values yi > Ki are identified as outliers:
M
K i = μ̂i +
wi − 1
where:

μ̂i is the fitted value for yi – often the stratum mean

wi is the population weight for yi

M minimises the mean squared error – based on previous survey
data
Set an acceptable range for calibration weights (Hedlin et al, 2001)
Detail
Any calibration weights outside acceptable range are identified as
‘outliers’
9
Modelling the Plotting/sorting/setting cut-offs
relationship
between
Examples
survey and
 Cook’s distance (Cook, 1977) calculates the influence of each datum
administrative
on regression parameters – plotting these (against index) identifies
variables
outliers

Estimating
missing
survey
variables from
administrative
variables
Least trimmed squares (Rousseeuw, 1984) sums the smallest k
absolute residuals/errors – a sudden increase identifies the (n-k)
largest as outliers
One-sided winsorisation (see above) of the estimated variables
Detail
Treat the model-based survey estimates as returned survey values (as
above)
10
4.
Treatment of outliers in a data warehouse
All identification methods fundamentally do the same thing: they set a limit above which
any value is an outlier. Once identified, outliers need to be treated to prevent distortion
of estimates/models.
Treatment fundamentally consists of weight adjustment:

an adjustment to 0 percent (of original) equates to deleting the outlier

an adjustment to P percent (of original) equates to reducing the impact of the outlier

an adjustment to 100 percent (of original) equates to no adjustment of the outlier
Treament will reduce variance but introduce bias, so the challenge is to find P that
minimises the mean squared error. Table 2 lists methods against expected uses of a data
warehouse.
Table 2: Treatment methods for identified outliers in a data warehouse
Data use
Outlier treatment methods
Processing (edit rules
and imputation
models)
Delete outliers, impute or robustify rules or models (see
examples)
Examples
Updating the business
register

Use medians rather than means

Use quartiles/quantiles rather than maxima/minima

Use ratio of means rather than mean of ratios
Decide which source is an outlier, and update the register based
on the other
Examples

Check coherence of both sources with similar units on
register

Recontact data sources (if possible) to gather extra
information

Prioritise future collection of repeat data from survey
source
11
Estimation of survey
variables
Winsorisation automatically replaces outlier yi by
Estimating calibration
weights
Restrict outliers to upper or lower limits of acceptable range of
calibration weights, dependent on whether they are above the
upper or below the lower, respectively
Modelling the
relationship between
survey and
administrative
variables
Exclude outliers from modelling process (i.e. deletion)
Estimating missing
survey variables from
administrative
Winsorisation (as above)
yi +(wi −1)Ki
wi
12
5.
Conclusions & Recommendations
Whatever the format of a data warehouse, the inputs are data and outputs are results.
Both the inputs and outputs can be adversely affected by unusual values. The difference
is that unusual values in inputs are assumed to be errors (and mostly are edited, see also
deliverable 2.6) whilst unusual values in outputs are assumed to be correct values and
are treated as outliers.
Outliering changes the weight of unusual values to 0 (deletion), 1 (no treatment), or
something in between. An outlier always is identified as such in terms of other data used
in that particular output. But as in a S-DWH the same data can be re-used any number of
times in different outputs, it is impossible to say a priori which values will be outliers.
For example, someone 7 feet tall would be an outlier in a sample of the general
population, but if the population is basketball players they will not be an outlier and vice
versa for an average height person in a low height target population - hence all values
are potential outliers.
So in general you could state that outliering is (almost) impossible in a statistical data
warehouse.
However, the reality may not be so bleak. In the context of possible outliers in a S-DWH
the focus should be on the key outputs from the S-DWH, for example, GDP,
unemployment etc. Also, for any particular output metadata should be recorded and
associated with the output, whether or not a value has been treated as an outlier. If the
output is also stored in the data warehouse, these metadata will automatically
accompany it and provide onwards guidance on outliers.
So briefly summarized:
1. Neither data units nor their entries in a statistical data warehouse should/can be
labelled as outliers up front.
2. The identification and treatment of outliers must be unique to each instance data
(specific outputs) for which they are used.
3. Metadata on outliers, including which domains were searched for outliers and which
were not, should only be included in a data warehouse alongside outputs
4. Refer to tables 1 and 2 for some guidance – not necessarily exhaustive or exclusive –
on use-specific identification and treatment of outliers .
13
References
Cook, R. D. (1977).
Detection of Influential Observations in Linear Regression.
Technometrics, 19, 15-18.
Hedlin D., Falvey H., Chambers R., Kokic P. (2001).
Does the Model Matter for GREG Estimation? A
Business Survey Example. Journal of Official Statistics, 17, 527-544
Rousseeuw, P.J. (1984). Least Median of Squares Regression.
Journal of the American Statistical Association, 79, 871-880.
14
Download