Uploaded by Gnaboto Constant Kanon

Data cleaning

advertisement
1
Table des matières
1.
2.
Overview of data cleaning ................................................................................................................... 3
1.1.
Rationale ....................................................................................................................................... 3
1.2.
Definition ...................................................................................................................................... 3
1.2.1.
Data cleaning as a process .................................................................................................... 3
1.2.2.
Screening............................................................................................................................... 4
1.2.3.
Diagnosing phase .................................................................................................................. 5
1.2.4.
Editing or Treatment phase .................................................................................................. 6
1.2.5.
Documenting Changes .......................................................................................................... 6
1.3.
Purpose ......................................................................................................................................... 7
1.4.
Importance .................................................................................................................................... 7
Data cleaning in the statistical analysis process (or workflow or pipeline or cycle) ......................... 7
2.1.
The statistical value chain ............................................................................................................. 7
2.2.
From raw data to technical correct data ...................................................................................... 8
2.2.1.
Type conversion .................................................................................................................... 8
2.2.2.
Character manipulation ........................................................................................................ 8
2.2.3.
Character encodings issues ................................................................................................... 9
2.3.
From technically correct data to consistent data ......................................................................... 9
2
1. Overview of data cleaning
1.1. Rationale
In collecting data, errors or discrepancies occur in spite of careful design, conduct and implementation
of error-prevention strategies. In actuality, it is very rare that the raw data one works with are in the
correct format (data structure), are without errors, are complete and have all the labels and codes that
are needed for analysis. Data cleaning intends to identify and correct the errors or at least to minimize
their impact on study results.
1.2. Definition
1.2.1. Data cleaning as a process
Data cleaning is the process of transforming raw data into consistent data that can be analyzed. More
precisely, data cleaning is the process of detecting, diagnosing, and editing faulty data, and documenting
this process. As part of the data cleaning process, data editing consists in changing the value of data
shown to be incorrect.
Data cleaning is a three-stage process involving repeated cycles of screening, diagnosing and editing of
suspected data abnormalities. Figure 1 shows these tree steps, which can be initiated at three different
stages of a study.
Many times, what is detected is a suspected data point or pattern that needs careful examination.
Similarly, missing values require further investigation (structural vs true missing value).
Good practice
Predefined rules for dealing with errors and true missing and extreme values must be defined.
The diagnostic and treatment phases of data cleaning require insight into the sources and types of
errors at all stages of the study, during as well as after measurement. The concept o data flow is crucial
3
in this respect. After measurement research data undergo repeated steps of being entered into
information carriers, extracted, transferred to other carriers, edited, selected, transformed,
summarised, and presented. It is important to realize that errors can occur at any stage of the data flow,
including during data cleaning itself. Table 1 illustrates some of the sources and types of eors possible in
a large questionnaire survey.
Table 1: Issues to be considered during Data Collection, Management and Analysis of a Questionnaaire Survey
Inaccuracy of a single measurement and data point may be acceptable, and related to inherent technical
error of the measurement instrument. Hence data cleaning should focus on those errors that are
beyond small technical variations and that constitute a major shift within and beyond the population
distribution. In turn, data cleaning must be based on knowledge of technical errors and expected ranges
of normal values. Some errors deserve priority, but which ones are most important is highly studyspecific.
1.2.2. Screening
To prepare data for screening, tidy the dataset by transforming the data in an easy-to-use format.
Within a tidied dataset:
 Fonts have been harmonized;
 Text is aligned to the left, number to the right;
 Each variable has been turned into a column and each observation into a row;
 There are no blank rows;
 Column headers are clear and visually distinct;
 Leading spaces have been deleted.
When screening data it is convenient to distinguish four basic types of oddities: lack (missing) or excess
(duplicates) of data; outliers, including inconsistencies; strange patterns in (joint) distributions; and
unexpected analysis results and other types of inferences and abstractions. Examine data for the
following possible errors:
 Spelling and formatting irregularities: are categorical variables written incorrectly? Is the date
format consistent? For numeric fields, are all of the values numbers?...
 Lack of data: do some variables have far fewer non missing values compared to others?
 Excess data: are there duplicate entries or more answers than originally allowed?
 Outliers/inconsistencies: are there values that are so far beyond the typical distribution that
they seem potentially erroneous?
4


Remarkable patterns: are there patterns that suggest that the respondent or enumerator has
not answered or recorded questions honestly?
Suspect analysis results: do the answers to some questions seem counterintuitive or extremely
unlikely?
Screening methods need not only be statistical. Many outliers are detected by perceived nonconformity
with prior expectation, based on the researcher’s experience, pilot studies, evidence in the literature, or
common sense.
What can be done to make screening objective and systematic? To allow the researcher to understand
the data better, it should be examined using simple descriptive statistics.
For identifying suspect data, one can first predefine expectations about normal ranges, distribution
shapes, and strength of relationships. Second, the application of these criteria can be planned
beforehand, to be carried out during or shortly after data collection, during data entry, and regularly
afterward. Third, comparison of the data with the screening criteria can be partly automated and lead to
flagging of dubious data, patterns, and results.
A special problem is that of erroneous inliers, i.e., data points generated by error but falling within the
expected ranges (multivariate outliers). Erroneous inliers will often escape detection. Sometimes, inliers
are discovered to be suspect if viewed in relation to other variables, using scatter plots, regression
analysis, or consistency checks.
1.2.3. Diagnosing phase
The purpose is to clarify the true nature of the worrisome data points, patterns, and statistics. Possible
diagnoses for each data point are as follows:
a. Errors: typos or answer that indicate the question was misunderstood;
b. True extreme: answer that seems high but can be justified by other answers
c. True normal (i.e., prior expectation was incorrect): a valid record
d. Idiopathic (i.e., no explanation found, but still suspect).
e. Missing data: answer omitted by the respondent (nonresponse), questions skipped by the
enumerator or dropout.
5
Some data points are clearly logically or biologically impossible. Hence, one may predefine not only
screening cut-offs to flag suspect data points (soft cut-off) , but also cutoffs for immediate diagnosis
of error (hard cutoffs). Figure 2 illustrates this method.
Some procedures for diagnosing the nature of suspected data points involve the following: one
procedure is to go to previous stages of the data flow to see whether a value is consistently the same.
This requires access to well-archived and documented data with justification for any changes made at
any stage. A second procedure is to look for information that could confirm the true extreme status of
an outlying data point. A third procedure is to collect additional information.
1.2.4. Editing or Treatment phase
After identification of errors, missing values, true (extreme or normal) values, the researcher has to
decide what to do with problematic observations. The options are limited to correcting, deleting, or
leaving unchanged. There are some general rules for which option to choose. Impossible values are
never left unchanged, but should be corrected if a correct value can be found, otherwise they should be
deleted. For true extreme values, the investigator may wish to further examine the influence of such
data points, individually and as a group, on analysis results before deciding whether or not to leave the
data unchanged.
1.2.5. Documenting Changes
Documentation of error, alterations, additions and error checking is essential to:
 Maintain data quality
 Avoid duplication of error checking by different data cleaners
 Recover data cleaning errors
 Determine the fitness of the data for use
 Inform users who may have used the data knowing what changes have been made since they
last accessed the data
Create a change log where all information related to modified fields is sourced. This will serve as an
audit trail showing any modifications, and will allow a return to the original value if required. Within the
change log, store the following information:
 Table (if multiple tables are implemented)
 Column, row
 Date changed
 Changed by
 Old value
 New value
6
 Comments
Make sure to document what data cleaning steps and procedures were implemented or followed, by
whom, how many responses were affected and for which questions
ALWAYS make this information available when sharing the dataset internally or externally.
1.3. Purpose
It is aimed at improving the content of statistical statements based on the data as well as their
reliability.
1.4. Importance
Data cleaning may profoundly influence statistical statements based on the data. Typical actions like
imputation or outlier handling obviously influence the results of a statistical analysis.
2. Data cleaning in the statistical analysis process (or workflow or pipeline or cycle)
Statistical analysis is the result of a number of data processing steps, including data cleaning, where
each step increases the “value” of the data.
2.1. The statistical value chain
Figure 1 shows an overview of a typical data analysis project. Each rectangle represents data in a certain
state while each arrow represents the activities needed to get from one state to another. The first state
(Raw data) is the data as it comes in. Raw data files may lack headers, contain wrong data types (e.g.
numbers stored as strings), wrong category labels, unknown or unexpected character encoding and so
on. In short, reading such files into a statistical software package can be either difficult or impossible
without some sort of pre-processing.
Once this pre-processing has taken place, data can be deemed Technically correct. That is, in this state
data can be read into software, with correct names, types and labels, without further trouble. However,
this does not mean that the values are error-free or complete. For example, an age variable may be
reported negative, an under-aged person may be registered to possess a driver’s licence, or data may
7
simply be missing. Such inconsistencies obviously depend on the subject matter that the data pertains
to, and they should be ironed out before valid statistical analysis from such data can be produced.
Consistent data is the stage where data is ready for statistical analysis. it is the analytical data sets. It is
the data that most statistical theories use as a starting point.
Once Statistical results have been produced they can be stored for reuse and finally, results can be
Formatted to include in statistical reports or publications.
Best practice
Store the input data for each stage (raw, technically correct, consistent, aggregated and formatted)
separately for reuse. Each step between the stages may be performed by a separate code for
reproducibility.
Summarizing, a statistical analysis can be separate in five stages, from raw data to formatted output,
where the quality of the data improves in every step towards the final result. Data cleaning
encompasses two of the five stages in a statistical analysis.
2.2. From raw data to technical correct data
A data set is a collection of data that describes characteristics of a number of real-world objects (units).
With data that are technically correct, we understand a data set where each value:
a. Can be directly recognized as belonging to a certain variable
b. Is stored in a data type that represents the value domain of the real-world variable
In other words, for each unit, a text variable should be stored as text, a numeric variable as a number,
and so on, and all this in a format that is consistent across the data set.
The following operations can be needed in order to obtain a technically correct data from raw data.
2.2.1. Type conversion
Converting a variable from one type to another is called coercion.
2.2.1.1. Recoding factors
Categorical variables can require some pre-processing in languages such as R as their values are stored
in specific data types. Indeed, in R the value of categorical variables is stored in factor variables. A factor
in R is an integer vector endowed with a table specifying what integer value corresponds to what level.
2.2.1.2. Converting dates
Dates can require special attention when reading raw files into a statistical software packages in general.
2.2.2. Character manipulation
Because of the many ways people can write the same things down, character data can be difficult to
process. For example, character manipulation can be used in coding categorical variables, i.e. classifying
(“messy”) text strings into a number of fixed categories. Two complementary approaches to string
coding are string normalization and approximate text matching. These activities can involve the
following tasks:
 Remove prepending (leading) or trailing white spaces.
 Pad strings to a certain width.
 Transform to upper/lower case.
 Search for strings containing simple patterns (substrings).
 Approximate matching procedures based on string distances.
2.2.2.1. String normalization (standardization)
8
String normalization techniques are aimed transforming a variety of strings to a smaller set of string
values which are more easily processed.
2.2.2.2. Approximate string matching
There are two forms of string matching. The first consists of determining whether a (range of)
substring(s) occurs within another string. In this case one needs to specify a range of substrings (called a
pattern) to search for in another string. In the second form one defines a distance metric between
strings that measure how “different” two strings are.
2.2.3. Character encodings issues
A character encoding system is a system that defines how to translate each character of a given alphabet
into a computer byte or sequence of bytes (e.g. ASCII, ...). For languages such as R and Python to
correctly read text files, they must understand which character encoding scheme was used to store it.
Some actions can be taken to set the appropriate character encodings to fix related issues.
2.3. From technically correct data to consistent data
Consistent data are technically correct data that are fit for statistical analysis. They are data in which
missing values, special values, (obvious) errors and outliers are removed, corrected or imputed. The data
are consistent with constraints based on real-world knowledge about the subject that the data describe.
Consistent data can be understood to include in-record consistency, meaning that no contradictory
information is stored in a single record, and cross-record consistency, meaning statistical summaries of
different variable do not conflict with each other. Finally, one can include cross-dataset consistency,
meaning that the dataset that is currently analysed is consistent with other datasets pertaining to the
same subject matter. In this tutorial, we mainly focus on in-record consistency, with the exception of
outliers handling which can be considered a cross-record consistency.
The process towards consistent data always involves the following three steps.
i.
Detection of an inconsistency. That is, one establishes which constraints are violated.
For example, an age variable is constrained to non-negative values.
ii.
Selection of the field or fields causing the inconsistency. This is trivial in a case of an
univariate demand as in the previous step, but may be more cumbersome when crosvariable relations are expected to hold. For example, the marital status of a child mut be
unmarried. In the case of a violation it is not immediately clear whether age, marital
status or both are wrong.
iii.
Correction of the field that are deemed erroneous by the selection method. This may be
done through deterministic (model-based) or stochastic methods.
For many data correction methods these steps are not necessarily separated. Netherverless, it is useful
to be able to recognize these sub-steps in order to make clear what assumptions are made during the
data cleaning process.
2.3.1. Detection and localization of errors
This section describes a number of techniques to detect univariate and multivariate constraint
violations.
2.3.1.1. Missing values
9
A missing value ia a placeholder for a datum of which the type is known but it value isn’t. Therefor, it is
impossible to perform statistical analysis on data where one or more values in the data are missing. One
may choose to either omit elements from a dataset that contain missing values or impute a value, but
missingness is something to be dealt with prior to any analysis.
Note: it may happen that a missing value in a data set means 0 or Not applicable. If that is the case, it
should be explicitly imputed with that value, because it is not unknown, but was coded s empty.
2.3.1.2. Special values
2.3.1.3. Outliers
An outlier in a dataset is an observation (or set of observations) which appear to be inconsistent with
that set of data.
Note: Outliers do not equal errors. They should be detected, but not necessarily removed. Their
inclusion in the analysis is a statistical decision.
2.3.1.4. Obvious inconsistencies
An obvious inconsistency occurs when a record contains a value or combination of values that cannot
correspond to a real-world situation. For example, a person’s age cannot be negative, a man cannot be
pregnant and an under-aged person cannot possess a driver license.
Such knowledge can be expressed as rules or constraints. In data editing literature these rules are
referred to as edit ules or edits, in short.
However, as the number of variables increases, the number of rules may increase rapidly and it may be
beneficial to manage the rules separate from the data. Moreover, since multivariate rules may be
interconnected by common variables, deciding which variable or variables in a record cause an
inconsistency may not be straightforward.
2.3.1.5. Error localization
The interconnectivity of edits is what makes error localization difficult.
2.3.2. Correction
Correction methods aim to fix inconsistent observations by altering invalid values in a record based on
information from valid data. Depending on the method this is either a single-step procedure or a twostep procedure where first, an error localization method is used to empty certain fields, followed by an
imputation step.
10
fie
Data flow: Passage of recorded information through successive information carriers.
An information carrier is a means to keep (store) information. It exists in the form of a card file, form, or
computer file, for example.
11
Download