1 Table des matières 1. 2. Overview of data cleaning ................................................................................................................... 3 1.1. Rationale ....................................................................................................................................... 3 1.2. Definition ...................................................................................................................................... 3 1.2.1. Data cleaning as a process .................................................................................................... 3 1.2.2. Screening............................................................................................................................... 4 1.2.3. Diagnosing phase .................................................................................................................. 5 1.2.4. Editing or Treatment phase .................................................................................................. 6 1.2.5. Documenting Changes .......................................................................................................... 6 1.3. Purpose ......................................................................................................................................... 7 1.4. Importance .................................................................................................................................... 7 Data cleaning in the statistical analysis process (or workflow or pipeline or cycle) ......................... 7 2.1. The statistical value chain ............................................................................................................. 7 2.2. From raw data to technical correct data ...................................................................................... 8 2.2.1. Type conversion .................................................................................................................... 8 2.2.2. Character manipulation ........................................................................................................ 8 2.2.3. Character encodings issues ................................................................................................... 9 2.3. From technically correct data to consistent data ......................................................................... 9 2 1. Overview of data cleaning 1.1. Rationale In collecting data, errors or discrepancies occur in spite of careful design, conduct and implementation of error-prevention strategies. In actuality, it is very rare that the raw data one works with are in the correct format (data structure), are without errors, are complete and have all the labels and codes that are needed for analysis. Data cleaning intends to identify and correct the errors or at least to minimize their impact on study results. 1.2. Definition 1.2.1. Data cleaning as a process Data cleaning is the process of transforming raw data into consistent data that can be analyzed. More precisely, data cleaning is the process of detecting, diagnosing, and editing faulty data, and documenting this process. As part of the data cleaning process, data editing consists in changing the value of data shown to be incorrect. Data cleaning is a three-stage process involving repeated cycles of screening, diagnosing and editing of suspected data abnormalities. Figure 1 shows these tree steps, which can be initiated at three different stages of a study. Many times, what is detected is a suspected data point or pattern that needs careful examination. Similarly, missing values require further investigation (structural vs true missing value). Good practice Predefined rules for dealing with errors and true missing and extreme values must be defined. The diagnostic and treatment phases of data cleaning require insight into the sources and types of errors at all stages of the study, during as well as after measurement. The concept o data flow is crucial 3 in this respect. After measurement research data undergo repeated steps of being entered into information carriers, extracted, transferred to other carriers, edited, selected, transformed, summarised, and presented. It is important to realize that errors can occur at any stage of the data flow, including during data cleaning itself. Table 1 illustrates some of the sources and types of eors possible in a large questionnaire survey. Table 1: Issues to be considered during Data Collection, Management and Analysis of a Questionnaaire Survey Inaccuracy of a single measurement and data point may be acceptable, and related to inherent technical error of the measurement instrument. Hence data cleaning should focus on those errors that are beyond small technical variations and that constitute a major shift within and beyond the population distribution. In turn, data cleaning must be based on knowledge of technical errors and expected ranges of normal values. Some errors deserve priority, but which ones are most important is highly studyspecific. 1.2.2. Screening To prepare data for screening, tidy the dataset by transforming the data in an easy-to-use format. Within a tidied dataset: Fonts have been harmonized; Text is aligned to the left, number to the right; Each variable has been turned into a column and each observation into a row; There are no blank rows; Column headers are clear and visually distinct; Leading spaces have been deleted. When screening data it is convenient to distinguish four basic types of oddities: lack (missing) or excess (duplicates) of data; outliers, including inconsistencies; strange patterns in (joint) distributions; and unexpected analysis results and other types of inferences and abstractions. Examine data for the following possible errors: Spelling and formatting irregularities: are categorical variables written incorrectly? Is the date format consistent? For numeric fields, are all of the values numbers?... Lack of data: do some variables have far fewer non missing values compared to others? Excess data: are there duplicate entries or more answers than originally allowed? Outliers/inconsistencies: are there values that are so far beyond the typical distribution that they seem potentially erroneous? 4 Remarkable patterns: are there patterns that suggest that the respondent or enumerator has not answered or recorded questions honestly? Suspect analysis results: do the answers to some questions seem counterintuitive or extremely unlikely? Screening methods need not only be statistical. Many outliers are detected by perceived nonconformity with prior expectation, based on the researcher’s experience, pilot studies, evidence in the literature, or common sense. What can be done to make screening objective and systematic? To allow the researcher to understand the data better, it should be examined using simple descriptive statistics. For identifying suspect data, one can first predefine expectations about normal ranges, distribution shapes, and strength of relationships. Second, the application of these criteria can be planned beforehand, to be carried out during or shortly after data collection, during data entry, and regularly afterward. Third, comparison of the data with the screening criteria can be partly automated and lead to flagging of dubious data, patterns, and results. A special problem is that of erroneous inliers, i.e., data points generated by error but falling within the expected ranges (multivariate outliers). Erroneous inliers will often escape detection. Sometimes, inliers are discovered to be suspect if viewed in relation to other variables, using scatter plots, regression analysis, or consistency checks. 1.2.3. Diagnosing phase The purpose is to clarify the true nature of the worrisome data points, patterns, and statistics. Possible diagnoses for each data point are as follows: a. Errors: typos or answer that indicate the question was misunderstood; b. True extreme: answer that seems high but can be justified by other answers c. True normal (i.e., prior expectation was incorrect): a valid record d. Idiopathic (i.e., no explanation found, but still suspect). e. Missing data: answer omitted by the respondent (nonresponse), questions skipped by the enumerator or dropout. 5 Some data points are clearly logically or biologically impossible. Hence, one may predefine not only screening cut-offs to flag suspect data points (soft cut-off) , but also cutoffs for immediate diagnosis of error (hard cutoffs). Figure 2 illustrates this method. Some procedures for diagnosing the nature of suspected data points involve the following: one procedure is to go to previous stages of the data flow to see whether a value is consistently the same. This requires access to well-archived and documented data with justification for any changes made at any stage. A second procedure is to look for information that could confirm the true extreme status of an outlying data point. A third procedure is to collect additional information. 1.2.4. Editing or Treatment phase After identification of errors, missing values, true (extreme or normal) values, the researcher has to decide what to do with problematic observations. The options are limited to correcting, deleting, or leaving unchanged. There are some general rules for which option to choose. Impossible values are never left unchanged, but should be corrected if a correct value can be found, otherwise they should be deleted. For true extreme values, the investigator may wish to further examine the influence of such data points, individually and as a group, on analysis results before deciding whether or not to leave the data unchanged. 1.2.5. Documenting Changes Documentation of error, alterations, additions and error checking is essential to: Maintain data quality Avoid duplication of error checking by different data cleaners Recover data cleaning errors Determine the fitness of the data for use Inform users who may have used the data knowing what changes have been made since they last accessed the data Create a change log where all information related to modified fields is sourced. This will serve as an audit trail showing any modifications, and will allow a return to the original value if required. Within the change log, store the following information: Table (if multiple tables are implemented) Column, row Date changed Changed by Old value New value 6 Comments Make sure to document what data cleaning steps and procedures were implemented or followed, by whom, how many responses were affected and for which questions ALWAYS make this information available when sharing the dataset internally or externally. 1.3. Purpose It is aimed at improving the content of statistical statements based on the data as well as their reliability. 1.4. Importance Data cleaning may profoundly influence statistical statements based on the data. Typical actions like imputation or outlier handling obviously influence the results of a statistical analysis. 2. Data cleaning in the statistical analysis process (or workflow or pipeline or cycle) Statistical analysis is the result of a number of data processing steps, including data cleaning, where each step increases the “value” of the data. 2.1. The statistical value chain Figure 1 shows an overview of a typical data analysis project. Each rectangle represents data in a certain state while each arrow represents the activities needed to get from one state to another. The first state (Raw data) is the data as it comes in. Raw data files may lack headers, contain wrong data types (e.g. numbers stored as strings), wrong category labels, unknown or unexpected character encoding and so on. In short, reading such files into a statistical software package can be either difficult or impossible without some sort of pre-processing. Once this pre-processing has taken place, data can be deemed Technically correct. That is, in this state data can be read into software, with correct names, types and labels, without further trouble. However, this does not mean that the values are error-free or complete. For example, an age variable may be reported negative, an under-aged person may be registered to possess a driver’s licence, or data may 7 simply be missing. Such inconsistencies obviously depend on the subject matter that the data pertains to, and they should be ironed out before valid statistical analysis from such data can be produced. Consistent data is the stage where data is ready for statistical analysis. it is the analytical data sets. It is the data that most statistical theories use as a starting point. Once Statistical results have been produced they can be stored for reuse and finally, results can be Formatted to include in statistical reports or publications. Best practice Store the input data for each stage (raw, technically correct, consistent, aggregated and formatted) separately for reuse. Each step between the stages may be performed by a separate code for reproducibility. Summarizing, a statistical analysis can be separate in five stages, from raw data to formatted output, where the quality of the data improves in every step towards the final result. Data cleaning encompasses two of the five stages in a statistical analysis. 2.2. From raw data to technical correct data A data set is a collection of data that describes characteristics of a number of real-world objects (units). With data that are technically correct, we understand a data set where each value: a. Can be directly recognized as belonging to a certain variable b. Is stored in a data type that represents the value domain of the real-world variable In other words, for each unit, a text variable should be stored as text, a numeric variable as a number, and so on, and all this in a format that is consistent across the data set. The following operations can be needed in order to obtain a technically correct data from raw data. 2.2.1. Type conversion Converting a variable from one type to another is called coercion. 2.2.1.1. Recoding factors Categorical variables can require some pre-processing in languages such as R as their values are stored in specific data types. Indeed, in R the value of categorical variables is stored in factor variables. A factor in R is an integer vector endowed with a table specifying what integer value corresponds to what level. 2.2.1.2. Converting dates Dates can require special attention when reading raw files into a statistical software packages in general. 2.2.2. Character manipulation Because of the many ways people can write the same things down, character data can be difficult to process. For example, character manipulation can be used in coding categorical variables, i.e. classifying (“messy”) text strings into a number of fixed categories. Two complementary approaches to string coding are string normalization and approximate text matching. These activities can involve the following tasks: Remove prepending (leading) or trailing white spaces. Pad strings to a certain width. Transform to upper/lower case. Search for strings containing simple patterns (substrings). Approximate matching procedures based on string distances. 2.2.2.1. String normalization (standardization) 8 String normalization techniques are aimed transforming a variety of strings to a smaller set of string values which are more easily processed. 2.2.2.2. Approximate string matching There are two forms of string matching. The first consists of determining whether a (range of) substring(s) occurs within another string. In this case one needs to specify a range of substrings (called a pattern) to search for in another string. In the second form one defines a distance metric between strings that measure how “different” two strings are. 2.2.3. Character encodings issues A character encoding system is a system that defines how to translate each character of a given alphabet into a computer byte or sequence of bytes (e.g. ASCII, ...). For languages such as R and Python to correctly read text files, they must understand which character encoding scheme was used to store it. Some actions can be taken to set the appropriate character encodings to fix related issues. 2.3. From technically correct data to consistent data Consistent data are technically correct data that are fit for statistical analysis. They are data in which missing values, special values, (obvious) errors and outliers are removed, corrected or imputed. The data are consistent with constraints based on real-world knowledge about the subject that the data describe. Consistent data can be understood to include in-record consistency, meaning that no contradictory information is stored in a single record, and cross-record consistency, meaning statistical summaries of different variable do not conflict with each other. Finally, one can include cross-dataset consistency, meaning that the dataset that is currently analysed is consistent with other datasets pertaining to the same subject matter. In this tutorial, we mainly focus on in-record consistency, with the exception of outliers handling which can be considered a cross-record consistency. The process towards consistent data always involves the following three steps. i. Detection of an inconsistency. That is, one establishes which constraints are violated. For example, an age variable is constrained to non-negative values. ii. Selection of the field or fields causing the inconsistency. This is trivial in a case of an univariate demand as in the previous step, but may be more cumbersome when crosvariable relations are expected to hold. For example, the marital status of a child mut be unmarried. In the case of a violation it is not immediately clear whether age, marital status or both are wrong. iii. Correction of the field that are deemed erroneous by the selection method. This may be done through deterministic (model-based) or stochastic methods. For many data correction methods these steps are not necessarily separated. Netherverless, it is useful to be able to recognize these sub-steps in order to make clear what assumptions are made during the data cleaning process. 2.3.1. Detection and localization of errors This section describes a number of techniques to detect univariate and multivariate constraint violations. 2.3.1.1. Missing values 9 A missing value ia a placeholder for a datum of which the type is known but it value isn’t. Therefor, it is impossible to perform statistical analysis on data where one or more values in the data are missing. One may choose to either omit elements from a dataset that contain missing values or impute a value, but missingness is something to be dealt with prior to any analysis. Note: it may happen that a missing value in a data set means 0 or Not applicable. If that is the case, it should be explicitly imputed with that value, because it is not unknown, but was coded s empty. 2.3.1.2. Special values 2.3.1.3. Outliers An outlier in a dataset is an observation (or set of observations) which appear to be inconsistent with that set of data. Note: Outliers do not equal errors. They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision. 2.3.1.4. Obvious inconsistencies An obvious inconsistency occurs when a record contains a value or combination of values that cannot correspond to a real-world situation. For example, a person’s age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a driver license. Such knowledge can be expressed as rules or constraints. In data editing literature these rules are referred to as edit ules or edits, in short. However, as the number of variables increases, the number of rules may increase rapidly and it may be beneficial to manage the rules separate from the data. Moreover, since multivariate rules may be interconnected by common variables, deciding which variable or variables in a record cause an inconsistency may not be straightforward. 2.3.1.5. Error localization The interconnectivity of edits is what makes error localization difficult. 2.3.2. Correction Correction methods aim to fix inconsistent observations by altering invalid values in a record based on information from valid data. Depending on the method this is either a single-step procedure or a twostep procedure where first, an error localization method is used to empty certain fields, followed by an imputation step. 10 fie Data flow: Passage of recorded information through successive information carriers. An information carrier is a means to keep (store) information. It exists in the form of a card file, form, or computer file, for example. 11