USER’S MANUAL STATISTICAL DATA VALIDATION TOOLKIT June 2015 1 | Data Validation Toolkit African Centre for Statistics Table of Contents 1 Introduction to the Statistical Data Validation Toolkit ......................................................................... 3 2 Programing Platform Selection ............................................................................................................. 3 3 Data Requirement ................................................................................................................................. 3 4 3.1 Indicator Coding ............................................................................................................................ 3 3.2 Country Coding ............................................................................................................................. 4 3.3 Input Data Format ......................................................................................................................... 5 3.4 Rules Data Format......................................................................................................................... 8 3.5 Creating New Rule ........................................................................................................................ 9 Solution Architecture .......................................................................................................................... 12 4.1 Architecture Diagram .................................................................................................................. 12 4.2 Functions ..................................................................................................................................... 12 5 Step by Step Validation Process .......................................................................................................... 15 6 Output Files ......................................................................................................................................... 17 7 Annex I: Indicators coding list ............................................................................................................. 20 8 Annex II: R Colors ............................................................................................................................... 21 2 | Data Validation Toolkit African Centre for Statistics 1 Introduction to the Statistical Data Validation Toolkit African Centre of Statistics developed a software to perform automatic validation of statistical data. The toolkit is a computerized application to validate datasets compiled from national and other sources against predefined rules and scientific standards. The software is developed under the expectation that it should identify inconsistencies on the datasets and report the same to users so that users can take measures to correct such reported inconsistencies. 2 Programing Platform Selection After an in-depth analysis of a couple of programming platforms, the toolkit is developed using R programming. R is a programming language and software development environment for statistical computing. It is an integrated suite of software facilities specifically geared to statistical data analysis and manipulation, calculation and graphics display. R Programming is selected mainly due to the following reasons: 3 it has effective data storage and access facility which can handle bulk statistical data it is equipped with an integrated and large collection of statistical data calculation and manipulation modules it has a pre-built graphics facilities for statistical data presentation it is developed with the basis of easy-to-use and intuitive commands which facilitates and expedites the process of software development, and Expertise of ACS staff with R programming. Data Requirement 3.1 Indicator Coding The validation algorithm is developed based on the standard data collection questionnaire developed by ACS. In order to facilitate programming and enhance the performance of the algorithm, each and every indicator is assigned to a specific code. Indicators in the toolkit are accessed through their respective code. The code assigned to the indicators in the data files affect the rules repository as it is developed based on the defined coding scheme. 3 | Data Validation Toolkit African Centre for Statistics The complete set of indicator coding is attached in the Annex I: Indicator Codes 3.2 Country Coding To avoid that would happen due to misspelling of country names, it is found necessary to assign codes to countries. To this effect the ISO Alpha-3 country Code as adopted by UN Statistics Division is used. The code of African countries is extracted as below for reference purpose. Table 1. Country code Ser. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Country Algeria Angola Benin Botswana Burkina Faso Burundi Cabo Verde Cameroon Central African Republic Chad Comoros Congo Côte d'Ivoire Democratic Republic of the Congo Djibouti Egypt Equatorial Guinea Eritrea Ethiopia Gabon Gambia Ghana Guinea Guinea-Bissau Kenya Lesotho Liberia Libya Madagascar Malawi Mali Mauritania 4 | Data Validation Toolkit Numerical 12 24 204 72 854 108 132 120 140 148 174 178 384 180 262 818 226 232 231 266 270 288 324 624 404 426 430 434 450 454 466 478 ISO ALPHA-3 code DZA AGO BEN BWA BFA BDI CPV CMR CAF TCD COM COG CIV COD DJI EGY GNQ ERI ETH GAB GMB GHA GIN GNB KEN LSO LBR LBY MDG MWI MLI MRT African Centre for Statistics Ser. No. 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 Country Mauritius Morocco Mozambique Namibia Niger Nigeria Rwanda Sao Tome and Principe Senegal Seychelles Sierra Leone Somalia South Africa South Sudan Sudan Swaziland Togo Tunisia Uganda United Republic of Tanzania Zambia Zimbabwe Numerical 480 504 508 516 562 566 646 678 686 690 694 706 710 728 729 748 768 788 800 834 894 716 ISO ALPHA-3 code MUS MAR MOZ NAM NER NGA RWA STP SEN SYC SLE SOM ZAF SSD SDN SWZ TGO TUN UGA TZA ZMB ZWE 3.3 Input Data Format The data validation toolkit is developed based on a strict adherence to the format of the data collection questionnaire used for ASYB data collection, with minor modification. Below is the format under consideration. 5 | Data Validation Toolkit African Centre for Statistics cCode iCode 5 Indicators Population and Demography Mid-year population - Total (Million) Female (%) Population by Age Group (000) 0-14 years AGO 6 15-64 years AGO 7 65+ years Active Population ('000) Female (%) AGO 1 AGO 2 AGO AGO AGO 10 AGO 11 AGO 13 AGO 18 AGO 20 AGO AGO 23 AGO 24 AGO 25 AGO 26 AGO 27 Crude birth rate Total (per 1000) Econ. active pop. in agric. (as % of Total) Prevalence of undernourishment Education Teaching staff at first level - Total (thousands) Teaching staff at second level Total (thousands) Teaching staff at third level - Total (thousands) First level student enrollment - Total (thousands) First level student enrollment Female (thousands) 2000 2001 2002 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 13.9 14.4 14.9 15.4 16.0 16.5 17.1 17.7 18.3 18.9 19.5 20.2 20.8 21.5 50.7 50.7 50.6 50.6 50.6 50.6 50.5 50.5 50.5 50.5 50.5 50.4 50.4 50.4 6 633 6 948 6 869 7 116 7 375 7 639 7 904 8 202 8 496 8 786 9 070 9 346 9 647 9 930 10 199 7 161 7 402 7 665 7 944 8 234 8 500 8 783 9 081 9 396 9 730 10 046 10 389 10 757 344 356 368 381 394 406 420 434 447 461 473 487 501 516 5 033.1 5 185.4 5 334.6 5 522.6 5 704.5 5 870.8 6 017.7 6 173.6 6 365.9 6 621.1 6 886.0 7 131.9 7 374.0 7 646.2 48.4 48.3 47.8 47.8 47.5 46.8 46.2 45.6 45.3 45.6 45.9 45.9 50.5 50.3 50.0 49.8 49.5 49.2 48.8 48.3 47.7 47.1 46.3 45.6 44.8 44.1 86.170 86.184 86.549 86.299 86.283 86.495 86.943 87.469 87.324 86.360 85.361 84.634 83.984 83.048 47.5 45.6 43.4 40.6 37.6 35.1 33.3 32.1 30.7 29.2 28 27.4 42.31 43.3 53.96 72.92 73.0 74.5 75.6 79.9 103.0 108.3 8.75 8.1 25.26 34.14 34.6 30.4 31.7 39.9 19.1 53.0 5.0 5.0 3.8 5.2 5.3 8.5 8.5 19.5 7.7 25.6 3 559 3 930 4 046 4 273 5 027 1 813 1 855 1 912 1 956 Table 2. Extracts of the data file 6 | Data Validation Toolkit African Centre for Statistics 2014 Columns: The first two column headers (cCode and iCode) are critical in which the system expects them to be spelled as shown here. cCode and iCode refers to country code and indicator code respectively. During the validation process, in most cases the validation reads each row, hence the country code is repeated in each row for coding simplification purpose. In the above format the first three columns (cCode, iCode, and indicator) are mandatory, the system does not use the data in the indicator column. But still the indicator column needs to be in the format as a placeholder. The rest of the data columns can vary depending on the available data. However, it is expected that the years are in sequential order. The toolkit considers the last column as the current year. Acceptable Format: The toolkit accepts data from Microsoft Excel file. It has a user interface to select as much files as available, and the system reads the first worksheet from each file and generate a consolidated R dataframe in memory to assess validity. The toolkit also expects all the files have same number of columns. An empty cell is considered as missing value. Cells with value zero (0) are not considered as missing values. 7 | Data Validation Toolkit African Centre for Statistics 3.4 Rules Data Format Table 3. Rules format iCode Color active Function Missing Values Warning message Significant number of missing values for this indicator cyan4 Y missingValues ( Outlier Potential outlier Red Y markOutliers ( Gabon is greater than Ethiopia Enrollment in primary education is decreasing Cadetblue Y isCountryGreater ( 19 20 darkgoldenrod Y Growth ( 26 2% ) 27 Compare Countries Enrollment growth Primary education Enrollment growth Primary education female darkgoldenrod Y Growth ( 27 2% ) 68 GNI per capita darkgoldenrod Y Growth ( 68 2% ) 71 Richest income GNI per capita is not growing The Richest income is less than the poorest income Cadetblue Y Greater ( 71 72 ) 34 Literacy rate – male Y minMax ( 34 0 100 ) Literacy rate – female Mle literacy rate is out of range Female literacy rate is out of range deepskyblue4 35 deepskyblue4 Y minMax ( 35 0 100 ) cyan4 Y Growth ( 1 1% Chocolate Y populationGrowth ( 1 16 2 ) Grey Y minMax ( 2 40 60 ) Brown Y Equality ( getSumOfIndicatorValues Grey Y minMax ( 3 5 cyan4 Y Growth ( 5 1% ) cyan4 Y Growth ( 5 1% ) cyan4 Y Growth ( 5 1% ) Cadetblue Y Greater ( 1 10 ) Grey Y minMax ( 11 20 Cadetblue Y Greater ( 13 14 26 rName 1 Population 1 Population growth 2 percentage of female 1 population total Urban population percentage Population growth age 1 Population growth age 2 Population growth age 3 Enrollment in primary education (female) is not growing 30% ) ) 26 68 10 Active Population 11 13 Active female Birth rate greater than death rate Suspicious population growth Population size does not match with the estimate (population growth rate) Unacceptable percentage of population (female) Population size is different from sum of age groups Unacceptable percentage of population (urban) Suspicious population growth age 0 - 14 Suspicious population growth age 15-64 Suspicious population growth age 65+ Active population size is greater than total population size Active female population out of range Death rate is greater than birth rate 14 Crude death rate Crude death rate out of range deepskyblue4 Y minMax ( 13 0 50 ) 15 Firtility rate Firtility rate out of range deepskyblue4 Y minMax ( 15 0 10 ) 17 Depandency ratio Dependency ratio out of range deepskyblue4 Y minMax ( 17 0 100 ) 3 5 6 7 8 | Data Validation Toolkit 2 ) 7 ) getIndicatorValue ) ( 5 African Centre for Statistics 39 6 80 ) 70 ) ) ( 1 ) ) Columns: iCode: The first column is for the indicator code. There are two types of rules namely generic and indicator based. The generic rules apply on all indicators. On the contrary, the toolkit applies the indicator-based rules only when it encounters the specified indicators in the data file. In the above table rules like Missing Values, and outliers are examples of generic rules. These rules apply to all indicators. The system applies these rules to validate each and every cell in the data file. On the other hand some rules in the above table are indicator based. Rules like GNI per capita, richest income, etc. are indicator based. The system applies such rules only to the rows of the specified indicator in the data file. In the rules file, if the iCode column is empty then that specific rule is general, if not the rule is indicator based specific to the indicator rName: This column is used to give a descriptive name of the rule while reading the rule repository. The toolkit does not use this column in its validation process. Warning message: This column is used to store warning messages to be displayed in the report files if the data fails to fulfil the specific rule. It warns the user that a specific cell or raw of data has validation issues. Color: This column is used to specify the background color which is used to mark invalid cells in the excel report file. If a cell is invalid as per the rule the system reads the color column and change the background color to the specified color in the excel report file. A complete list of acceptable colors are attached in Annex A: R Colors. Active: This column specifies the status of the rule. The status of a rule can be either ACTIVE of INACTIVE. To activate a rule put Y in this column and leave the column empty to deactivate the rule. The Toolkit considers only active rules. Hence, users can deactivate rules if they want to disable them in a specific validation run. Function: Function and the rest of the columns are used to define the actual rule. Rules are constructed from the functions defined in Section 4.2 following predefined syntax. The following section explains how a rule can be constructed from those functions. 3.5 Creating New Rule Users can create new rules that suite the requirement of their data without writing a single line of R code. The toolkit is designed in such a way that novice user of R programming can create powerful validation rules in an intuitive manner. New rules can be created only in the rules repository file. A user can insert a new rule to an existing rules file or can create his/her own file with filename he/she prefers. However the rules file format has to follow the requirements as specified in Section 3.4 above. Those mandatory columns and column names should be available. Once the format is created new rules can be inserted the same way as displayed in Table 3 9 | Data Validation Toolkit African Centre for Statistics This includes inserting the following values for a rule: Indicator code under iCode column– if the rule depends on an indicator code Rule name under rName Column – just to give a descriptive name to the rule Warning message under the warning message column – this is the message to be displayed in the output files Color in the color column – this is the color to be displayed as background in the excel output file. Use predefined color names as specified in Annex II: R Colors Specify status of that rule in the Active column. The next columns present the function(s) which is the logic of the validation rule. Constructing the validating function The validating rule gets constructed from the predefined functions in Section 4.2. The whole construct of a validation function instructs the system to work on a specific portion of the data to be validated and returns either TRUE or FALSE. The toolkit decides on the validity of that portion of data depending on the response of the validation function - TRUE is for valid data and FALSE is for firing the warning message of the rule under consideration. All functions have similar format. Following are major components of a function Name – all functions have a name Opening parenthesis ‘(‘ – in all the functions the name is followed by opening parenthesis Parameters – most functions have parameters. The allowed type and quantity of parameters of the predefined functions is specified in Section 4.2 below. Parameters can also be functions. As specified in Section 4.2, some functions have 0 parameter, while others have 1 and even the rest with more than one. Closing parenthesis ‘)’ – Once parameters are listed we have to put a closing parenthesis so that the systems knows that the list of parameters has completed. The opening and closing parenthesis should be available even in cases where the function has 0 parameter. And each token (component) of the function should be put in its own cell. At the same time there should not be an empty cell in between. Example Rules Example 1: Mark outliers iCode rName Warning Message color active Function Outlier Potential outlier red Y markOutliers ( ) This is a very simple rule with varies characteristics. First of all it is a generic rule which does not depend on any specific indicator. That is the reason that the iCode column is empty. Note that it has a name, warning message color and it is active. The function is markOutliers. This function has no parameters. Also note how the opening and closing parentheses are presented. Example 2: Range or minMax 10 | Data Validation Toolkit African Centre for Statistics iCode rName Warning Messaage Percentage of public Public expenditure on expenditure education is out of 32 on education range color active Function deepskyblue4 Y minMax ( 32 5 50 ) The above rule depends on indicator ‘public expenditure on education’ whose code is 32. The function is minMax. minMax has three parameters. The first is the indicator code, followed by the minimum allowed value followed by the maximum allowed value. Hence while running the validation with this rule, the toolkit picks the value of ‘public expenditure on education’ and checks if the value is between 5 and 50. If it is out of range sends the warning message to the output files. Example 3: Equality iCo de 1 rName populat ion total Warning Message Population size is different from sum of age groups colo r acti ve Function bro wn Y equality ( getSumOfIndicator Values ( 5 6 7 ) getIndicator Value ( 1 ) ) The above rule is more complex, a typical example which shows the design of more powerful rules by using multiple functions. The equality function validates the equality of the total population size to that of the sum of the population sizes under the three age groups. The function equality has two parameters. These parameters (in this case) are computed by the two functions namely getSumOfIndicatorValues and getIndicatorValue. The function getSumOfIndicatorValues has variable number of parameters, in this case it has three. 5, 6 and 7 are indicator codes whose values to sum up. The function getIndicatorValue has one parameter. It takes indicator code and returns its value. Once these two functions return values as per the data file, the equality function checks if these two values are equal. Please not how the opening and closing parentheses are used. 11 | Data Validation Toolkit African Centre for Statistics 4 Solution Architecture 4.1 Architecture Diagram Data Rules Repository Data Loader User Rule Parser User Interface Data Validation Engine Validation Report Figure 1. Overall architecture of the validation toolkit. Overall architecture of the software is as depicted above in Fig. 1. The toolkit expects two types of input files namely the data file and rules repository. After validation, two output files (.xslx and .txt) are generated to report the result. 4.2 Functions The toolkit has pre-defined functions which can be used to execute an atomic data extraction and validation tasks. Rules are constructed from these functions. There are two types of function depending on their return value. 12 | Data Validation Toolkit African Centre for Statistics 1) Validation functions: return Boolean values (TRUE or FALSE). These functions are the one which determine the validation of a specific value of a cell in the data file. Validation functions can further be classified into two categories namely: a. Indicator based: functions which get executed when the toolkit encounters the indicator as defined in the specific rule at the iCode column b. Generic : functions which get executed at any data cell 2) Helper functions: return data value by just extracting or applying some statistical computation. Such functions are used to extract and compute data and feed to the validation functions to determine validation. Below is the list of all functions defined in the toolkit Numeric getSumOfIndicatorValues (indicatorList): o used to sum the values of indicators o has variable number of parameters. All variables should be indicator codes o returns the sum of the values of indicators in the argument list. This function has variable number of parameters. All the parameters should be indicator codes. o Helper function Boolean isCountryGreater (countryG, countryL, indicatorList): o Used to compare indicator values of two countries. o Has variable number of parameters. The first two parameters are country codes followed by the indicators to be compared. Note that it is expected that the first country to have greater value with respect to the indicators listed in the parameter. o Validation function o Generic function Boolean minMax (indicator, thisMin, thisMax): o used to validate if an indicator value falls in a range. o has three parameters. indicator - indicator code to be validated thisMin – minimum allowed value of the range thisMax – maximum allowed value of the range o Validation function o Indicator based function Numeric getIndicatorByYear(indicator, year): o used to retrieve indicator value of a given year o has two parameters indicator – indicator code to whose value to be retrieved year – year of the value to be retrieved o Helper function 13 | Data Validation Toolkit African Centre for Statistics Boolean equality (x, y) : o validates equality of two values x and y o has two parameters – both numeric o validation function o indicator based Boolean populationGrowth(baseIndicator, growthIndicator, threshold): o used to validate growth of an indicator value as per the growth rate indicator. A typical example is validating if population is growing considering population size of previous year and average annual population growth rate o has three parameters baseIndicator – indicator whose growth is validated growthIndicator – indicator which measures growth rate of baseIndicator threshold – this parameter controls the expected perfection in the calculation o validation function o indicator based. Boolean greater (indicator1, indicator2): o used to validate if indicatore1 is greater than indicator2 o has two parameters indicator1 – indicator code indicator2 – indicator code o validation function o indicator based Boolean growth (indicator, rate): o used to validate if if the value of indicator is growing at least with the rate provided year to year o has two parameters indicator – indicator code which is under validation rate – the expected growth rate o validation function o indicator based Boolean markOutliers (): o searches the data for outliers (uses three sigma method) o no parameter o validation function o generic function Boolean missingValues (rate): o used to validate if the number of missing values of indicators are acceptable 14 | Data Validation Toolkit African Centre for Statistics o o o has one parameter rate – control parameter to define the acceptable percentage of missing values validation function generic function Note that the above list of functions presents only those functions which can be used by the end user to formulate new rules. The system has much more helper functions implemented to perform activities such as data loading, data cleansing, data formatting, data saving, user interface, etc. 5 Step by Step Validation Process Prepare the data to be validated. Format the data to be validated as per the requirement of the Data Validation Toolkit. Follow the guidelines as outlined in Section 3.3 above, Edit the Rules file to suite the current requirement of validation if there is a need. Section 3.4 has presented the modifications a user can perform in the rules repository, Get ready the output folder where validation reports can be saved, Run the Data Validation Toolkit by double clicking the respective icon from the desktop or by selecting the program from the list of Programs, The System first opens a command window as shown below. This window controls the kernel of the Data Validation Toolkit and shows that the Toolkit is running. Once this window is closed the toolkit stopped functioning even though the the toolkit screen is showing on a web browser. (Please note that the content of the window might vary depending on the installation environment) Figure 2. Command window 15 | Data Validation Toolkit African Centre for Statistics After displaying the command window, the System automatically opens the default browser and runs the Toolkit. In a typical browser the toolkit page looks like the following figure. Figure 3. Validation page As any web-based application, the Toolkit can be accessed at a specific URL as can be seen at the top of the screen – 127.0.0.1:7373 If the toolkit page is closed accidentally, opening a browser and keying in the above URL reopens the page. Click the Data Files button to select the data to be validated. Multiple files can be selected at a time for validation. o Note that as the validation process loops through the data quite a number of times, it takes longer time. Hence, it is recommended to run separate data files if there is no any inter-file validation required. o inter-file validation refers to the comparison of indicator values in different files such as comparing population size of two countries. 16 | Data Validation Toolkit African Centre for Statistics 6 Click Rule File button to select the rules repository to be applied in the data validation process Click Working Folder button to specify the output folder where the report files can be saved All the selected files and folders will be displayed in the display fields below the respective buttons In case there is something wrong with the selection of the files and folders, it is possible to click the respective buttons to correct the selection Once the selection is done as expected, click Load button to load the selected files to the system. o Depending on the size of the files, the loading process might take a while. This is because data loading module performs different data cleansing and rearranging activities. o The border lines of the display field gets dimed to show that the loading process is running o Once the loading is finished the system display the text “Data Successfully Loaded” in the display field below the the button Click Validate button to start the validation process. o The border lines of the display field gets dimed to show that the validation process is running o Once validation is completed, the Toolkit displays the status of validation and the output files in the display field below the button. o The output file names are time stamped in order to avoid replacement. Output Files There are two output files with .txt and .xlsx format. The text format generates just the warning messages in reference to the specific indicator, year and country code. Whereas the Excel format presents the data as it is and displays warning cells with colors and comments. Comments can be viewed by hovering the mouse on top of the colored cell. Both files have the same name except the filename extension. The filenames are Text file valid_result ddmmyyyy_HHMMSS.txt Excel file valid_result ddmmyyyy_HHMMSS.xlsx 17 | Data Validation Toolkit African Centre for Statistics The file name is time stamped for the sake of avoiding overwriting a previously run validation result. Accordingly ddmmyyyy_HHMMSS represents the time when that specific validation run has started, where: dd – day mm – month yyyy – year HH – hour MM – minute SS – second Figure 4. Extract of validation result output – text file The following figure shows an extract of the validation output in Excel format. The excel format is just a reproduction of the validated data file with warning colors (as specified in the rules repository) and comments. In addition the the output file inserts a column to track the data while processing (Column A). 18 | Data Validation Toolkit African Centre for Statistics Figure 5. Extract of validation output – excel file 19 | Data Validation Toolkit African Centre for Statistics 7 Annex I: Indicators coding list 20 | Data Validation Toolkit African Centre for Statistics 8 Annex II: R Colors Please refer next page for R colors 21 | Data Validation Toolkit African Centre for Statistics