Users manual - Validation Toolkit

advertisement
USER’S MANUAL
STATISTICAL DATA VALIDATION TOOLKIT
June 2015
1 | Data Validation Toolkit
African Centre for Statistics
Table of Contents
1
Introduction to the Statistical Data Validation Toolkit ......................................................................... 3
2
Programing Platform Selection ............................................................................................................. 3
3
Data Requirement ................................................................................................................................. 3
4
3.1
Indicator Coding ............................................................................................................................ 3
3.2
Country Coding ............................................................................................................................. 4
3.3
Input Data Format ......................................................................................................................... 5
3.4
Rules Data Format......................................................................................................................... 8
3.5
Creating New Rule ........................................................................................................................ 9
Solution Architecture .......................................................................................................................... 12
4.1
Architecture Diagram .................................................................................................................. 12
4.2
Functions ..................................................................................................................................... 12
5
Step by Step Validation Process .......................................................................................................... 15
6
Output Files ......................................................................................................................................... 17
7
Annex I: Indicators coding list ............................................................................................................. 20
8
Annex II: R Colors ............................................................................................................................... 21
2 | Data Validation Toolkit
African Centre for Statistics
1
Introduction to the Statistical Data Validation
Toolkit
African Centre of Statistics developed a software to perform automatic validation of statistical
data. The toolkit is a computerized application to validate datasets compiled from national and
other sources against predefined rules and scientific standards. The software is developed
under the expectation that it should identify inconsistencies on the datasets and report the
same to users so that users can take measures to correct such reported inconsistencies.
2
Programing Platform Selection
After an in-depth analysis of a couple of programming platforms, the toolkit is developed using
R programming. R is a programming language and software development environment for
statistical computing. It is an integrated suite of software facilities specifically geared to
statistical data analysis and manipulation, calculation and graphics display.
R Programming is selected mainly due to the following reasons:
3

it has effective data storage and access facility which can handle bulk statistical data

it is equipped with an integrated and large collection of statistical data calculation and
manipulation modules

it has a pre-built graphics facilities for statistical data presentation

it is developed with the basis of easy-to-use and intuitive commands which facilitates and
expedites the process of software development, and

Expertise of ACS staff with R programming.
Data Requirement
3.1 Indicator Coding
The validation algorithm is developed based on the standard data collection questionnaire developed by
ACS. In order to facilitate programming and enhance the performance of the algorithm, each and every
indicator is assigned to a specific code. Indicators in the toolkit are accessed through their respective
code.
The code assigned to the indicators in the data files affect the rules repository as it is developed based on
the defined coding scheme.
3 | Data Validation Toolkit
African Centre for Statistics
The complete set of indicator coding is attached in the Annex I: Indicator Codes
3.2 Country Coding
To avoid that would happen due to misspelling of country names, it is found necessary to assign
codes to countries. To this effect the ISO Alpha-3 country Code as adopted by UN Statistics
Division is used. The code of African countries is extracted as below for reference purpose.
Table 1. Country code
Ser. No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Country
Algeria
Angola
Benin
Botswana
Burkina Faso
Burundi
Cabo Verde
Cameroon
Central African Republic
Chad
Comoros
Congo
Côte d'Ivoire
Democratic Republic of the Congo
Djibouti
Egypt
Equatorial Guinea
Eritrea
Ethiopia
Gabon
Gambia
Ghana
Guinea
Guinea-Bissau
Kenya
Lesotho
Liberia
Libya
Madagascar
Malawi
Mali
Mauritania
4 | Data Validation Toolkit
Numerical
12
24
204
72
854
108
132
120
140
148
174
178
384
180
262
818
226
232
231
266
270
288
324
624
404
426
430
434
450
454
466
478
ISO ALPHA-3 code
DZA
AGO
BEN
BWA
BFA
BDI
CPV
CMR
CAF
TCD
COM
COG
CIV
COD
DJI
EGY
GNQ
ERI
ETH
GAB
GMB
GHA
GIN
GNB
KEN
LSO
LBR
LBY
MDG
MWI
MLI
MRT
African Centre for Statistics
Ser. No.
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Country
Mauritius
Morocco
Mozambique
Namibia
Niger
Nigeria
Rwanda
Sao Tome and Principe
Senegal
Seychelles
Sierra Leone
Somalia
South Africa
South Sudan
Sudan
Swaziland
Togo
Tunisia
Uganda
United Republic of Tanzania
Zambia
Zimbabwe
Numerical
480
504
508
516
562
566
646
678
686
690
694
706
710
728
729
748
768
788
800
834
894
716
ISO ALPHA-3 code
MUS
MAR
MOZ
NAM
NER
NGA
RWA
STP
SEN
SYC
SLE
SOM
ZAF
SSD
SDN
SWZ
TGO
TUN
UGA
TZA
ZMB
ZWE
3.3 Input Data Format
The data validation toolkit is developed based on a strict adherence to the format of the data collection
questionnaire used for ASYB data collection, with minor modification. Below is the format under
consideration.
5 | Data Validation Toolkit
African Centre for Statistics
cCode
iCode
5
Indicators
Population and
Demography
Mid-year
population - Total
(Million)
Female (%)
Population by Age
Group (000)
0-14 years
AGO
6
15-64 years
AGO
7
65+ years
Active Population
('000)
Female (%)
AGO
1
AGO
2
AGO
AGO
AGO
10
AGO
11
AGO
13
AGO
18
AGO
20
AGO
AGO
23
AGO
24
AGO
25
AGO
26
AGO
27
Crude birth rate Total (per 1000)
Econ. active pop.
in agric. (as % of
Total)
Prevalence of
undernourishment
Education
Teaching staff at
first level - Total
(thousands)
Teaching staff at
second level Total (thousands)
Teaching staff at
third level - Total
(thousands)
First level student
enrollment - Total
(thousands)
First level student
enrollment Female
(thousands)
2000
2001
2002
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
13.9
14.4
14.9
15.4
16.0
16.5
17.1
17.7
18.3
18.9
19.5
20.2
20.8
21.5
50.7
50.7
50.6
50.6
50.6
50.6
50.5
50.5
50.5
50.5
50.5
50.4
50.4
50.4
6 633
6 948
6 869
7 116
7 375
7 639
7 904
8 202
8 496
8 786
9 070
9 346
9 647
9 930
10 199
7 161
7 402
7 665
7 944
8 234
8 500
8 783
9 081
9 396
9 730
10 046
10 389
10 757
344
356
368
381
394
406
420
434
447
461
473
487
501
516
5 033.1
5
185.4
5 334.6
5 522.6
5 704.5
5 870.8
6 017.7
6
173.6
6
365.9
6
621.1
6
886.0
7
131.9
7
374.0
7
646.2
48.4
48.3
47.8
47.8
47.5
46.8
46.2
45.6
45.3
45.6
45.9
45.9
50.5
50.3
50.0
49.8
49.5
49.2
48.8
48.3
47.7
47.1
46.3
45.6
44.8
44.1
86.170
86.184
86.549
86.299
86.283
86.495
86.943
87.469
87.324
86.360
85.361
84.634
83.984
83.048
47.5
45.6
43.4
40.6
37.6
35.1
33.3
32.1
30.7
29.2
28
27.4
42.31
43.3
53.96
72.92
73.0
74.5
75.6
79.9
103.0
108.3
8.75
8.1
25.26
34.14
34.6
30.4
31.7
39.9
19.1
53.0
5.0
5.0
3.8
5.2
5.3
8.5
8.5
19.5
7.7
25.6
3 559
3 930
4 046
4 273
5 027
1 813
1 855
1 912
1 956
Table 2. Extracts of the data file
6 | Data Validation Toolkit
African Centre for Statistics
2014
Columns:
The first two column headers (cCode and iCode) are critical in which the system expects them to be
spelled as shown here. cCode and iCode refers to country code and indicator code respectively.
During the validation process, in most cases the validation reads each row, hence the country code is
repeated in each row for coding simplification purpose.
In the above format the first three columns (cCode, iCode, and indicator) are mandatory, the system does
not use the data in the indicator column. But still the indicator column needs to be in the format as a
placeholder. The rest of the data columns can vary depending on the available data. However, it is
expected that the years are in sequential order.
The toolkit considers the last column as the current year.
Acceptable Format:
The toolkit accepts data from Microsoft Excel file. It has a user interface to select as much files as
available, and the system reads the first worksheet from each file and generate a consolidated R
dataframe in memory to assess validity.
The toolkit also expects all the files have same number of columns.
An empty cell is considered as missing value. Cells with value zero (0) are not considered as missing
values.
7 | Data Validation Toolkit
African Centre for Statistics
3.4 Rules Data Format
Table 3. Rules format
iCode
Color
active
Function
Missing Values
Warning message
Significant number of missing
values for this indicator
cyan4
Y
missingValues
(
Outlier
Potential outlier
Red
Y
markOutliers
(
Gabon is greater than Ethiopia
Enrollment in primary education
is decreasing
Cadetblue
Y
isCountryGreater
(
19
20
darkgoldenrod
Y
Growth
(
26
2%
)
27
Compare Countries
Enrollment growth Primary education
Enrollment growth Primary education
female
darkgoldenrod
Y
Growth
(
27
2%
)
68
GNI per capita
darkgoldenrod
Y
Growth
(
68
2%
)
71
Richest income
GNI per capita is not growing
The Richest income is less than
the poorest income
Cadetblue
Y
Greater
(
71
72
)
34
Literacy rate – male
Y
minMax
(
34
0
100
)
Literacy rate – female
Mle literacy rate is out of range
Female literacy rate is out of
range
deepskyblue4
35
deepskyblue4
Y
minMax
(
35
0
100
)
cyan4
Y
Growth
(
1
1%
Chocolate
Y
populationGrowth
(
1
16
2
)
Grey
Y
minMax
(
2
40
60
)
Brown
Y
Equality
(
getSumOfIndicatorValues
Grey
Y
minMax
(
3
5
cyan4
Y
Growth
(
5
1%
)
cyan4
Y
Growth
(
5
1%
)
cyan4
Y
Growth
(
5
1%
)
Cadetblue
Y
Greater
(
1
10
)
Grey
Y
minMax
(
11
20
Cadetblue
Y
Greater
(
13
14
26
rName
1
Population
1
Population growth
2
percentage of female
1
population total
Urban population
percentage
Population growth age
1
Population growth age
2
Population growth age
3
Enrollment in primary education
(female) is not growing
30%
)
)
26
68
10
Active Population
11
13
Active female
Birth rate greater than
death rate
Suspicious population growth
Population size does not match
with the estimate (population
growth rate)
Unacceptable percentage of
population (female)
Population size is different from
sum of age groups
Unacceptable percentage of
population (urban)
Suspicious population growth
age 0 - 14
Suspicious population growth
age 15-64
Suspicious population growth
age 65+
Active population size is greater
than total population size
Active female population out of
range
Death rate is greater than birth
rate
14
Crude death rate
Crude death rate out of range
deepskyblue4
Y
minMax
(
13
0
50
)
15
Firtility rate
Firtility rate out of range
deepskyblue4
Y
minMax
(
15
0
10
)
17
Depandency ratio
Dependency ratio out of range
deepskyblue4
Y
minMax
(
17
0
100
)
3
5
6
7
8 | Data Validation Toolkit
2
)
7
)
getIndicatorValue
)
(
5
African Centre for Statistics
39
6
80
)
70
)
)
(
1
)
)
Columns:
iCode: The first column is for the indicator code. There are two types of rules namely generic and
indicator based. The generic rules apply on all indicators. On the contrary, the toolkit applies
the indicator-based rules only when it encounters the specified indicators in the data file.
In the above table rules like Missing Values, and outliers are examples of generic rules. These
rules apply to all indicators. The system applies these rules to validate each and every cell in the
data file.
On the other hand some rules in the above table are indicator based. Rules like GNI per capita,
richest income, etc. are indicator based. The system applies such rules only to the rows of the
specified indicator in the data file.
In the rules file, if the iCode column is empty then that specific rule is general, if not the rule is
indicator based specific to the indicator
rName: This column is used to give a descriptive name of the rule while reading the rule repository. The
toolkit does not use this column in its validation process.
Warning message: This column is used to store warning messages to be displayed in the report files if
the data fails to fulfil the specific rule. It warns the user that a specific cell or raw of data has
validation issues.
Color: This column is used to specify the background color which is used to mark invalid cells in the excel
report file. If a cell is invalid as per the rule the system reads the color column and change the
background color to the specified color in the excel report file. A complete list of acceptable
colors are attached in Annex A: R Colors.
Active: This column specifies the status of the rule. The status of a rule can be either ACTIVE of
INACTIVE. To activate a rule put Y in this column and leave the column empty to deactivate the
rule. The Toolkit considers only active rules. Hence, users can deactivate rules if they want to
disable them in a specific validation run.
Function: Function and the rest of the columns are used to define the actual rule. Rules are constructed
from the functions defined in Section 4.2 following predefined syntax. The following section
explains how a rule can be constructed from those functions.
3.5 Creating New Rule
Users can create new rules that suite the requirement of their data without writing a single line of R
code. The toolkit is designed in such a way that novice user of R programming can create powerful
validation rules in an intuitive manner.
New rules can be created only in the rules repository file. A user can insert a new rule to an existing
rules file or can create his/her own file with filename he/she prefers. However the rules file format has
to follow the requirements as specified in Section 3.4 above. Those mandatory columns and column
names should be available.
Once the format is created new rules can be inserted the same way as displayed in Table 3
9 | Data Validation Toolkit
African Centre for Statistics
This includes inserting the following values for a rule:






Indicator code under iCode column– if the rule depends on an indicator code
Rule name under rName Column – just to give a descriptive name to the rule
Warning message under the warning message column – this is the message to be displayed in
the output files
Color in the color column – this is the color to be displayed as background in the excel output
file. Use predefined color names as specified in Annex II: R Colors
Specify status of that rule in the Active column.
The next columns present the function(s) which is the logic of the validation rule.
Constructing the validating function
The validating rule gets constructed from the predefined functions in Section 4.2. The whole construct
of a validation function instructs the system to work on a specific portion of the data to be validated and
returns either TRUE or FALSE. The toolkit decides on the validity of that portion of data depending on
the response of the validation function - TRUE is for valid data and FALSE is for firing the warning
message of the rule under consideration.
All functions have similar format. Following are major components of a function






Name – all functions have a name
Opening parenthesis ‘(‘ – in all the functions the name is followed by opening parenthesis
Parameters – most functions have parameters. The allowed type and quantity of parameters of
the predefined functions is specified in Section 4.2 below. Parameters can also be functions. As
specified in Section 4.2, some functions have 0 parameter, while others have 1 and even the rest
with more than one.
Closing parenthesis ‘)’ – Once parameters are listed we have to put a closing parenthesis so that
the systems knows that the list of parameters has completed.
The opening and closing parenthesis should be available even in cases where the function has 0
parameter.
And each token (component) of the function should be put in its own cell. At the same time
there should not be an empty cell in between.
Example Rules
Example 1: Mark outliers
iCode
rName
Warning Message color active Function
Outlier
Potential outlier
red
Y
markOutliers
( )
This is a very simple rule with varies characteristics. First of all it is a generic rule which does not depend
on any specific indicator. That is the reason that the iCode column is empty. Note that it has a name,
warning message color and it is active. The function is markOutliers. This function has no parameters.
Also note how the opening and closing parentheses are presented.
Example 2: Range or minMax
10 | Data Validation Toolkit
African Centre for Statistics
iCode rName
Warning Messaage
Percentage of public
Public
expenditure on
expenditure education is out of
32 on education range
color
active Function
deepskyblue4 Y
minMax
(
32
5 50 )
The above rule depends on indicator ‘public expenditure on education’ whose code is 32. The function
is minMax. minMax has three parameters. The first is the indicator code, followed by the minimum
allowed value followed by the maximum allowed value. Hence while running the validation with this
rule, the toolkit picks the value of ‘public expenditure on education’ and checks if the value is between 5
and 50. If it is out of range sends the warning message to the output files.
Example 3: Equality
iCo
de
1
rName
populat
ion
total
Warning
Message
Population
size is
different
from sum
of age
groups
colo
r
acti
ve
Function
bro
wn
Y
equality
(
getSumOfIndicator
Values
(
5
6
7
)
getIndicator
Value
(
1
)
)
The above rule is more complex, a typical example which shows the design of more powerful rules by
using multiple functions. The equality function validates the equality of the total population size to that
of the sum of the population sizes under the three age groups.
The function equality has two parameters. These parameters (in this case) are computed by the two
functions namely getSumOfIndicatorValues and getIndicatorValue. The function
getSumOfIndicatorValues has variable number of parameters, in this case it has three. 5, 6 and 7 are
indicator codes whose values to sum up. The function getIndicatorValue has one parameter. It takes
indicator code and returns its value. Once these two functions return values as per the data file, the
equality function checks if these two values are equal.
Please not how the opening and closing parentheses are used.
11 | Data Validation Toolkit
African Centre for Statistics
4
Solution Architecture
4.1 Architecture Diagram
Data
Rules Repository
Data Loader
User
Rule Parser
User Interface
Data Validation Engine
Validation Report
Figure 1. Overall architecture of the validation toolkit.
Overall architecture of the software is as depicted above in Fig. 1. The toolkit expects two types of input
files namely the data file and rules repository. After validation, two output files (.xslx and .txt) are
generated to report the result.
4.2 Functions
The toolkit has pre-defined functions which can be used to execute an atomic data extraction and
validation tasks. Rules are constructed from these functions.
There are two types of function depending on their return value.
12 | Data Validation Toolkit
African Centre for Statistics
1) Validation functions: return Boolean values (TRUE or FALSE). These functions are the one which
determine the validation of a specific value of a cell in the data file. Validation functions can
further be classified into two categories namely:
a. Indicator based: functions which get executed when the toolkit encounters the indicator
as defined in the specific rule at the iCode column
b. Generic : functions which get executed at any data cell
2) Helper functions: return data value by just extracting or applying some statistical computation.
Such functions are used to extract and compute data and feed to the validation functions to
determine validation.
Below is the list of all functions defined in the toolkit

Numeric getSumOfIndicatorValues (indicatorList):
o used to sum the values of indicators
o has variable number of parameters. All variables should be indicator codes
o returns the sum of the values of indicators in the argument list. This function
has variable number of parameters. All the parameters should be indicator
codes.
o Helper function

Boolean isCountryGreater (countryG, countryL, indicatorList):
o Used to compare indicator values of two countries.
o Has variable number of parameters. The first two parameters are country codes
followed by the indicators to be compared. Note that it is expected that the
first country to have greater value with respect to the indicators listed in the
parameter.
o Validation function
o Generic function

Boolean minMax (indicator, thisMin, thisMax):
o used to validate if an indicator value falls in a range.
o has three parameters.
 indicator - indicator code to be validated
 thisMin – minimum allowed value of the range
 thisMax – maximum allowed value of the range
o Validation function
o Indicator based function

Numeric getIndicatorByYear(indicator, year):
o used to retrieve indicator value of a given year
o has two parameters
 indicator – indicator code to whose value to be retrieved
 year – year of the value to be retrieved
o Helper function
13 | Data Validation Toolkit
African Centre for Statistics

Boolean equality (x, y) :
o validates equality of two values x and y
o has two parameters – both numeric
o validation function
o indicator based

Boolean populationGrowth(baseIndicator, growthIndicator, threshold):
o used to validate growth of an indicator value as per the growth rate indicator. A
typical example is validating if population is growing considering population size
of previous year and average annual population growth rate
o has three parameters
 baseIndicator – indicator whose growth is validated
 growthIndicator – indicator which measures growth rate of
baseIndicator
 threshold – this parameter controls the expected perfection in the
calculation
o validation function
o indicator based.

Boolean greater (indicator1, indicator2):
o used to validate if indicatore1 is greater than indicator2
o has two parameters
 indicator1 – indicator code
 indicator2 – indicator code
o validation function
o indicator based

Boolean growth (indicator, rate):
o used to validate if if the value of indicator is growing at least with the rate
provided year to year
o has two parameters
 indicator – indicator code which is under validation
 rate – the expected growth rate
o validation function
o indicator based

Boolean markOutliers ():
o searches the data for outliers (uses three sigma method)
o no parameter
o validation function
o generic function

Boolean missingValues (rate):
o used to validate if the number of missing values of indicators are acceptable
14 | Data Validation Toolkit
African Centre for Statistics
o
o
o
has one parameter
 rate – control parameter to define the acceptable percentage of missing
values
validation function
generic function
Note that the above list of functions presents only those functions which can be used by the end user to
formulate new rules. The system has much more helper functions implemented to perform activities
such as data loading, data cleansing, data formatting, data saving, user interface, etc.
5
Step by Step Validation Process





Prepare the data to be validated. Format the data to be validated as per the
requirement of the Data Validation Toolkit. Follow the guidelines as outlined in Section
3.3 above,
Edit the Rules file to suite the current requirement of validation if there is a need.
Section 3.4 has presented the modifications a user can perform in the rules repository,
Get ready the output folder where validation reports can be saved,
Run the Data Validation Toolkit by double clicking the respective icon from the desktop
or by selecting the program from the list of Programs,
The System first opens a command window as shown below. This window controls the
kernel of the Data Validation Toolkit and shows that the Toolkit is running. Once this
window is closed the toolkit stopped functioning even though the the toolkit screen is
showing on a web browser. (Please note that the content of the window might vary
depending on the installation environment)
Figure 2. Command window
15 | Data Validation Toolkit
African Centre for Statistics

After displaying the command window, the System automatically opens the default
browser and runs the Toolkit. In a typical browser the toolkit page looks like the
following figure.
Figure 3. Validation page



As any web-based application, the Toolkit can be accessed at a specific URL as can be
seen at the top of the screen – 127.0.0.1:7373
If the toolkit page is closed accidentally, opening a browser and keying in the above URL
reopens the page.
Click the Data Files button to select the data to be validated. Multiple files can be
selected at a time for validation.
o Note that as the validation process loops through the data quite a number of
times, it takes longer time. Hence, it is recommended to run separate data files if
there is no any inter-file validation required.
o inter-file validation refers to the comparison of indicator values in different files
such as comparing population size of two countries.
16 | Data Validation Toolkit
African Centre for Statistics






6
Click Rule File button to select the rules repository to be applied in the data validation
process
Click Working Folder button to specify the output folder where the report files can be
saved
All the selected files and folders will be displayed in the display fields below the
respective buttons
In case there is something wrong with the selection of the files and folders, it is possible
to click the respective buttons to correct the selection
Once the selection is done as expected, click Load button to load the selected files to
the system.
o Depending on the size of the files, the loading process might take a while. This is
because data loading module performs different data cleansing and rearranging
activities.
o The border lines of the display field gets dimed to show that the loading process
is running
o Once the loading is finished the system display the text “Data Successfully
Loaded” in the display field below the the button
Click Validate button to start the validation process.
o The border lines of the display field gets dimed to show that the validation
process is running
o Once validation is completed, the Toolkit displays the status of validation and the
output files in the display field below the button.
o The output file names are time stamped in order to avoid replacement.
Output Files
There are two output files with .txt and .xlsx format. The text format generates just the warning
messages in reference to the specific indicator, year and country code. Whereas the Excel format
presents the data as it is and displays warning cells with colors and comments. Comments can be
viewed by hovering the mouse on top of the colored cell.
Both files have the same name except the filename extension. The filenames are
Text file  valid_result ddmmyyyy_HHMMSS.txt
Excel file  valid_result ddmmyyyy_HHMMSS.xlsx
17 | Data Validation Toolkit
African Centre for Statistics
The file name is time stamped for the sake of avoiding overwriting a previously run validation result.
Accordingly ddmmyyyy_HHMMSS represents the time when that specific validation run has started,
where:
dd – day
mm – month
yyyy – year
HH – hour
MM – minute
SS – second
Figure 4. Extract of validation result output – text file
The following figure shows an extract of the validation output in Excel format. The excel
format is just a reproduction of the validated data file with warning colors (as specified
in the rules repository) and comments. In addition the the output file inserts a column
to track the data while processing (Column A).
18 | Data Validation Toolkit
African Centre for Statistics
Figure 5. Extract of validation output – excel file
19 | Data Validation Toolkit
African Centre for Statistics
7
Annex I: Indicators coding list
20 | Data Validation Toolkit
African Centre for Statistics
8
Annex II: R Colors
Please refer next page for R colors
21 | Data Validation Toolkit
African Centre for Statistics
Download