UIS Data Transformation and Validations

advertisement
UIS Data Transformation and
Validations
As it pertains to the SDMX TWG EXL
Initiative
Gathering Data
• Each data point to be collected is described with
dimensions prior to collection
• Unique identifier is assigned to each data
point/dimensional grouping
• Data is collected via surveys
• Data is inserted into the database by
country/year for each survey returned
• Data goes through a cleaning process that
involves both human and automated validation
(ERS)
Data Encoding
EMC_ID
ACTIVE
EC_PRIO
EC_UNIT
EC_SECTO
EC_PRGDS
EC_GRADE
EC_AGE
EC_FIELD
EC_FORGN
EC_ISCED
EC_SEX
EC_PRGDU
EC_DGPOS
EC_PRGLO
EC_ADULT
20062
1
2
210
100000
100000
100000
100000
10000000
100000
10
100000
100000
100000
100000
10
EMC_ID: Internal unique identifier used to store data. Each EMC_ID
summarizes a set of dimension for data that we collect.
In this case, the data point refers to ENROLLMENT (EC_UNIT=210) in
ISCED 1 (EC_ISCED = 10). Labels for each dimensional value are stored
in separate dimension tables.
For a more human legible format, each EMC_ID used in indicator
definitions is also given an alphanumeric code that summarizes the
dimensions. In this case, “E.1” is used, for ENROLLMENT in ISCED 1.
Raw Data Validation (ERS)
• Database (T-SQL) implementation.
• Stored procedures and reporting services
• Based on CONCEPTS
Example:
Concept: Redundant Data Check
Description: UIS Surveys often have cells that are redundant in order to verify that the
value entered in one cell is accurate and not the victim of a human input error
Purpose: Verify that one cell equals another, redundant, cell
Method: Validates that a specific “MASTER” cell is equal to any other redundant cell.
Redundant cells are identified by having all dimensional values equal to the master cell
with the exception of the PRIORITY dimension.
Preparing Indicators (transformations)
• Indicators are encoded in XML using extended
MathML
• Resulting XML file can render in a friendly
manner in any browser, providing immediate
documentation
• Indicator XML file is “parsed” to convert the XML
into database records
• Indicator definitions are validated when parsed to
ensure completeness as well as the existence of
any needed indicators
Indicator Definition
Indicators are defined using MathML, with custom tags implemented by the UIS.
<indicator name="GAP">
<label>
<en>Graduation age population</en>
<fr>Population d age de graduation</fr>
</label>
<formulas>
<formula>
<roll wildcard="isc" list="1,2.GPV,2"/>
<roll wildcard="sex" list="F,T"/>
<suffix>(isc).(sex)</suffix>
<offset wildcard="age" low="Ag1" up="Ag25">
<sum>
<d>thAge.(isc)</d>
<d>thDur.(isc)</d>
</sum>
<c>1</c>
<d src="POP" >P.(age).(sex)</d>
</offset>
<synonym wildcard="isc" use="2.A.GPV" for="2.GPV" />
<synonym wildcard="isc" use="2.A.GPV" for="2" />
</formula>
</formulas>
</indicator>
Indicator Definition (cont.)
• When loaded into a MathML enabled browser, the
indicator definition becomes human readable and self
documenting.
• Rendering the XML in a browser also helps to validate that
the XML indicator specification is well formed.
Indicator Definition (cont.)
• A parser is then used to convert the XML
indicator specification to a database structure
for use in processing the transformations
IndicCode
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
GAP.1
Term
Parent
1
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1002
1003
1004
1005
Action
0 offset
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1d
1 sum
1002 d
1002 d
1c
parentAction
root
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
offset
sum
sum
offset
Nterms
Value
Sequence Source
27
1
0 P.Ag1
3 POP
0 P.Ag2
4 POP
0 P.Ag3
5 POP
0 P.Ag4
6 POP
0 P.Ag5
7 POP
0 P.Ag6
8 POP
0 P.Ag7
9 POP
0 P.Ag8
10 POP
0 P.Ag9
11 POP
0 P.Ag10
12 POP
0 P.Ag11
13 POP
0 P.Ag12
14 POP
0 P.Ag13
15 POP
0 P.Ag14
16 POP
0 P.Ag15
17 POP
0 P.Ag16
18 POP
0 P.Ag17
19 POP
0 P.Ag18
20 POP
0 P.Ag19
21 POP
0 P.Ag20
22 POP
0 P.Ag21
23 POP
0 P.Ag22
24 POP
0 P.Ag23
25 POP
0 P.Ag24
26 POP
0 P.Ag25
27 POP
2
1
0 thAge.1
1 EDU
0 thDur.1
2 EDU
0
1
2
calcIndic
• Seasoned for 7 years
• Currently on 4th version
• Entirely developed using database stored
procedures and T-SQL
• Leverages well seasoned database
functionality
• Data, indicator definitions and transformation
code all in a single database. Fast.
calcIndic (part 2)
- Indicator definitions are read
- Each <d> (data) or <i> (indicator) tag is resolved
by joining the required data point to the indicator
definition for each country and year involved in
the transformation
- The steps for performing the calculation are
performed based on the indicator definition
- Data is written to domain-specific tables
- Indicator validations are performed and
problematic results are flagged. The reasons for
each flag are logged to permit easy auditing.
User Defined Indicator Validation
(DIVA)
(in development)
• XML based. Validation rules for a particular
indicator are defined alongside the indicator
definition.
• MathML based with extended custom tags
• Validation process is SQL based
• As with the indicator definition, browser
plugin makes the XML definition selfdocumenting
User Defined Indicator Validation
(DIVA)
<diva>
(in development)
<!-- relative change is not greater than 10< -->
<formula cid="12" cids="1" range="0.1">
<roll wildcard="sex" list="M,F,T" />
<roll wildcard="isc" list="0,1,23,4,56" />
<suffix>(isc).(sex)</suffix>
<test>
<i>SAP.(isc).(sex)</i>
<i relyear="-1">SAP.(isc).(sex)</i>
<filter>
<i>SAP.(isc)</i><i>SAP.(isc).M</i><i>SAP.(isc).F</i>
</filter>
</test>
</formula>
</diva>
Dealing with missing/special data
• Both ERS and calcIndic allow for special
processing of missing data
• Rules coding allow for custom treatment of
special data
• Normal rule for formulas: “Special data”
properties are viral. If you add a list of numbers
together, and one value is “missing”, the sum will
be “missing”.
• Normal rule for comparisons: Special data is only
equal to similar special data (missing = missing).
Dealing with missing/special data
(cont.)
• Specifying alternate processing rules possible
on a case-by-case basis.
• When defining an indicator, each data point
can have a rule specified to enable an
alternate way of dealing with special data
• When defining a validation concept in ERS,
each concept can have an alternate rule
specified for comparisons
ERS: Example of special data rules for
comparisons
Default Comparison
EQUAL 1 - Direct comparison between 2 cells
Result
Master
missing
inclusion
missing
TRUE
FALSE
inclusion
FALSE
TRUE
nil
FALSE
FALSE
not applicable
FALSE
FALSE
value
FALSE
FALSE
nil
FALSE
FALSE
TRUE
FALSE
FALSE
not applicable
FALSE
FALSE
FALSE
TRUE
FALSE
value
FALSE
FALSE
FALSE
FALSE
numeric
Alternate Comparison for INCLUSION
(when the data is included in the master cell)
EQUAL 2 -Comparison between one cell (master) and a sum: The sum might be “inclusion”,
because all is included in the master.
Result
Master
missing
inclusion
nil
not applicable
value
missing
TRUE
FALSE
FALSE
FALSE
FALSE
inclusion
FALSE
TRUE
FALSE
FALSE
FALSE
nil
FALSE
FALSE
TRUE
FALSE
FALSE
not applicable
FALSE
FALSE
FALSE
TRUE
FALSE
value
FALSE
TRUE
FALSE
FALSE
numeric
calcIndic: Example of special data rules
for calculations
<d>E.(isc).(age).(sex)</d>
By default, if the above data point is missing, the indicator calculated will also be labeled as missing.
<d MG=“2”>E.(isc).(age).(sex)</d>
The MG=“2” code above alters the behavior of the data point. Missing data for this data point will now
be considered ‘nil’ or 0
Future Development
• DIVA
• Ability to launch “on command”, instancing
• Ability to calculate only the indicators that are
affected by an underlying data change
Download