UIS Data Transformation and Validations As it pertains to the SDMX TWG EXL Initiative Gathering Data • Each data point to be collected is described with dimensions prior to collection • Unique identifier is assigned to each data point/dimensional grouping • Data is collected via surveys • Data is inserted into the database by country/year for each survey returned • Data goes through a cleaning process that involves both human and automated validation (ERS) Data Encoding EMC_ID ACTIVE EC_PRIO EC_UNIT EC_SECTO EC_PRGDS EC_GRADE EC_AGE EC_FIELD EC_FORGN EC_ISCED EC_SEX EC_PRGDU EC_DGPOS EC_PRGLO EC_ADULT 20062 1 2 210 100000 100000 100000 100000 10000000 100000 10 100000 100000 100000 100000 10 EMC_ID: Internal unique identifier used to store data. Each EMC_ID summarizes a set of dimension for data that we collect. In this case, the data point refers to ENROLLMENT (EC_UNIT=210) in ISCED 1 (EC_ISCED = 10). Labels for each dimensional value are stored in separate dimension tables. For a more human legible format, each EMC_ID used in indicator definitions is also given an alphanumeric code that summarizes the dimensions. In this case, “E.1” is used, for ENROLLMENT in ISCED 1. Raw Data Validation (ERS) • Database (T-SQL) implementation. • Stored procedures and reporting services • Based on CONCEPTS Example: Concept: Redundant Data Check Description: UIS Surveys often have cells that are redundant in order to verify that the value entered in one cell is accurate and not the victim of a human input error Purpose: Verify that one cell equals another, redundant, cell Method: Validates that a specific “MASTER” cell is equal to any other redundant cell. Redundant cells are identified by having all dimensional values equal to the master cell with the exception of the PRIORITY dimension. Preparing Indicators (transformations) • Indicators are encoded in XML using extended MathML • Resulting XML file can render in a friendly manner in any browser, providing immediate documentation • Indicator XML file is “parsed” to convert the XML into database records • Indicator definitions are validated when parsed to ensure completeness as well as the existence of any needed indicators Indicator Definition Indicators are defined using MathML, with custom tags implemented by the UIS. <indicator name="GAP"> <label> <en>Graduation age population</en> <fr>Population d age de graduation</fr> </label> <formulas> <formula> <roll wildcard="isc" list="1,2.GPV,2"/> <roll wildcard="sex" list="F,T"/> <suffix>(isc).(sex)</suffix> <offset wildcard="age" low="Ag1" up="Ag25"> <sum> <d>thAge.(isc)</d> <d>thDur.(isc)</d> </sum> <c>1</c> <d src="POP" >P.(age).(sex)</d> </offset> <synonym wildcard="isc" use="2.A.GPV" for="2.GPV" /> <synonym wildcard="isc" use="2.A.GPV" for="2" /> </formula> </formulas> </indicator> Indicator Definition (cont.) • When loaded into a MathML enabled browser, the indicator definition becomes human readable and self documenting. • Rendering the XML in a browser also helps to validate that the XML indicator specification is well formed. Indicator Definition (cont.) • A parser is then used to convert the XML indicator specification to a database structure for use in processing the transformations IndicCode GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 GAP.1 Term Parent 1 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1002 1003 1004 1005 Action 0 offset 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1d 1 sum 1002 d 1002 d 1c parentAction root offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset offset sum sum offset Nterms Value Sequence Source 27 1 0 P.Ag1 3 POP 0 P.Ag2 4 POP 0 P.Ag3 5 POP 0 P.Ag4 6 POP 0 P.Ag5 7 POP 0 P.Ag6 8 POP 0 P.Ag7 9 POP 0 P.Ag8 10 POP 0 P.Ag9 11 POP 0 P.Ag10 12 POP 0 P.Ag11 13 POP 0 P.Ag12 14 POP 0 P.Ag13 15 POP 0 P.Ag14 16 POP 0 P.Ag15 17 POP 0 P.Ag16 18 POP 0 P.Ag17 19 POP 0 P.Ag18 20 POP 0 P.Ag19 21 POP 0 P.Ag20 22 POP 0 P.Ag21 23 POP 0 P.Ag22 24 POP 0 P.Ag23 25 POP 0 P.Ag24 26 POP 0 P.Ag25 27 POP 2 1 0 thAge.1 1 EDU 0 thDur.1 2 EDU 0 1 2 calcIndic • Seasoned for 7 years • Currently on 4th version • Entirely developed using database stored procedures and T-SQL • Leverages well seasoned database functionality • Data, indicator definitions and transformation code all in a single database. Fast. calcIndic (part 2) - Indicator definitions are read - Each <d> (data) or <i> (indicator) tag is resolved by joining the required data point to the indicator definition for each country and year involved in the transformation - The steps for performing the calculation are performed based on the indicator definition - Data is written to domain-specific tables - Indicator validations are performed and problematic results are flagged. The reasons for each flag are logged to permit easy auditing. User Defined Indicator Validation (DIVA) (in development) • XML based. Validation rules for a particular indicator are defined alongside the indicator definition. • MathML based with extended custom tags • Validation process is SQL based • As with the indicator definition, browser plugin makes the XML definition selfdocumenting User Defined Indicator Validation (DIVA) <diva> (in development) <!-- relative change is not greater than 10< --> <formula cid="12" cids="1" range="0.1"> <roll wildcard="sex" list="M,F,T" /> <roll wildcard="isc" list="0,1,23,4,56" /> <suffix>(isc).(sex)</suffix> <test> <i>SAP.(isc).(sex)</i> <i relyear="-1">SAP.(isc).(sex)</i> <filter> <i>SAP.(isc)</i><i>SAP.(isc).M</i><i>SAP.(isc).F</i> </filter> </test> </formula> </diva> Dealing with missing/special data • Both ERS and calcIndic allow for special processing of missing data • Rules coding allow for custom treatment of special data • Normal rule for formulas: “Special data” properties are viral. If you add a list of numbers together, and one value is “missing”, the sum will be “missing”. • Normal rule for comparisons: Special data is only equal to similar special data (missing = missing). Dealing with missing/special data (cont.) • Specifying alternate processing rules possible on a case-by-case basis. • When defining an indicator, each data point can have a rule specified to enable an alternate way of dealing with special data • When defining a validation concept in ERS, each concept can have an alternate rule specified for comparisons ERS: Example of special data rules for comparisons Default Comparison EQUAL 1 - Direct comparison between 2 cells Result Master missing inclusion missing TRUE FALSE inclusion FALSE TRUE nil FALSE FALSE not applicable FALSE FALSE value FALSE FALSE nil FALSE FALSE TRUE FALSE FALSE not applicable FALSE FALSE FALSE TRUE FALSE value FALSE FALSE FALSE FALSE numeric Alternate Comparison for INCLUSION (when the data is included in the master cell) EQUAL 2 -Comparison between one cell (master) and a sum: The sum might be “inclusion”, because all is included in the master. Result Master missing inclusion nil not applicable value missing TRUE FALSE FALSE FALSE FALSE inclusion FALSE TRUE FALSE FALSE FALSE nil FALSE FALSE TRUE FALSE FALSE not applicable FALSE FALSE FALSE TRUE FALSE value FALSE TRUE FALSE FALSE numeric calcIndic: Example of special data rules for calculations <d>E.(isc).(age).(sex)</d> By default, if the above data point is missing, the indicator calculated will also be labeled as missing. <d MG=“2”>E.(isc).(age).(sex)</d> The MG=“2” code above alters the behavior of the data point. Missing data for this data point will now be considered ‘nil’ or 0 Future Development • DIVA • Ability to launch “on command”, instancing • Ability to calculate only the indicators that are affected by an underlying data change