Canada’s Updated Case Study and the Benefits and Challenges of Implementing the Generic Statistical Information Model Flavio Rizzolo Joint work with Tim Dunstan and Kathryn Stevenson Work Session on Statistical Metadata Geneva, 6-8 May 2013 What we do Statistical Production Services Business Processing Services Collection Services • • • • • • • • • Survey Planning Instrument Generation ICOS Governance Sample Management Training Collected Data Management Response Collection Workload Management HR Management • • • • • • • • • All theseCommon services use metadata in processing • Common processing platform for all or another platform for all socioone form Logistics • Case Management Response Processing Collection Systems Operation Survey Pre-Production Testing Survey Progress Monitoring Pre-Collection Respondent Communication Respondent Support Internal Communication business / microeconomic surveys Dissemination Services Social Processing Services economic, labour, health surveys • • • • • • • Challenge Systems 1:Generalized to make metadata Services Macro-economic available• to all of them in an Analysis & Tabulation Modeling • Confidentiality efficient, effective and controlled • Sampling • To be captured • Editway & Imputation “New Dissemination Model” Single Output Database CLF compliance OpenData portal Syndication Social Media Web Services Census • Statistical Infrastructure Services Challenge 2: to exchange metadata in a common format with minimumMetadata Data Service Business Register Address Register Management Centre(s) Services Services overhead • EAIP web services • • 2 2 Steady-state dataset management Search / discover Statistics Canada • Statistique • Statistical Metadata Management Strategy implementation • Metadata Portal • Stewardship Canada • Model repository? • Metadata search Census 2016 Platform Classification Services • • • Classification Management Concordance Management Common services Metadata management building blocks Integrated Metadata Base (IMDB) Integrated Business Surveys Project (IBSP) Centralized metadata repository Integration with IMDB Metadata + processing environment Common Tools Project Centralized Metadata repository Integration with IMDB Social Survey Metadata Environment (SSME) Social Survey Processing Environment (SSPE) Data Service Centres (DCS) Integrated Service Oriented Architecture (SOA) 3 Completed Underway Planned GSBPM and GSIM The Generic Statistical Business Process Model (GSBPM) has been a StatCan reference model since 2010 – also being used to harmonize StatCan’s stat processing infrastructure The Generic Statistical Information Model (GSIM) is being adopted to specify, design, and implement components for integration into “plug’n’play” architectures and link to standard formats (e.g. DDI, SDMX) GSIM’s Concepts and Structures Groups will be the main classifiers of metadata and function as inputs/outputs of GSBPM statistical business sub-processes 4 Data models for input and outputs Information has to be consistent across all relevant business units However, the same abstract information object (e.g., survey, questionnaire, classification) can be physically implemented by different data producers (and consumers) in different ways This “impedance mismatch” between producers and consumers’ views (and understanding) of data can be addressed either: • by forcing them to conform to each other’s data models (point-to-point data integration) • by creating canonical information models to which producers and consumers models will map (SOA data integration) 5 Canonical models Canonical information models are enterprise- or segment-wide, common representations of information objects – a “lingua franca” for data exchange Within a SOA framework, they are implemented as object models that are serialized into XML Schema Definition (XSD) types • XSD types of canonical models will be maintained in a repository that can be referenced and reused by multiple service contracts (WSDL) • XSD types will be maintained by the service developers within a governance framework • Producer and consumer schemas need to be mapped to the canonical metadata models via schema mappings – object-relational (ORMs) or object-XML (OXMs) 6 SOA (meta)data exchange model complex data transformations via custom SQL queries (possibly recursive) when source model is far from canonical customized XSLTs automatic for transforming deserialization applications XML structure for (may include can access composite services automatic customized XSLTs consumer’s serialization when consumer models directly and typing model is far from canonical) automatically generated when producer model is close to canonical 7 inventory of canonical metadata XSD types to be imported into WSDLs for reusability across services StatCan and GSIM synergy Active groups Plug & Play Implementation Mapping GSIM to DDI and SDMX Information objects are being aligned with GSIM at the implementation level A two-way convergence GSIM to StatCan Survey instrument, questionnaire and classification canonical models – Semantic work between Enterprise Architecture, SNA, IBSP and ICOS influenced by GSIM model StatCan to GSIM Separation of Flow Decision (Rule) and Flow Action (Control Transition) in GSIM version 1.0 – Participation in GSIM Production Group 8 Canonical questionnaire model and GSIM class EQGS mapping to GSIM GSIM information object mapping Survey Instance + + + + + + + + + + + Parameter Mapping ID Version SDDS Effective Time Period Questionnaire Production - Rule Production - Process Control Business - Acquisition Activity Business - Instrument Business - Instance Question Block (Instrument Control) Business - Instance Question Business - Control Transition Edit Specification + + + + + + + + + These entities may not be relevant for all GSBPM phases ID Version QType Questionnaire Title Questionnaire Instruction Questionnaire Help Effective Time Period ID Version Time Period Type Edit Purpose Response Type Follow-up Priority Confirmable Indicator Failure Condition Error Message 1 determines choice of Edit Action is component and sub-component of + ID + Version + Edit Action Type Module has sub-modules + + + + + + + + ID Version Module Title Module Label Module Instruction Module Help Sort Order Effective Time Period 1 validates Flow Decision + + + + ID Version Flow Origin Object ID Flow Decision 1 is used to determine 0..* applied to may change edit status of determines 1..* Flow Action 0..* Question + + + + + + + ID Version Question Label Question Instruction Question Help Sort Order Effective Time Period 1..* 0..* Cell + + + + + + ID Version Cell Type Value Domain Default Value Cell Edit Status to answer 1..* Response records + + + + + ID Version Response Label Value Effective Time Period 1..* 1..* 9 1..* + ID + Version + Flow Target Object ID Canonical classification model: object level class Classification Obj ect Model ClassificationScheme «content» - id :String - nameEnglish :String - nameFrench :String - abbreviationEn :String - abbreviationFr :String +scheme - descriptionEn :String 1 - descriptionFr :String 1 AdminType - adminStatus registrationLevel registrationStatus 0..* PropertyType - Additional entities entities:to to 1 handle-scheme registration and more flexible formats +properties +items 1 -items 1..* property :String 0..* value :String 0..* -levels 1..* 1 «content» - id :String - nameEnglish :String - nameFrench :String - abbreviationEn :String - abbreviationFr :String - descriptionEn :String - descriptionFr :String -parent 0..1 nameEnglish :String -children nameFrench :String 0..* startDate :Date endDate :Date sortOrder :int -parent 0..1 ClassificationLev el 10 1..* The item hierarchy is a tree: every item ClassificationItem may have zero or «content» - code :String more children -level 1 -child 0..1 hierarchy hierarchy Two items are in a parent-child relationship The level hierarchy is only if their respective linear: every level has levels are one in a child. parentat most child relationship as well IMDB DDI service proof-of-concept Expose the IMDB repository using a Service Oriented Architecture (SOA) approach instead of point-to-point Provide IMDB metadata content in a standard format (DDI v3.x) Support applications that focus on different types of metadata (e.g., surveys, variables, classifications, concepts) Support the Data Liberation Initiative (DLI) and the Canadian Research Data Centre Network (CRDCN) Metadata projects 11 IMDB DDI service architecture XSLT Transforms – from DDI to HTML, CSV and other internal data formats (key for interoperability and SOA) Proof-of-concept clients developed internally: JSP/Servlet, web client, standalone Java client Mapping between the IMDB physical model and the DDI XML schemas Implemented with SQL queries. Potential clients: .NET, SAS, Excel, Reports, DW Integration Services 12 (Near) future work How to deal with change management in GSIM (not trivial once it has been implemented) What is the best possible implementation of GSIM for SOA data exchange? Need to handle the complexity of data exchange across dozens of statistical production and infrastructure systems Canonical models need to be simple and intuitive (easy to use by clients), and create little overhead Need to consider light-weight alternatives to XML (e.g., JSON). Does a single implementation “fit” the entire GSBPM? Need to look at GSBPM and identify what level of detail is necessary for each information object within process/subprocess Example: a canonical questionnaire model should include edits, flows and cells at the design and collect phases, but not at the process or disseminate phases. Similarly, a data quality metric may not be necessary before the process phase Will GSIM “level 2” take phases into account? 13 Farther ahead… Metadata: investigate large-scale entity resolution – entity identifiers should not be multiplied beyond necessity Every DB has a different id for the same information object Do we keep a centralized mapping between them? Do we keep a centralized DNS-like system that assigns id’s to entities? (OKKAM project approach) Architecture: explore alternative paradigms, e.g., event-driven architecture (EDA), to complement SOA Subscription-based rather than request-based (e.g., RSS, Atom, etc.) Loose coupling and scalability SOA service composition vs. EDA syndication/aggregation EDA subscribers need to be more sophisticated than SOA clients (e.g., need to be ready to store/handle event responses whenever they happen) 14