Statistical Production Services

advertisement
Canada’s Updated Case Study and
the Benefits and Challenges of
Implementing the Generic Statistical
Information Model
Flavio Rizzolo
Joint work with Tim Dunstan and Kathryn Stevenson
Work Session on Statistical Metadata
Geneva, 6-8 May 2013
What we do
Statistical Production Services
Business
Processing Services
Collection Services
•
•
•
•
•
•
•
•
•
Survey Planning
Instrument Generation
ICOS Governance
Sample Management
Training
Collected Data
Management
Response Collection
Workload Management
HR Management
•
•
•
•
•
•
•
•
•
All theseCommon
services
use metadata in
processing
• Common processing
platform
for all or another
platform for all socioone
form
Logistics
•
Case Management
Response Processing
Collection Systems
Operation
Survey Pre-Production
Testing
Survey Progress
Monitoring
Pre-Collection Respondent
Communication
Respondent Support
Internal Communication
business / microeconomic surveys
Dissemination
Services
Social Processing
Services
economic, labour,
health surveys
•
•
•
•
•
•
•
Challenge Systems
1:Generalized
to make
metadata
Services
Macro-economic
available• to
all
of
them
in
an
Analysis &
Tabulation
Modeling
• Confidentiality
efficient, effective
and
controlled
• Sampling
• To be captured
• Editway
& Imputation
“New Dissemination Model”
Single Output Database
CLF compliance
OpenData portal
Syndication
Social Media
Web Services
Census
•
Statistical Infrastructure Services
Challenge 2: to exchange metadata
in a common format
with minimumMetadata
Data Service
Business Register
Address Register
Management
Centre(s)
Services
Services
overhead
•
EAIP web services
•
•
2
2
Steady-state dataset
management
Search / discover
Statistics Canada • Statistique
•
Statistical Metadata
Management Strategy
implementation
• Metadata Portal
• Stewardship
Canada
• Model repository?
• Metadata search
Census 2016 Platform
Classification
Services
•
•
•
Classification Management
Concordance Management
Common services
Metadata management building blocks
 Integrated Metadata Base (IMDB) 
 Integrated Business Surveys Project (IBSP)
 Centralized metadata repository 
 Integration with IMDB 
 Metadata + processing environment
 Common Tools Project





Centralized Metadata repository 
Integration with IMDB 
Social Survey Metadata Environment (SSME)
Social Survey Processing Environment (SSPE)
 Data Service Centres (DCS) 
 Integrated Service Oriented Architecture (SOA)
3
 Completed
 Underway
 Planned



GSBPM and GSIM
 The Generic Statistical Business Process Model (GSBPM)
has been a StatCan reference model since 2010 – also being
used to harmonize StatCan’s stat processing infrastructure
 The Generic Statistical Information Model (GSIM) is being
adopted to specify, design, and implement components for
integration into “plug’n’play” architectures and link to standard
formats (e.g. DDI, SDMX)
 GSIM’s Concepts and Structures Groups will be the main
classifiers of metadata and function as inputs/outputs of
GSBPM statistical business sub-processes
4
Data models for input and outputs
 Information has to be consistent across all relevant
business units
 However, the same abstract information object (e.g.,
survey, questionnaire, classification) can be physically
implemented by different data producers (and
consumers) in different ways
 This “impedance mismatch” between producers and
consumers’ views (and understanding) of data can be
addressed either:
• by forcing them to conform to each other’s data models
(point-to-point data integration)
• by creating canonical information models to which
producers and consumers models will map (SOA data
integration)
5
Canonical models
 Canonical information models are enterprise- or
segment-wide, common representations of information
objects – a “lingua franca” for data exchange
 Within a SOA framework, they are implemented as
object models that are serialized into XML Schema
Definition (XSD) types
• XSD types of canonical models will be maintained in a
repository that can be referenced and reused by multiple
service contracts (WSDL)
• XSD types will be maintained by the service developers
within a governance framework
• Producer and consumer schemas need to be mapped to
the canonical metadata models via schema mappings –
object-relational (ORMs) or object-XML (OXMs)
6
SOA (meta)data exchange model
complex data
transformations via
custom SQL
queries (possibly
recursive) when
source model is far
from canonical
customized XSLTs
automatic
for transforming
deserialization
applications
XML structure for
(may include
can access
composite services
automatic customized XSLTs
consumer’s
serialization when consumer models directly
and typing model is far from
canonical)
automatically
generated when
producer model is
close to canonical
7
inventory of canonical
metadata XSD types
to be imported into
WSDLs for reusability
across services
StatCan and GSIM synergy
 Active groups
 Plug & Play
 Implementation
 Mapping GSIM to DDI and SDMX
 Information objects are being aligned with GSIM at the
implementation level
 A two-way convergence
 GSIM to StatCan
 Survey instrument, questionnaire and classification
canonical models – Semantic work between Enterprise
Architecture, SNA, IBSP and ICOS influenced by GSIM
model
 StatCan to GSIM
 Separation of Flow Decision (Rule) and Flow Action (Control
Transition) in GSIM version 1.0 – Participation in GSIM
Production Group
8
Canonical questionnaire model and GSIM
class EQGS mapping to GSIM
GSIM information object mapping
Survey Instance
+
+
+
+
+
+
+
+
+
+
+
Parameter Mapping
ID
Version
SDDS
Effective Time Period
Questionnaire
Production - Rule
Production - Process Control
Business - Acquisition Activity
Business - Instrument
Business - Instance Question Block (Instrument Control)
Business - Instance Question
Business - Control Transition
Edit Specification
+
+
+
+
+
+
+
+
+
These entities may
not be relevant for
all GSBPM phases
ID
Version
QType
Questionnaire Title
Questionnaire Instruction
Questionnaire Help
Effective Time Period
ID
Version
Time Period Type
Edit Purpose
Response Type
Follow-up Priority
Confirmable Indicator
Failure Condition
Error Message
1
determines choice of
Edit Action
is component and sub-component of
+ ID
+ Version
+ Edit Action Type
Module
has sub-modules
+
+
+
+
+
+
+
+
ID
Version
Module Title
Module Label
Module Instruction
Module Help
Sort Order
Effective Time Period
1
validates
Flow Decision
+
+
+
+
ID
Version
Flow Origin Object ID
Flow Decision
1
is used to
determine
0..*
applied to
may change
edit status of
determines
1..*
Flow Action
0..*
Question
+
+
+
+
+
+
+
ID
Version
Question Label
Question Instruction
Question Help
Sort Order
Effective Time Period
1..*
0..*
Cell
+
+
+
+
+
+
ID
Version
Cell Type
Value Domain
Default Value
Cell Edit Status
to answer
1..*
Response
records
+
+
+
+
+
ID
Version
Response Label
Value
Effective Time Period
1..*
1..*
9
1..*
+ ID
+ Version
+ Flow Target Object ID
Canonical classification model: object level
class Classification Obj ect Model
ClassificationScheme
«content»
- id :String
- nameEnglish :String
- nameFrench :String
- abbreviationEn :String
- abbreviationFr :String +scheme
- descriptionEn :String
1 - descriptionFr :String
1
AdminType
-
adminStatus
registrationLevel
registrationStatus
0..*
PropertyType
-
Additional entities
entities:to
to
1
handle-scheme
registration and
more flexible formats
+properties
+items
1 -items 1..*
property :String
0..*
value :String
0..*
-levels 1..*
1 «content»
- id :String
- nameEnglish :String
- nameFrench :String
- abbreviationEn :String
- abbreviationFr :String
- descriptionEn :String
- descriptionFr :String
-parent
0..1
nameEnglish :String -children
nameFrench :String 0..*
startDate :Date
endDate :Date
sortOrder :int
-parent
0..1
ClassificationLev el
10
1..*
The item hierarchy is
a tree: every item
ClassificationItem
may have zero or
«content»
- code :String more children
-level
1
-child
0..1
hierarchy
hierarchy
Two items are in a
parent-child relationship
The level hierarchy is
only if their respective
linear: every level has
levels
are one
in a child.
parentat most
child relationship as well
IMDB DDI service proof-of-concept
 Expose the IMDB repository using a Service Oriented
Architecture (SOA) approach instead of point-to-point
 Provide IMDB metadata content in a standard format (DDI
v3.x)
 Support applications that focus on different types of metadata
(e.g., surveys, variables, classifications, concepts)
 Support the Data Liberation Initiative (DLI) and the Canadian
Research Data Centre Network (CRDCN) Metadata projects
11
IMDB DDI service architecture
XSLT Transforms – from
DDI to HTML, CSV and
other internal data
formats (key for
interoperability and SOA)
Proof-of-concept
clients developed
internally:
JSP/Servlet, web
client, standalone
Java client
Mapping between the
IMDB physical model
and the DDI XML
schemas Implemented
with SQL queries. Potential clients:
.NET, SAS, Excel,
Reports, DW
Integration Services
12
(Near) future work
 How to deal with change management in GSIM (not trivial once it
has been implemented)
 What is the best possible implementation of GSIM for SOA data
exchange?
 Need to handle the complexity of data exchange across dozens
of statistical production and infrastructure systems
 Canonical models need to be simple and intuitive (easy to use
by clients), and create little overhead
 Need to consider light-weight alternatives to XML (e.g., JSON).
 Does a single implementation “fit” the entire GSBPM?


Need to look at GSBPM and identify what level of detail is
necessary for each information object within process/subprocess
Example: a canonical questionnaire model should include edits,
flows and cells at the design and collect phases, but not at the
process or disseminate phases. Similarly, a data quality metric
may not be necessary before the process phase
 Will GSIM “level 2” take phases into account?
13
Farther ahead…
 Metadata: investigate large-scale entity resolution – entity
identifiers should not be multiplied beyond necessity
 Every DB has a different id for the same information object
 Do we keep a centralized mapping between them?
 Do we keep a centralized DNS-like system that assigns id’s to
entities? (OKKAM project approach)
 Architecture: explore alternative paradigms, e.g., event-driven
architecture (EDA), to complement SOA
 Subscription-based rather than request-based (e.g., RSS, Atom,
etc.)
 Loose coupling and scalability
 SOA service composition vs. EDA syndication/aggregation
 EDA subscribers need to be more sophisticated than SOA clients
(e.g., need to be ready to store/handle event responses whenever
they happen)
14
Download