Logical components of the system

advertisement
Basic Requirements on Data Analysis
System of DRG Restart Project
Table of contents
Table of contents ..................................................................................................................................... 1
Document summary ................................................................................................................................ 1
Project goal.......................................................................................................................................... 1
Project submitter................................................................................................................................. 2
Document purpose .............................................................................................................................. 2
Product description ................................................................................................................................. 2
System assumptions ............................................................................................................................ 2
Fundamental data sources .................................................................................................................. 2
Logical components of the system ...................................................................................................... 3
Basic requirements on designed solution ........................................................................................... 4
Possible solution examples ................................................................................................................. 5
Use cases ................................................................................................................................................. 5
Requirement specification ...................................................................................................................... 5
Appendices .............................................................................................................................................. 6
Appendix A: Definitions, Acronyms and Abbreviations ...................................................................... 6
Document summary
Project goal
The project goal is to define a complex platform architecture – solution, which will allow to process,
integrate, validate, analyze and report data in the DRG Restart project. Result of this project will be
set of system proposals, provided by the interested vendors, describing selected architecture,
benefits and limitations as well as pricing and schedule offers. This system specification project
should end by the end of 2015.
This project’s goal is not a complete design of whole data warehouse, particular data integration
processes, data models, reports etc. but overall architecture, components, modules, tools… The
DWH process design itself will be implemented by the submitter on its own.
Project submitter
This project is submitted by Institute of Health Information and Statistics of the Czech Republic (IHIS).
Contact person: Milan Blaha, PhD (CIO, project management), Milan.Blaha@uzis.cz,
+420 601 392 841.
Document purpose
This document’s purpose is to define basic concepts and requirements on data analysis and data
mining system for the DRG Restart project. Detailed specification of the system will arise during the
system specification project.
The audience of this document is wide IT vendors community, developing public healthcare DWH
solutions world-wide. It should be used as a base document to discuss submitters’ expectations and
vendors’ possibilities and as a specification for the vendors’ solution offers.
Product description
System assumptions
Central data warehouse (DWH) will predominantly process structured administrative medical data,
originating from public healthcare care providers, or medical insurance companies, respectively. Its
increment will be hundreds of millions of records in a year (approx. 300 – 400 mil.), which
correspond to tens of gigabytes of input data of the lowest granularity. Together with historical and
aggregated data we expect billions (10^12) of records and terabytes of data to be processed and
analyzed.
System will be focused on small group of advanced users, in order of singles or little tens of them.
Deployment for hundreds or more of unexperienced users is not expected, at least in few years.
Another specific assumption is the expectation of batch imports: New data will arrive on a quarterly
basis at the beginning. It can be speeded up to monthly period of updates later on. Online updates
are now not expected because of the data character. Also, high availability of the system is not
required as the system will not be used for online operation management, rather for long-term
quality assurance and cost-effectiveness evaluation and optimization analyses.
Fundamental data sources
Following data sources will create a bone of processed data in the designed system:
1) National registry of healthcare services covered by public health insurance
 Production data of all healthcare providers on services covered by public health insurance
 Data on organizational, technical and personal assurance of these services
 Data on payments for these services addressed to individual providers
2) Registry of the data from network of reference hospital healthcare providers
 Production data on provided healthcare services and submitted to the public health
insurance companies
 Data on organizational, technical and personal assurance of these services
 Data on expenses spent on the provided care
3) Developed classification systems and rules (classification system of hospital procedures,
hierarchical classification of healthcare production items, rules compiling hospitalization
confinements, rules defining DRG classification etc.)
4) National and international lists and classifications (ICD, TNM, DASTA, VZP, SUKL, …)
5) Other registries maintained by IHIS and other sources
Logical components of the system
The designed system has to be composed of the following logical components, ensuring required
capabilities and functions:
Component
Data integration
subsystem
DWH store
BI OLAP
Capabilities and functions
Visual transform design, processing data from files, databases, web
services, other components of this system
Data warehouse database for effective storing and acquiring (meta)data
Dimensional modelling, OLAP cubes, pivot tables, multidimensional
querying, self-service analyses
Advanced DM
Data mining tools for advanced analyses (statistical, business rules
engine, regression models and classifications, clustering, associative
models and pattern matching, …)
Reporting
Online predefined or self-service analyses, export to text, pdf, word,
excel, powerpoint, images, …
Specialized components System management tools, Metadata management tools, QA tools, …
Real technical solution doesn’t need to correspond exactly to these components, but has to fulfill
their capabilities.
Picture 1: Schema of the logical components of DWH
Basic requirements on designed solution
The following table summarizes the general requirements on the system without exact specification.
1) Complexity: All the basic elements of BI&DM process are covered – acquiring, processing and
integration of data, its storage, basic as well as advanced analyses, reporting generation
2) Scalability: System could be simply enhanced in order to performance boost by new
(preferably independent) hardware and licenses with reasonable price
3) Modularity, openness and interoperability: Particular process components could be solved
with various specialized tools, which could be implemented by different vendors. These
components will be integrated via specified interfaces according to exact requirements. They
have to cooperate on couple of levels – e.g. some advanced datamining task results could be
used as an input for data integration process; metadata with data integration process
description will be stored in DWH and further analyzed etc.
4) Exchangeability: Thanks to previous requirements, some of the solution components could
be exchanged with some alternatives during system’s lifecycle, which could solve changing
expectations better. The expenses of the change itself (migration of data and metadata,
integration with other components etc.) must not be too high. One special requirement is to
allow using a solution based on open-source and/or free tools for non-profit and educational
purposes, which will be compatible with standard DWH system as much as possible.
5) Extensibility: The system could be easily extended with additional tools and components,
providing users with new functions and capabilities, not included at the beginning.
6) Quality Assurance: A tool for designing, validating and improving data and metadata quality
process has to be supported. It is necessary to support validation and ensuring
completeness, consistency and recentness of processed data, as well as enable validating the
whole process (metadata) of data transformations since their acquiring up to final reporting.
7) Security: The system will be used for processing very sensitive personal health data, so it has
to be secured against all external or internal threats. It has support secure authentication
and authorization, storing and communication. It has to allow logging and auditing of
execution and read operations. As this system will be available to a limited number of
advanced users, the user access rights will be set to database, table or column level. Row or
cell level access rights don’t have to be supported. The system has to be operated on local
servers of the submitter. Cloud or outsourced solutions are forbidden due to legal
limitations.
8) Simplicity: Particular components and the whole system as well, have to be very simple to
use and manage. Stability of the solution is crucial.
9) Metadata and data versioning, archiving and backup: There have to be support of tools,
which allow version control and development cycle (GIT or others) and parallel team
collaboration on all processes, jobs (data flows), DB schemas, etc. The system has to allow
commit and revert changes in tasks, comparison of changes in data and metadata etc.
10) Performance requirements: System should be designed for few concurrent users, batch
processing of source data and complex data mining analyses. The complete data integration
process of quarterly data increments should take no more than tens or few hundreds of
minutes. OLAP analyses reports have to respond in maximum few seconds. Enhanced
parallelization of the process is expected – hardware architecture will be now based on
single-machine, multi-threaded (up to many tens of threads) server. Further enhancement to
distributed platform is also expected.
Possible solution examples
The following table describes several possibilities of the component set, which the designed system
could be composed from. These solutions are compiled without detailed knowledge of the
commercial products. The table is far from being complete and is introduced here to illustrate and
demonstrate basic system concepts. Particular components could be combined or exchanged to
others according to functional, performance and financial expectations and limits.
Solution \
Component
Pentaho
community
DI
DB
BI OLAP
Data Integration
(Kettle)
MonetDB
Mixed
community
IBM
Talend Open
Studio
InfoSphere
DataStage
Data Integrator
InfoBright
ICE
DB2
BA Platform
OLAP
(Mondrian)
Palo
Oracle
HewlettPackard
Microsoft
?
SAS
Data
Management
SSIS
Cognos TM1
Advanced
DM
DM (Weka)
Reporting
R
JasperReports
Server
Cognos BI
Reporting
Database
Standard
Vertica
BI Server
SPSS
Modeler
?
?
?
?
SQL Server
BI ed.
?
SSAS BI ed.
?
SSRS
Enterprise BI
server
Enterprise
Miner
Visual Analytics
Use cases
The use case scenarios will be defined during the project course.
Requirement specification
The detailed requirement specification will be defined during the project course.
BI Publisher
Appendices
Appendix A: Definitions, Acronyms and Abbreviations
The following table explains terms and abbreviations, mentioned in the text (emphasized by italic).
Term
BI
DASTA
DB
DI
DM
DRG
DRG Restart
DWH
ETL
GIT
IHIS
ICD
OLAP
QA
SUKL
TNM
VZP
Description
Business Intelligence
DAta exchange STAndard in Czech healthcare system
Database system (RDBMS, columnar store, in-memory, distributed, ...)
Data integration process
Data Mining
Diagnosis related group
Long-term complex project of restructuring payment methodology for
inpatient care in the Czech Republic healthcare system
Data Warehouse
Extract, Transform, Load Process
Open source version control system
Institute of Health Information and Statistics of the Czech Republic
International classification of diseases, 10th revision
OnLine Analysis Processing System
Quality Assurance
State Institute for Drug Control of Czech Republic
Tumor staging classification
General public healthcare insurance company of Czech Republic
Download