The SAFTINet Black Box Functional Requirements Version 1.0 22-July-2011 1 Purpose of the SAFTINet Black Box The SAFTINet Black Box (SBB) is responsible for accepting data from multiple sources to create linked output data that is compliant with the SAFTINet grid node (see Figure 1). In its initial configuration, SBB will support two linked sources: Clinical data from clinical data partners and Medicaid data from patients associated with clinical data partners. Figure 1: High-level view of SAFTINet Black Box inputs and outputs. Key capabilities to be provided by the SBB are: Mapping site-specific codes and values into uniform codes and values based on the OMOP CONCEPT_IDs. Performing record linkage across two data sources, supporting both clear-text and encrypted identifiers Creating random identifiers and unaltered dates, resulting in a HIPAA limited data set. Supporting role-based administrative functions and pre-defined reports for evaluating data quality, source_to_concept_ID mapping performance, and record linkage performance Enabling ad-hoc data queries and reporting using clinical data, source to concept ID mapping data, and record linkage data The SBB will be developed in multiple releases. Section 7 provides an initial view of a proposed (subjectto-change) functional roadmap for SBB releases. 2 High Level System Design The internal functional modules and processing workflows are illustrated in Figure 2. Data from clinical sources “flow” through the modules in the upper half of the figure. Medicaid data “flows” through modules in the lower half of the figure. By design, Medicaid data are matched to patients that are present in the clinical data set. That is, Medicaid identifiers are matched to clinical identifiers. To match Medicaid identifiers to clinical identifiers, clinical data must be fully processed PRIOR to processing Medicaid data. The execution of the Medicaid data processing is optional. In other settings (e.g., DARTNet) that do not have a requirement to link a second data source, the clinical data flow can be used to create a SAFTINetcompliant grid XML data file from an import file that conforms to the SAFTINet ETL XML schema. Page 1 of 9 Figure 2: Key functional modules within the SAFTINet Black Box. Description of key functional components The tasks to be performed in each module are: XML Importer o Accepts data in SAFTINet-specified XML, checks for conformance to SAFTINet ETL XML Schema and inserts data into the SBB MySQL database. All import errors are logged for review, reporting, and analysis o As additional methods for data import are developed, the XML Importer module will be generalized to be a more flexible Data Importer module Map Source Codes to OMOP Concept_IDs o Maps local site codes into OMOP Concept_IDs using a SAFTINET specific Source-ToConcept_ID mapping table. o Inserts Concept_ID = 0 into all mapping failures per OMOP convention. o Support data quality and mapping failures reporting PPRL Encryption of Clear Text PHI o Performs PPRL data encryption using SAFTINet-developed encryption algorithms on all clear text HIPAA identifiers o Inserts clear-text and encrypted identifiers into SBB database (ID Mapping Table). Page 2 of 9 3 GUID generation o Generates large random GUID for each patient and each encounter ID o Inserts GUID identifiers into ID Mapping Table. Clear Text Record linkage o Performs clear text record linkage using code base from Regenstrief or OpenEMPI (TBD) o Provides record linkage measures for non-matches, near-matches, and matches Privacy Protecting Record Linkage o Performs PPRL record linkage using code base developed by Vijay Thurimellia based on methods developed by EA Durham @ Vanderbilt University o Provides record linkage measures for non-matches, near-matches, and matches Map to GUID o Maps clear text identifiers (if present) to GUID identifiers in ID Mapping Table o Maps encrypted identifiers (if present) to GUID identifiers in ID Mapping Table JDBC Exporter o Creates output data in SAFTINet defined grid node JDBC database schema. o Appends data quality, mapping performance, and record-linkage performance measures data as defined in grid node JDBC database schema. o Appends site-specific source-to-concept_ID mappings and source mapping failures as defined in grid node JDBC database schema. Architectural Considerations To meet AHRQ funding obligations and to support the wide distribution of technical work products, the SAFTINet team is committed to creating a system that can be widely deployed by other data contributors interested in participating in an OMOP-based distributed research network. To achieve these goals, the following architectural principles will be applied: 1. As much as possible, development will be done using open-source, open-license software. An exception will be made if open source development significantly increases the development cost or development time. The use of proprietary software must be limited to software that is widely used and easily available, such as Microsoft Windows and Microsoft SQL Server. Since many components to be incorporated into the Black Box have been written in java, at least some components will require java unless the time and cost in redeveloping these tools in another language can be justified. 2. Existing work by others that exist in the public domain will be leveraged as much as possible. For each component derived from the work of others, explicit permission to incorporate their work will be obtained. Acknowledgement of their contributions within the Black Box will be prominently displayed in the documentation to ensure proper intellectual and technical credits. 3. All development source code will be hosted in an readily available open-access source management system such as SourceForge or GForge. Development, technical and user documentation will also be available from the same site. 4. Detailed technical and user documentation will be required to support end-user acceptance of the Black Box. SAFTINet may enlist client partners to assist in creating documentation that meets the needs of a broad community. 5. To minimize installation and configuration barriers, the SAFTINet Black Box will be distributed as a complete virtual machine. Detailed instructions for installation and configuration of the virtual Page 3 of 9 machine will be provided during software distribution. Based on the deployment plans for the grid node, the VM will be deployed using VMWare and VMplayer. 4 4.1 Functional Specifications System Administration 4.1.1 Users must log into system using an unique user name and password 4.1.2 User accounts are assigned to one and only one role 4.1.3 Roles include root, SBB administrator, database administrator, data manager, standard user 4.1.4 Functional capabilities by roles: Full system access Add users; Assign roles Install S/W updates Alter DBMS structures Initiate data upload Review log files Create reports View reports x x x x x x x x x x x x x x x x x x x x x x x Root SBB Admin Database Admin Data Manager Standard User 4.1.5 System software updates will be accomplished by installing a new virtual machine image 4.1.5.1 4.2 x Future functionality: System software updates will allow saving and restoring Black Box data and settings across upgrades. Data upload 4.2.1 A role-based qualified user may initiate a data upload. 4.2.1.1 Future functionality: Data uploads can be scheduled with pre-configuration of processing options. 4.2.2 A qualified user may select a data file for uploading using standard file navigation dialogs 4.2.3 System validates selected file for conformance with SAFTINet XML schema 4.2.4 A file that fails XML conformance testing will not be allowed to proceed. The system will provide information on the nature of the XML non-conformance 4.2.5 The system will check the XML input file for a Data Source attribute. If Data Source attribute = “clinical” then all data tables will be cleared of existing data. The user will be warned prior to data tables being cleared. This requirement implements a full data dump input model (to be replaced in later releases) 4.2.5.1 Future functionality: System supports incremental data uploads 4.2.6 The system will insert all data contained in the input file into the internal SBB DBMS 4.2.7 The system will inform the user of a successful upload into the internal SBB DBMS, including the number of new records loaded into each table 4.2.8 A qualified user will be able to clear all data tables. The user will be warned twice prior to executing this function. Page 4 of 9 4.3 Terminology Management 4.3.1 A role-based qualified user can map new site-specific source codes to existing Concept_IDs to the Source Code To Concept_ID file. New source code mappings will be marked for easy identification/querying. 4.3.2 Because Concept_IDs must be common across all members of the DRN, local users cannot enter new Concept_IDs to their local copy of the Source-To-Concept_ID mappings. Only SAFTINet central administration can add new Concept_IDs to the master list of Concept_IDs. SAFTINet central administration will work with the national OMOP terminology team to ensure new SAFTINet Concept_IDs are incorporated into the OMOP national terminology 4.3.3 A qualified user can change an existing Source Code to Concept_ID mapping to point a local source code to a different existing Concept_ID. Requirement 4.3.2 prevents users from changing an existing source code to a locally created Concept_ID. All changes will be marked for easy identification/querying. New local mappings will be marked for easy identification/querying. 4.3.4 New site-specific source code mappings to existing Concept_IDs (Requirements 4.3.1 and 4.3.3) will be incorporated into a master Source-To-Concept_ID centralized table maintained by SAFTINet. 4.3.5 A qualified user can upload a new set of Source Code to Concept_ID mappings provided by SAFTINet. Uploading a new set of mappings overwrites all existing mappings including any local changes. Requirement 4.3.4 will incorporate local mappings into the new master file / upload. 4.3.5.1 4.4 Future functionality: A centralized web service will enable terminology additions to be managed and distributed to all participating nodes automatically. 4.3.6 Source codes that do not match any existing source code in the current Source Code to Concept_ID mappings must be assigned Concept_ID = 0 to conform to the national OMOP terminology convention. 4.3.7 The current Source Code to Concept_ID mappings and markings (Requirements 4.3.1, 4.3.3) will be included in the Export XML and be available for querying on the grid. Local changes/addition to the current mappings will be detected via a grid query. ETL Processing Rules Information on specific ETL processing rules and conventions will be added to this section as they are discovered. These rules cover the internal processing of input data. The format of the input ETL XML is described in a separate ETL specifications document. Rules on the construction of output XML is discussed in Section 4.8 in this document. 4.4.1 Provider table: Provider records with a Source_Care_Site_Identifier value that does not match an existing value in the Care_Site table will be assigned a value of 0 in Provider.Source_Care_Site_Identifier attribute. 4.4.2 Care_Site table: Care site records with a Source_Organization_Identifier that does not match an existing value in the Organization table will be assigned a value of 0 in Care_Site.Source_Organization_Identifier attribute. Page 5 of 9 4.4.2.1 4.5 4.6 PPRL Encryption 4.5.1 All clear text PHI data fields will be processed by domain-specific data normalization routines. Both original and normalized PHI fields will be maintained in the SBB database. 4.5.2 Normalized first name and last name clear text data fields will have the Double Metaphone and New York State Identification and Intelligence System (NYSIIS) Soundex systems applied and maintained in the SBB database. 4.5.3 Clear text PHI data fields plus normalized first/last names plus all Soundex results will be individually encrypted using the PPRL Encryption routine developed by Vijay Thurimella based on the Bloom methodology described by EA Durham. Individual encrypted PHI data fields will be included in the SBB database along with the associated clear text fields. GUIDs 4.6.1 Each patient and each visit encounter will be associated with a GUID that is guaranteed to be unique across all SAFTINet participating sites. A GUID is also guaranteed to not be reused for another patient or encounter within the current database. GUIDs may be reused across complete database reloads (Requirement 4.2.5). 4.6.1.1 4.6.2 4.7 Future functionality: Black box will process address information in Location records to calculate geocoding longitude and latitude information and insert into Location records. Future functionality: With incremental data loads, GUIDs will remain stable across data loads All database tables that contain either clear text or encrypted patient or encounter identifiers will have its associated GUID inserted into the SBB database Record Linkage 4.7.1 The type of record linkage algorithm to be applied will be determined by the Identifiers Format XML attribute, a required XML element in the input data file. If Identifiers Format = “cleartext” then clear text record linkage will be used. If Identifiers Format = “PPRL”, then PPRL record linkage will be used. Any other value in this attribute is an error. 4.7.2 The system will check the XML input file for a Data Source attribute, a required XML element in the input data file. If Data Source attribute = “medicaid” then the appropriate record linkage algorithms as determined in Requirement 4.7.1 will be applied to each record. 4.7.3 Blocking will be used to minimize processing time. Records will be blocked using the month and day of the birthdate provided in the Medicaid data set. Medicaid records without birthdates will be processed without blocking (MGK: Vijay – what do you think about this specification?) 4.7.4 For each record to be linked, link scores will be calculated using all identifiers that are non-null in the both records being linked. Due to the existence of missing identifier fields, absolute and relative link scores will be calculated. The absolute link score is the weighted sum of all available identifying fields. The maximum link score is the sum of all available identifying fields assuming a perfect match for all fields. The relative link score is the absolute link score divided by the maximum link score. 4.7.5 A minimal relative link score will be part of a Black Box configuration file. 4.7.6 Link scores that do not exceed the minimal relative link score (Requirement 4.7.5) will not be matched. 4.7.7 All record pairs that exceed the minimal relative link score (Requirement 4.7.5) will be marked Page 6 of 9 as “potential links.” 4.7.8 The single record pair with the highest relative link score that exceeds the pre-established minimal relative link score (Requirements 4.7.4 & 4.7.5) will be determined to be the link match for that record pair. 4.7.9 A log of all matched and unmatched Medicaid records will be created. 4.7.10 Record linkage using the Fellegi-Sunter algorithm requires pre-computation of linkage weights for each field that could be compared during linkage. Linkage weights are unique to each population and thus must be calculated using data from each participating site. A separate independent utility application will be created to calculate site-specific record weights. The requirements of this application will be described in a separate document. Linkage weights calculated by this application will be part of a Black Box configuration file. 4.7.10.1 4.8 4.9 Future functionality: Linkage weights can be calculated using the identifying data elements present in the current Black Box database. Data Output 4.8.1 Data output will be XML that follows the SAFTINet output XML schema. The output XML schema will be based on the SAFTINet-extended OMOP data model. 4.8.2 Location records linked to patient records will have all PHI attribute fields set to NULL in output XML. Location records linked to objects other than patient records (e.g., providers, organizations, care-sites) will include all PHI attribute values in output XML. All location records will include Zip_Code3 values. 4.8.3 Data output will include the current Source_to_Concept_ID mappings as specified in the XML schema 4.8.4 Data output will include failed Source to Concept ID mappings as specified in the XML schema 4.8.5 Data output will include information on record linkage results as specified in the XML schema 4.8.6 Data output will include information on data quality as specified in the XML schema Reporting 4.9.1 Role-based authorized users will be able to execute pre-defined reports. 4.9.2 Reports may be viewed on the screen, printed to a PDF file, or printed to a user-configured networked printer 4.9.3 All available reports will be visible and executable by user roles with reporting enabled. 4.9.3.1 Future functionality: Reports will be associated with roles. Only reports associated with the current user’s role will be visible and executable by the current user. 4.10 Report Creation 4.10.1 A role-based qualified user will be able to create an ad-hoc report using a graphical report writing software (such as iReport/JasperReports) Page 7 of 9 4.10.1.1 Future functionality: New reports will be assigned one or more roles that can view and execute the report 4.10.2 User-defined reports can be exported and imported to preserve report definitions across system upgrades and new virtual machines. 5 Development Considerations Rapid applications development (RAD) methods will be used with 2-week development sprints. Initial development will focus on processing clinical data from end-to-end. Implementation of Medicaid data processing will occur after clinical data processing reaches an initial alpha state. Depending on resources, Medicaid data processing can be developed in parallel with clinical data processing. Both clinical and Medicaid data processing workflows will use a number of shared modules, include XML data import, Source_to_Concept-ID mapping, PHI encryption, and XML data export. Modules unique to clinical data processing include GUID generation. Modules unique to Medicaid data processing include clear-text record linkage, PPRL linkage and GUID mapping. When possible, common functions should use common modules. A proposed sequence of 2-week development sprints: 5.1 5.2 Clinical data processing 5.1.1 Import XML for single table (patient). Generate error for non-conforming XML. 5.1.2 Import XML for second table (visit). Map one data element using Source_to_Concept-ID file 5.1.3 Import XML for provider and location table. Map all data elements using Source_to_ConceptID file 5.1.4 Implement method to add new Source_to_Concept-ID mappings. Implement method that captures unmapped Source codes. Assign Concept-ID = 0 to unmapped source codes. 5.1.5 Import XML for all remaining input tables; generate export XML from all input tables. Generate export XML for Source_to_Concept-ID mapings and unmapped source codes. 5.1.6 Implement PPRL encryption for all clear-text PHI. Implement ID mappings table. 5.1.7 Implement GUID generation. Add GUIDs to ID mappings table. 5.1.8 Implement JDBC data export. Implement JDBC export of mapping performance data. Implement JDBC export of local Source-To-Concept_ID mappings 5.1.9 Unit test clinical data processing end-to-end workflow. Document installation, configuration and end user. Medicaid data processing 5.2.1 Import XML for Medicaid data elements. 5.2.2 Implement Source_to_Concept-ID mappings for Medicaid data. PPRL encryption for all cleartext PHI (if present) 5.2.3 Implement clear-text record linkage, scoring algorithm. Implement linkage performance log 5.2.4 Implement PPRL linkage, scoring algorithm. Implement linage performance log 5.2.5 Implement GUID matching. Generate export JDBC data expert with Medicaid data Page 8 of 9 5.3 6 5.2.6 Implement JDBC data export, including data quality, mapping performance and report linkage performance data 5.2.7 Unit test Medicaid data processing end-to-end workflow. Document installation, configuration and end user. Administrative functions 5.3.1 Implement role-based login 5.3.2 Implement role-based access to system functions 5.3.3 Implement report generation system. Implement initial pre-specified reports 5.3.4 Implement additional pre-specified reports. 5.3.5 Implement ad-hoc report generation 5.3.6 System integration testing 5.3.7 Document installation, configuration and end user. Other considerations What open source licensing arrangement do we want to use? Should we talk to the technology transfer office to see if they have a preference? 7 Proposed Development Roadmap Release 1 Release 2 Release 3 Release 4 Release 5 Release 6 Release 7 Clinical data processing end-to-end with source code mapping, encryption, GUID assignment, and ID mappings table Medicaid data processing end-to-end with source code mapping, encryption, record linkage, GUID mapping. Pre-defined reports; ad-hoc reporting system Record linkage management & review Centralized terminology management and distribution Incremental data loads; Save system data/state information between system upgrades Automated / scheduled data import/export Page 9 of 9