NJ CDISC UG 2013-09-19 Merck -- Ganesh

Experience and process for
collaborating with an outsource
company to create the define file.
Ganesh Sankaran
TAKE Solutions
• Typical work flow when sponsors create the SDTM /
ADaM in-house and collaborate with vendors for
the Define files
• Define.xml Sections
• Define.xml Process - How do we go about
extracting the information from the data &
documents provided ..?
• Validating Define.xml & the typical Checks
• Common Issues
• Conclusion – How soon should the sponsor start..?
Typical Work flow collaborating with a Vendor for creating
Define files
Sponsor provides
the documents &
Draft Data
Sponsor reviews the
findings and update the
specification / dataset /
Send the updated
tion / XPTs back to
the vendor for a final
delivery (Pass II)
Run the compliance /
structure checks on
the data
Generate draft
Define.xml & run the
compliance checks
Summarize the
Issues/findings and
deliver the draft
define for review
Runs the compliance
checks, re-generate the
final version of Define
(Pass II)
Inputs that are provided..
• Annotated Case Report Form
• Mapping Specification documents
• SAS Datasets / XPTs
• Sponsor Controlled Terminology Documents, if
• Protocol, if Trial Design Domain to be produced
• Data Guide / Supplemental Document
Define.XML Section
TOC – Metadata of Datasets
blankcrf (Annotated )
Variable Level Metadata
Value Level Metadata
Controlled Terminology
Computational Algorithms
Supplemental Data Definition Document
Define.XML Section (Not visible
through the Style Sheet)
• Xmlns - Identifies the default namespace for
this document
• ODMVersion - Identifies the ODM version
that underlies the schema for the DefineXML
• FileOID - unique identifier for this file.
• CreationDateTime - When the specific
version of the define.xml file was created.
• StudyName, StudyDescription,
ProtocolName – Study level Information
Define.XML Components and how do
we generate them…
• MetaData Generation –
• DOMAIN Level
• VALUE Level
• ORIGIN, CODELIST, Comments and Computational
• blankcrf, Data Guide / Supplemental Docs
• Generate Define.xml
• Validate Define files
Define.XML process
Input Sheet for Define.XML Generation
DOMAIN Level Input – SAS based macro utility will create the Input s for this sheet
based on the Datasets provided
VARIABLE METADATA – By reading through the metadata of the SAS datasets
provided, variable Level metadata input sheet is populated.
Input Sheet for Define.XML Generation
ORIGIN information will be extracted based on the Annotations & Mapping
Specification provided. Based on the variables for which CODELIST , COMPUTATION
ALGORITHM and VALUELIST need to be populated, OID will be assigned here.
Based on the OIDs assigned in the VARIABLE LEVEL sheet, VALUE LEVEL input
sheet and CODELIST input sheet will be generated by reading the data and the
associated codelist files.
Input Sheet for Define.XML Generation
• Value Level Input
• Codelist / Computation Methods Input
External Documents – blankcrf & Data
• Annotated Case Report Form and Supplemental Documents like
Data Guide will be linked to the define.xml
ORIGIN Page number presented as part of the variable level
metadata must be hyperlinked to the corresponding CRF pages
attached to the Define file.
Input Sheet for Define.XML Generation
• Once the Domain Level, Variable level, Value Level, Codelist sheets
are created, external documents linked and the ORIGIN,
COMPUTATIONAL ALGORITHM & External Dictionary information
updated and inputs reviewed, DEFINE.XML can be generated
Validation Checks
• Structural Checks:
Type of Checks on the Metadata
Type of Checks on the Metadata
6. Non-standard SDTM variables
7. Variable Names in lower case
8. Variable Order mismatch
9. Variables with Formats
10. Permissible variables present
with NULL Values for all records
Domain Label mismatch
Variable Label mismatch
Data type mismatch
Missing Expected & Required variables
Required / Expected Variables with NULL
values for all records
Validate Define.XML
• A valid Define.xml should be well formed & conform to the XML
schemas. Should reference correct versions of CDISC standards.
Sample Validation Checks
1. XML is well formed
2. All Required Elements are included and /
not empty
3. OID attribute element must be unique
within a single Metadata version – No
duplicates def:leaf element,
def:ComputationMethod ,
4. No Duplicates in ItemGroupDef, ItemDef,
ItemRef, Study, CodeList element etc.
6. Invalid Data type value for
CODELIST elements
7. CodeValue must be unique
within a single CodeList
8. Invalid Codelist for variable, nonextensible CT
8. Invalid Data type value for
ItemDef elements
9. Invalid ‘Filetype’, ‘MedDRA’ values
10. Invalid ‘Repeating’, ‘Mandatory’
Common Issues
• Origin is ‘CRF’, but not annotated. ORIGIN ‘Derived’ but annotated in the
• Key variables not properly defined.
• While presenting Custom domains, Domain assumption should be followed.
Sometimes custom domains derived without a TOPIC variable.
• Subjects collected as part of external data LB/EG, but not populated in DM
domain. All Subjects must be present in DM domain.
• One-to-one relationship missing across some of the paired variables like
• Common variables across different domains having different ORIGIN
derivation. If it’s the same across, can go with “Copied from ADSL.XX”
Common Issues (contd)
• Generally, XPTs up to 1 GB size is fine. If the XPT file size exceeds
1GB, it must be split to smaller datasets not exceeding 1 GB. Study
Data Specifications
• Split files should have the same metadata structure so that
concatenation / merging of the split datasets should be feasible. Both
smaller split files & larger (non-split) file should be included.
• Split datasets and the method applied should be documented in the
data guide
• If not following linear approach, need to make sure consistency
between ADaM/SDTM sources.
Common Issues (Contd)
• ADaM when derived in a Parallel Stream might require
extra efforts for ensuring traceability & Data Lineage.
• Finalize the scope of the work being outsourced / to be performed by the
• Explain the process being followed and agree to a common form for
exchange of documets that could expedite the Define files generation.
• While working across a family of similar studies within the same
indication, after a couple of iterations/studies, should look for achieving
better efficiency.
• Identify the Vendor(s) at least three months before you expect the first
Define.XML to be published. If possible, do a pilot or DEMO define.
Thank You