Sizing of SDTM Variables and Datasets Two interrelated issues involving sizing considerations that have been identified for further clarification from the FDA are: Sizing of Variables/ Data Elements/Columns Overall File/Dataset Size Because of how these two issues are interrelated, a single discussion document addresses both of them. The following references have contributed to this document: 1. 2. 3. 4. 5. 6. SDTMIG 3.1.3: SDTM IG (V3.1.3/2012-07-16) SDTM-MSG: SDTM Metadata Submission Guideline (V1.0/2011-12-31) SDS: Study Data Specifications (V2.0/July 18, 2012) CDER-CDS: CDER Common Data Standards Issues Document (V1.1/December 2011) PBO: 2013PhUSEBreakoutCABVP-CBER-CDER, FDA/PhUSE CSC (March 2013) PWG1: Workgroup1-OpenCDISC_Updates, FDA/PhUSE CSC (March 2013?) These documents discuss these issues in various ways, sometimes in more than one location in a given document. To provide full context for the issues, a review of each document was performed, and relevant text has been extracted into appendices of this document. To provide a more concise discussion, however, details have been summarized, with differences highlighted. 1. Sizing of Variables/Data Elements/Columns See Appendix A for a full accounting of the text from various documents related to this issue. There are three main categories related to variable length to consider. For ease of review, each of these categories is summarized individually: 1. Column size constraints due to general CDISC and SAS Transport v5 requirements Various columns are identified as having a maximum length. This is mostly due to SAS Transport V5 requirements or, in the case of vertical data structures, values that could have constraints if the dataset were transformed to a horizontal structure (e.g., -TESTCD, --TEST). For values that could become variable names and labels, there are some implications that setting these to the maximum length is appropriate (e.g., SDTMIG states “--TESTCD and IDVAR will never be more than 8, so length can always be set to 8”). For domains split into multiple datasets, sizing decisions should be made prior to implementing the split, and variables should be consistently sized in each dataset. This holds true for SUPPQUAL datasets within a domain (e.g., SUPPLBHE and SUPPLBCH). This holds true for all FA-related datasets (i.e., all FA-related datasets, regardless of parent domain, must have consistent variable lengths, FA being it’s own domain). NB: My (C. Radovsky) recollection is that PhUSE minutes from the meeting in March indicated that FDA was OK with resizing per dataset, but I am not able to find these minutes. Possible Discussion Topics: Should any of these variables be subject to length evaluations? All or some (e.g., --TESTCD length of 8 OK, but not --TEST length of 40)? ? 2. Column size for variables associated with a Codelist The references related to Codelist length are recent developments, few and consistent: Codelist variable length should not be greater than the longest term in the Codelist. FDA material doesn’t call out this issue explicitly. It seems bundled under the general statement that variable length should not exceed value length. Possible Discussion Topics: Approach used to represent a Codelist for submission can impact decisions on length. Consolidating to a minimum set of codelists may reduce development overhead within sponsor organizations, but increases the size of the codelist variables. (Both consolidated and not are technically compliant approaches). ? 3. Column size issues not otherwise constrained by the prior two categories The references related to variable length previously constrained only the 200 character limit are also recent developments, few, and consistent: Length should be based on data values and not set to 200 characters by default. A small amount of padding is acceptable. Working Group 1 discussions addressed the issue as follows: Severity based on “excessive” chars in var length o 2-5 Chars – Information o 6-20 Chars – Warning o >20 Chars – Error Possible Discussion Topics: While the WG1 discussions came up with a reasonable interpretation of FDA language, it was “arbitrarily” reasonable, and met with resistance in the industry. Is this more a communication challenge than issue with the logic rule? Given that sponsors need to address all rule violations with a severity of Warning and Error, is there benefit in distinguishing the severity between the two? Consider phased implementation (e.g., warning/error when above 20 for the first year, warning/error when above 5 after that)? ? 2. Overall File/Dataset Size See Appendix B for a full accounting of the text from various documents related to this issue. References to file/dataset size and splitting decisions can be summarized as a whole: One GB is the target for maximum file size, with some flexibility. As noted under the variable length section, variable sizing issues need to be addressed prior to decisions on splitting datasets, and variable length should be consistent across split datasets. As with variable padding, a scale of severity is identified in some material, and is likely subject to the same considerations. CDISC references address splitting based on categorical variables (e.g., --CAT/--SCAT). FDA references note this, as well as other mechanisms such as grouping by subject number, but these other mechanisms are identified as inferior. Possible Discussion Topics: Is there a rationale/benefit to maintain references to splitting by less desirable methods? Per above, a scale of severity is identified in some material, and is likely subject to the same considerations as variable length severity. The FDA CDER Common Data Standards Issues Document mentions the use of a /SPLIT subdirectory, which is not identified in other material. That said, it is a critical piece of information, not only in providing a clear mechanism for submitting both split and not split datasets at the same time, but addresses procedural challenges with some software (e.g., to my recollection, WebSDM doesn’t handle having both versions of the data in a single folder). ? Appendix A Following the main document, details related to variable length are addressed in three main categories: 1. Column size constraints due to general CDISC and SAS Transport v5 requirements a. SDTMIG 3.1.3 (p. 28, section 4.1.2.1) i. Values of --TESTCD must be limited to 8 characters . . . ii. QNAM serves the same purpose as --TESTCD within supplemental qualifier datasets, and so values of QNAM are subject to the same restrictions as values of --TESTCD. iii. ETCD (the companion to ELEMENT) and TSPARMCD (the companion to TSPARM) are limited to 8 characters. . . . iv. ARMCD is limited to 20 characters and does not have special character restrictions. b. SDTMIG 3.1.3 (p. 35, linked section 4.1.2.9) i. The maximum SAS Version 5 character variable length of 200 characters should not be used unless necessary. ii. Sponsors should consider the nature of the data, and apply reasonable, appropriate lengths to variables. For example: The length of flags will always be 1 --TESTCD and IDVAR will never be more than 8, so length can always be set to 8 The length for variables which use controlled terminology can be set to the length of the longest term. c. SDTMIG 3.1.3 (p. 50, section 4.1.5.3.1) i. . . . the length of --TEST is normally limited to 40 characters to conform to the limitations of the SAS V5 Transport format currently used for submission datasets. ii. The convention above should also be applied to the Qualifier Value Label (QLABEL) in Supplemental Qualifiers (SUPP--) datasets. IETEST values in IE and TI are exceptions to the above 40-character rule and are limited to 200 characters since they are not expected to be transformed to a column labels. d. SDTM-MSG (p. 20, section 5.1.2) i. ARMCD can be up to 20 characters in length with no special character restrictions. The arm name (ARM) variable length must not exceed 200 characters due to SAS version 5 transport requirements. e. SDTM-MSG (p. 20, section 5.1.3) i. The full inclusion/exclusion criteria text should be stored in IETEST and corresponding metadata [up to 200 characters]. f. SDTM IG 3.1.3 (p. 22, section 4.1.1.7) i. Variables of the same name in separate datasets should have the same SAS Length attribute to avoid any difficulties if the sponsor or FDA should decide to append datasets together. g. SDTM-MSG (p. 31, section 6.3) i. Each of the value-level QNAMs (defined by the “Name” attribute) has a different length, a label, and associated controlled terminology (CodeLists). h. SDS (p. 5, section 2.3) i. Variable Name has a Maximum Length of 8. ii. Variable Descriptive Label has a Maximum Length of 40. iii. Dataset Label has a Maximum Length of 40. i. SDS (p. 5, section 2.4) i. For all datasets, in order to significantly reduce dataset file sizes, the allotted character column length/size for each column should be the maximum length used. Lengths/sizes of columns should not arbitrarily be set to 200. ii. An inclusion of a small amount of padding to column width may be acceptable as long as this doesn’t result in significant increases in file size. j. CDER-CDS (p. 12) i. For both CDISC and non-CDISC datasets, in order to significantly reduce dataset file sizes, the allotted character variable length/size for each column in a dataset should be the maximum length used. Lengths/Sizes of columns should not arbitrarily be set to 200. ii. Alternative solutions to this problem that involve some inclusion of a small amount of padding to column width may be acceptable as long as they don’t result in significant increases in file size due to the padding. 2. Column size for variables associated with a Codelist a. SDTMIG 3.1.3 (p. 35, linked section 4.1.2.9) i. The length for variables which use controlled terminology can be set to the length of the longest term. 3. Column size issues not otherwise constrained by the prior two categories a. SDTMIG 3.1.3 (p. 95, section 6.4.2) i. [For split FA datasets]. . . Variables of the same name in multiple datasets should have the same SAS Length attribute. b. SDS (p.5) . . . the allotted character column length/size for each column should be the maximum length used. Lengths/sizes of columns should not arbitrarily be set to 200. c. d. e. f. . . . An inclusion of a small amount of padding to column width may be acceptable as long as this doesn’t result in significant increases in file size. CDER-CDS (p. 12) For both CDISC and non-CDISC datasets, in order to significantly reduce dataset file sizes, the allotted character variable length/size for each column in a dataset should be the maximum length used. . . . Alternative solutions to this problem that involve some inclusion of a small amount of padding to column width may be acceptable as long as they don’t result in significant increases in file size due to the padding. PBO (slide 2) Ideally, column widths should only be the maximum length used in that column across records Solution: Should we use extra characters used or %? Exceptions? PBO (slide 32) SD0017(others): Variable lengths Description: o The value of xxxx should be no more than yy characters in length --TEST, --TESTCD, TSPARM, ARMCD, o Are there extenuating circumstances to these rules where character limits may not be enough? Adjust OpenCDISC checks to accommodate? o What’s the difference between SD0017 and SD1058? Duplicate rules? PWG1 (slide 9) Variable length is too long for actual data i. Severity based on “excessive” chars in var length ii. 2-5 Chars – Information iii. 6-20 Chars – Warning iv. >20 Chars – Error Appendix B Following the main document, details related to Overall File/Dataset Size and resulting split domains are addressed as a whole: 1. SDTM IG 3.1.3 (p. 22, section 4.1.1.7) a. Sponsors may choose to split a domain of topically related information into physically separate datasets. In such cases, one of two approaches should be implemented: 1) For a domain based on a general observation class, splitting should be according to values in --CAT (which must not be null). 2) The Findings About (FA) domain (Section 6.4) can be split either by --CAT values (per the bullet above) or relative to the parent domain of the value in --OBJ. For example, FACM would store Findings About CM records. See Section 6.4.2 for more details. 2. SDTM MSG (p. 24, section 5.5.1.3) a. If the decision had been made to split a submission dataset, it is recommended that the sponsor communicate with their review division regarding exactly what needs to be included in the submission, i.e. the split datasets or both the split datasets and the unsplit datasets. 3. SDS (p. 5, section 2.2) a. Each dataset is provided in a single transport file. The maximum size of an individual dataset is dependent on many factors. In general, datasets greater than 1 GB in size should be split into smaller datasets, each no larger than 1GB in size. Datasets divided to meet the maximum size restrictions should contain the same variable presentation so they can be easily concatenated. b. Datasets which are divided should be clearly named to aid the reviewer in reconstructing the original dataset, e.g., xxx1, xxx2, xxx3, etc. The files that have been divided and need to be concatenated should be noted in the data definition document. This documentation should identify the range of subject numbers (or other criteria used for division) in the label for each of the divided datasets. . . . 4. CDER-CDS (p. 9) a. LB Domain (Laboratory): The size of the LB domain is often quite large and can exceed the reviewers’ ability to open the file using standard-issue computers. This size issue can be addressed by splitting the large LB dataset into smaller data sets according to LBCAT and LBSCAT, using LBCAT for initial splitting. If the size is still too large, then use LBSCAT for further splitting. For example: use the dataset name lbc.xpt for chemistry lbh.xpt for hematology and lbu.xpt for urinalysis. Splitting it other ways (by subject or file size, etc) makes the data less useable. Sponsors should submit these smaller files in addition to the larger non-split standard LB domain file. The smaller split files should be submitted in a separate Sub-directory /SPLIT which is clearly documented in addition to the larger non-split standard LB domain file in the CRT directory. Please see File Size section for information about file size limits. 5. CDER-CDS (p. 12) a. Dataset Splitting If datasets are greater than 1 gb in size, please split the datasets into smaller datasets no larger than 1 gb in size. Datasets should be resized to the maximum length used prior to splitting. This will ensure split datasets have matching variable lengths for future merges. Split data should be noted in the data definition document, clearly identifying the method used for the dataset splitting. For additional information or clarification on file size limitations or splitting of datasets submitted to CDER, contact eData@fda.hhs.gov. 6. PBO (slide 3) Dataset Sizes Description: o Based on FDA’s SDS and CID Solution: o Limit set to 1 gb o Warning acceptable for this? o Should warning/error be based on certain limits? Warning 1 gb to 1.25 gb Error > 1.5?