Sizing_Discussion_Do..

advertisement
Sizing of SDTM Variables and Datasets
Two interrelated issues involving sizing considerations that have been identified for further
clarification from the FDA are:


Sizing of Variables/ Data Elements/Columns
Overall File/Dataset Size
Because of how these two issues are interrelated, a single discussion document addresses both
of them. The following references have contributed to this document:
1.
2.
3.
4.
5.
6.
SDTMIG 3.1.3: SDTM IG (V3.1.3/2012-07-16)
SDTM-MSG: SDTM Metadata Submission Guideline (V1.0/2011-12-31)
SDS: Study Data Specifications (V2.0/July 18, 2012)
CDER-CDS: CDER Common Data Standards Issues Document (V1.1/December 2011)
PBO: 2013PhUSEBreakoutCABVP-CBER-CDER, FDA/PhUSE CSC (March 2013)
PWG1: Workgroup1-OpenCDISC_Updates, FDA/PhUSE CSC (March 2013?)
These documents discuss these issues in various ways, sometimes in more than one location in
a given document. To provide full context for the issues, a review of each document was
performed, and relevant text has been extracted into appendices of this document. To provide
a more concise discussion, however, details have been summarized, with differences
highlighted.
1. Sizing of Variables/Data Elements/Columns
See Appendix A for a full accounting of the text from various documents related to this issue.
There are three main categories related to variable length to consider. For ease of review, each
of these categories is summarized individually:
1. Column size constraints due to general CDISC and SAS Transport v5 requirements
 Various columns are identified as having a maximum length. This is mostly due to SAS
Transport V5 requirements or, in the case of vertical data structures, values that could
have constraints if the dataset were transformed to a horizontal structure (e.g., -TESTCD, --TEST).


For values that could become variable names and labels, there are some implications
that setting these to the maximum length is appropriate (e.g., SDTMIG states “--TESTCD
and IDVAR will never be more than 8, so length can always be set to 8”).
For domains split into multiple datasets, sizing decisions should be made prior to
implementing the split, and variables should be consistently sized in each dataset.
 This holds true for SUPPQUAL datasets within a domain (e.g., SUPPLBHE and
SUPPLBCH).
 This holds true for all FA-related datasets (i.e., all FA-related datasets, regardless
of parent domain, must have consistent variable lengths, FA being it’s own
domain).
 NB: My (C. Radovsky) recollection is that PhUSE minutes from the meeting in
March indicated that FDA was OK with resizing per dataset, but I am not able to
find these minutes.
Possible Discussion Topics:


Should any of these variables be subject to length evaluations? All or some (e.g.,
--TESTCD length of 8 OK, but not --TEST length of 40)?
?
2. Column size for variables associated with a Codelist
 The references related to Codelist length are recent developments, few and consistent:
Codelist variable length should not be greater than the longest term in the Codelist.
 FDA material doesn’t call out this issue explicitly. It seems bundled under the general
statement that variable length should not exceed value length.
Possible Discussion Topics:


Approach used to represent a Codelist for submission can impact decisions on
length. Consolidating to a minimum set of codelists may reduce development
overhead within sponsor organizations, but increases the size of the codelist
variables. (Both consolidated and not are technically compliant approaches).
?
3. Column size issues not otherwise constrained by the prior two categories
 The references related to variable length previously constrained only the 200 character
limit are also recent developments, few, and consistent:
 Length should be based on data values and not set to 200 characters by default.
 A small amount of padding is acceptable.

Working Group 1 discussions addressed the issue as follows:
 Severity based on “excessive” chars in var length
o 2-5 Chars – Information
o 6-20 Chars – Warning
o >20 Chars – Error
Possible Discussion Topics:




While the WG1 discussions came up with a reasonable interpretation of FDA
language, it was “arbitrarily” reasonable, and met with resistance in the industry.
Is this more a communication challenge than issue with the logic rule?
Given that sponsors need to address all rule violations with a severity of Warning
and Error, is there benefit in distinguishing the severity between the two?
Consider phased implementation (e.g., warning/error when above 20 for the first
year, warning/error when above 5 after that)?
?
2. Overall File/Dataset Size
See Appendix B for a full accounting of the text from various documents related to this issue.
References to file/dataset size and splitting decisions can be summarized as a whole:




One GB is the target for maximum file size, with some flexibility.
As noted under the variable length section, variable sizing issues need to be addressed
prior to decisions on splitting datasets, and variable length should be consistent across
split datasets.
As with variable padding, a scale of severity is identified in some material, and is likely
subject to the same considerations.
CDISC references address splitting based on categorical variables (e.g., --CAT/--SCAT).
FDA references note this, as well as other mechanisms such as grouping by subject
number, but these other mechanisms are identified as inferior.
Possible Discussion Topics:

Is there a rationale/benefit to maintain references to splitting by less desirable
methods?



Per above, a scale of severity is identified in some material, and is likely subject
to the same considerations as variable length severity.
The FDA CDER Common Data Standards Issues Document mentions the use of a
/SPLIT subdirectory, which is not identified in other material. That said, it is a
critical piece of information, not only in providing a clear mechanism for
submitting both split and not split datasets at the same time, but addresses
procedural challenges with some software (e.g., to my recollection, WebSDM
doesn’t handle having both versions of the data in a single folder).
?
Appendix A
Following the main document, details related to variable length are addressed in three main
categories:
1. Column size constraints due to general CDISC and SAS Transport v5 requirements
a. SDTMIG 3.1.3 (p. 28, section 4.1.2.1)
i. Values of --TESTCD must be limited to 8 characters . . .
ii. QNAM serves the same purpose as --TESTCD within supplemental qualifier
datasets, and so values of QNAM are subject to the same restrictions as values of
--TESTCD.
iii. ETCD (the companion to ELEMENT) and TSPARMCD (the companion to TSPARM)
are limited to 8 characters. . . .
iv. ARMCD is limited to 20 characters and does not have special character
restrictions.
b. SDTMIG 3.1.3 (p. 35, linked section 4.1.2.9)
i. The maximum SAS Version 5 character variable length of 200 characters should
not be used unless necessary.
ii. Sponsors should consider the nature of the data, and apply reasonable,
appropriate lengths to variables. For example:
 The length of flags will always be 1
 --TESTCD and IDVAR will never be more than 8, so length can always be set to 8
 The length for variables which use controlled terminology can be set to the
length of the longest term.
c. SDTMIG 3.1.3 (p. 50, section 4.1.5.3.1)
i. . . . the length of --TEST is normally limited to 40 characters to conform to the
limitations of the SAS V5 Transport format currently used for submission
datasets.
ii. The convention above should also be applied to the Qualifier Value Label
(QLABEL) in Supplemental Qualifiers (SUPP--) datasets. IETEST values in IE and TI
are exceptions to the above 40-character rule and are limited to 200 characters
since they are not expected to be transformed to a column labels.
d. SDTM-MSG (p. 20, section 5.1.2)
i. ARMCD can be up to 20 characters in length with no special character
restrictions. The arm name (ARM) variable length must not exceed 200
characters due to SAS version 5 transport requirements.
e. SDTM-MSG (p. 20, section 5.1.3)
i. The full inclusion/exclusion criteria text should be stored in IETEST and
corresponding metadata [up to 200 characters].
f. SDTM IG 3.1.3 (p. 22, section 4.1.1.7)
i. Variables of the same name in separate datasets should have the same SAS
Length attribute to avoid any difficulties if the sponsor or FDA should decide to
append datasets together.
g. SDTM-MSG (p. 31, section 6.3)
i. Each of the value-level QNAMs (defined by the “Name” attribute) has a different
length, a label, and associated controlled terminology (CodeLists).
h. SDS (p. 5, section 2.3)
i. Variable Name has a Maximum Length of 8.
ii. Variable Descriptive Label has a Maximum Length of 40.
iii. Dataset Label has a Maximum Length of 40.
i. SDS (p. 5, section 2.4)
i. For all datasets, in order to significantly reduce dataset file sizes, the allotted
character column length/size for each column should be the maximum length
used. Lengths/sizes of columns should not arbitrarily be set to 200.
ii. An inclusion of a small amount of padding to column width may be acceptable as
long as this doesn’t result in significant increases in file size.
j. CDER-CDS (p. 12)
i. For both CDISC and non-CDISC datasets, in order to significantly reduce dataset
file sizes, the allotted character variable length/size for each column in a dataset
should be the maximum length used. Lengths/Sizes of columns should not
arbitrarily be set to 200.
ii. Alternative solutions to this problem that involve some inclusion of a small
amount of padding to column width may be acceptable as long as they don’t
result in significant increases in file size due to the padding.
2. Column size for variables associated with a Codelist
a. SDTMIG 3.1.3 (p. 35, linked section 4.1.2.9)
i. The length for variables which use controlled terminology can be set to the
length of the longest term.
3. Column size issues not otherwise constrained by the prior two categories
a. SDTMIG 3.1.3 (p. 95, section 6.4.2)
i. [For split FA datasets]. . . Variables of the same name in multiple datasets should
have the same SAS Length attribute.
b. SDS (p.5)
. . . the allotted character column length/size for each column should be the
maximum length used. Lengths/sizes of columns should not arbitrarily be set to 200.
c.
d.
e.
f.
. . . An inclusion of a small amount of padding to column width may be acceptable as
long as this doesn’t result in significant increases in file size.
CDER-CDS (p. 12)
For both CDISC and non-CDISC datasets, in order to significantly reduce dataset file
sizes, the allotted character variable length/size for each column in a dataset should
be the maximum length used. . . . Alternative solutions to this problem that involve
some inclusion of a small amount of padding to column width may be acceptable as
long as they don’t result in significant increases in file size due to the padding.
PBO (slide 2)
Ideally, column widths should only be the maximum length used in that column
across records
Solution:
 Should we use extra characters used or %?
 Exceptions?
PBO (slide 32)
 SD0017(others): Variable lengths
 Description:
o The value of xxxx should be no more than yy characters in length
 --TEST, --TESTCD, TSPARM, ARMCD,
o Are there extenuating circumstances to these rules where character
limits may not be enough?
 Adjust OpenCDISC checks to accommodate?
o What’s the difference between SD0017 and SD1058? Duplicate rules?
PWG1 (slide 9)
Variable length is too long for actual data
i. Severity based on “excessive” chars in var length
ii. 2-5 Chars – Information
iii. 6-20 Chars – Warning
iv. >20 Chars – Error
Appendix B
Following the main document, details related to Overall File/Dataset Size and resulting split
domains are addressed as a whole:
1. SDTM IG 3.1.3 (p. 22, section 4.1.1.7)
a. Sponsors may choose to split a domain of topically related information into
physically separate datasets. In such cases, one of two approaches should be
implemented:
1) For a domain based on a general observation class, splitting should be
according to values in --CAT (which must not be null).
2) The Findings About (FA) domain (Section 6.4) can be split either by --CAT
values (per the bullet above) or relative to the parent domain of the
value in --OBJ. For example, FACM would store Findings About CM
records. See Section 6.4.2 for more details.
2. SDTM MSG (p. 24, section 5.5.1.3)
a. If the decision had been made to split a submission dataset, it is recommended
that the sponsor communicate with their review division regarding exactly what
needs to be included in the submission, i.e. the split datasets or both the split
datasets and the unsplit datasets.
3. SDS (p. 5, section 2.2)
a. Each dataset is provided in a single transport file. The maximum size of an
individual dataset is dependent on many factors. In general, datasets greater
than 1 GB in size should be split into smaller datasets, each no larger than 1GB in
size. Datasets divided to meet the maximum size restrictions should contain the
same variable presentation so they can be easily concatenated.
b. Datasets which are divided should be clearly named to aid the reviewer in
reconstructing the original dataset, e.g., xxx1, xxx2, xxx3, etc. The files that have
been divided and need to be concatenated should be noted in the data
definition document. This documentation should identify the range of subject
numbers (or other criteria used for division) in the label for each of the divided
datasets. . . .
4. CDER-CDS (p. 9)
a. LB Domain (Laboratory):
The size of the LB domain is often quite large and can exceed the reviewers’
ability to open the file using standard-issue computers. This size issue can be
addressed by splitting the large LB dataset into smaller data sets according to
LBCAT and LBSCAT, using LBCAT for initial splitting. If the size is still too large,
then use LBSCAT for further splitting. For example: use the dataset name lbc.xpt
for chemistry lbh.xpt for hematology and lbu.xpt for urinalysis. Splitting it other
ways (by subject or file size, etc) makes the data less useable. Sponsors should
submit these smaller files in addition to the larger non-split standard LB domain
file. The smaller split files should be submitted in a separate Sub-directory /SPLIT
which is clearly documented in addition to the larger non-split standard LB
domain file in the CRT directory. Please see File Size section for information
about file size limits.
5. CDER-CDS (p. 12)
a. Dataset Splitting
If datasets are greater than 1 gb in size, please split the datasets into smaller
datasets no larger than 1 gb in size. Datasets should be resized to the maximum
length used prior to splitting. This will ensure split datasets have matching
variable lengths for future merges. Split data should be noted in the data
definition document, clearly identifying the method used for the dataset
splitting. For additional information or clarification on file size limitations or
splitting of datasets submitted to CDER, contact eData@fda.hhs.gov.
6. PBO (slide 3)
 Dataset Sizes
 Description:
o Based on FDA’s SDS and CID
 Solution:
o Limit set to 1 gb
o Warning acceptable for this?
o Should warning/error be based on certain limits?
 Warning 1 gb to 1.25 gb
 Error > 1.5?
Download