GPEC Variable Convention for Statistical Analysis S/W Need for data annotation convention As the amount of GPEC data increases, it is important to establish a convention for data annotation. Some of the reasons are list below: Organize data. Standardized annotation allow researcher to know what the data represents. It enhances the researchers’ ability to o determine the suitability of the data for his/her analysis o search for particular data Knowledge transfer and data sharing. Interpretations of poorly annotated data are often lost when the original user (the one who originally analyzed this data) is away. In some cases, data could be rendered useless without the guidance of the original user. Standardized annotations capture important information about data and preserve them so that other/future GPEC employees or collaborators can use GPEC data with confidence. Definition of variable for statistical analysis S/W Currently, most data analysis in GPEC is done in SPSS. An SPSS variable consists of the following: Variable name An array of values (for each case). The type can be either text or numeric. Data value label. For each value, there may be an associated text description. E.g. 0 = “negative” Data label. A text description of this variable Due to the transferability of the above data format (e.g. the above format cannot be exported to Excel) the data annotation convention would be restricted to the following data format. Variable name An array of values (for each case). The type can be either text or numeric. Essentially, this is a column in an excel spreadsheet. Variable name Biomarker name GPEC should try to establish an internal preferred name for biomarkers that would be easy to understand for most researchers. For example, “her2” could be used as the internal preferred name for “Her-2” which has synonym: “erbb2”, “neu” … etc. A central glossary easily accessible (e.g. a web page) will help researchers on different studies establish uniform data definitions, which can add value by allowing later meta analyses of multiple datasets. In addition to the above considerations, variable names should abide to a restricted version of SPSS variable naming convention (Note, this naming convention is compatible with R. Please refer to R manual: http://cran.r-project.org/doc/manuals/R-intro.html): Variable name is case sensitive. Each variable name must be unique; duplication is not allowed. Variable names can be up to 64 bytes long, and the first character must be a letter {A-Z, a-z}. Subsequent characters can be any combination of letters, numbers {0-9}, a period “.”, and underscore “_”. Variable names cannot contain spaces. The period and underscore can be used within variable names. For example, A.B_1 is a valid variable name. Variable names ending with a period should be avoided, since the period may be interpreted as a SPSS command terminator. Variable names ending in underscores should be avoided. SPSS Reserved keywords cannot be used as variable names. Reserved keywords are: ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH. Variable names can be defined with any mixture of uppercase and lowercase characters, and case is preserved for display purposes. Variable type [biomarker] raw IHC scores for biomarker o e.g. -1 = missing; 0 = negative; 1,2,3 … etc = some notion of positivity biomarker names must follow restrictions with variable names as well as the following: the first character must be a letter {A-Z, a-z}; subsequent characters can be any combination of letters, numbers {0-9}, and a period “.” only. i.e. biomarker names cannot contain “_”. e.g. gata3 [biomarker]_b[negative score(s)]v[positive score(s)] IHC binary scores e.g. gata3_b0v12 o binarized gata3 data {0} vs {1,2} For missing vs. interpretable, use o [biomarker]_bXvS o e.g. gata3_bXvS [biomarker]_c raw continuous IHC scores for biomarker e.g. p66_c [biomarker]_c[number of cut-off pt’s]v[cut-off version # for different cut-off’s] cut-off continuous IHC scores for biomarker version # starts with 1; always present even when there is only one version e.g. p66_c2v1 for missing vs. interpretable for continues IHC, use o [biomarker]_cXvS o e.g. p66_cXvS [biomarker]_fish raw continuous FISH amp ratio for biomarker e.g. her2_fish [biomarker]_fish[number of cut-off pt’s]v[cut-off version # for different cut-off’s] cut-off of FISH amp ratio e.g. her2_fish1v1 [biomarker]_cish raw continuous CISH amp ratio for biomarker e.g. her2_cish [biomarker]_cish[number of cut-off pt’s]v[cut-off version # for different cut-off’s] cut-off of CISH amp ratio e.g. her2_cish1v1 [biomarker]_qrtpcr raw continuous qRT-PCR normalized expression data for biomarker e.g. her2_qrtpcr [biomarker]_qrtpcr[number of cut-off pt’s]v[cut-off version # for different cut-off’s] cut-off of FISH amp ratio e.g. her2_fish1v1 Cluster membership (i.e. cross product of two or more IHC variables) naming convention, is out of the scope of this document. Variable version number The following are some occasions where one would need version number Same TMA stained by different institution IHC visual scores done by different researchers Note: version number starts with ‘0’ except at the top level. Trailing 0’s is omitted. For example, her2_v1 is the first version of Her-2 IHC staining+score and her2_v1.1 is the second scoring of her2_v1.x staining where the first version is her2_v1.0 (which should be written as her2_v1). her2_fv1_v1.1 should be labeled as her2_fv1_v1 [biomarker + type]_v[version #] version # starts with 1; always present even when there is only one version raw FISH amp ratio for biomarker version 2 e.g. her2_fv1_v2 New version vs. new biomarker name One needs to standardize when to use a new version of a biomarker and when it is more appropriate to use a new biomarker. New New Scenario Biomarker Biomarker Name Version different detection method … different antigen … same above and different staining attempt (same or different antibodies) … same above and different scoring mechanism/media … same above and different scoring approach … e.g. IHC, FISH, qrtPCR … YES - YES - YES - - - YES x different pathologist (e.g. different humans or different machine) - YES A.x different session - different scoring system - different isoforms / splice variants / phosphoration state of gene product of same gene(s) different gene(s) YES A.B.x YES A.B.x Naming convention for different data types For example, non-IHC methods and automated scoring machine generates different types of data on the same scan/analysis. The type of the data is indicated as part of the version number. The version number code is as follows: IHC-related Version Description Example Number Code cx IHC score for core x in multiplicate core. E.g. ki67_v1.c2 ki67_v1.c2 indicates ki67_v1 score for core #2 of the duplicate core. avg average IHC score among the multiplicate cores. Please ki67_v1.avg note, the default consolidation scheme is categorical score: maximum continuous score: average max maximum IHC score among the multiplicate cores. Please ki67_c_v1.max note, the default consolidation scheme is categorical score: maximum cmt int pp org FISH Version Number Code as cep6.as continuous score: average additional comments for individual cores IHC score for intensity (used when more than one score is given to a core in the same scoring session – e.g. score for cores in the format of {percent positive, intensity} IHC score for percent of positive cells (used when more than one score is given to a core in the same scoring session – e.g. score for cores in the format of {percent positive, intensity} Original score as entered by scoring pathologist (free text) bcl2_v2.pp bcl2_v2.org bcl2_v2.c2.org Description Example average signal CEP 6 average signal emsy_fish_v1.as emsy_fish_v1.cep6.as Automated Scores Version Number Code Description hs H-score; definition varies between biomarkers, please refer to the specific biomarker description page mod median optical density of a color (e.g. brown), ImageJ specific ncn negative cytoplasm number; Ariol specific nni negative nuclear intensity; Ariol specific nnn number of negative nuclei pa percent color (e.g. brown) area; ImageJ specific pci positive cytoplasm intensity; Ariol specific pcn positive cytoplasm number; Ariol specific pcp positive cytoplasm percent; Ariol specific pn percent color (e.g. brown) number; ImageJ specific pni positive nuclear intensity; Ariol specific pnn number of positive nuclei rs region score; Ariol specific tnn uod ki67_v1.cmt bcl2_v2.int total nuclei number (~ cell count) mean optical density of a color (e.g. brown), ImageJ specific Example yb1_c_v1.1.hs yb1_c_v1.1.mod yb1_c_v1.1.ncn yb1_c_v1.1.nni yb1_c_v1.1.nnn yb1_c_v1.2.pa yb1_c_v1.1.pci yb1_c_v1.1.pcn yb1_c_v1.1.pcp yb1_c_v1.2.pn yb1_c_v1.1.pni yb1_c_v1.1.pnn yb1_c_v1.1.rs yb1_c_v1.1.tnn yb1_c_v1.1.uod Note: The version number is the only indicator that indicates the scoring is done by a particular machine. “_c_” indicates whether variable is continuous or not. Its presence does not imply automated or non-automated scores. Automated scores generated by different machines receive different version numbers. One would need a lookup table to correspond version numbers to specific information about the automated scoring machine (e.g. Ariol, Image …) Exceptional variable indicator Some variables do not follow the general variable naming conventions. They are indicated by a special tag at the end of the variable after the version number. Exception Tag No negative value. In this case, missing values will be coded as a special positive value. Since ranges of the variables differ, the positive missing value code will differ between different variables. The following are the convention for positive error codes. [variable name + version #]_po IHC: o 999: missing with no e.g. her2_fv1_v2_po specified reasons IHC (continuous): o 999 FISH (amp ratio): o 999 Variable value considerations Accurate recording of data may take extra time initially, but it will yield time savings at analysis time and allow deeper scientific understanding of the processes under investigation. Record extra levels of detail. For example, avoid a general “Missing” code; review why data might be missing and establish codes for each reason. Multiple codes can easily be collapsed into a general code category at analysis time, but if a general coding is used initially it is difficult to redefine sub-categories after the fact. Clearly note the units of measurement for each variable to avoid problems such as Imperial/Metric substitutions. Immunohistochemistry (IHC) value convention General IHC values convention is as follows: {0} refers some form of negative staining {1,2,3, …} refers to some form of positive staining {-1} refers to missing value with unspecified reason(s) {-2,-3,-4, …} refers to some missing value with some specified reason(s) Missing values denoted by negative integers shall conform to the following convention. Missing Missing Value Description Value Code missing with unspecified reason(s) -1 very few cancer cells -2 in situ -3 core drop off / missing core -4 did not score -5 value missing due to number of cells in core less then some number (e.g. 50). this -6 -7 -8 -9 -10 -11 -12 -50 to -100 minimum number of cells is specific to tumor type. please refer to comments in data file. no viable cancer marker no experiment done for this case exclude core on v3 big series core ID = 1499 due to questionable array construction non-specific staining core folded user-defined missing value codes i.e. they may refer to different reason for missing value for different variables. IHC values have associated description. E.g. for one biomarker, 0 refers to < 5% nuclei positive and 1 refers to 5-50% nuclei positive. This can be captured very naturally in SPSS data file. However, the more transferable data format considered here cannot capture these value descriptions. Therefore, GPEC would need to document these value descriptions in an easily accessible place, e.g. an internal website. Update log … Date 9/26/2006 10/17/2006 10/18/2006 10/20/2006 1/4/2007 10/16/2007 12/05/2007 Modifications/Updates Added missing value code table; Added more automated score version codes. Added missing value for “no viable cancer” and “marker” Added missing value for “user-defined missing value …” Added name for FISH “average signal” Added missing value for “no experiment done for this case” Added missing value for “exclude core on v3 big series core ID = 1499 due to questionable array construction” Added the “IHC-related” table under “Naming convention for different data types”. Added the “cx” and “avg” entry in the “IHC-related” table 12/10/2007 1/16/2008 01/06/2009 03/28/2011 10/20/2014 Added “-11” missing value code Added “cmt” in “IHC-related” table Added “int” and “pp” in “IHC-related” table Added -12 “core folded” to IHC value convention Added “max” and “org” in “IHC-related” table