GPEC Variable Convention for Statistical Analysis S/W

advertisement
GPEC Variable Convention for Statistical Analysis S/W
Need for data annotation convention
As the amount of GPEC data increases, it is important to establish a convention for data
annotation. Some of the reasons are list below:
 Organize data. Standardized annotation allow researcher to know what the data
represents. It enhances the researchers’ ability to
o determine the suitability of the data for his/her analysis
o search for particular data
 Knowledge transfer and data sharing. Interpretations of poorly annotated data are
often lost when the original user (the one who originally analyzed this data) is
away. In some cases, data could be rendered useless without the guidance of the
original user. Standardized annotations capture important information about data
and preserve them so that other/future GPEC employees or collaborators can use
GPEC data with confidence.
Definition of variable for statistical analysis S/W
Currently, most data analysis in GPEC is done in SPSS. An SPSS variable consists of
the following:
 Variable name
 An array of values (for each case). The type can be either text or numeric.
 Data value label. For each value, there may be an associated text description.
E.g. 0 = “negative”
 Data label. A text description of this variable
Due to the transferability of the above data format (e.g. the above format cannot be
exported to Excel) the data annotation convention would be restricted to the following
data format.
 Variable name
 An array of values (for each case). The type can be either text or numeric.
Essentially, this is a column in an excel spreadsheet.
Variable name
Biomarker name
GPEC should try to establish an internal preferred name for biomarkers that would be
easy to understand for most researchers. For example, “her2” could be used as the
internal preferred name for “Her-2” which has synonym: “erbb2”, “neu” … etc. A
central glossary easily accessible (e.g. a web page) will help researchers on different
studies establish uniform data definitions, which can add value by allowing later meta
analyses of multiple datasets.
In addition to the above considerations, variable names should abide to a restricted
version of SPSS variable naming convention (Note, this naming convention is compatible
with R. Please refer to R manual: http://cran.r-project.org/doc/manuals/R-intro.html):









Variable name is case sensitive.
Each variable name must be unique; duplication is not allowed.
Variable names can be up to 64 bytes long, and the first character must be a letter
{A-Z, a-z}. Subsequent characters can be any combination of letters, numbers
{0-9}, a period “.”, and underscore “_”.
Variable names cannot contain spaces.
The period and underscore can be used within variable names. For example,
A.B_1 is a valid variable name.
Variable names ending with a period should be avoided, since the period may be
interpreted as a SPSS command terminator.
Variable names ending in underscores should be avoided.
SPSS Reserved keywords cannot be used as variable names. Reserved keywords
are: ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH.
Variable names can be defined with any mixture of uppercase and lowercase
characters, and case is preserved for display purposes.
Variable type
[biomarker]
 raw IHC scores for biomarker
o e.g. -1 = missing; 0 = negative; 1,2,3 … etc = some notion of positivity
 biomarker names must follow restrictions with variable names as well as the
following: the first character must be a letter {A-Z, a-z}; subsequent characters
can be any combination of letters, numbers {0-9}, and a period “.” only. i.e.
biomarker names cannot contain “_”.
 e.g. gata3
[biomarker]_b[negative score(s)]v[positive score(s)]
 IHC binary scores
 e.g. gata3_b0v12
o binarized gata3 data {0} vs {1,2}
 For missing vs. interpretable, use
o [biomarker]_bXvS
o e.g. gata3_bXvS
[biomarker]_c
 raw continuous IHC scores for biomarker
 e.g. p66_c
[biomarker]_c[number of cut-off pt’s]v[cut-off version #
for different cut-off’s]
 cut-off continuous IHC scores for biomarker
 version # starts with 1; always present even when there is only one version
 e.g. p66_c2v1
 for missing vs. interpretable for continues IHC, use
o [biomarker]_cXvS
o e.g. p66_cXvS
[biomarker]_fish
 raw continuous FISH amp ratio for biomarker
 e.g. her2_fish
[biomarker]_fish[number of cut-off pt’s]v[cut-off version #
for different cut-off’s]
 cut-off of FISH amp ratio
 e.g. her2_fish1v1
[biomarker]_cish
 raw continuous CISH amp ratio for biomarker
 e.g. her2_cish
[biomarker]_cish[number of cut-off pt’s]v[cut-off version #
for different cut-off’s]
 cut-off of CISH amp ratio
 e.g. her2_cish1v1
[biomarker]_qrtpcr
 raw continuous qRT-PCR normalized expression data for biomarker
 e.g. her2_qrtpcr
[biomarker]_qrtpcr[number of cut-off pt’s]v[cut-off version
# for different cut-off’s]
 cut-off of FISH amp ratio
 e.g. her2_fish1v1
Cluster membership (i.e. cross product of two or more IHC variables) naming
convention, is out of the scope of this document.
Variable version number
The following are some occasions where one would need version number
 Same TMA stained by different institution
 IHC visual scores done by different researchers
Note: version number starts with ‘0’ except at the top level. Trailing 0’s is omitted.
For example,
 her2_v1 is the first version of Her-2 IHC staining+score and her2_v1.1 is
the second scoring of her2_v1.x staining where the first version is
her2_v1.0 (which should be written as her2_v1).
 her2_fv1_v1.1 should be labeled as her2_fv1_v1
[biomarker + type]_v[version #]
 version # starts with 1; always present even when there is only one version
 raw FISH amp ratio for biomarker version 2
 e.g. her2_fv1_v2
New version vs. new biomarker name
One needs to standardize when to use a new version of a biomarker and when it is more
appropriate to use a new biomarker.
New
New
Scenario
Biomarker Biomarker
Name
Version
different detection
method …
different antigen …
same above and
different staining
attempt (same or
different antibodies)
…
same above and
different scoring
mechanism/media …
same above and
different scoring
approach …
e.g. IHC, FISH, qrtPCR …
YES
-
YES
-
YES
-
-
-
YES
x
different pathologist (e.g. different
humans or different machine)
-
YES
A.x
different session
-
different scoring system
-
different isoforms / splice variants
/ phosphoration state of gene
product of same gene(s)
different gene(s)
YES
A.B.x
YES
A.B.x
Naming convention for different data types
For example, non-IHC methods and automated scoring machine generates different types
of data on the same scan/analysis. The type of the data is indicated as part of the version
number. The version number code is as follows:
IHC-related
Version
Description
Example
Number
Code
cx
IHC score for core x in multiplicate core. E.g. ki67_v1.c2
ki67_v1.c2
indicates ki67_v1 score for core #2 of the duplicate core.
avg
average IHC score among the multiplicate cores. Please
ki67_v1.avg
note, the default consolidation scheme is
 categorical score: maximum
 continuous score: average
max
maximum IHC score among the multiplicate cores. Please ki67_c_v1.max
note, the default consolidation scheme is
 categorical score: maximum
cmt
int
pp
org
FISH
Version
Number
Code
as
cep6.as
continuous score: average
additional comments for individual cores
IHC score for intensity (used when more than one score is
given to a core in the same scoring session – e.g. score for
cores in the format of {percent positive, intensity}
IHC score for percent of positive cells (used when more
than one score is given to a core in the same scoring
session – e.g. score for cores in the format of {percent
positive, intensity}
Original score as entered by scoring pathologist (free text)
bcl2_v2.pp
bcl2_v2.org
bcl2_v2.c2.org
Description
Example
average signal
CEP 6 average signal
emsy_fish_v1.as
emsy_fish_v1.cep6.as
Automated Scores
Version
Number
Code
Description
hs
H-score; definition varies between biomarkers, please
refer to the specific biomarker description page
mod
median optical density of a color (e.g. brown), ImageJ
specific
ncn
negative cytoplasm number; Ariol specific
nni
negative nuclear intensity; Ariol specific
nnn
number of negative nuclei
pa
percent color (e.g. brown) area; ImageJ specific
pci
positive cytoplasm intensity; Ariol specific
pcn
positive cytoplasm number; Ariol specific
pcp
positive cytoplasm percent; Ariol specific
pn
percent color (e.g. brown) number; ImageJ specific
pni
positive nuclear intensity; Ariol specific
pnn
number of positive nuclei
rs
region score; Ariol specific
tnn
uod
ki67_v1.cmt
bcl2_v2.int
total nuclei number (~ cell count)
mean optical density of a color (e.g. brown), ImageJ
specific
Example
yb1_c_v1.1.hs
yb1_c_v1.1.mod
yb1_c_v1.1.ncn
yb1_c_v1.1.nni
yb1_c_v1.1.nnn
yb1_c_v1.2.pa
yb1_c_v1.1.pci
yb1_c_v1.1.pcn
yb1_c_v1.1.pcp
yb1_c_v1.2.pn
yb1_c_v1.1.pni
yb1_c_v1.1.pnn
yb1_c_v1.1.rs
yb1_c_v1.1.tnn
yb1_c_v1.1.uod
Note:
 The version number is the only indicator that indicates the scoring is done by a
particular machine.
 “_c_” indicates whether variable is continuous or not. Its presence does not imply
automated or non-automated scores.
 Automated scores generated by different machines receive different version numbers.
One would need a lookup table to correspond version numbers to specific information
about the automated scoring machine (e.g. Ariol, Image …)
Exceptional variable indicator
Some variables do not follow the general variable naming conventions. They are
indicated by a special tag at the end of the variable after the version number.
Exception
Tag
No negative value. In this case, missing
values will be coded as a special positive
value. Since ranges of the variables differ,
the positive missing value code will differ
between different variables. The following
are the convention for positive error codes.
[variable name + version #]_po
 IHC:
o 999: missing with no
e.g. her2_fv1_v2_po
specified reasons
 IHC (continuous):
o 999
 FISH (amp ratio):
o 999
Variable value considerations
Accurate recording of data may take extra time initially, but it will yield time savings at
analysis time and allow deeper scientific understanding of the processes under
investigation.
 Record extra levels of detail. For example, avoid a general “Missing” code;
review why data might be missing and establish codes for each reason. Multiple
codes can easily be collapsed into a general code category at analysis time, but if
a general coding is used initially it is difficult to redefine sub-categories after the
fact.
 Clearly note the units of measurement for each variable to avoid problems such as
Imperial/Metric substitutions.
Immunohistochemistry (IHC) value convention
General IHC values convention is as follows:
 {0} refers some form of negative staining
 {1,2,3, …} refers to some form of positive staining
 {-1} refers to missing value with unspecified reason(s)
 {-2,-3,-4, …} refers to some missing value with some specified reason(s)
Missing values denoted by negative integers shall conform to the following convention.
Missing
Missing Value Description
Value
Code
missing with unspecified reason(s)
-1
very few cancer cells
-2
in situ
-3
core
drop
off
/ missing core
-4
did not score
-5
value
missing
due
to
number
of
cells
in core less then some number (e.g. 50). this
-6
-7
-8
-9
-10
-11
-12
-50 to -100
minimum number of cells is specific to tumor type. please refer to comments in data file.
no viable cancer
marker
no experiment done for this case
exclude core on v3 big series core ID = 1499 due to questionable array construction
non-specific staining
core folded
user-defined missing value codes i.e. they may refer to different reason for missing value
for different variables.
IHC values have associated description. E.g. for one biomarker, 0 refers to < 5% nuclei
positive and 1 refers to 5-50% nuclei positive. This can be captured very naturally in
SPSS data file. However, the more transferable data format considered here cannot
capture these value descriptions. Therefore, GPEC would need to document these value
descriptions in an easily accessible place, e.g. an internal website.
Update log …
Date
9/26/2006
10/17/2006
10/18/2006
10/20/2006
1/4/2007
10/16/2007
12/05/2007
Modifications/Updates
Added missing value code table; Added more automated score version
codes.
Added missing value for “no viable cancer” and “marker”
Added missing value for “user-defined missing value …”
Added name for FISH “average signal”
Added missing value for “no experiment done for this case”
Added missing value for “exclude core on v3 big series core ID = 1499 due
to questionable array construction”
Added the “IHC-related” table under “Naming convention for different data
types”. Added the “cx” and “avg” entry in the “IHC-related” table
12/10/2007
1/16/2008
01/06/2009
03/28/2011
10/20/2014
Added “-11” missing value code
Added “cmt” in “IHC-related” table
Added “int” and “pp” in “IHC-related” table
Added -12 “core folded” to IHC value convention
Added “max” and “org” in “IHC-related” table
Download