CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES

advertisement
CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES
Walter W. OWen
The Biostatistics Center
The George Washington University
ABSTRACT
Data from the behavioral sciences are
often analyzed by normalizing the scores
for individuals in experimental subgroups
to a reference population.
Normalized
scores, called Z-scores, may then be used
to compare performance relative to the
reference
group
either
across
the
experimental subgroups or among different
Population:
Z = (Xi
Sample:
Z
=
(Xi -
Xref )
I s ref
Where:
Z
variables.
Normalized Score
Individualized Raw Score
Summary
procedures
al~ow
group
statistics to be output to SAS data sets.
These data sets may be reshaped using the
MATRIX and TRANSPOSE
procedures before
being brought together via SET and MERGE
statements.
The result is a compact table
of normalized scores with SAS variable
labels identifying the tests presented.
Reference Mean Scores
Reference Standard Deviation
Addition of the reference values to the
table allows the reader to extrapolate
information about the experimental subgroup
means and to compare the reference group to
other
popUlations
reported
in
the
literature.
Further,
the percentage of
each subgroup having
absolute Z-scores
greater than an arbitrary cutoff could be
added yielding an even better definition of
the experimental subgroups.
For example,
absolute scores greater than 1.64 indicate
that an individual is performing at a level
different
from
90%
of
a
normally
distributed reference group.
The resultant Z values are unitless
scores indicating the number of standard
deviations by which the corresponding raw
score lies above or below the mean of the
reference distribution.
The reference
group is characterized by a Z-score mean of
zero and standard deviation of one.
If an
individual has an absolute Z-score of 1.64
or greater,
he is performing at a level
different
from
90%
of
a
normally
distributed
reference
population.
Similarly, an absolute Z-score greater than
1.96 puts the individual outside the range
of 95% of that reference group.
INTRODUCTION
The use of normalized (Z) scores is
widespread in the behavioral sciences. The
process of
normalization involves
the
transformation of data from experimental
subgroups using
the performance
of a
standard, or control, group as the initial
point of reference. The resultant Z-scores
allow researchers a common ground on which
to compare a wide variety of tests that may
scales or are
be scored on different
essentially objective in
nature.
The
formulas for computing individual Z-scores
based on reference popUlations and samples
are listed below.
When reporting the scores of several
different subgroups for a battery of tests,
it is desirable to present the results in
tabular form.
SAS provides several paths
by which to create such a table.
This
paper
will
focus
on
gathering
the
information and the usefulness of different
table layouts rather than elaborate methods
information on paper.
for putting the
PROe PRINT,
with LINESIZE
Accordingly,
options, was used to
output the tables
shown.
!
~
i
I
I
1116
METHODS
Appending the reference statistics to
each observation of the subgroup data set
allows the calculation of individual Zscores (see Program Segment 4). The scores
should replace the original raw values,
thus retaining the variable labels for
future use.
A word of caution -- be aware
that the
variables must
be able
to
accommodate the decimal portion of the
newly created
Z-score.
A
series of
counting variables may be created to record
which Z-scores are outside a desired range
(perhaps
1.64
or 1.96
as
described
previously).
If a value of 100 is used in
these counting variables to mark a score as
deviant and a value of zero if it is not,
the mean of the values will automatically
yield the percentage of individuals outside
the specified range.
Several requirements for an incoming
data set should
be established before
elaborating on other methodology.
Though
the techniques described below work equally
well for any number of subgroups,
the
groups must be classified by a single
variable (perhaps GROUP) that identifies
the experimental groupings as well as the
reference group,
in a mutually exclusive
format.
The
appropriate PRoe
FORMAT
statement should include values for all
groups and labels suitable as SAS variable
names (i.e., eight characters or fewer with
no
spaces).
This
format should
be
permanently
assigned to
the
GROUPing
variable when creating the groups.
Global
macro variables should be established to
give the number of variables (called by &N
in the programming segments that follow)
and the number of groups used (called by
&G,
not including the reference group),
thus
allowing much
of the
remaining
programming to be generalized to accept
variations in these values (see Program
Segment 1).
Also,
a macro (called by
%VARS) listing the actual variables to be
normalized is fundamental if the program is
to be easily adapted for various purposes.
PRoe MEANS (or SUMMARY)
is used again,
this time BY the GROUPing variable to
output a data set of mean Z-scores with an
observation for each subgroup (see' Program
Segment 5). PRoe TRANSPOSE, using GROUP as
the 10 variable, will produce data ready
for the final table.
The same general
process is used to
of outliers
Segment 6).
The number of observations per group is
output from PROe FREQuency, TRANSPOSEd into
a single observation, and saved for later
use (see Program Segment
2) .
It is
convenient to create a permanent length of
40 for the variable labels at this point.
This will
allow any label up
to 40
characters to be printed in the final table
without worrying about truncation in a
subsequent MERGE statement.
PROe PRINT
will adjust spacing if no label requires
this much space.
prepare the percentages
for the
table
(see
Program
Data manipulation is completed by match
MERGEing the reference statistics with the
Z-score and percentage
means for each
subgroup.
The group sizes may now be SET
with the information collected for the
variables in the previous step (see Program
Segment 7).
THE TABLES
Now
that
all
of
the
necessary
information is together in a single data
set
having one
observation for
each
variable plus one observation containing
the group sizes, the tables may be PRINTed.
The SPLIT option for labeling columns of
PRoe PRINT should be used to give better
definition
to the
table.
The
most
simplistic output (see Table 1) gives only
the mean Z-scores for each subgroup.
The next step in the process is to
create two data sets, one for the reference
group and the other for all of the subgroup
data.
The mean and standard deviation for
each variable in the reference group is
output as a single observation using PROe
MEANS. A copy of this data set is reshaped
by PROe MATRIX and output for use in the
final tables as reference parameters (see
Program Segment 3).
PROe SUMMARY could
also be used, but requires a separate PRINT
statement to look at the data. For smaller
data sets, PROe MEANS is preferred even if
it is slightly less efficient.
Addition of the reference group means
and standard deviations (see Table 2) will
define where the values are centered and
allows the reader to determine the means of
the experimental subgroups.
This is done
1117
by multiplying
the reference
standard
deviation by the subgroup mean Z-score and
then adding this value to the reference
mean.
SAS is the registered trademark
Institute, Inc., Cary, NC, USA.
The final bit of information to add is
the percentage of each subgroup which lies
outside of the specified range.
These
values are based on the number of subjects
in each group who actually took the test
and can enhance the information already
listed by indicating the possible skewness
of the subgroups.
See Program Segment 8
for the PROe PRINT used to produce Table 3.
The statements to produce Tables 1 and 2
are comparable.
of
Address Correspondence To:
Walter W. Owen
The Biostatistics Center
7979 Old Georgetown Road, Suite 500
Bethesda, MD 20814
SUMMARY
The use of the global macro variables G
and N, defining the number of subgroups and
variables respectively, allows flexibility
in the programming.
Simply by varying the
value of N,
along with the appropriate
modifications in the VARS macro containing
the list of variables used, the table may
reflect different subsets of test items.
The format of any of these tables may,
of course,
be changed to reflect the
desired number of significant digits.
If
measurement units for the variables are
needed, they should be included in the SAS
variable labels.
Units apply only for the
reference group as Z-scores and percentages
are unitless values.
The SAS macro
language offers some
intriguing possibilities for the ambitious
programmer.
If further generalizations
were added, a procedure-style macro could
be set up with defining parameters to cover
many of the requirements set forth for the
incoming data mentioned earlier in this
paper.
It has proven to be a formidable
challenge to put group sizes into macro
variables for use in labeling the output,
but SAS capabilities
should make this
possible. Also there is the possibility of
using PUT statements to print the tables,
although more information
is generally
needed to allow for varying column lengths,
particularly for the variable labels.
1118
SAS
Table 1
TABLE OF MEAN Z-SCORES FOR GROUPS 1-2
VARIABLE
DESCRIPTION
N=
SAS
SAS
SAS
SAS
SAS
LABEL FOR
LABEL FOR
LABEL FOR
LABEL FOR
LABEL FOR
VARIABLE 1
VARIABLE 2
VARIABLE 3
VARIABLE 4
VARIABLE 5
GROUP
MEAN Z
GROUP 2
125.00
-0.49
-0.49
0.57
0.52
0.72
212.00
-0.42
-0.54
0.59
0.45
O.BO
MEAN
Z
Table 2
TABLE OF MEAN Z-SCORES FOR GROUPS 1-2
NORMALIZED TO REFERENCE PARAMETERS
VARIABLE
REFERENCE
REFERENCE
MEAN
STD DEV
85.00
107.3B
61.54
3.05
0.35
0.75
12.66
14.84
3.29
0.07
0.49
DESCRIPTION
N=
SAS
SAS
SAS
SAS
SAS
LABEL FOR VARIABLE 1
LABEL FOR VARIABLE
LABEL FOR VARIABLE
LABEL FOR VARIABLE
LABEL FOR VARIABLE
2
3
4
5
GROUP 2
GROUP
MEAN
125.00
-0.49
-0.49
0.57
0.52
0.72
Z
MEAN
Z
212.00
-0.42
-0.54
0.59
0.45
0.80
Table 3
TABLE OF MEAN Z-SCORES FOR GROUPS 1-2
NORMALIZED TO REFERENCE PARAMETERS
PERCENTAGE OF GROUP WITH Izl > 1.64 SHOWN
VARIABLE
DESCRIPTION
N=
SAS
SAS
SAS
SAS
SAS
LABEL FOR VARIABLE
LABEL FOR VARIABLE 2
LABEL FOR VARIABLE 3
LABEL FOR VARIABLE 4
LABEL FOR VARIABLE 5
REFERENCE
MEAN
REFERENCE
STn DEV
85.00
107.38
61.54
3.05
0.35
0.75
12.66
14.84
3.29
0.07
0.49
1119
PCT
GROUP
MEAN
125.00
-0.49
-0.49
0.57
0.52
0.72
Z
15
18
12
14
19
GROUP 2
MEAN Z
212.00
-0.42
-0.54
0.59
0.45
0.80
PCT 2
16
19
15
16
25
PROGRAMMING SEGMENTS
Program Segment 5
Calculate Hean Z-scores
PRoe SORT DATA=ZSCORES;
BY GROUP;
PRoe MEANS DATA=ZSCORES NOPRINT;
BY GROUP;
VAR %VARS;
Program Segment 1
Macro Definitions
OUTPUT OUT=ZMEANS MEAN= %VARS;
Assignment
Call
Heaning
&G
&N
&CUT
%VARS
number of groups
number of variables
Z score cutoff
list of raw score
variables
PROC TRANSPOSE DATA=ZMEANS OUT=ZMEANS;
ID GROUP;
%LET G = 2;
XLET N = 5;
%LET CUT = 1.64;
Program Segment 6
Obtain the Percentage of Deviate Z-scores
%KACRO VARS;
variable list
%HEND
VARS;
DATA PCTZ;
SET ZSCORES;
ARRAY Z (H) %VARS;
ARRAY CNT (H) CNT1-CNT&N;
Program Segment 2
Obtaining Group N's
00 OVER Z;
IF ABS(Z) GT &CUT THEN CNT=IOO;
PROC FREQ;
TABLES GROUP / OUT=FREQSET NOPRINT;
PROC TRANSPOSE DATA=FREQSET OUT=FREQSET;
ELSE IF Z NE • THEN CNT=O;
END;
PROC SORT DATA=PCTZ;
BY GROUP;
PROC MEANS DATA=PCTZ NOPRINT;
BY GROUP;
VAR CNTl-CNT&N;
ID GROUP;
VAR COUNT;
DATA FREQSET;
LENGTH VARLABEL $40;
SET FREqSET (RENAME=( NAME =VARNAME
REFGRP=REFMEAN»;
OUTPUT OUT=PCTZ HEAN=%VARS;
PROC TRANSPOSE DATA=PCTZ OUT=PCTZ
VARLABEL='N=';
PREFIX=PCT;
VAR %VARS;
Program Segment 3
Obtaining Reference Statistics
Program Segment 1
Combine and Concatenate Data
PROC MEANS DATA=REFGRPS NOPRINT;
VAR %VARS;
DATA COMBINE;
MERGE ZMEANS
OUTPUT OUT=REFMEANS MEAN=MEAN1-MEAN&N
STD=STD1-STD&N;
PCTZ
REFSET (DROP=ROW);
RENAME NAME =VARNAME
PRoe MATRIX;
FETCH X DATA=REFHEANS;
Y = SHAPE(X.&N);
z
=LABEL_=VARLABEL;
y';
=
OUTPUT Z OUT=REFSET(RENAME3
(COL1=REFMEAN COL2=REFSTD»;
DATA FINAL;
SET
FREQSET
COHBINE;
LABEL REFHEAN =REFERENCE* Mean
REFSTD =REFERENCE* Std Dev
VARNAME =VARIABLE
Program Segment 4
Calculate Individual Z-scores
VARLABEL=VARIABLE*DESCRIPTION
GROUPl =GROUP l*Mean Z
GROUP2 =GROUP 2*Mean Z
DATA ZSCORES;
PCT1
PCT2
IF N =1 THEN SET REFHEANS;
SET-CROUPS;
ARRAY
ARRAY
ARRAY
ARRAY
Z (8)
V (H)
M (H)
S (H)
%VARS;
%VARS;
MEANI-MEAN&N;
STD1-STD&N;
=PCT 1
=PCT 2
Program Segment 8
Printing Table ~:
00 OVER S;
PRoe PRINT SPLIT=*;
ID VARLABEL;
IF S HE 0 THEN 00;
Z = (V-M)/S;
VAR REFMEAN REFSTD GROUP I PCTI
END;
GROUP2 PCT2;
FORHAT REFMEAN REFSTD
ELSE Z = .;
END;
GROUPI-GROUP&G 8.2
PCTl-PCT&G 3.0;
1120
Download