CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES Walter W. OWen The Biostatistics Center The George Washington University ABSTRACT Data from the behavioral sciences are often analyzed by normalizing the scores for individuals in experimental subgroups to a reference population. Normalized scores, called Z-scores, may then be used to compare performance relative to the reference group either across the experimental subgroups or among different Population: Z = (Xi Sample: Z = (Xi - Xref ) I s ref Where: Z variables. Normalized Score Individualized Raw Score Summary procedures al~ow group statistics to be output to SAS data sets. These data sets may be reshaped using the MATRIX and TRANSPOSE procedures before being brought together via SET and MERGE statements. The result is a compact table of normalized scores with SAS variable labels identifying the tests presented. Reference Mean Scores Reference Standard Deviation Addition of the reference values to the table allows the reader to extrapolate information about the experimental subgroup means and to compare the reference group to other popUlations reported in the literature. Further, the percentage of each subgroup having absolute Z-scores greater than an arbitrary cutoff could be added yielding an even better definition of the experimental subgroups. For example, absolute scores greater than 1.64 indicate that an individual is performing at a level different from 90% of a normally distributed reference group. The resultant Z values are unitless scores indicating the number of standard deviations by which the corresponding raw score lies above or below the mean of the reference distribution. The reference group is characterized by a Z-score mean of zero and standard deviation of one. If an individual has an absolute Z-score of 1.64 or greater, he is performing at a level different from 90% of a normally distributed reference population. Similarly, an absolute Z-score greater than 1.96 puts the individual outside the range of 95% of that reference group. INTRODUCTION The use of normalized (Z) scores is widespread in the behavioral sciences. The process of normalization involves the transformation of data from experimental subgroups using the performance of a standard, or control, group as the initial point of reference. The resultant Z-scores allow researchers a common ground on which to compare a wide variety of tests that may scales or are be scored on different essentially objective in nature. The formulas for computing individual Z-scores based on reference popUlations and samples are listed below. When reporting the scores of several different subgroups for a battery of tests, it is desirable to present the results in tabular form. SAS provides several paths by which to create such a table. This paper will focus on gathering the information and the usefulness of different table layouts rather than elaborate methods information on paper. for putting the PROe PRINT, with LINESIZE Accordingly, options, was used to output the tables shown. ! ~ i I I 1116 METHODS Appending the reference statistics to each observation of the subgroup data set allows the calculation of individual Zscores (see Program Segment 4). The scores should replace the original raw values, thus retaining the variable labels for future use. A word of caution -- be aware that the variables must be able to accommodate the decimal portion of the newly created Z-score. A series of counting variables may be created to record which Z-scores are outside a desired range (perhaps 1.64 or 1.96 as described previously). If a value of 100 is used in these counting variables to mark a score as deviant and a value of zero if it is not, the mean of the values will automatically yield the percentage of individuals outside the specified range. Several requirements for an incoming data set should be established before elaborating on other methodology. Though the techniques described below work equally well for any number of subgroups, the groups must be classified by a single variable (perhaps GROUP) that identifies the experimental groupings as well as the reference group, in a mutually exclusive format. The appropriate PRoe FORMAT statement should include values for all groups and labels suitable as SAS variable names (i.e., eight characters or fewer with no spaces). This format should be permanently assigned to the GROUPing variable when creating the groups. Global macro variables should be established to give the number of variables (called by &N in the programming segments that follow) and the number of groups used (called by &G, not including the reference group), thus allowing much of the remaining programming to be generalized to accept variations in these values (see Program Segment 1). Also, a macro (called by %VARS) listing the actual variables to be normalized is fundamental if the program is to be easily adapted for various purposes. PRoe MEANS (or SUMMARY) is used again, this time BY the GROUPing variable to output a data set of mean Z-scores with an observation for each subgroup (see' Program Segment 5). PRoe TRANSPOSE, using GROUP as the 10 variable, will produce data ready for the final table. The same general process is used to of outliers Segment 6). The number of observations per group is output from PROe FREQuency, TRANSPOSEd into a single observation, and saved for later use (see Program Segment 2) . It is convenient to create a permanent length of 40 for the variable labels at this point. This will allow any label up to 40 characters to be printed in the final table without worrying about truncation in a subsequent MERGE statement. PROe PRINT will adjust spacing if no label requires this much space. prepare the percentages for the table (see Program Data manipulation is completed by match MERGEing the reference statistics with the Z-score and percentage means for each subgroup. The group sizes may now be SET with the information collected for the variables in the previous step (see Program Segment 7). THE TABLES Now that all of the necessary information is together in a single data set having one observation for each variable plus one observation containing the group sizes, the tables may be PRINTed. The SPLIT option for labeling columns of PRoe PRINT should be used to give better definition to the table. The most simplistic output (see Table 1) gives only the mean Z-scores for each subgroup. The next step in the process is to create two data sets, one for the reference group and the other for all of the subgroup data. The mean and standard deviation for each variable in the reference group is output as a single observation using PROe MEANS. A copy of this data set is reshaped by PROe MATRIX and output for use in the final tables as reference parameters (see Program Segment 3). PROe SUMMARY could also be used, but requires a separate PRINT statement to look at the data. For smaller data sets, PROe MEANS is preferred even if it is slightly less efficient. Addition of the reference group means and standard deviations (see Table 2) will define where the values are centered and allows the reader to determine the means of the experimental subgroups. This is done 1117 by multiplying the reference standard deviation by the subgroup mean Z-score and then adding this value to the reference mean. SAS is the registered trademark Institute, Inc., Cary, NC, USA. The final bit of information to add is the percentage of each subgroup which lies outside of the specified range. These values are based on the number of subjects in each group who actually took the test and can enhance the information already listed by indicating the possible skewness of the subgroups. See Program Segment 8 for the PROe PRINT used to produce Table 3. The statements to produce Tables 1 and 2 are comparable. of Address Correspondence To: Walter W. Owen The Biostatistics Center 7979 Old Georgetown Road, Suite 500 Bethesda, MD 20814 SUMMARY The use of the global macro variables G and N, defining the number of subgroups and variables respectively, allows flexibility in the programming. Simply by varying the value of N, along with the appropriate modifications in the VARS macro containing the list of variables used, the table may reflect different subsets of test items. The format of any of these tables may, of course, be changed to reflect the desired number of significant digits. If measurement units for the variables are needed, they should be included in the SAS variable labels. Units apply only for the reference group as Z-scores and percentages are unitless values. The SAS macro language offers some intriguing possibilities for the ambitious programmer. If further generalizations were added, a procedure-style macro could be set up with defining parameters to cover many of the requirements set forth for the incoming data mentioned earlier in this paper. It has proven to be a formidable challenge to put group sizes into macro variables for use in labeling the output, but SAS capabilities should make this possible. Also there is the possibility of using PUT statements to print the tables, although more information is generally needed to allow for varying column lengths, particularly for the variable labels. 1118 SAS Table 1 TABLE OF MEAN Z-SCORES FOR GROUPS 1-2 VARIABLE DESCRIPTION N= SAS SAS SAS SAS SAS LABEL FOR LABEL FOR LABEL FOR LABEL FOR LABEL FOR VARIABLE 1 VARIABLE 2 VARIABLE 3 VARIABLE 4 VARIABLE 5 GROUP MEAN Z GROUP 2 125.00 -0.49 -0.49 0.57 0.52 0.72 212.00 -0.42 -0.54 0.59 0.45 O.BO MEAN Z Table 2 TABLE OF MEAN Z-SCORES FOR GROUPS 1-2 NORMALIZED TO REFERENCE PARAMETERS VARIABLE REFERENCE REFERENCE MEAN STD DEV 85.00 107.3B 61.54 3.05 0.35 0.75 12.66 14.84 3.29 0.07 0.49 DESCRIPTION N= SAS SAS SAS SAS SAS LABEL FOR VARIABLE 1 LABEL FOR VARIABLE LABEL FOR VARIABLE LABEL FOR VARIABLE LABEL FOR VARIABLE 2 3 4 5 GROUP 2 GROUP MEAN 125.00 -0.49 -0.49 0.57 0.52 0.72 Z MEAN Z 212.00 -0.42 -0.54 0.59 0.45 0.80 Table 3 TABLE OF MEAN Z-SCORES FOR GROUPS 1-2 NORMALIZED TO REFERENCE PARAMETERS PERCENTAGE OF GROUP WITH Izl > 1.64 SHOWN VARIABLE DESCRIPTION N= SAS SAS SAS SAS SAS LABEL FOR VARIABLE LABEL FOR VARIABLE 2 LABEL FOR VARIABLE 3 LABEL FOR VARIABLE 4 LABEL FOR VARIABLE 5 REFERENCE MEAN REFERENCE STn DEV 85.00 107.38 61.54 3.05 0.35 0.75 12.66 14.84 3.29 0.07 0.49 1119 PCT GROUP MEAN 125.00 -0.49 -0.49 0.57 0.52 0.72 Z 15 18 12 14 19 GROUP 2 MEAN Z 212.00 -0.42 -0.54 0.59 0.45 0.80 PCT 2 16 19 15 16 25 PROGRAMMING SEGMENTS Program Segment 5 Calculate Hean Z-scores PRoe SORT DATA=ZSCORES; BY GROUP; PRoe MEANS DATA=ZSCORES NOPRINT; BY GROUP; VAR %VARS; Program Segment 1 Macro Definitions OUTPUT OUT=ZMEANS MEAN= %VARS; Assignment Call Heaning &G &N &CUT %VARS number of groups number of variables Z score cutoff list of raw score variables PROC TRANSPOSE DATA=ZMEANS OUT=ZMEANS; ID GROUP; %LET G = 2; XLET N = 5; %LET CUT = 1.64; Program Segment 6 Obtain the Percentage of Deviate Z-scores %KACRO VARS; variable list %HEND VARS; DATA PCTZ; SET ZSCORES; ARRAY Z (H) %VARS; ARRAY CNT (H) CNT1-CNT&N; Program Segment 2 Obtaining Group N's 00 OVER Z; IF ABS(Z) GT &CUT THEN CNT=IOO; PROC FREQ; TABLES GROUP / OUT=FREQSET NOPRINT; PROC TRANSPOSE DATA=FREQSET OUT=FREQSET; ELSE IF Z NE • THEN CNT=O; END; PROC SORT DATA=PCTZ; BY GROUP; PROC MEANS DATA=PCTZ NOPRINT; BY GROUP; VAR CNTl-CNT&N; ID GROUP; VAR COUNT; DATA FREQSET; LENGTH VARLABEL $40; SET FREqSET (RENAME=( NAME =VARNAME REFGRP=REFMEAN»; OUTPUT OUT=PCTZ HEAN=%VARS; PROC TRANSPOSE DATA=PCTZ OUT=PCTZ VARLABEL='N='; PREFIX=PCT; VAR %VARS; Program Segment 3 Obtaining Reference Statistics Program Segment 1 Combine and Concatenate Data PROC MEANS DATA=REFGRPS NOPRINT; VAR %VARS; DATA COMBINE; MERGE ZMEANS OUTPUT OUT=REFMEANS MEAN=MEAN1-MEAN&N STD=STD1-STD&N; PCTZ REFSET (DROP=ROW); RENAME NAME =VARNAME PRoe MATRIX; FETCH X DATA=REFHEANS; Y = SHAPE(X.&N); z =LABEL_=VARLABEL; y'; = OUTPUT Z OUT=REFSET(RENAME3 (COL1=REFMEAN COL2=REFSTD»; DATA FINAL; SET FREQSET COHBINE; LABEL REFHEAN =REFERENCE* Mean REFSTD =REFERENCE* Std Dev VARNAME =VARIABLE Program Segment 4 Calculate Individual Z-scores VARLABEL=VARIABLE*DESCRIPTION GROUPl =GROUP l*Mean Z GROUP2 =GROUP 2*Mean Z DATA ZSCORES; PCT1 PCT2 IF N =1 THEN SET REFHEANS; SET-CROUPS; ARRAY ARRAY ARRAY ARRAY Z (8) V (H) M (H) S (H) %VARS; %VARS; MEANI-MEAN&N; STD1-STD&N; =PCT 1 =PCT 2 Program Segment 8 Printing Table ~: 00 OVER S; PRoe PRINT SPLIT=*; ID VARLABEL; IF S HE 0 THEN 00; Z = (V-M)/S; VAR REFMEAN REFSTD GROUP I PCTI END; GROUP2 PCT2; FORHAT REFMEAN REFSTD ELSE Z = .; END; GROUPI-GROUP&G 8.2 PCTl-PCT&G 3.0; 1120