PROC TABULATE: CONTROLLING TABLE APPEARANCE: August V. Treff, NationsBank appearance of a table by displaying the values for questions, QUES, versus responses, RESP. The SAS® code is: ABSTRACT PROC TABULATE provides one, two, and three dimensional tables. Tables may be constructed from class variables, analysis variables, and statistics keywords. Descriptive statistics can be displayed in table format with a minimum of code. This paper will use a five-response survey to demonstrate how proc tabulate features translate to a printed report. We will examine table dimensions, crossing and concatenating variables, and percent calculations. Controlling table appearance through formats and customized column headings also will be discussed. Many programmers do not use PROC TABULATE because they find that there is too much to learn about the procedure. PROC TABULATE has many more features than other procedures. We will examine the effect that these features have on the report display. TABLE APPEARANCE Tabular reports are defined by the table statement. Table page dimension definition, row dimension definition, column dimension definition; Commas are used to separate dimensions. If two dimensions are specified, they are interpreted as row and column definitions. One dimension is interpreted as a column definition. An asterisk is used to nest variables within a definition while a space concatenates variables. Our task is to prepare a tabular report of responses to a ten question survey. Responses have been entered into a sequential file. Elements include GENDER (M F), AGE group (1 to 3), GROUP code (1 to 3), LOCATION, and RESP (responses A through E). A blank in a field indicates no response while an asterisk indicates multiple responses. The item number, QUES, is determined from the position of the response. We begin our examination of how to control the - 1 - Proc Tabulate Data=SURVEY; Class QUES RESP; Table QUES, RESP; Variables QUES and RESP must be included in a CLASS statement if they are to be used in a TABLE statement. QUES serves as the row dimension while RESP serves as the column dimension. Our efforts are rewarded with a tabular display of question numbers crossed with responses. --------------------------------------------| | RESP | |------------------------| | A | B | |------------+-----------| | N | N |------------------+------------+-----------|QUES | | |------------------| | |1 | 167.00| 144.00 |------------------+------------+-----------|2 | 157.00| 144.00 The display of frequencies as two place decimals is not what is desired. The default format of BEST12. may be overridden with: Proc Tabulate FORMAT=5.; and we get: ------------------------------------------| | RESP | |----------------------| | A | B | C | D | |-----+-----+-----+----| | N | N | N | N |------------------+-----+-----+-----+----|QUES | | | | |------------------| | | | |1 | 167| 144| 281| 245 |------------------+-----+-----+-----+----|2 | 157| 144| 266| 263 In addition to suppressing the decimal places, we have reduced the width of each cell to five spaces. Our next area for improving the appearance of the table is the size of the row title space. By default, this space is one-fourth of the LINESIZE= value. Its size can be changed with an option on the TABLE statement. Table QUES, RESP / rts=6; RESP*SUM*_FREQ_ / rts=6; which gives us the following result: ----------------------------| | RESP | |----------------------| | A | B | C | D | |-----+-----+-----+----| | N | N | N | N |----+-----+-----+-----+----|QUES| | | | |----| | | | |1 | 167| 144| 281| 245 |----+-----+-----+-----+----|2 | 157| 144| 266| 263 and our report becomes: -------------------------| | Resp | |-------------------| | * | A | B | |------+------+-----| |Count |Count |Count | |------+------+-----| |_FREQ_|_FREQ_|_FREQ_ |----+------+------+-----|Item| | | |----| | | |1 | 10| 166| 145 |----+------+------+-----|2 | 7| 157| 144 The RTS= option changes the size of the row title space. The vertical lines count as spaces in RTS. This was not the case when we changed the cell width with FORMAT=. The word "Count" replaces "SUM" because of the keylabel statement. The analysis variable name _FREQ_ may be suppressed by adding the following: The name of a statistic, such as N, may be relabeled with a KEYLABEL statement. Class variables may be relabeled with a LABEL statement. Keylabel Label RESP*SUM*_FREQ_=' ' / rts=6; N='Count'; QUES='Item' RESP='Response'; The display becomes: -------------------------| | Resp | |-------------------| | * | A | B | |------+------+-----| |Count |Count |Count |----+------+------+-----|Item| | | |----| | | |1 | 10| 166| 145 |----+------+------+-----|2 | 7| 157| 144 The display becomes: -----------------------------------| | Response | | |-----------------------------| | | A | B | C | D | E | | |-----+-----+-----+-----+-----| | |Count|Count|Count|Count|Count| |----+-----+-----+-----+-----+-----| |Item| | | | | | |----| | | | | | |1 | 167| 144| 281| 245| 142| |----+-----+-----+-----+-----+-----| |2 | 157| 144| 266| 263| 154| This feature allows us to suppress the printing of a column definition variable or statistic. Adding _FREQ_=' ' to a label statement would suppress the word _FREQ_ but the blank cell would remain. Statistic names may also be suppressed in this manner. Data may be summarized in a PROC SUMMARY step before being reported with Proc Tabulate. This uses fewer resources and results in a shorter run time. In the PROC TABULATE step, an analysis variable is needed along with the statistic SUM. The PROC SUMMARY step is as follows: Labels may be included between the single quotes. This allows greater control over labels than do the KEYLABEL and LABEL statements. Proc Summary NWAY; Class QUES RESP; Output out=SURSUMM; Total responses to an item may be displayed by adding the keyword "all" to the column dimension definition. The TABLE statement is: The default variable, _FREQ_, provides the frequency of each crossing of QUES and RESP. _FREQ_ is used as the analysis variable for PROC TABULATE. Our new invocation is: Table QUES, All*SUM*_FREQ_=' ' RESP*SUM*_FREQ_=' ' / rts=6; or Proc Tabulate data=SURSUMM format=6.; Class QUES RESP; Var REQ_; Keylabel sum='Count'; Label QUES='Item' RESP='Response'; Table QUES, Table QUES, (All RESP)*SUM*_FREQ_=' ' / rts=6; - Either statement produces: 2 - --------------------------------------| | GENDER | | |--------------------| | | * | F | M | |----------------+------+------+------| |Item |Response| | | | |-------+--------| | | | |1 |A | 4| 92| 67| | |--------+------+------+------| | |B | 1| 83| 60| -------------------------| | | | | |------------| | ALL | * | A |----+------+------+-----|Item| | | |----| | | |1 | 989| 10| 166 |----+------+------+-----|2 | 991| 7| 157 The row title space is divided evenly between QUES and RESP columns. Three of the eighteen allotted spaces are used for delimiter characters "|". The first column is one character smaller than the second column. Variable ALL may be relabeled with the KEYLABEL statement or by using ='text' in the table statement. The table statement method has the advantage of allowing different labels for ALL in different contexts, as in: Values for GENDER are displayed in ASCII sort sequence by default. The order that variable values are displayed may be changed by using the ORDER option in the PROC TABULATE statement. Available options are DATA, FORMATTED, FREQ, and INTERNAL. DATA displays values in the order in which they are encountered in the data set. FORMATTED sorts by the formatted values. FREQ orders by descending frequency. Table QUES ALL='Column Total', (ALL='Row Total' RESP)* SUM*_FREQ_=' '; ALL is separated from RESP by a space. This concatenates the sums for ALL and each value of RESP. The use of parentheses allows us to avoid repeating the code *SUM*_FREQ_=' '. None of the options will result in the desired display of Male-Female followed by Invalid. Also, the option is applied to all variables in the table statement. Using ORDER=FREQ produces: Sums for ALL differ because of values of missing for RESP. Missing values are suppressed unless the missing option is included in the PROC TABULATE statement. Proc Tabulate data=SURSUMM format=6. missing; --------------------------------------| | GENDER | | |--------------------| | | F | M | * | |----------------+------+------+------| |Item |Response| | | | |-------+--------| | | | |1 |C | 152| 122| 2| | |--------+------+------+------| | |D | 126| 114| 4| | |--------+------+------+------| The missing statement must also be included in any preceding PROC SUMMARY. Our display becomes: --------------------------------| | | | | Row |-------------------| |Total | | * | A |----+------+------+------+-----|Item| | | | |----| | | | |1 | 1000| 11| 10| 166 |----+------+------+------+-----|2 | 1000| 9| 7| 157 The order in which GENDER values are displayed can best be controlled by recoding "M" as "0", "F" as "1", and invalid responses as "3". The display of these values may then be modified with a format. We code a PROC FORMAT step as follows: More complicated tables may be produced using the asterisk to nest variables within a dimension definition. We may want to display the item numbers along with the responses along the side. Gender will be displayed across the top of the table. We will use only responses of A through E in our analysis by including the statement Proc Format; Value $sex 'M'='0' ‘F'='1' other='3'; Value $gen '0'='Male' '1'='Female' '3'='Invalid'; If 'A'<=RESP<='E' then output; In the DATA-step. we code: in our DATA-step. GENDER = put(GENDER,$sex.); The TABLE statement: Our PROC TABULATE step then becomes: Table QUES*RESP, GENDER*SUM*_FREQ_=' ' /rts=18; Proc Tabulate format=6.; Class QUES RESP GENDER; Var FREQ_; Keylabel SUM='Count'; produces this display: - 3 - Class QUES RESP GENDER AGE; Format GENDER $gen. AGE $age.; Label QUES='Item' RESP='Response'; Table QUES*RESP, GENDER*SUM=' '*_FREQ_=' ' / rts=18; Format GENDER $gen.; The TABLE-Statement becomes: Table QUES*RESP, (GENDER AGE)*SUM*_FREQ_=' ' / rts=18; The printed report becomes: --------------------------------------| | GENDER | | |--------------------| | | | |Inval-| | | Male |Female| id | |----------------+------+------+------| |Item |Response| | | | |-------+--------| | | | |1 |A | 67| 92| 7| | |--------+------+------+------| | |B | 60| 83| 2| | |--------+------+------+------| | |C | 122| 152| 3| Our display becomes: ---------------------------------------| | AGE | GENDER |----------------------|---------------| Under | 21 to | | Male |Female | 21 | 50 |Over 50 +-------+-------+-------+-------+------| | | | | | | | | | | 67| 92| 12| 90| 57 +-------+-------+-------+-------+------| 60| 83| 6| 93| 44 The desired order is achieved. Because of the way in which we reassigned values for GENDER, both space and asterisk are tallied as "Invalid". Inserting an asterisk between GENDER and AGE in the table statement will nest AGE within GENDER. The display will be a cross of GENDER and AGE. The text "Invalid" did not fit into the six spaces allowed. PROC TABULATE will automatically divide headers where spaces or hyphens occur. It will not suppress the hyphen if the label fits in the allotted space. Changing the value in the PROC FORMAT step to '3'='In-valid' produces: Table QUES*RESP, GENDER*AGE*SUM*_FREQ_=' ' / rts=18; which then gives the result below: ------------------------------------------| GENDER |-----------------------------------------| Male | Female |-----------------------+-----------------| AGE | AGE |-----------------------+-----------------| Under | 21 to | | Under | 21 to | | 21 | 50 |Over 50| 21 | 50 |Ov +-------+-------+-------+-------+-------+-| | | | | | | | | | | | | 3| 33| 31| 9| 57| +-------+-------+-------+-------+-------+-| 2| 40| 18| 4| 53| ---------------------| GENDER | |--------------------| | | | In- | | Male |Female|valid | +------+------+------| Although this is an improved division for the word "Invalid", a better solution is to increase the default format by one space as in: Proc Tabulate format=7.; Subtotals can be produced by adding the keyword ALL after the appropriate class variable. This gives us: ------------------------| GENDER | |-----------------------| | Male |Female |Invalid| +-------+-------+-------| Table QUES*RESP, GENDER*(AGE all='All Ages')* SUM*_FREQ_=' ' / rts=18; Variables may be concatenated in the column dimension definition. We might want to display response distributions by GENDER and then by AGE. The values of AGE, one through three, are associated with a format “$age.”. which produces table below. ---------------------------------------| GENDER |--------------------------------------| Male | |-------------------------------+------| AGE | | |-----------------------| |------| Under | 21 to | | All | Under | 21 | 50 |Over 50| Ages | 21 +-------+-------+-------+-------+------| | | | | | | | | | Value $age '1'='Under 21' '2'='21 to 50' '3'='Over 50'; AGE is added to the CLASS and FORMAT statements. - 4 - Table LOC='Location', QUES*RESP, GENDER=' '*sum=' '*_FREQ_=' ' / rts=18 box='July ''97 Survey'; | 3| 33| 31| 67| 9 +-------+-------+-------+-------+------| 2| 40| 18| 60| 4 The variable names GENDER and AGE do not contribute to the readability of the report. Their printing may be suppressed with the addition of =' ' in the TABLE statement. Our display becomes: The resulting report is: Location 701 ---------------------------------|July '97 Survey | Male |Female | |----------------+-------+-------| |Item Response| | | |1 A | 26| 33| | B | 15| 19| | C | 44| 52| | D | 36| 50| | E | 25| 34| ---------------------------------------| Male | |-------------------------------+------| Under | 21 to | | All | Under | 21 | 50 |Over 50| Ages | 21 +-------+-------+-------+-------+------| | | | | | | | | | | 3| 33| 31| 67| 9 +-------+-------+-------+-------+------| 2| 40| 18| 60| 4 Text which does not fit in the box is truncated. When box=_page_ is used, the page dimension text is printed in the box. Adding the NOSEPS option to the PROC TABULATE statement will suppress the horizontal lines below the headers. This option allows more lines to be printed on one page. The revised report becomes: ---------------------------------|Location 701 | Male |Female | |----------------+-------+-------| |Item Response| | | |1 A | 26| 33| | B | 15| 19| | C | 44| 52| -------------------------------------------| | Male | |-------------------------| | Under | 21 to | | | | 21 | 50 |Over 50| A |----------------+-------+-------+-------+-|Item Response| | | | |1 A | 3| 33| 31| | B | 2| 40| 18| | C | 4| 83| 35| A fourth table dimension became available when version 6 of SAS allowed the printing of by variables in titles. More than one table may be printed on a physical page by adding the CONDENSE option to the table statement. Invoking the option NOBYLINE and using the physical page variable in the TITLE and the BY statement will result in multiple short tables being printed on a physical page. The code is: Also suppressed is the vertical separator in the row title space. A third dimension is available with PROC TABULATE.. This dimension, the page dimension, is the first definition when three are included in the TABLE statement. We could provide reports by location with the following TABLE statement: Options nobyline; Title 'Population at Location ##byval(LOC)'; Proc Tabulate noseps format=6.; Class GENDER GROUP AGE; Var _FREQ_; By LOC; Table GENDER, AGE, GROUP*SUM=' '*_FREQ_=' ' / rts=18 box=_page_ condense; Format GENDER $gen. GROUP $grp. AGE $age.; Table LOC='Location', QUES*RESP, GENDER=' '*sum=' '*_FREQ_=' ' / rts=18; LOC is also added to the CLASS statement. Our report becomes: Location 701 ---------------------------------| | Male |Female | |----------------+-------+-------| |Item Response| | | |1 A | 26| 33| | B | 15| 19| A page of the resulting report is: Population at Location #701 -----------------------------------------|GENDER Male | GROUP | | |-----------------------| | | Sales | R & D | Other | |----------------+-------+-------+-------| |AGE | | | | The box at the upper left corner of the table may be used in two ways. Text may be inserted using the TABLE state- ment. - 5 - QUES, RESP*sum=' '*_FREQ_=' ' / rts=6 printmiss; |Under 21 | 10| 39| .| |21 to 50 | 315| 356| 254| |Over 50 | 208| 227| 47| ----------------------------------------------------------------------------------|GENDER Female | GROUP | | |-----------------------| | | Sales | R & D | Other | |----------------+-------+-------+-------| |AGE | | | | |Under 21 | 39| 30| 10| |21 to 50 | 479| 493| 244| |Over 50 | 216| 234| 145| ------------------------------------------ The resulting table displays values of missing under response B for location #703. Location 703 ---------------------------------------------| | Response | | |--------------------------------------| | | A | B | C | D | E | |----+-------+-------+-------+-------+-------| |Item| | | | | | |----| | | | | | |1 | 6| .| 13| 4| 2| |----+-------+-------+-------+-------+-------| |2 | 9| .| 4| 9| 3| Location #702 has no males represented in survey responses. The physical page for #702 would be as follows: Population at Location #702 -----------------------------------------|GENDER Female | GROUP | | |-----------------------| | | Sales | R & D | Other | |----------------+-------+-------+-------| |AGE | | | | |Under 21 | 88| 108| 30| |21 to 50 | 879| 976| 461| |Over 50 | 379| 357| 266| ------------------------------------------ It is possible that no one in the population selected B for any question. We would like to have a column printed for B with missing values. To accomplish this, we need to inform PROC TABULATE that B is a valid answer. One way to accomplish this is to read the text of questions and responses from a keyed file. This file of valid responses would be merged with the data set created by the PROC SUMMARY step. Data should be sorted by the by variable. Alternatively, NOTSORTED may be added to the by statement. Data VALID; Merge SURSUMM QUESTS; By QUES RESP; If _FREQ_ = . then LOC='701'; When a value is not represented in a table, PROC TABULATE does not include the value in the table. At location #703, no one selected response B for any question. When we report the cross of LOC, QUES, and RESP with the following code: Any crossing of QUES and RESP which was not represented in the data set SURSUMM would have a value of missing for _FREQ_. Assigning a valid value for LOC avoids producing a table for missing LOC. We produce the desired tables with a combination of the MISSING and PRINTMISS options: Proc Tabulate format=7.; Class LOC QUES RESP; Var _FREQ_; Label QUES='Item' RESP='Response'; Table LOC='Location', QUES, RESP*sum=' '*_FREQ_=' ' / rts=6; Proc Tabulate format=7. missing; Class LOC QUES RESP; Var _FREQ_; Label QUES='Item' RESP='Response'; Table LOC='Location', QUES, RESP*sum=' '*_FREQ_=' ' / rts=6 printmiss; this table is produced for location #703: Location 703 -------------------------------------| | Response | | |-------------------------------| | | A | C | D | E | |----+-------+-------+-------+-------| |Item| | | | | |----| | | | | |1 | 7| 14| 4| 2| |----+-------+-------+-------+-------| |2 | 9| 4| 9| 3| The table for location #703 is: Location 703 ---------------------------------------------| | Response | | |--------------------------------------| | | A | B | C | D | E | |----+-------+-------+-------+-------+-------| |Item| | | | | | |----| | | | | | |1 | 6| .| 13| 4| 2| |----+-------+-------+-------+-------+-------| |2 | 9| .| 4| 9| 3| Including the PRINTMISS option in the table statement specifies that row and column headings are to be the same for all logical pages of the report. The TABLE statement becomes: The MERGE with valid responses can be used to screen for inappropriate responses. Item 1 may have had Table LOC='Location', - 6 - | | SUM |PCTSUM| SUM |PCTSUM |----+------+------+------+-----|Item| | | | |----| | | | |1 | 167| 2| 144| 1 |----+------+------+------+-----|2 | 157| 2| 144| 1 responses of A through D. Two people at location #703 selected the inappropriate response E. Our data step becomes: Data VALID; Merge SURSUMM QUESTS (in=A); By QUES RESP; If _FREQ_ = . then LOC='701'; If ^A then RESP='Z'; The format=6. does not allow the display of decimal places. This default may be overridden by adding *f=6.1 after PCTSUM: Table QUES, RESP*(sum pctsum*f=6.1)* _FREQ_=' ' / rts 6; Any inappropriate responses have been recoded as "Z". A format for RESP may be written which prints "Invalid" instead of "Z". Proc Format; Value $rsp 'Z'='Invalid'; which gives us something a bit better: --------------------------------| | | |--------------------------| | A | B | |-------------+------------| | SUM |PCTSUM| SUM |PCTSUM |----+------+------+------+-----|Item| | | | |----| | | | |1 | 167| 1.7| 144| 1.5 |----+------+------+------+-----|2 | 157| 1.6| 144| 1.5 The resulting table becomes: --------------------------------------Response | --------------------------------------| B | C | D | E |Invalid| ------+-------+-------+-------+-------| | | | | | | | | | | .| 13| 4| .| 2| ------+-------+-------+-------+-------| .| 4| 9| 3| .| Decimals are displayed but the percents are much smaller that we might anticipate. By default, PCTSUM calculates the percent that each sum is of all sums in the table. The sum of all of the percents is 100%. The two inappropriate responses are reported under the heading "Invalid" rather than "E". PERCENT CALCULATIONS We would like to know what percent each sum is We would like to know what percent each sum is of the sum of the row. To calculate this, we must specify the denominator definition to be used. The TABLEstatement is: In addition to reporting the frequency of responses, we may be asked to include the percent responding a particular way. Returning to our report of questions crossed with responses, the PROC TABULATE step is: Table QUES, RESP*(sum pctsum<RESP>*f=6.1)* _FREQ_=' ' / rts 6; Proc Tabulate format=6.; Class QUES RESP; Var _FREQ_; Label QUES='Item' RESP='Response'; Table QUES, RESP*sum=' '*_FREQ_=' ' / rts 6; Our table becomes: --------------------------------| | | |--------------------------| | A | B | |-------------+------------| | SUM |PCTSUM| SUM |PCTSUM |----+------+------+------+-----|Item| | | | |----| | | | |1 | 167| 17.1| 144| 14.7 |----+------+------+------+-----|2 | 157| 16.0| 144| 14.6 We can request that percents of the sums be printed by adding PCTSUM to the TABLE statement. Table QUES, RESP*(sum pctsum)*_FREQ_=' ' / rts 6; The resulting table is not exactly what is desired. RESP is the denominator definition because we would add all of the sums across all values of RESP to calculate the percent that each value is of the total. Our percents now sum to 100% across each row. --------------------------------| | | |--------------------------| | A | B | |-------------+------------- 7 - Had we used QUES as a denominator definition, we would have calculated the percent that a response value was associated with each item number. The percents would sum to 100% down each column. ***** CONCLUSION PROC TABULATE can best be learned by experience. The code on the next page (attachment) will generate a test file of survey responses. SAS is a registered trademark in the U.S.A. of the SAS Institute Inc., Cary NC Use the test file to create one and two dimensional tables. Add features to the code as you become more confident with the Procedure. Have a clear picture of the desired table before beginning to code. If you have any questions or comments, you may contact the author at: NationsBank 10 Light Street MD4-302-33-01 Baltimore, MD 21202 REFERENCES / ACKNOWLEDGMENTS PROC TABULATE is covered in the manual “ SAS Guide to TABULATE Processing “, Second Edition (410) 605-8673 PHONE (410) 605-1199 FAX0 - 8 - - 9 - PROC TABULATE: CONTROLLING TABLE APPEARANCE: August V. Treff, NationsBank - 1 - ATTACHMENT: SAS CODE USED TO GENERATE SAMPLE DATA Proc Format; Value rsp 0-15 ='A' 16-30 ='B' 31-60 ='C' 61-85 ='D' 86-100='E'; Value gen 0-40 ='M' 41-100='F'; Value age 0-15 ='1' 6-70 ='2' 71-100='3'; Value grp 0-35 ='1' 36-72 ='2' 73-100='3'; Value loc 0-35 ='701' 36-71 ='702' 72-75 ='703' 76-100='704'; Data RESP; File DISK notitles; Do J=1 to 1000; Do I=1 to 10; If I=1 then do; GENDER = put(int(100*ranuni(10)),gen.); AGE = put(int(100*ranuni(0)),age.); LOC = put(int(100*ranuni(5)),loc.); GROUP = put(int(100*ranuni(1)),grp.); put @1 GENDER @2 AGE @3 GROUP @6 LOC @; end; RESP = put(int(100*ranuni(0)),rsp.); put @I+10 RESP @; end; put; end; - 2 -