Public Health 144A, Sections 1 and 2 Spring 2009 Using Statistical PROCs to Create Output Data Sets So far in our use of SAS we have used statistical PROCs – e.g., PROC MEANS -- to produce statistical results from precreated (temporary) SAS Data Sets. These SAS Data Sets are input into the desired SAS PROC -- either explicitly (with the use of the DATA= statement) or implicitly (based upon the convention that SAS PROCs operate, by default, on the last SAS Data Set which has been output in a SAS Data Step). Almost all of the statistical PROCs, however, allow for the output of SAS Data Sets. These output SAS Data Sets have many applications and are commonly used by SAS programmers. In this introduction, we will look at the output SAS Data Set options produced by two of the procedures that we have already used, PROC MEANS and PROC UNIVARIATE. The PROC MEANS Scenario Consider the following SAS code: PROC SORT DATA=MYFILE; BY WOMAN; PROC MEANS DATA=MYFILE; VAR GRBMI; BY WOMAN; OUTPUT OUT=BMISTATS N=N_BMI MEAN=MEAN_BMI STD=STD_BMI; This code -- assumed to exist somewhere in a complete SAS program -- operates on the temporary SAS Data Set MYFILE. In the first line, MYFILE is first sorted in order of the variable WOMAN in order to allow for use of the variable WOMAN as a BY-variable in the succeeding SAS Procedure. In the second line, we have specified a PROC MEANS as we have in previous examples and exercises. This PROC MEANS operates on the SAS Data Set MYFILE, creating a report of summary statistics for the variable GRMBI for each value of the variable WOMAN. Recall that in the CHDS Data Set that a woman may have multiple records due to the possibility of multiple pregnancies and multiple births. The SAS code in the third line is new. This line -- often referred to as a "SAS Output Statement" -- represents a single SAS Statement and specifies the name and contents of a new output SAS Data Set. This is a rather simple version of a SAS Output Statement, producing a single SAS Data Set with three output statistical variables, based upon one BYvariable. Note these specific features of SAS Output Statements in general and this SAS Output Statement in particular: * The SAS Output Statement begins with the (required) SAS word OUTPUT * The SAS Output Statement must contain the word OUT= followed by the name of the temporary SAS Data Set -- in this case, we have named the output SAS Data Set: BMISTATS. Note that this SAS Data Set is similar to a SAS Data Set that has been created in a SAS Data Step. It is a temporary SAS Data Set that is available to you while SAS is running and vanishes when SAS is exited. * The rest of the SAS Output Statement contains assignments for variables to be contained in the output SAS Data Set (BMISTATS). Specific statistics are requested by using the official SAS name for the statistics (e.g. MEAN) followed by an equal sign (=), followed by the (arbitrary) name of your choice. In this example, we are asking for three statistics: the sample size (official SAS name: N), the mean (official SAS name: MEAN), and the Standard Deviation (official SAS name: STD). For each variable that we've indicated, the summary statistic we have specified will be computed and added to the output SAS Data Set (BMISTATS), one record for each value within the specified BY-variable. * It is important to note that this output SAS Data Set will contain, in all: * a single record based upon unique values for variables named in the BY-statement. In this case, the output SAS Data Set, BMISTATS, will contains a single record for each woman subject. * the requested statistics, based upon each unique value of the BY-variable. In this case, for each woman in the study, the requested summary statistics -- n, mean, and standard deviation -- will be computed. * two "automatic variables", _TYPE_ and _FREQ_. For now, we won't be concerned with these last two variables except to note that _TYPE_ can be useful when we elect the OUTPUT option in association subsets that are indicated by the use of the SAS Class Statement and that _FREQ_ is merely the sample size associated with each BY-variable value. You may elect to add other "identifying" variables to the output SAS Data Set with the use of a so-called "ID Variable." Variables listed in the ID Statement are also included in the SAS Output Data Sets created by SAS Procedures. * SAS allows for several summary statistics to be output; in this example we have only asked for three of them N MAX RANGE SUMWGT USS VAR STDERR SKEWNESS T NMISS MIN SUM MEAN CSS STD CV KURTOSS PRT Note that the syntax for SAS Output Statements varies among different SAS PROCs. The SAS reference documentation will provide all of the details for producing output SAS Data Sets for each of the statistical procedures. EXERCISE 1 Edit one of your SAS Programs and create the following SAS program; when you are done, file it as MLTBRTH1.SAS . This program accomplishes the following: * Reads in the full CHDS Basic Data Set. * Removes all records with a missing BIRTHWT variable. (Affects singleton and all children within a multiple pregnancy.) * Uses a combination of the "subsetting IF Statement” and the FIRST-LAST convention to create a SAS Data Set comprised of all multiple births (twins, triplets, etc.). * Uses the multiple births SAS Data Set within a PROC MEANS Output function to create another SAS Data Set comprised of selected summary statistics for the BIRTHWT variable for each pregnancy associated with a multiple birth. * Prints the first 20 observations of this new, output SAS Data Set. /* your name */ OPTIONS LS=80; FILENAME INDATA ’C:\PH144\BASIC.DAT’; DATA ALLPREGS; INFILE INDATA; INPUT WOMAN 1-5 PREG 1-6 BIRTHWT 50-52; /*SET MISSING VALUES FOR SELECTED VARIABLES*/ IF BIRTHWT = 999 THEN BIRTHWT = .; note the subsetting IF ---> /*CREATE SAS DATA SET WITH ALL MULTIPLE BIRTHS*/ PROC SORT; BY PREG; DATA ALLMULTS; SET ALLPREGS; BY PREG; IF FIRST.PREG EQ 0 OR LAST.PREG EQ 0; this code selects multiple births ---> note the DATA= and /*REMOVE RECORDS WITH MISSING BIRTH WEIGHT VARIABLE*/ IF BIRTHWT NE .; /*CREATE SAS DATA SET: MEAN BIRTH WEIGHTS FOR EACH MULTIPLE BIRTH*/ PROC MEANS DATA=ALLMULTS NOPRINT; BY PREG; VAR BIRTHWT; OUTPUT OUT=MULTWTS N=N_BWT MEAN=MEAN_BWT STD=STD_BWT; NOPRINT statements ---> PROC PRINT DATA=MULTWTS (OBS=20); RUN; Save this SAS program, submit it and examine the output. Note some of the special features of this program and its output: 1. Since this program uses multiple SAS Data Sets (ALLPREGS, ALLMULTS, and MULTWTS), we have used the DATA= option in each of the PROCs to explicitly specify the SAS Data Set to be operated on. 2. The PROC MEANS Statement contains a new option: NOPRINT. This option acts to abort the normal output that we would expect to see with PROC MEANS and is an option that is often used when creating an output SAS Data Set. If this option was not included, SAS would produce a single PROC MEANS statistical output for each value of the PREG variable -- this would produce hundreds of lines of output and is not required or necessary for our goal of producing an output SAS Data Set. 3. Examine the first twenty observations of the output SAS Data Set: MULTWTS. As discussed above, note that this SAS Data Set contains the following variables: * the BY-variable: PREG. * each of the requested summary statistics, using the (arbitrary) name that we have chosen): N_BWT, MEAN_BWT, and STD_BWT. * the two "automatic variables": _TYPE_ and _FREQ_. As a second exercise . . . EXERCISE 2: Edit MLTBRTH1 SUBMIT to accomplish the following: * Read in the full CHDS Basic Data Set (same as above). * Remove all records with a missing BIRTHWT variable (same as above). * Use the FIRST-LAST convention to create an indicator variable -- call it BIRTH -- to separately identify or flag singleton births and multiple births (twins, triplets, etc.). That is, create a two-valued variable, BIRTH, which takes on one value (perhaps: 1) for records associated with singleton births and another value (perhaps: 2) for records associated with a multiple births. * Use an OUTPUT Statement in PROC MEANS to create a new SAS Data Set that includes the mean for the BIRTHWT variable for each pregnancy -- both singleton and multiple births. As in the previous example, name the variable MEAN_BWT. Obviously, the singleton births will only contribute one record to each of the means; the multiple births will contribute multiple records. * Print the first 20 observations of this new, output SAS Data Set. * Using the new PROC MEANS output SAS Data Set, perform a (group) t-test on the mean birth weight variable (MEAN_BWT); use the indicator variable BIRTH as the classification variable. This should insure that birth weights for multiple births will be represented by the mean of the weights for all of the infants within each pregnancy. * Use SAS COMMENT Statements appropriately. Save this SAS program as MLTBRTH2.SAS . The PROC UNIVARIATE Scenario. PROC UNIVARIATE’s OUTPUT Statement is similar to that of PROC MEANS, however it allows for a somewhat different set of statistics to be output. Of particular interest is the availability of any percentile for output – e.g. 1st percentile, 21st percentilie, 87th percentile. In order to specify which percentiles you want to be output, you must (1) indicate the “Perentile Prefix” and (2) indicate the specific percentiles. The “prefix” become the first part of the variable name that is output; it is then conjoined with specific percentile requested that becomes the suffix of the variable name. The “prefix” is indicated in the OUTPUT Statement as PCTLPRE and the percentiles are indicates as PCTLPTS. So, for example, the following Output Statement OUTPUT OUT=PERCOUT PCTLPRE=P_ PCTLPTS=33 66; creates an output data set names PERCOUT with variables P_33 and P_66, along with any BY-Variables and IDVariables. Consider the following SAS Program that uses the SAS OUTPUT Statement within PROC UNIVARIATE for class discussion. OPTIONS LINESIZE=80; FILENAME INDATA 'C:\PH144\BASIC.DAT'; DATA ALLPREGS; INFILE INDATA; INPUT WOMAN 1-5 PREG 1-6 BIRTHWT 50-52; /*SET MISSING VALUES FOR SELECTED VARIABLES*/ IF BIRTHWT = 999 THEN BIRTHWT = .; /*REMOVE RECORDS WITH MISSING BIRHT WEIGHT VARIABLE*/ IF BIRTHWT NE .; /*CREATE MUTIPLE BIRTH INDICATOR VARIABLE*/ PROC SORT; BY PREG; DATA ALLPREGS; SET ALLPREGS; BY PREG; IF FIRST.PREG EQ 1 AND LAST.PREG EQ 1 THEN BIRTH=1; IF FIRST.PREG EQ 0 OR LAST.PREG EQ 0 THEN BIRTH=2; /*CREATE SAS DATA SET: Percentile BIRTH WEIGHTS FOR EACH MULTIPLE BIRTH*/ PROC SORT; BY BIRTH; PROC UNIVARIATE DATA=ALLPREGS NOPRINT; BY BIRTH; VAR BIRTHWT; OUTPUT OUT=BW_PCTLS PCTLPRE=P_ PCTLPTS= 0 TO 100 BY 5; PROC FORMAT; VALUE BIRTHF 1='SINGLE' 2='MULTIPLE'; PROC PRINT DATA=BW_PCTLS NOOBS; FORMAT BIRTH BIRTHF.; RUN; This program produces the following output from PROC PRINT. The SAS System BIRTH P_0 P_5 P_10 P_15 P_20 P_25 P_30 P_35 P_40 P_45 SINGLE MULTIPLE 1 2 84 30 94 47 99 57 102 63 105 68 108 73 110 77 113 82 115 83 P_50 P_55 P_60 P_65 P_70 P_75 P_80 P_85 P_90 P_95 P_100 117 86 119 88 122 90 124 93 126 96 129 99 132 103 136 106 140 112 147 116 225 138