The PROC MEANS Scenario

advertisement
Public Health 144A, Sections 1 and 2
Spring 2009
Using Statistical PROCs to Create Output Data Sets
So far in our use of SAS we have used statistical PROCs – e.g., PROC MEANS -- to produce statistical results from precreated (temporary) SAS Data Sets. These SAS Data Sets are input into the desired SAS PROC -- either explicitly (with
the use of the DATA= statement) or implicitly (based upon the convention that SAS PROCs operate, by default, on the last
SAS Data Set which has been output in a SAS Data Step).
Almost all of the statistical PROCs, however, allow for the output of SAS Data Sets. These output SAS Data Sets have
many applications and are commonly used by SAS programmers. In this introduction, we will look at the output SAS Data
Set options produced by two of the procedures that we have already used, PROC MEANS and PROC UNIVARIATE.
The PROC MEANS Scenario
Consider the following SAS code:
PROC SORT DATA=MYFILE; BY WOMAN;
PROC MEANS DATA=MYFILE; VAR GRBMI; BY WOMAN;
OUTPUT OUT=BMISTATS N=N_BMI MEAN=MEAN_BMI STD=STD_BMI;
This code -- assumed to exist somewhere in a complete SAS program -- operates on the temporary SAS Data Set
MYFILE. In the first line, MYFILE is first sorted in order of the variable WOMAN in order to allow for use of the variable
WOMAN as a BY-variable in the succeeding SAS Procedure. In the second line, we have specified a PROC MEANS as
we have in previous examples and exercises. This PROC MEANS operates on the SAS Data Set MYFILE, creating a
report of summary statistics for the variable GRMBI for each value of the variable WOMAN. Recall that in the CHDS Data
Set that a woman may have multiple records due to the possibility of multiple pregnancies and multiple births.
The SAS code in the third line is new. This line -- often referred to as a "SAS Output Statement" -- represents a single
SAS Statement and specifies the name and contents of a new output SAS Data Set. This is a rather simple version of a
SAS Output Statement, producing a single SAS Data Set with three output statistical variables, based upon one BYvariable. Note these specific features of SAS Output Statements in general and this SAS Output Statement in particular:
* The SAS Output Statement begins with the (required) SAS word OUTPUT
* The SAS Output Statement must contain the word OUT= followed by the name of the temporary SAS Data Set
-- in this case, we have named the output SAS Data Set: BMISTATS. Note that this SAS Data Set is similar to
a SAS Data Set that has been created in a SAS Data Step. It is a temporary SAS Data Set that is available to
you while SAS is running and vanishes when SAS is exited.
* The rest of the SAS Output Statement contains assignments for variables to be contained in the output SAS
Data Set (BMISTATS). Specific statistics are requested by using the official SAS name for the statistics (e.g.
MEAN) followed by an equal sign (=), followed by the (arbitrary) name of your choice. In this example, we are
asking for three statistics: the sample size (official SAS name: N), the mean (official SAS name: MEAN), and
the Standard Deviation (official SAS name: STD). For each variable that we've indicated, the summary
statistic we have specified will be computed and added to the output SAS Data Set (BMISTATS), one record
for each value within the specified BY-variable.
* It is important to note that this output SAS Data Set will contain, in all:
* a single record based upon unique values for variables named in the BY-statement. In this case, the
output SAS Data Set, BMISTATS, will contains a single record for each woman subject.
* the requested statistics, based upon each unique value of the BY-variable. In this case, for each
woman in the study, the requested summary statistics -- n, mean, and standard deviation -- will be
computed.
* two "automatic variables", _TYPE_ and _FREQ_. For now, we won't be concerned with these last
two variables except to note that _TYPE_ can be useful when we elect the OUTPUT option in
association subsets that are indicated by the use of the SAS Class Statement and that _FREQ_ is
merely the sample size associated with each BY-variable value. You may elect to add other
"identifying" variables to the output SAS Data Set with the use of a so-called "ID Variable." Variables
listed in the ID Statement are also included in the SAS Output Data Sets created by SAS Procedures.
* SAS allows for several summary statistics to be output; in this example we have only asked for three of them
N
MAX
RANGE
SUMWGT
USS
VAR
STDERR
SKEWNESS
T
NMISS
MIN
SUM
MEAN
CSS
STD
CV
KURTOSS
PRT
Note that the syntax for SAS Output Statements varies among different SAS PROCs. The SAS reference documentation
will provide all of the details for producing output SAS Data Sets for each of the statistical procedures.
EXERCISE 1
Edit one of your SAS Programs and create the following SAS program; when you are done, file it as MLTBRTH1.SAS .
This program accomplishes the following:
* Reads in the full CHDS Basic Data Set.
* Removes all records with a missing BIRTHWT variable. (Affects singleton and all children within a multiple
pregnancy.)
* Uses a combination of the "subsetting IF Statement” and the FIRST-LAST convention to create a SAS Data
Set comprised of all multiple births (twins, triplets, etc.).
* Uses the multiple births SAS Data Set within a PROC MEANS Output function to create
another SAS Data Set comprised of selected summary statistics for the BIRTHWT variable
for each pregnancy associated with a multiple birth.
* Prints the first 20 observations of this new, output SAS Data Set.
/* your name */
OPTIONS LS=80;
FILENAME INDATA ’C:\PH144\BASIC.DAT’;
DATA ALLPREGS;
INFILE INDATA;
INPUT WOMAN 1-5 PREG 1-6 BIRTHWT 50-52;
/*SET MISSING VALUES FOR SELECTED VARIABLES*/
IF BIRTHWT = 999 THEN BIRTHWT = .;
note the subsetting IF --->
/*CREATE SAS DATA SET WITH ALL MULTIPLE BIRTHS*/
PROC SORT; BY PREG;
DATA ALLMULTS; SET ALLPREGS; BY PREG;
IF FIRST.PREG EQ 0 OR LAST.PREG EQ 0;
this code selects
multiple births --->
note the DATA= and
/*REMOVE RECORDS WITH MISSING BIRTH WEIGHT VARIABLE*/
IF BIRTHWT NE .;
/*CREATE SAS DATA SET: MEAN BIRTH WEIGHTS FOR EACH MULTIPLE BIRTH*/
PROC MEANS DATA=ALLMULTS NOPRINT; BY PREG; VAR BIRTHWT;
OUTPUT OUT=MULTWTS N=N_BWT MEAN=MEAN_BWT STD=STD_BWT;
NOPRINT statements --->
PROC PRINT DATA=MULTWTS (OBS=20);
RUN;
Save this SAS program, submit it and examine the output.
Note some of the special features of this program and its output:
1. Since this program uses multiple SAS Data Sets (ALLPREGS, ALLMULTS, and MULTWTS), we have used the
DATA= option in each of the PROCs to explicitly specify the SAS Data Set to be operated on.
2. The PROC MEANS Statement contains a new option: NOPRINT. This option acts to abort the normal output that
we would expect to see with PROC MEANS and is an option that is often used when creating an output SAS Data Set.
If this option was not included, SAS would produce a single PROC MEANS statistical output for each value of the
PREG variable -- this would produce hundreds of lines of output and is not required or necessary for our goal of
producing an output SAS Data Set.
3.
Examine the first twenty observations of the output SAS Data Set: MULTWTS. As discussed above, note that this
SAS Data Set contains the following variables:
* the BY-variable: PREG.
* each of the requested summary statistics, using the (arbitrary) name that we have chosen):
N_BWT, MEAN_BWT, and STD_BWT.
* the two "automatic variables": _TYPE_ and _FREQ_.
As a second exercise . . .
EXERCISE 2:
Edit MLTBRTH1 SUBMIT to accomplish the following:
* Read in the full CHDS Basic Data Set (same as above).
* Remove all records with a missing BIRTHWT variable (same as above).
* Use the FIRST-LAST convention to create an indicator variable -- call it BIRTH -- to separately
identify or flag singleton births and multiple births (twins, triplets, etc.). That is, create a two-valued variable,
BIRTH, which takes on one value (perhaps: 1) for records associated with singleton births and another value
(perhaps: 2) for records associated with a multiple births.
* Use an OUTPUT Statement in PROC MEANS to create a new SAS Data Set that includes the mean for the
BIRTHWT variable for each pregnancy -- both singleton and multiple births. As in the previous example,
name the variable MEAN_BWT. Obviously, the singleton births will only contribute one record to each of the
means; the multiple births will contribute multiple records.
* Print the first 20 observations of this new, output SAS Data Set.
* Using the new PROC MEANS output SAS Data Set, perform a (group) t-test on the mean birth weight variable
(MEAN_BWT); use the indicator variable BIRTH as the classification variable. This should insure that birth
weights for multiple births will be represented by the mean of the weights for all of the infants within each
pregnancy.
* Use SAS COMMENT Statements appropriately.
Save this SAS program as MLTBRTH2.SAS .
The PROC UNIVARIATE Scenario.
PROC UNIVARIATE’s OUTPUT Statement is similar to that of PROC MEANS, however it allows for a somewhat different
set of statistics to be output. Of particular interest is the availability of any percentile for output – e.g. 1st percentile, 21st
percentilie, 87th percentile. In order to specify which percentiles you want to be output, you must (1) indicate the “Perentile
Prefix” and (2) indicate the specific percentiles. The “prefix” become the first part of the variable name that is output; it is
then conjoined with specific percentile requested that becomes the suffix of the variable name. The “prefix” is indicated in
the OUTPUT Statement as PCTLPRE and the percentiles are indicates as PCTLPTS. So, for example, the following
Output Statement
OUTPUT OUT=PERCOUT PCTLPRE=P_ PCTLPTS=33 66;
creates an output data set names PERCOUT with variables P_33 and P_66, along with any BY-Variables and IDVariables.
Consider the following SAS Program that uses the SAS OUTPUT Statement within PROC UNIVARIATE for class
discussion.
OPTIONS LINESIZE=80;
FILENAME INDATA 'C:\PH144\BASIC.DAT';
DATA ALLPREGS;
INFILE INDATA;
INPUT WOMAN 1-5 PREG 1-6 BIRTHWT 50-52;
/*SET MISSING VALUES FOR SELECTED VARIABLES*/
IF BIRTHWT = 999 THEN BIRTHWT = .;
/*REMOVE RECORDS WITH MISSING BIRHT WEIGHT VARIABLE*/
IF BIRTHWT NE .;
/*CREATE MUTIPLE BIRTH INDICATOR VARIABLE*/
PROC SORT; BY PREG;
DATA ALLPREGS; SET ALLPREGS; BY PREG;
IF FIRST.PREG EQ 1 AND LAST.PREG EQ 1 THEN BIRTH=1;
IF FIRST.PREG EQ 0 OR LAST.PREG EQ 0 THEN BIRTH=2;
/*CREATE SAS DATA SET: Percentile BIRTH WEIGHTS FOR EACH MULTIPLE BIRTH*/
PROC SORT; BY BIRTH;
PROC UNIVARIATE DATA=ALLPREGS NOPRINT; BY BIRTH; VAR BIRTHWT;
OUTPUT OUT=BW_PCTLS PCTLPRE=P_ PCTLPTS= 0 TO 100 BY 5;
PROC FORMAT; VALUE BIRTHF 1='SINGLE' 2='MULTIPLE';
PROC PRINT DATA=BW_PCTLS NOOBS; FORMAT BIRTH BIRTHF.;
RUN;
This program produces the following output from PROC PRINT.
The SAS System
BIRTH
P_0
P_5
P_10
P_15
P_20
P_25
P_30
P_35
P_40
P_45
SINGLE
MULTIPLE
1
2
84
30
94
47
99
57
102
63
105
68
108
73
110
77
113
82
115
83
P_50
P_55
P_60
P_65
P_70
P_75
P_80
P_85
P_90
P_95
P_100
117
86
119
88
122
90
124
93
126
96
129
99
132
103
136
106
140
112
147
116
225
138
Download