Chapter 2 - Describing Data 1. Summary Statistics - Proc Means 2. More Statistics and Plots - Univariate Note that for good plots, SAS is not recommended. Other packages such as Mathematica, Matlab, Maple and R are much better. 3. Proc Sort This covers sections: 2.A-H. You should also read section 19I. 1 Creating a SAS data set: Example /* Population, population density, births and deaths for Western European countries, 1995 */ DATA EUROPE_W; /* this creates a SAS Data Set called EUROPE_W */ /* Source: Organisation for Economic Co-op. and Devel. Labour Force Stat., 1976-1996, Paris, 1997 Ed.*/ INPUT COUNTRY $ POP DENSITY BRATE DRATE; /* POP = population in 1000’s, DENSITY = 1000’s of residents/km^2 BRATE, DRATE = birth, death rate per 1000 */ DATALINES; Austria 8047 95.9 . . Belgium 10137 332.4 . . Denmark 5228 121.3 13.4 12.0 Finland 5108 15.1 12.3 9.6 France 58143 105.9 12.5 9.1 2 Creating a SAS Data Set: Example Germany 81661 228.8 9.4 10.8 Greece 10454 79.2 9.7 9.6 Iceland 267 2.61 7.2 6.0 Ireland 3598 51.2 . . Italy 57283 190.2 . . Luxembourg 413 158.8 13.2 9.3 Netherlands 15459 378.91 2.3 8.8 Norway 4348 13.4 13.8 10.3 Portugal 9918 107.3 10.8 10.5 Spain 39210 77.7 9.2 8.7 Sweden 8847 19.7 11.6 10.6 Switzerland 7062 171.0 11.6 8.9 UK 58606 239.4 12.5 11.0 ; 3 Questions of interest: 1. How many missing birth rates are in the sample? 2. What is the mean population density? 3. How variable is population density from country to country? 4. What is the distribution of population? population density? 4 Another SAS Data Set: Infile and Input The file snails.txt contains data from an experiment in which groups of 20 snails were held for periods of 1, 2, 3 or 4 weeks in carefully controlled conditions of temperature and relative humidity. There were two species of snail: A and B. At the end of the exposure time the snails were tested to see if they had survived; the process itself is fatal for the animals. Using the INFILE and INPUT statements, the data can be read into a SAS data set called SNAILS. Species Time Humidity Temperature Fatalities N A 1 60.0 10 0 20 A 1 60.0 15 0 20 ........................................... B 4 75.8 20 7 20 5 Questions of interest: 1. What is the mean and standard deviation of the number of fatalities of species B for each level of exposure (TIME)? 2. What is the distribution of the number of fatalities? 3. What is an approximate 95% confidence interval for the mean number of fatalities? 4. How many times did 0 fatalities occur? 6 Proc Means Syntax: PROC MEANS DATA = SASdata options; (optional statements) Explanation: the DATA option specifies a SAS data set. If this option is not used, SAS looks to the most recently created or used SAS data set. Examples: PROC MEANS DATA = EUROPE_W; PROC MEANS DATA = SNAILS; PROC MEANS; 7 Europe Sample Example PROC MEANS DATA = EUROPE_W; The SAS System The MEANS Procedure Variable N Mean Std Dev Minimum Maximum POP 18 21321.61 25376.41 267.0000000 81661.00 DENSITY 18 132.7122222 108.3853853 2.6100000 378.9100000 BRATE 14 10.6785714 3.0579477 2.3000000 13.8000000 DRATE 14 9.6571429 1.4318972 6.0000000 12.0000000 The SAS System 8 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum Optional Statements for Proc Means To compute specific kinds of statistics, use e.g. N, NMISS, MEAN, STD, STDERR, CLM, MIN, MAX, SUM, VAR, CV, SKEWNESS, KURTOSIS, T, and MAXDEC=n. An additional option is the NOPRINT option which suppresses printing of output in the Output Window. PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR MAXDEC=4; gives the number of missing observations for each variable in the SAS data set EUROPE_W, as well as the mean, standard deviation and variance. The MAXDEC option restricts the number of decimal places to 4. A number of types of optional statements can be used, including a TITLE , VAR , CLASS, BY and OUTPUT statement. 9 The MEANS Procedure Variable N Miss Mean Std Dev Variance POP 0 21321.6111 25376.4144 643962409.08 DENSITY 0 132.7122 108.3854 11747.3917 BRATE 4 10.6786 3.0579 9.3510 9.6571 1.4319 2.0503 DRATE Example 4 Europe Sample PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR MAXDEC=4; The SAS System The MEANS Procedure Variable N Miss Mean Std Dev Variance POP 0 21321.6111 25376.4144 643962409.08 DENSITY 0 132.7122 108.3854 11747.3917 BRATE 4 10.6786 3.0579 9.3510 DRATE 4 9.6571 1.4319 2.0503 The SAS System The MEANS Procedure 10 Variable N Miss Mean Std Dev Variance Subcommand statements for Proc Means The TITLE statement is useful for preparing reports. The VAR statement specifies which variables the summary statistics should be computed for. Example: PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR; TITLE ’Demographic Statistics for Western Europe’; VAR DENSITY BRATE DRATE; 11 Subgrouping with the Class Statement The CLASS statement is used when we require computation of the various summary statistics for different subgroups of classes. For example, to estimate the mean number of fatalities for each of the two species of snail, we use SPECIES as a class variable: Example (try this one yourself using the data file from the web): DATA SNAILS; INFILE ’snails.txt’; INPUT SPECIES $ TIME HUMIDITY TEMP FATALITY N; PROC MEANS DATA=SNAILS MEAN; TITLE ’Mean Fatalities For Each Species of Snail’; VAR FATALITY; CLASS SPECIES; RUN; QUIT; 12 Subgrouping with Class After execution, the Output window contains the two averages: Mean Fatalities For Each Species of Snail A 0.708333 B 4.020833 13 Subgrouping with Class We are actually interested in the mean number of fatalities for each type of snail at each level of exposure (TIME). Thus, TIME is a second classification variable, nested within the first classification variable SPECIES. We can obtain all of the required averages, as well as 95% confidence limits for the true mean in each case, by employing the following: PROC MEANS DATA=SNAILS MEAN CLM; TITLE ’Mean Fatalities For Each Species of Snail’; VAR FATALITY; CLASS SPECIES TIME; 14 Subgrouping with the By Statement The BY statement is almost interchangeable with the CLASS statement. However, it will only work when the data set is sorted according to the BY variable. The CLASS statement does not have this restriction. Example: PROC MEANS DATA=SNAILS MEAN CLM; TITLE ’Mean Fatalities For Each Species of Snail’; VAR FATALITY; BY SPECIES TIME; This works since SPECIES and TIME are already sorted. For each value of SPECIES the variable TIME is sorted. The CLASS statement uses more memory than BY, but the BY will tend to be slower than CLASS, since sorting is a slow operation. These differences are only noticeable for large data sets. 15 Using Output from Proc Means The OUTPUT statement is used to create a new SAS data set consisting of the summary statistic computed by PROC MEANS. Example: The following creates a new SAS data set called SNAILSUM which will contain 2 observations (one for each species) on the 3 variables M_FATAL, S_FATAL, and V_FATAL. PROC MEANS DATA=SNAILS MEAN STD VAR NOPRINT; VAR FATALITY; CLASS SPECIES; OUTPUT OUT=SNAILSUM MEAN=M_FATAL STD =S_FATAL VAR =V_FATAL; 16 Output: Another Example The following creates a SAS data set consisting of a single observation on the two variables M_BRATE and M_DRATE. The number of variables in the VAR statement must match the number of variables created by the OUTPUT statement, for each statistic listed in the options. PROC MEANS DATA=EUROPE_W MEAN; VAR BRATE DRATE; OUTPUT OUT=EUROPSUM MEAN=M_BRATE M_DRATE; These new SAS data sets can later be used by SAS procedures, if desired. 17 Proc Means: Example Here we plot a histogram of the averages of the numbers of fatalities. Note that we have used the NOPRINT option here to suppress output to the Output window. PROC MEANS DATA=SNAILS MEAN NOPRINT; TITLE ’Mean Fatalities For Each Species of Snail’; VAR FATALITY; CLASS SPECIES TIME; OUTPUT OUT = SNAILSUM; MEAN = M_FATAL; PROC CHART DATA=SNAILSUM; VBAR M_FATAL; 18 PROC UNIVARIATE Syntax: PROC UNIVARIATE DATA = SASdata options; statements; Many of the options are the same as for PROC MEANS. Some additional ones are available: see page 29 of the textbook. The default output is quite extensive and includes the median and quartiles, the extreme percentiles, and lowest and highest 5 observations. These last are useful for ensuring that the data has been read in sensibly. 19 PROC UNIVARIATE options The NORMAL option gives a crude normal QQ plot, an informal, yet useful, test of normality. It is a plot of the ordered observations versus the expected value of ordered normal observations. If the plot is close to a straight line, then the data are approximately normally distributed. Otherwise, the data are likely non-normal. 20 Normal QQ Plot: Example This checks whether the distribution of Western European population densities are approximately normal. PROC UNIVARIATE DATA=EUROPE_W NORMAL; VAR DENSITY; To train your eye to recognize typical departures from non-normality, simulation of normal and non-normal data sets having various sample sizes is helpful: DATA _NULL_; FILE ’normal.dat’; N = 20; DO I=1 TO N; X = RANNOR(0); PUT X; END; RUN; QUIT; 21 Normal QQ Plotting Now, construct the normal QQ plot: DATA NORTEST; INFILE ’normal.dat’; INPUT X; PROC UNIVARIATE NORMAL; VAR X; RUN; QUIT; Repeating this for a number of different simulation runs will give you a good notion as to what the normal QQ plot should look like. 22 Normal QQ Plotting of Non-Normal Data To see what a normal QQ plot shouldn’t look like, try something like the following: DATA _NULL_; FILE ’normal.dat’; N = 20; DO I=1 TO N; U = UNIFORM(0); IF U < .8 THEN X = RANNOR(0); ELSE X = 5*RANNOR(0); PUT X; END; RUN; QUIT; or 23 Normal QQ Plots of Non-Normal Data DATA _NULL_; FILE ’normal.dat’; N = 20; DO I=1 TO N; X = RANEXP(0); PUT X; END; RUN; QUIT; In each case, create the normal QQ plot to see what happens when the data is really not normally distributed. 24 The Plot options and Proc Means Crude stem-and-leaf and boxplots can be produced using the PLOT option. Most of the statements that can be used with PROC MEANS can be used with PROC UNIVARIATE. The exception is the CLASS statement. You must make sure the data are sorted properly and use the BY statement instead. 25 PROC SORT Syntax PROC SORT DATA=SASdata; BY var1 var2 ... ; Example 1: PROC SORT DATA = EUROPE_W; BY DENSITY; 26 PROC SORT The SAS data set then becomes Country POP DENSITY BRATE DRATE Iceland 267 2.61 7.2 6.0 Norway 4348 13.40 13.8 10.3 Finland 5108 15.10 12.3 9.6 ................................ Netherlands 15459 378.91 2.3 8.8 The following sorts the data set so that DENSITY appears in reverse order. PROC SORT DATA = EUROPE_W; BY DESCENDING DENSITY; 27