Introduction to the UNIVARIATE Procedure Kim L. Kolbe Ritzow of Systems Seminar Consultants, Kalamazoo, MI FREO, PLOT, and NORMAL Options Further information can be derived from PROe UNIVARIATE by using the FREQ, PLOT, or NORMAL options on the PROe UNIVARIATE statement. The FREQ option will generate a frequency table, much like that of PROe FREQ, except PROe FREQ generates cumulative counts, which the FREQ option does not. The PLOT option creates a histogram (or a stem-and-leaf plot) and a box plot of the values to check their The NORMAL option provides distribution. another way to check the distribution by generating a normal probability plot which plots the data values against a normal distribution. Abstract PROe UNIVARIATE is a powerful BASE SASe PROe that combines many of the features found in other analytical PROes such as FREQ, MEANS, SUMMARY, and TABULATE into a single PROe step. PRoe UNIVARIATE is an excellent exploratory data analysis tool. It provides more information, both descriptively and graphically, in a single pass of the data than any other BASE SAS PROe. In some cases it provides information that cannot be found on any other BASE SAS PROe, such as information on the data's median, mode, quartiles and percentiles. PROC UNIVARIATE DATA=SASDATA.WGTDATA FRED PLOT NORMAL; VARWEIGHT; This paper will discuss not only how to interpret some of the results generated by PROe UNIVARIATE, but it will also discuss its syntax and provide efficiency tips and techniques. RUNj (see Table 3 for resulting output and an explanation of the statistics) Other Useful Options Two other useful options on the PROe UNIVARIATE statement are the NOPRINT and ROUND= options. A Simple PROe UNIVARIATE PRoe UNIVARIATE without any options or statements will produce a UNIVARIATE report for all numeric variables on the data set, which may give you more information than you desire. The NOPRINT option allows the user to suppress the default report from printing when generating an output SAS data set on PROe UNIVARIATE (we'll see later on how an output SAS data set can be built). PROC UNIVARIATE DATA=SASDATA.WGTDATA; RUN; It is more efficient to limit the scope of analysis by requesting only the numeric variables in which you are interested in analyzing by using the VAR statement on PROe UNIVARIATE. The ROUND = option, which is new on UNIVARIATE starting with Version 6.06, specifies a level of preCision for the statistics. The ROUND= option can improve efficiency by reducing the amount of memory required (it does not have to store as many unique values for each variable). PROC UNIVARIATE DATA=SASDATA.WGTDATA; VARWEIGHT; RUN; (see Table 1 for resulting output) The 10 Statement The ID statement on a PROe UNIVARIATE names a variable that identifies the highest and lowest values on the EXTREME section of the default report by the value of the identifying variable rather than by the observation number. The use of an ID statement does not affect any other part on the report other than the EXTREMES section. PROC UNIVARIATE DATA=SASDATA.WGTDATA ROUND= 1 NOPRINT; VARWEIGHT; RUN; (this example shows how the NOPRINT option is specified, but it only really makes sense to use it when the OUTPUT OUT= statement is being used to build an output SAS data set). The ROUND= option defines how the values will be internally rounded prior to the calculation of the statistics. It does not affect the display of the values on the report. When the ROUND= option is used, a message will appear next to the PROC UNIVARIATE DATA=SASDATA.WGTDATA; VARWEIGHT; IDGENDER; RUN; (see Table 2 for resulting output) 1390 VARIABLE = text on the top of the report which reads: "Rounded to the nearest multiple of X· where X is the value specified in the ROUND = option. statement to suppress the default report when building an output SAS data set. PRoe UNIVARIATE DATA=SASDATA.WGTDATA NOPRINT; VAR WEIGHT HEIGHT; OUTPUT OUT=AVGS MEAN = AVGWGT AVGHGT MAX=MAXWGT; OUTPUT OUT=NEWD MEDlAN=MEDWGT 01=OlWGT 03=03WGT; RUN; If the ROUND= option contains a single value, it applies to all specified variables. If the ROUND = option specifies more than one value, a VAR statement must be used and its values will correspond to the order of the variables specified in the VAR statement. PRoe PRINT DATA=AVGS; TITLE 'AVGS DATA SET; RUNj PRoe UNIVARIATE DATA=SASDATA.WGTDATA ROUND=1.51; VAR WEIGHT HEIGHT AGE; PRoe PRINT DATA= NEWD; TITLE 'NEWD DATA SET; RUN; RUN; The value specified on the ROUND = option must greater than or equal to zero. If the value is less than or equal to zero, it has no effect on the rounding. Other Statements Available Other statements available on PROC UNIVARIATE are the FREO and WEIGHT statements. They, like the other statements we have seen (BY, VAR, and ID), come after the PROC and before the RUN. There is a subtle difference between the FREO and WEIGHT statements. The FREO statement identifies a variable which contains the number of observations each observation is to represent. For instance, let's say we had a variable on our data set called HOWMANY and our data set looked something like this: More information regarding the specifics of the ROUND= option is available in the SAS· Procedures Guide under PROC UNIVARIATE. The BY Statement The BY Statement on PROC UNIVARIATE allows us to obtain separate sub-group analyses for each value of the BY variable. Whenever using the BY statement it requires that the data be in the BY order. If not, sorting will be required prior to the PROC step unless the data is indexed on the BY variable, or the NOTSORTED or DESCENDING options are used on the BY statement. GENDER FEMALE FEMALE .. etc .. WEIGHT 98 110 .. etc.. HOWMANY 5 2 .. etc .. PRoe UNIVARIATE DATA=SASDATA.WGTDATA; VARWEIGHT; FREO HOWMANY; RUN; When the BY statement is used with the PLOT option on the PROC statement an additional graph will appear labeled Schematic Plots, which will contain side-by-side box plots for each BY value. In the case of our first observation, a 98 pound female, the FREO statement produces the same result as if that same observation appeared on the data set five separate times. Without the FREO statement, UNIVARIATE assumes that each observation represents itself (1 observation). Therefore, in this example with this data, the use of the FREO statement will produce dramatically different results in the statistics than if we would have not used it. PRoe UNIVARIATE DATA=SASDATA.WGTDATA PLOT; VARWEIGHT; BY GENDER; RUN; (see Table 4 for resulting output) Creating Output SAS Data Sets PROC UNIVARIATE has the unique ability to create multiple output SAS data sets in a single pass of the da!a. When creating output SAS data sets on PROC UNIVARIATE, the VAR statement must be used. It is also a good idea to use the NOPRINT option on the PROC UNIVARIATE With the FREO statement, only the integer portion of its value is used. If its value is 3.5, it is considered to be 3. If the value is less than 1 or missing, it is not used in the analysis. 1391 the use of the CLASS statement, which PROC UNIVARIATE cannot. The WEIGHT statement on the other hand, specifies a variable name whose values are used to weight each observation. WEIGHTing values affects only the mean, variance and sum (they become weighted statistics). Whereas the FREQ statement will change the meaning of all the statistics reported. While SUMMARY and MEANS are a bit faster and require less memory than UNIVARIATE, PROC UNIVARIATE provides the most descriptive information in a single pass of the data than any other SAS PROC available. Changes and Enhancements Version 6.06 of PROC UNIVARIATE offers some new features. The functionality of PROC PCTl from the Version 5 supplemental library has been incorporated into PROC UNIVARIATE under Version 6.06 (the PCTLNAME=, PCTLPTS=, and the PCTlRPRE= options can be used on the OUTPUT statement to specify user-defined percentiles). Trademark Notice SAS is a registered trademark of the SAS Institute Inc., Cary, NC, USA and other countries. Useful Publications SAS Institute Inc. (1990), SAS' Procedures Guide, Version 6, Third Edition, Cary, NC: SAS Institute Inc. Also new are the PROBS and PROBN statistics used on the OUTPUT statement. PROBS gives the probability of a greater absolute value for the centered, signed rank statistic. PROBN gives the probability for testing the hypothesis that the data are from a normal distribution. SAS Institute Inc. (1987) (written by Sandra D. Schlotzhauer and Dr. Ramon Littell), SAS' System for Elementary Statistical Analysis, Cary, NC.: SAS Institute Inc. Cody, Ronald P. and Smith, Jeffery K. (1991), Applied Statistics and the SAS' Programming Language, Third Edition, North Holland, New York The ROUND = option, as seen in an earlier example, is also new. It specifies the level of precision for the variable's values. Using the ROUND= option can improve efficiency by reducing the amount of memory required. Hartwig, Frederick and Dearing, Brian E. (1979), Exploratory Data Analysis, Third Edition, Sage University Papers, Beverly Hills, CA Another option specified on the PROC statement, PCTLDEF =, has changed its default value from 5 to 4. Tukey, J.w. (1977), Exploratory Data Analysis, Addison-Wesley, Reading, MA There have been no enhancements to PROC UNIVARIATE with Version 6.07 or 6.08 of SAS Software. Any questions or comments regarding the paper may be directed to the author: Kim L. Kolbe Ritzow Systems Seminar Consultants Kalamazoo Office 927 Lakeway Drive Kalamazoo, MI 49001 Phone: (616) 345-6636 Fax: (616) 345-5793 In Summary Other PROCs like MEANS, SUMMARY, and FREQ can give us similar information with a few key differences. PROC UNIVARIATE provides both statistical and graphical information that can be used to analyze data. It offers statistics that cannot be found on any other PROC (quartiles, median, and user-defined percentiles), details on outlying or extreme values, graphical information to analyze the distribution of the data, and the ability to build multiple output SAS data sets. MEANS and SUMMARY can only create one output SAS data set at a time. They can, however, sum-marize statistics at various levels by 1392 -- - Table 1 (result of using the VAR statement) THE SAS SYSTEM UNIVARIATE PROCEDURE VARIABLE=WEIGHT MOMENTS N MEAN STD DEV SI(£WNESS USS CV T:MEAH=O HUM "= 0 M(SIGN) SGN RANK 1017 168.5261 51.19324 0.600273 31546529 30.37705 104.982 1017 508.5 258826.5 QUANTILES(DEF=5) SUM WGTS SUM VARIANCE KURTOSIS CSS STD MEAN PR> [T) NUM > 0 PR>= [M) PR>= [S) 1017 171391 2620.748 -0.61305 2662680 1.605285 0.0001 1017 0.0001 0.0001 100% 75% 50% 25% 0% MAX Q3 MED Ql MIN RANGE Q3-Ql MODE 316 204 156 127 89 EXTREMES 99% 95% 90% 10% 5% 1% 290 265 246 110 102 92 LOWEST 89( 89( 90( 90( 90( OBS HIGHEST 295( 297( 300( 300( 7) 316( 4) 3) 9) 8) OBS 1017) 1018) 553) 1019) 1020) 227 77 120 MISSING VALUE COUNT % COUNT/NOBS 3 0.29 Table 2 (result of using the ID statement) THE SAS SYSTEM UNIVARIATE PROCEDURE VARIABLE=WEIGHT MOMENTS N MEAN STO DEV SKEWNESS USS CV T:MEAN=O HUM ". 0 M(SIGH) S6N RANK W:NORMAL 1017 168.5261 51.19324 0.600273 31546529 30.37705 104.982 1017 508.5 258826.5 0.92491 QUANTILES(DEF=5) SUM WGTS SUM VARIANCE KURTOSIS CSS STO MEAN PR>[T) NUM > 0 PR>=[M) PR>=[S) PR<W 1017 171391 2620.748 -0.61305 2662680 1.605285 0.0001 1017 0.0001 0.0001 0.0001 100% 75% 50% 25% 0% RANGE Q3-QI MODE MAX Q3 MED Ql MIN 316 204 156 127 89 99% 95% 90% 10% 5% 1% 227 77 120 MISSING VALUE COUNT % COUNT/N08S - 1393 3 0.29 EXTREMES 290 265 246 110 102 92 HIGHEST 10 LOWEST 10 89(FEMALE ) 295(MALE 89 (FEMALE ) 297 (MALE 300(FEMALE 90(FEMALE ) 300(MALE 90(FEMALE ) 90(FEMALE ) 316(MALE ) ) ) ) ) Table 2 Cont'd Definition of the Statistics Definition of the Moments Statistics: CSS CV Kurtosis Mean M(sign) N Num~= Num > 0 PR>= IMI PROS> ITI PROS> lSI SGN RANK SKEWNESS STO DEV STD MEAN SUM SUM WGTS T:MEAN=O VARIANCE USS Corrected sum of squares (the sum of squares about the mean). Coefficient of variance. Measures the'shape of the distribution. Large values indicate heavy tails (values are distant from the mean). Ar~hmetic average. Describes the center of a distribution of values for a variable. The sign statistic. Number of nonmissing values. Number of nonmissing values not equal to zero (I.e. how many values are zero). The number of positive observations. The probabil~y of a greater absolute value for the sign statistic under the hypothesis that the population mean is O. P-value for the t-statistic (2-tailed). Large values indicate significant differences. p-value for the Sign Rank test. If the value is less than the predetermined significance level, conclude that the average difference is significantly different from zero. Value of the Wilcoxon signed rank statistic (the nonparametric equivalent of the paired-difference t-test). A measure to determine if values are more spread out on one side of the mean than the other. Positive skewness indicates values to the right of the mean are more spread out than the values to the left of the mean. It is a measure of symmetry. Standard deviation is the square root of variance. It measures dispersion about the mean. Easier to interpret than the variance since it's values are the same units as the data. Standard error of the mean. It puts a "confidence interval" around the mean. Useful when analyzing sampled data. It tells how far off our sample mean may be from our population data. The sum of values for all observations. Sum of the observation's weights. Student's t-statistic value for testing the hypothesis that the population mean is O. Most common way to measure dispersion, or variability about the mean. Variance is small when all the values are close to the mean and large when the values are scattered widely about the mean. Uncorrected sum of squares. Definition of the Ouantile Statistics: 100% Max 75% 03 50% Med 25% 01 0% Min 99",(, - 1% Range 03-01 Mode The maximum, or highest value found (the heaviest person in the survey was 316 pounds). The third quartile, the value that is larger than 75% of the values (a weight of 204 is larger than 75% of the values in the data set). The halfway point (half of the values are larger than 156 pounds, half of the values are below 156 pounds). Helps describe the center of the distribution (look for a pictorial representation in the Box Plot). Not as sensitive as the mean with skewed data. The first quartile, the value that is larger than 25% of the values (a weight of 127 is larger than 25% of the values in the data set). The minimum, or lowest value found (the lightest person in the survey was 89 pounds). The 99th, 95th, 90th, 1Dth, 5th, and 1st percentiles. The values that are larger than 99%, 95%, 90%, 10%, 5%, and 1% of the values. The difference between the largest and smallest values. There was a difference in the weight between our heaviest and lightest person of 227 pounds. The interquartile range (difference between 03 and 01) was 77 pounds. It is used as a measure of dispersion. The smaller the value, the closer your values are to one another. The larger the value, the more spread out the values are. The most popular value. The value that has the most observations. More people weighed 120 pounds than any other value. If the data has more than one mode, UNIVARIATE lists the mode w~h the smallest value. 1394 Table 2 Cont'd Definition of the Statistics Extremes Section: This section lists the five highest and fIVe lowest values found on the data set. It does not list five unique lowest and highest values. The same value may appear more than once on this section. In addition to displaying the values, it also shows the corresponding observation number. If the 10 statement would have been used, the values would be represented by the IDentifying value rather than the observation number (note the difference by comparing the Extremes Section on Table 1 and Table 2's output. Normal Probability Plot The plus signs form a straight line based on the sample mean and standard deviation. The asterisks represent the actual data. If the sample is from a normal distribution, the asterisks form a straight line and cover most of the plus signs. A large number of visible plus signs indicate a nonnormal distribution, as ours does. Frequency Table Similar to the output generated by PROC FREQ, except PROC FREQ generates cumulative counts in addition to the statistics shown here. The frequency table yet another way to see how the data are distributed. Missing Value Section: The Missing Value notes how missing values are represented on the data set. Count refers to how many observations were found with that missing value, and the % Count/Nobs refers to the percent of total observations that were missing. Checking for Data Errors By using the Quantiles, Extremes, Histogram, Box Plot, and Frequency Table, one can quickly check for data errors. Using the Quantiles section, check the maximum and minimum values to make sure they make sense (a maximum value of 990, or a minimum value of 1 may indicate a data error). You could also use the Frequency Table to check for data errors. If you find something unusual, use the Extremes section to determine which observation number it is (if an 10 statement is not being used). The Histogram and Box Plot can be used to determine outliers in the data. Explanation of the Histogram and Box Plot: If no more than 48 observations fall into a Single interval, a stem-and-Ieaf plot will be generated rather than a histogram (a horizontal bar chart). If you look at the data sideways it should roughly look like a bell curve (meaning it is normally distributed). Given our data, the histogram generated suggest the sample data are not normally distributed, they are skewed. The Box Plot can further describe the distribution of the data. The upper and lower ends of the box represent the 25th and 75th percentiles. The line inside the box is the median. The' +' indicates the mean. Our mean and median are not the same. The lines coming out of the box represent "whiskers· and extend to a maximum of 1.5 times the interquartile range. Data beyond the whiskers up to 3 times the interquartile range, represented by O's show more extreme values, and outliers are represented by an "'. The histogram (or stem-and-Ieaf plot) and the Box Plot should be used to determine Hsymmetry and smoothness (!.e. no heavy tails) exists in the data. 1395 Table 3 (result of using FREQ, PLOT, and NORMAL options) THE SAS SYSTEM UNIVARIATE PROCEDURE VARIABLE=WEIGHT QUANTILES(DEF=5) MOMENTS N MEAN STD DEV SKEWNESS USS CV T:MEAN=O NUM '= 0 M(SIGN) SGN RANK 1017 168.5261 51.19324 0.600273 31546529 30.37705 104.982 1017 508.5 258826.5 SUM WGTS SUM VARIANCE KURTOSIS CSS STD MEAN PR>[Tl NUM > 0 PR>=[Ml PR>=[sl 100% 75% 50% 25% 0% 1017 171391 2620.748 -0.61305 2662680 1. 605285 0.0001 1017 0.0001 0.0001 MAX Q3 MED Q1 MIN 316 204 156 127 89 RANGE Q3-Q1 MODE HISTOGRAM # 1 * 9 85+* 13 28 31 35 38 36 34 41 49 47 72 44 68 78 110 84 103 44 38 LOWEST 89C 89( 90( 90( 90( OBS 4) 3) 9) 8) 7) HIGHEST 295( 297( 300( 300( 316( OBS 1017) 1018) 553) 1019) 1020) 3 0.29 1 1 +-----+ 1 1 1 1 + NORMAL PROBABILITY PLOT 315+ 1 1 1 1 1 1 1 1 1 1 1 2 ********** *********** ************ ************* ************ ************ ************** ***************** **************** ************************ *************** *********************** ************************** ************************************* **************************** *********************************** *************** ************* 290 265 246 110 102 92 BOXPLOT 10 **** *** ***** 99% 95% 90% 10% 5% 1% 227 77 120 MISSING VALUE COUNT % COUNT/NOBS 315+* EXTREMES 1 1 1 1 *-----* 1 1 +-----+ 2 *** 1r1r*+ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ***+ ****+ ***++ ***++ ** + ***++ **++ **+ ** *** +** +*** ++** ++*** +**** ***" ****** *****++ ]******** ++ 85+* ++ +----+----+----+----+----+----+----+----+----+---+ ----+----+----+----+----+----+----+-- * MAY REPRESENT UP TO 3 COUNTS -2 -1 o +1 +2 FREQUENCY TABLE PERCENTS VALUE COUNT CELL CUM 89 2 0.2 0.2 90 5 0.5 0.7 91 3 0.3 1.0 etc PERCENTS VALUE COUNT CELL CUM 139 14 1.4 37.5 140 18 1.8 39.2 141 5 0.5 39.7 etc PERCENTS VALUE COUNT CELL CUM 189 3 0.3 67.8 190 11 1.1 68.9 5 0.5 69.4 191 etc 1396 PERCENTS VALUE COUNT CELL CUM 239 1 0.1 87.3 240 11 1.1 88.4 241 I 0.1 88.5 etc Table 4 (partial output-result of using the BY statement) THE SAS SYSTEM UNIVARIATE PROCEDURE SCHEMATIC PLOTS VARIABLE=WEIGHT 1 320 + 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * * 300 + 1 1 1 280 + 1 1 1 * * * * 260 + 1 1 1 * 240 + 0 0 0 0 0 0 0 0 0 0 1 1 1 220 + 1 1 1 200 + 1 1 1 1 1 1 160 + 140 + 1 1 1 +-----+ +-----+ 1 1 + + 1 1 1 1 1 1 1 *----- * 1 1 1 1 1 1 1 1 1 180 + 1 1 1 +-----+ 1 1 1 1 1 1 1 1 1 *-----* 1 1 1 1 1 1 1 1 1 120 + 1 1 1 +-----+ 100 + 1 1 1 80 + GENDER ------------+-----------+----------FEMALE 1397 MALE