Descriptive Statistics In SAS Exploring Your Data Summary Statistics Before you begin any analyses, you will want to get a feel for your data. What is the mean and standard deviation of certain variables? Are the data normally distributed? You can use summary statistics and visual aids, such as histograms and box plots to help you see the distribution of your data. Proc Univariate PROC UNIVARIATE is a procedure in SAS that provides summary statistics on any quantitative variable. We will create a data set called “demo” which contains the weights of 57 day care children. Copy the values from the file http://www.biostat.umn.edu/~susant/PH6415DATA/ demo.txt and paste them into your SAS Editor window with the following SAS code: Example of Proc Univariate DATA demo; INPUT weight; DATALINES; 68 63 … 12 ; RUN; TITLE 'Proc Univariate'; PROC UNIVARIATE DATA = demo normal plot; VAR weight; histogram weight / midpoints = 10 to 80 by 5 normal; RUN; Notes about code: • Include every observation in your code; the example uses (…) to represent all observations • normal requests tests for normality • plot requests stem-and-leaf plots, box plots, and a normal probability plot • VAR weight requests that SAS performs PROC UNIVARIATE only on the variable weight. This command is useful when you have numerous variables; if you do not specify which variables to analyze, SAS will perform PROC UNIVARIATE on every variable, which generates a lot of unnecessary output. • histogram weight requests a histogram for the variable weight. The options after the / specify you want the midpoints of each bar in the histogram to range from 10 to 80 and be 5 lbs. apart. This is not necessary, but it gives you more control over the appearance of your histogram. • normal, in the histogram statement, requests a normal curve be drawn over the histogram, which gives a visual comparison of what the data should look like if they are normally distributed. Run the PROC Univariate! • Check your Log; are there any warnings or errors? Do you have all 57 observations? • Notice that a new window, GRAPH1, has opened. This contains your histogram with normal curve. • By looking at the histogram, you see that the largest amount of observations falls to the left of the center of the normal curve. This indicates that the data may be skewed to the right. Histogram in SAS Output Window • Look at the Output. Can you find the mean, variance, and standard deviation of weight? How about the median? • Notice that the mean (36.68) is larger than the median (32.0); this is another indication that the data may be skewed. • Under the heading “Tests for Normality” you will see four tests of normality. All have a p-value < 0.05, leading to the conclusion that the data are NOT normally distributed. (Ho: data are from a normal distribution; Ha: data are not from a normal distribution.) Plots • The output also includes a stem-and-leaf plot and a box plot. The stem-and-leaf plot resembles the histogram in shape. The box plot shows the mean (+) greater than the median (the middle *------*), and it also shows an outlier (0), weight 79, which could be the extreme observation responsible for the skewed data. Plots, continued • The output also contains a normal probability plot. This plots the distribution of the data points we would expect under normality (+) along the x-axis against the distribution we actually observe (*) along the y-axis. If the data are normally distributed, we should see a y=x line at a 45° angle. • The plot shows some points do not fall on the y=x line, again indicating possibly skewed data. Normal Probability Plot Conclusions • The weights of the 57 day care children do not appear to come from a normal distribution. • Proc Univariate is a valuable tool used for creating summary statistics for the data • Proc Univariate can be used to generate plots and graphs of the data, in order to determine whether the data come from a normal distribution. It also provides formal tests of normality by choosing the “normal” option in the Proc Univariate data step. • It is often necessary to determine that the data are normally distributed before analyzing them.