Introduction to the UNIVARIATE Procedure

advertisement
Introduction to the UNIVARIATE Procedure
Kim L. Kolbe Ritzow of Systems Seminar Consultants, Kalamazoo, MI
FREO, PLOT, and NORMAL Options
Further information can be derived from PROe
UNIVARIATE by using the FREQ, PLOT, or
NORMAL options on the PROe UNIVARIATE
statement. The FREQ option will generate a
frequency table, much like that of PROe FREQ,
except PROe FREQ generates cumulative counts,
which the FREQ option does not. The PLOT
option creates a histogram (or a stem-and-leaf
plot) and a box plot of the values to check their
The NORMAL option provides
distribution.
another way to check the distribution by
generating a normal probability plot which plots
the data values against a normal distribution.
Abstract
PROe UNIVARIATE is a powerful BASE SASe
PROe that combines many of the features found
in other analytical PROes such as FREQ, MEANS,
SUMMARY, and TABULATE into a single PROe
step.
PRoe UNIVARIATE is an excellent exploratory
data analysis tool. It provides more information,
both descriptively and graphically, in a single pass
of the data than any other BASE SAS PROe. In
some cases it provides information that cannot be
found on any other BASE SAS PROe, such as
information on the data's median, mode, quartiles
and percentiles.
PROC UNIVARIATE DATA=SASDATA.WGTDATA
FRED PLOT NORMAL;
VARWEIGHT;
This paper will discuss not only how to interpret
some of the results generated by PROe
UNIVARIATE, but it will also discuss its syntax
and provide efficiency tips and techniques.
RUNj
(see Table 3 for resulting output and an explanation of the
statistics)
Other Useful Options
Two other useful options on the PROe
UNIVARIATE statement are the NOPRINT and
ROUND= options.
A Simple PROe UNIVARIATE
PRoe UNIVARIATE without any options or
statements will produce a UNIVARIATE report for
all numeric variables on the data set, which may
give you more information than you desire.
The NOPRINT option allows the user to suppress
the default report from printing when generating
an output SAS data set on PROe UNIVARIATE
(we'll see later on how an output SAS data set
can be built).
PROC UNIVARIATE DATA=SASDATA.WGTDATA;
RUN;
It is more efficient to limit the scope of analysis by
requesting only the numeric variables in which
you are interested in analyzing by using the VAR
statement on PROe UNIVARIATE.
The ROUND = option, which is new on
UNIVARIATE starting with Version 6.06, specifies
a level of preCision for the statistics. The
ROUND= option can improve efficiency by
reducing the amount of memory required (it does
not have to store as many unique values for each
variable).
PROC UNIVARIATE DATA=SASDATA.WGTDATA;
VARWEIGHT;
RUN;
(see Table 1 for resulting output)
The 10 Statement
The ID statement on a PROe UNIVARIATE names
a variable that identifies the highest and lowest
values on the EXTREME section of the default
report by the value of the identifying variable
rather than by the observation number. The use
of an ID statement does not affect any other part
on the report other than the EXTREMES section.
PROC UNIVARIATE DATA=SASDATA.WGTDATA
ROUND= 1 NOPRINT;
VARWEIGHT;
RUN;
(this example shows how the NOPRINT option is specified, but
it only really makes sense to use it when the OUTPUT OUT=
statement is being used to build an output SAS data set).
The ROUND= option defines how the values will
be internally rounded prior to the calculation of
the statistics. It does not affect the display of the
values on the report. When the ROUND= option
is used, a message will appear next to the
PROC UNIVARIATE DATA=SASDATA.WGTDATA;
VARWEIGHT;
IDGENDER;
RUN;
(see Table 2 for resulting output)
1390
VARIABLE = text on the top of the report which
reads: "Rounded to the nearest multiple of X·
where X is the value specified in the ROUND =
option.
statement to suppress the default report when
building an output SAS data set.
PRoe UNIVARIATE DATA=SASDATA.WGTDATA
NOPRINT;
VAR WEIGHT HEIGHT;
OUTPUT OUT=AVGS
MEAN = AVGWGT AVGHGT MAX=MAXWGT;
OUTPUT OUT=NEWD
MEDlAN=MEDWGT 01=OlWGT 03=03WGT;
RUN;
If the ROUND= option contains a single value, it
applies to all specified variables. If the ROUND =
option specifies more than one value, a VAR
statement must be used and its values will
correspond to the order of the variables specified
in the VAR statement.
PRoe PRINT DATA=AVGS;
TITLE 'AVGS DATA SET;
RUNj
PRoe UNIVARIATE DATA=SASDATA.WGTDATA
ROUND=1.51;
VAR WEIGHT HEIGHT AGE;
PRoe PRINT DATA= NEWD;
TITLE 'NEWD DATA SET;
RUN;
RUN;
The value specified on the ROUND = option must
greater than or equal to zero. If the value is less
than or equal to zero, it has no effect on the
rounding.
Other Statements Available
Other statements available on PROC UNIVARIATE
are the FREO and WEIGHT statements. They, like
the other statements we have seen (BY, VAR, and
ID), come after the PROC and before the RUN.
There is a subtle difference between the FREO
and WEIGHT statements. The FREO statement
identifies a variable which contains the number of
observations each observation is to represent.
For instance, let's say we had a variable on our
data set called HOWMANY and our data set
looked something like this:
More information regarding the specifics of the
ROUND= option is available in the SAS·
Procedures Guide under PROC UNIVARIATE.
The BY Statement
The BY Statement on PROC UNIVARIATE allows
us to obtain separate sub-group analyses for each
value of the BY variable.
Whenever using the BY statement it requires that
the data be in the BY order. If not, sorting will be
required prior to the PROC step unless the data
is indexed on the BY variable, or the
NOTSORTED or DESCENDING options are used
on the BY statement.
GENDER
FEMALE
FEMALE
.. etc ..
WEIGHT
98
110
.. etc..
HOWMANY
5
2
.. etc ..
PRoe UNIVARIATE DATA=SASDATA.WGTDATA;
VARWEIGHT;
FREO HOWMANY;
RUN;
When the BY statement is used with the PLOT
option on the PROC statement an additional
graph will appear labeled Schematic Plots, which
will contain side-by-side box plots for each BY
value.
In the case of our first observation, a 98 pound
female, the FREO statement produces the same
result as if that same observation appeared on the
data set five separate times. Without the FREO
statement, UNIVARIATE assumes that each
observation represents itself (1 observation).
Therefore, in this example with this data, the use
of the FREO statement will produce dramatically
different results in the statistics than if we would
have not used it.
PRoe UNIVARIATE DATA=SASDATA.WGTDATA PLOT;
VARWEIGHT;
BY GENDER;
RUN;
(see Table 4 for resulting output)
Creating Output SAS Data Sets
PROC UNIVARIATE has the unique ability to
create multiple output SAS data sets in a single
pass of the da!a. When creating output SAS data
sets on PROC UNIVARIATE, the VAR statement
must be used. It is also a good idea to use the
NOPRINT option on the PROC UNIVARIATE
With the FREO statement, only the integer portion
of its value is used. If its value is 3.5, it is
considered to be 3. If the value is less than 1 or
missing, it is not used in the analysis.
1391
the use of the CLASS statement, which PROC
UNIVARIATE cannot.
The WEIGHT statement on the other hand,
specifies a variable name whose values are used
to weight each observation. WEIGHTing values
affects only the mean, variance and sum (they
become weighted statistics). Whereas the FREQ
statement will change the meaning of all the
statistics reported.
While SUMMARY and MEANS are a bit faster and
require less memory than UNIVARIATE, PROC
UNIVARIATE provides the most descriptive
information in a single pass of the data than any
other SAS PROC available.
Changes and Enhancements
Version 6.06 of PROC UNIVARIATE offers some
new features. The functionality of PROC PCTl
from the Version 5 supplemental library has been
incorporated into PROC UNIVARIATE under
Version 6.06 (the PCTLNAME=, PCTLPTS=, and
the PCTlRPRE= options can be used on the
OUTPUT statement to specify user-defined
percentiles).
Trademark Notice
SAS is a registered trademark of the SAS Institute
Inc., Cary, NC, USA and other countries.
Useful Publications
SAS Institute Inc. (1990), SAS' Procedures Guide,
Version 6, Third Edition, Cary, NC: SAS Institute
Inc.
Also new are the PROBS and PROBN statistics
used on the OUTPUT statement. PROBS gives
the probability of a greater absolute value for the
centered, signed rank statistic. PROBN gives the
probability for testing the hypothesis that the data
are from a normal distribution.
SAS Institute Inc. (1987) (written by Sandra D.
Schlotzhauer and Dr. Ramon Littell), SAS' System
for Elementary Statistical Analysis, Cary, NC.: SAS
Institute Inc.
Cody, Ronald P. and Smith, Jeffery K. (1991),
Applied Statistics and the SAS' Programming
Language, Third Edition, North Holland, New York
The ROUND = option, as seen in an earlier
example, is also new. It specifies the level of
precision for the variable's values. Using the
ROUND= option can improve efficiency by
reducing the amount of memory required.
Hartwig, Frederick and Dearing, Brian E. (1979),
Exploratory Data Analysis, Third Edition, Sage
University Papers, Beverly Hills, CA
Another option specified on the PROC statement,
PCTLDEF =, has changed its default value from 5
to 4.
Tukey, J.w. (1977), Exploratory Data Analysis,
Addison-Wesley, Reading, MA
There have been no enhancements to PROC
UNIVARIATE with Version 6.07 or 6.08 of SAS
Software.
Any questions or comments regarding the paper
may be directed to the author:
Kim L. Kolbe Ritzow
Systems Seminar Consultants
Kalamazoo Office
927 Lakeway Drive
Kalamazoo, MI 49001
Phone: (616) 345-6636
Fax:
(616) 345-5793
In Summary
Other PROCs like MEANS, SUMMARY, and FREQ
can give us similar information with a few key
differences. PROC UNIVARIATE provides both
statistical and graphical information that can be
used to analyze data. It offers statistics that
cannot be found on any other PROC (quartiles,
median, and user-defined percentiles), details on
outlying or extreme values, graphical information
to analyze the distribution of the data, and the
ability to build multiple output SAS data sets.
MEANS and SUMMARY can only create one
output SAS data set at a time. They can,
however, sum-marize statistics at various levels by
1392
--
-
Table 1
(result of using the VAR statement)
THE SAS SYSTEM
UNIVARIATE PROCEDURE
VARIABLE=WEIGHT
MOMENTS
N
MEAN
STD DEV
SI(£WNESS
USS
CV
T:MEAH=O
HUM "= 0
M(SIGN)
SGN RANK
1017
168.5261
51.19324
0.600273
31546529
30.37705
104.982
1017
508.5
258826.5
QUANTILES(DEF=5)
SUM WGTS
SUM
VARIANCE
KURTOSIS
CSS
STD MEAN
PR> [T)
NUM > 0
PR>= [M)
PR>= [S)
1017
171391
2620.748
-0.61305
2662680
1.605285
0.0001
1017
0.0001
0.0001
100%
75%
50%
25%
0%
MAX
Q3
MED
Ql
MIN
RANGE
Q3-Ql
MODE
316
204
156
127
89
EXTREMES
99%
95%
90%
10%
5%
1%
290
265
246
110
102
92
LOWEST
89(
89(
90(
90(
90(
OBS
HIGHEST
295(
297(
300(
300(
7)
316(
4)
3)
9)
8)
OBS
1017)
1018)
553)
1019)
1020)
227
77
120
MISSING VALUE
COUNT
% COUNT/NOBS
3
0.29
Table 2
(result of using the ID statement)
THE SAS SYSTEM
UNIVARIATE PROCEDURE
VARIABLE=WEIGHT
MOMENTS
N
MEAN
STO DEV
SKEWNESS
USS
CV
T:MEAN=O
HUM ". 0
M(SIGH)
S6N RANK
W:NORMAL
1017
168.5261
51.19324
0.600273
31546529
30.37705
104.982
1017
508.5
258826.5
0.92491
QUANTILES(DEF=5)
SUM WGTS
SUM
VARIANCE
KURTOSIS
CSS
STO MEAN
PR>[T)
NUM > 0
PR>=[M)
PR>=[S)
PR<W
1017
171391
2620.748
-0.61305
2662680
1.605285
0.0001
1017
0.0001
0.0001
0.0001
100%
75%
50%
25%
0%
RANGE
Q3-QI
MODE
MAX
Q3
MED
Ql
MIN
316
204
156
127
89
99%
95%
90%
10%
5%
1%
227
77
120
MISSING VALUE
COUNT
% COUNT/N08S
-
1393
3
0.29
EXTREMES
290
265
246
110
102
92
HIGHEST
10
LOWEST
10
89(FEMALE )
295(MALE
89 (FEMALE )
297 (MALE
300(FEMALE
90(FEMALE )
300(MALE
90(FEMALE )
90(FEMALE )
316(MALE
)
)
)
)
)
Table 2 Cont'd
Definition of the Statistics
Definition of the Moments Statistics:
CSS
CV
Kurtosis
Mean
M(sign)
N
Num~=
Num > 0
PR>= IMI
PROS> ITI
PROS> lSI
SGN RANK
SKEWNESS
STO DEV
STD MEAN
SUM
SUM WGTS
T:MEAN=O
VARIANCE
USS
Corrected sum of squares (the sum of squares about the mean).
Coefficient of variance.
Measures the'shape of the distribution. Large values indicate heavy tails (values are distant from the
mean).
Ar~hmetic average. Describes the center of a distribution of values for a variable.
The sign statistic.
Number of nonmissing values.
Number of nonmissing values not equal to zero (I.e. how many values are zero).
The number of positive observations.
The probabil~y of a greater absolute value for the sign statistic under the hypothesis that the population
mean is O.
P-value for the t-statistic (2-tailed). Large values indicate significant differences.
p-value for the Sign Rank test. If the value is less than the predetermined significance level, conclude
that the average difference is significantly different from zero.
Value of the Wilcoxon signed rank statistic (the nonparametric equivalent of the paired-difference t-test).
A measure to determine if values are more spread out on one side of the mean than the other. Positive
skewness indicates values to the right of the mean are more spread out than the values to the left of the
mean. It is a measure of symmetry.
Standard deviation is the square root of variance. It measures dispersion about the mean. Easier to
interpret than the variance since it's values are the same units as the data.
Standard error of the mean. It puts a "confidence interval" around the mean. Useful when analyzing
sampled data. It tells how far off our sample mean may be from our population data.
The sum of values for all observations.
Sum of the observation's weights.
Student's t-statistic value for testing the hypothesis that the population mean is O.
Most common way to measure dispersion, or variability about the mean. Variance is small when all the
values are close to the mean and large when the values are scattered widely about the mean.
Uncorrected sum of squares.
Definition of the Ouantile Statistics:
100% Max
75% 03
50% Med
25% 01
0% Min
99",(, - 1%
Range
03-01
Mode
The maximum, or highest value found (the heaviest person in the survey was 316 pounds).
The third quartile, the value that is larger than 75% of the values (a weight of 204 is larger than 75% of
the values in the data set).
The halfway point (half of the values are larger than 156 pounds, half of the values are below 156
pounds). Helps describe the center of the distribution (look for a pictorial representation in the Box Plot).
Not as sensitive as the mean with skewed data.
The first quartile, the value that is larger than 25% of the values (a weight of 127 is larger than 25% of
the values in the data set).
The minimum, or lowest value found (the lightest person in the survey was 89 pounds).
The 99th, 95th, 90th, 1Dth, 5th, and 1st percentiles. The values that are larger than 99%, 95%, 90%, 10%,
5%, and 1% of the values.
The difference between the largest and smallest values. There was a difference in the weight between
our heaviest and lightest person of 227 pounds.
The interquartile range (difference between 03 and 01) was 77 pounds. It is used as a measure of
dispersion. The smaller the value, the closer your values are to one another. The larger the value, the
more spread out the values are.
The most popular value. The value that has the most observations. More people weighed 120 pounds
than any other value. If the data has more than one mode, UNIVARIATE lists the mode w~h the smallest
value.
1394
Table 2 Cont'd
Definition of the Statistics
Extremes Section:
This section lists the five highest and fIVe lowest
values found on the data set. It does not list five
unique lowest and highest values. The same
value may appear more than once on this section.
In addition to displaying the values, it also shows
the corresponding observation number. If the 10
statement would have been used, the values
would be represented by the IDentifying value
rather than the observation number (note the
difference by comparing the Extremes Section on
Table 1 and Table 2's output.
Normal Probability Plot
The plus signs form a straight line based on the
sample mean and standard deviation.
The
asterisks represent the actual data. If the sample
is from a normal distribution, the asterisks form a
straight line and cover most of the plus signs. A
large number of visible plus signs indicate a
nonnormal distribution, as ours does.
Frequency Table
Similar to the output generated by PROC FREQ,
except PROC FREQ generates cumulative counts
in addition to the statistics shown here. The
frequency table yet another way to see how the
data are distributed.
Missing Value Section:
The Missing Value notes how missing values are
represented on the data set. Count refers to how
many observations were found with that missing
value, and the % Count/Nobs refers to the
percent of total observations that were missing.
Checking for Data Errors
By using the Quantiles, Extremes, Histogram, Box
Plot, and Frequency Table, one can quickly check
for data errors. Using the Quantiles section,
check the maximum and minimum values to make
sure they make sense (a maximum value of 990,
or a minimum value of 1 may indicate a data
error). You could also use the Frequency Table
to check for data errors. If you find something
unusual, use the Extremes section to determine
which observation number it is (if an 10 statement
is not being used). The Histogram and Box Plot
can be used to determine outliers in the data.
Explanation of the Histogram and Box Plot:
If no more than 48 observations fall into a Single
interval, a stem-and-Ieaf plot will be generated
rather than a histogram (a horizontal bar chart).
If you look at the data sideways it should roughly
look like a bell curve (meaning it is normally
distributed).
Given our data, the histogram
generated suggest the sample data are not
normally distributed, they are skewed.
The Box Plot can further describe the distribution
of the data. The upper and lower ends of the box
represent the 25th and 75th percentiles. The line
inside the box is the median. The' +' indicates
the mean. Our mean and median are not the
same.
The lines coming out of the box represent
"whiskers· and extend to a maximum of 1.5 times
the interquartile range. Data beyond the whiskers
up to 3 times the interquartile range, represented
by O's show more extreme values, and outliers
are represented by an "'.
The histogram (or stem-and-Ieaf plot) and the Box
Plot should be used to determine Hsymmetry and
smoothness (!.e. no heavy tails) exists in the data.
1395
Table 3
(result of using FREQ, PLOT, and NORMAL options)
THE SAS SYSTEM
UNIVARIATE PROCEDURE
VARIABLE=WEIGHT
QUANTILES(DEF=5)
MOMENTS
N
MEAN
STD DEV
SKEWNESS
USS
CV
T:MEAN=O
NUM '= 0
M(SIGN)
SGN RANK
1017
168.5261
51.19324
0.600273
31546529
30.37705
104.982
1017
508.5
258826.5
SUM WGTS
SUM
VARIANCE
KURTOSIS
CSS
STD MEAN
PR>[Tl
NUM > 0
PR>=[Ml
PR>=[sl
100%
75%
50%
25%
0%
1017
171391
2620.748
-0.61305
2662680
1. 605285
0.0001
1017
0.0001
0.0001
MAX
Q3
MED
Q1
MIN
316
204
156
127
89
RANGE
Q3-Q1
MODE
HISTOGRAM
#
1
*
9
85+*
13
28
31
35
38
36
34
41
49
47
72
44
68
78
110
84
103
44
38
LOWEST
89C
89(
90(
90(
90(
OBS
4)
3)
9)
8)
7)
HIGHEST
295(
297(
300(
300(
316(
OBS
1017)
1018)
553)
1019)
1020)
3
0.29
1
1
+-----+
1
1
1
1
+
NORMAL PROBABILITY PLOT
315+
1
1
1
1
1
1
1
1
1
1
1
2
**********
***********
************
*************
************
************
**************
*****************
****************
************************
***************
***********************
**************************
*************************************
****************************
***********************************
***************
*************
290
265
246
110
102
92
BOXPLOT
10
****
***
*****
99%
95%
90%
10%
5%
1%
227
77
120
MISSING VALUE
COUNT
% COUNT/NOBS
315+*
EXTREMES
1
1
1
1
*-----*
1
1
+-----+
2
***
1r1r*+
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
***+
****+
***++
***++
**
+
***++
**++
**+
**
***
+**
+***
++**
++***
+****
***"
******
*****++
]******** ++
85+*
++
+----+----+----+----+----+----+----+----+----+---+
----+----+----+----+----+----+----+--
* MAY REPRESENT UP TO 3 COUNTS
-2
-1
o
+1
+2
FREQUENCY TABLE
PERCENTS
VALUE COUNT CELL CUM
89
2 0.2 0.2
90
5 0.5 0.7
91
3 0.3 1.0
etc
PERCENTS
VALUE COUNT CELL CUM
139
14 1.4 37.5
140
18 1.8 39.2
141
5 0.5 39.7
etc
PERCENTS
VALUE COUNT CELL CUM
189
3 0.3 67.8
190
11
1.1 68.9
5 0.5 69.4
191
etc
1396
PERCENTS
VALUE COUNT CELL CUM
239
1 0.1 87.3
240
11
1.1 88.4
241
I
0.1 88.5
etc
Table 4
(partial output-result of using the BY statement)
THE SAS SYSTEM
UNIVARIATE PROCEDURE
SCHEMATIC PLOTS
VARIABLE=WEIGHT
1
320 +
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
*
*
300 +
1
1
1
280 +
1
1
1
*
*
*
*
260 +
1
1
1
*
240 +
0
0
0
0
0
0
0
0
0
0
1
1
1
220 +
1
1
1
200 +
1
1
1
1
1
1
160 +
140 +
1
1
1
+-----+
+-----+
1
1
+
+
1
1
1
1
1
1
1
*----- *
1
1
1
1
1
1
1
1
1
180 +
1
1
1
+-----+
1
1
1
1
1
1
1
1
1
*-----*
1
1
1
1
1
1
1
1
1
120 +
1
1
1
+-----+
100 +
1
1
1
80 +
GENDER
------------+-----------+----------FEMALE
1397
MALE
Download