Graphically Enhancing Univariate Descriptive Statistics

advertisement
Graphically Enhancing Univariate Descriptive Statistics
James R. O’Hearn, Pfizer Inc., New York, NY
ABSTRACT
The problem of analyzing univariate data is investigated. In this paper an
enhanced graphical approach based on a template of numerical summary
statistics and graphical plots using SAS/GRAPH, SAS/QC, and
SAS/STAT software is presented. It is shown that this template provides
a more powerful tool for assessing statistical assumptions, for suggesting
corrective actions when assumptions are not met, and for elucidating
patterns and relationships than that of the UNIVARIATE procedure. It is
anticipated that this template will be useful for the data analyst for gaining
greater insight into the structure of univariate data. An example using
fictitious exercise stress test data will be given to illustrate this
methodology.
INTRODUCTION
The first step in any statistical analysis is to generate univariate descriptive
statistics on the primary analysis variables. The most commonly used and
comprehensive method for univariate analysis in SAS software is the
UNIVARIATE procedure.
However, except for a few minor
modifications, the output from this procedure has not changed for nearly
twenty years. This is most evident in the graphical plots where a number of
shortcomings exist by today’s standards:
(1) line-printer style graphics, (2) rotated sideways plots with horizontal
axis values in reverse order, (3) lack of control in scaling the vertical axis,
and (4) lack of graphical sophistication in plot appearance.
The objective of this paper is to enhance the display, and at the same time,
improve the statistical analysis of univariate data using SAS software. Our
overall strategy is to generate and save the descriptive statistics along with
a title and a footnote to a graphics catalog, to next create a seven panel
template, and then to replay (re-display) the graphical output into the
template. The SAS code used in this paper will be based on the SAS for
Windows 6.11 operating system.
GENERATION OF GRAPHICAL OUTPUT
Numerical Summary Statistics
The first step in developing our statistical template is to create a graphics
output file of summary statistics from a text input file. To do this, we first
generate the summary statistics from the UNIVARIATE procedure, and
then direct the output to an external file referenced by the fileref on the
PRINTTO procedure. Next, we invoke the GPRINT procedure, which
inputs this text file and converts it into a graphics file, and then saves it to a
graphics catalog. The following SAS code outlines this process more
completely.
libname gph ‘c:\data\nesug97’;
filename univ ‘c:\data\nesug97\univ.out’;
options nocenter nodate nonumber ls=132 ps=25;
goptions reset=all dev=win gunit=cells chartype=007
hpos=132 vpos=25 htext=1.1 nodisplay;
proc printto print=univ new; run;
proc univariate data=univ normal;
var &var;
id pat;
output out=stats n=nobs median=median qrange=qrange
mean=mean std=std probn=probn normal=normal
min=min max=max range=range; run;
proc printto; run;
proc gprint fileref=univ gout=gph.univ name='stats';
run; quit;
Note the following features of this code:
• DEV=WIN identifies the Microsoft Windows device driver used to
produce/view graphical output on the monitor screen.
• GUNIT=CELLS specifies the graphical unit of measurement for the
numerical statistical output.
• CHARTYPE=007 defines the hardware/True Type font number (SAS
Monospace) for the summary statistics. (See (4) for instructions in
adding True Type fonts with the WINPRTx series of device drivers. )
• The HPOS and VPOS graphics output options are equal to the
LINESIZE and PAGESIZE SAS text output options.
• HTEXT=1.1 specifies height of all statistical text.
• The PRINTTO procedure directs all statistical output to the external file
UNIV.OUT while the NEW option removes any previous output.
• The NODISPLAY option suppresses the display of the summary
statistics for the time being.
• GOUT=GPH.UNIV indicates the graphics catalog where the graphical
output from the GPRINT procedure is stored.
• NAME=‘STATS’ specifies the name of the graph in the graphics catalog
GPH.UNIV.
Histogram
The histogram is the most familiar and widely used plot for summarizing a
data distribution. Unfortunately, it is often the most difficult to create
given the complexity associated with choosing the appropriate (or optimal)
binwidth/number of bins to display. To deal with this problem, a more
“precise” theoretical/adhoc approach is adopted, and then incorporated into
the CAPABILITY procedure of SAS/QC software (see p. 52 of (5) for
advantages of this procedure over the GCHART procedure). First, we
calculate the number of bins using one of the upper bound formulas of
Emerson and Hoaglin (1983), depending on the number of observations in
the data set. Then, we modify this upper bound based on the normality of
the data given by the p-value of the Shapiro-Wilk Test of Normality. In
particular, if the data is normal (p ≥ .05), we reduce the upper bound by
40% of the range from the upper bound to the default CAPABILITY lower
bound of Terrell and Scott (1985) to smooth the data a little, but to still
retain a great amount of detail. If the data is not normal (p < .05), we don’t
modify the upper bound, since it is recommended to use more bins than
fewer under such circumstances (Scott, 1992). The following SAS code
outlines the CAPABILITY part used to create the histogram and includes
as well a nonparametric kernel density curve superimposed onto the
histogram to examine features possibly obscured by the choice of
histogram bins or sampling variation. (See pp. 116-117 of (7) for more
details on kernel density estimates.)
goptions reset=all dev=win gunit=pct ftext=zapf nodisplay;
proc capability data=plots gout=gph.univ graphics noprint;
var &var;
hist &var/midpoints=(&first to &last by &binwidth)
vaxis=axis1 haxis=axis2 vscale=percent
kernel (k=normal c=mise l=1 w=2) noframe nolegend
name='hist';
axis1 label=(j=c h=&lh a=90 r=0 'Relative Frequency (%)')
value=(h=&vh);
axis2 offset=(10,) label=(j=c h=&lh "&varlabel (&unit)")
value=(h=&vh) order=(&first to &last by &binwidth)
minor=none;
title1 j=c h=&th 'Histogram Plot';
run; quit;
Note the following features of this code:
• GUNIT=PCT defines the graphical measurement unit for the plots.
• FTEXT=zapf identifies the software font used in the plots.
• GRAPHICS invokes high-resolution graphics.
• MIDPOINTS specifies the bin midpoints, where &first and &last are
designated as the “true” first and last midpoints of the data, respectively,
whose values are calculated by subtracting and adding a small
percentage of the binwidth to the min and max values, respectively, and
where &binwidth is equal to the range of the data/ # of calculated bins.
Note that this quantity will probably have to be adjusted depending on
the range of the data and the binwidth.
• KERNEL superimposes a nonparametric density curve onto the
histogram using the normal kernel function (K=normal) to estimate the
density function with the degree of smoothness (or bandwidth) indirectly
specified by the default C= value, the value that minimizes the
approximate mean integrated square error. (See p. 116 of (7).)
Box-and-Whisker Plot
The box-and-whisker plot presents a quick overview of a number of
prominent features (center, middle 50% of the data, tail spread, and
extreme values) of a distribution. Even though it is not customary in
quality control work, the SHEWHART procedure in SAS/QC can be used
to create an individual box-and-whisker plot. The following SAS code
creates this plot.
%let vnbrlo=%eval(&vnbr-2); %let vnbrup=%eval(&vnbr+2);
goptions dev=win hsize=4 in vsize=3.15 in nodisplay
proc shewhart data=plots &gout graphics;
boxchart &var*vnbr/boxstyle=schematicid
stddevs serifs nolegend nolimits noframe idsymbol=circle
vminor=1vaxis=axis1 haxis=axis2 name='box';
axis1 label=(a=90 r=0 j=c h=&lh "&varlabel (&unit)")
value=(h=&vh);
axis2 label=(j=c h=&lh 'Visit Week') order=(&vnbrlo to
&vnbrup by 1) value=(h=&vh);
symbol1 v=star h=3 c=black;
title1 j=c h=&th 'Box and Whisker Plot';
run; quit;
Note the following features of this code:
• BOXSTYLE=SCHEMATICID extends the whiskers of the box plot 1.5
x IQR below the 25th and above the 75th percentiles, respectively.
• NOLIMITS option suppresses the display of control limits.
• HSIZE= and VSIZE= options set the dimensions of the graphics output
area. Their intended purpose here is to make the graphical output
proportional to the size of the template panels, and thus prevent any
distortion. The values above are chosen based on the default HSIZE and
VSIZE and the dimensions of the template panels. (Note: These options
were not applied to the histogram plot for the sake of preventing some
potential confusion in interpretation caused by the appearance of two
unforeseen axis values, one on each end, being added to the horizontal
axis when applied, even though all other aspects of the plot remained the
same.)
It should be noted that a WARNING message appears in the log when
running this code. However, this message pertains not to the construction
of the plot, but to the customary practice in using the SHEWHART
procedure of comparing the distributions of more than one subgroup
(defined by a subgroup variable, here ‘vnbr’) of some analysis variable
(&var) via side-by-side box-and-whisker plots.
Normal and Detrended Normal Probability Plots
The normal and detrended normal probability plots describe more
systematically how an empirical (sample) distribution differs from a
theoretical standard normal distribution, or more specifically, from a whole
family of normal distributions, with different locations (means) and spreads
(standard deviations). These plots are generated using a slightly modified
version of the SAS System for Statistical Graphics NQPLOT macro. The
following SAS code outlines only the normal probability plot, with
standard errors lines added in order to examine the relative variability of
points in different regions of the plot.
data plots;
set plots;
sigma = σ
_p_=round((_n_ - .5)/nobs,.0001);
_z_=round(probit(_p_),.0001);
_se_=(sigma/((1/sqrt(2*3.1415926)) *exp(-(_z_**2)/2)))
*sqrt(_p_*(1-_p_)/nobs);
_normal_= sigma * _z_ + &mu ;
_resid_ = &var - _normal_;
_lower_ = _normal_ - 1.96*_se_;
_upper_ = _normal_ + 1.96*_se_;
_reslo_ = -1.96*_se_;
_reshi_ = 1.96*_se_;
label _z_='Standard Gaussian (Z) Quantile'
_resid_='Deviation From Normal';
run;
proc gplot data=plots &gout;
plot &var * _z_= 1
_normal_ * _z_= 2
%if &stderr=YES %then %do;
_lower_ * _z_= 3
_upper_ * _z_= 3 %end;
/ overlay vaxis=axis1 haxis=axis2
hminor=3 vminor=1 name='npp';
symbol1 v=dot h=2 i=none c=black l=1;
symbol2 v=none i=join c=blue l=3 w=2;
symbol3 v=none i=join c=green l=33 w=2;
axis1 label=(a=90 r=0 h=&lh "&varlabel (&unit)")
value=(h=&vh);
axis2 label=(h=&lh ‘Standard Normal Quantiles')
value=(h=&vh);
title1 j=c h=&th 'Normal Probability Plot';
run; quit;
Title and Footnote
The title and footnote for this template are both created using the GSLIDE
procedure. The following SAS code generates these two descriptive
features.
%let title = Descriptive Statistics: &varlabel for Treatment
&tg at Visit &vnbr;
goptions dev=win target=winprtm ftext=zapf gunit=cells
nodisplay;
proc gslide &gout name='title';
title1 j=center h=1.20 "&protid"; title2 j=c h=1.25 "&title";
run; quit;
proc gslide &gout name='foot';
title1;
footnote1 j=left h =.5 "Program:
footnote2 j=left h =.5 "Data Source:
footnote3 j=left h =.5 "Output Date:
footnote4 j=left h =.5 “Programmer:
run; quit;
&program";
&srce";
&sysdate";
&name”;
STATISTICAL TEMPLATE STRUCTURE
Let us now create the template. First, the graphics enivironment is reset to
DEV=WIN, TARGET=WINPRTM, and DISPLAY so as to preview the
template on the monitor screen as it should appear on a hard-copy device.
Second, the GREPLAY procedure is invoked with the NOFS, IGOUT, and
TC options. The NOFS (no full-screen) option allows us to use line-mode
statements for creating templates, replaying graphics, and managing
graphics catalogs, while IGOUT=GPH.UNIV specifies the graphics
(catalog) to input into the GREPLAY procedure and TC=GPH.UNIVCAT
identifies the storage location of the template. Third, the TDEF statement
defines the template structure, including the name SPECS and the four (x,
y) coordinates for each of the seven panels. Fourth, the TEMPLATE
statement assigns the newly defined template as the current template while
the DELETE statement deletes any graphical output from the input
catalog. Lastly, the TREPLAY statement selects and replays the graphs
into the template based on the panel number and graphical catalog entry
name. See Figure 1 below.
goptions reset=all dev=win target=winprtm display hsize=0
vsize=0;
proc greplay nofs igout=gph.univ tc=univcat;
tdef specs
1/ llx=0
lly=0
ulx=0
uly=100
urx=100 ury=100 lrx=100 lry=0
color=black
2/ llx=0
lly=65
ulx=0
uly=95
urx=100 ury=95
lrx=100 lry=65 color=black
3/ llx=0
lly=35
ulx=0
uly=65
urx=50
ury=65
lrx=50
lry=35 color=black
4/ llx=50
lly=35
ulx=50 uly=65
urx=100 ury=65
lrx=100 lry=35 color=black
5/ llx=0
lly=5
ulx=0
uly=35
urx=50
ury=35
lrx=50
lry=5
color=black
6/ llx=50
lly=5
ulx=50 uly=35
urx=100 ry=35
lrx=100 lry=5
color=black
7/ llx=0
lly=0
ulx=0
uly=100
urx=100 ury=100 lrx=100 lry=0
color=black;
template specs;
treplay 1:title 2:stats 3:hist 4:box 5:npp 6:dpp 7:foot;
run; quit;
CONCLUSION
This paper has attempted to improve the visualization and analysis of
univariate statistical data. By enhancing the graphical features of the plots,
the conveyance and summarization of quantitative information is more
readily apparent and more easily understandable. In general, these
graphical enhancements are a good idea, however, they do require greater
time and computer resources. Thus, it is up to the data analyst to decide
whether or not the visual improvement of the template is worth the increase
in production time for univariate statistical analysis.
REFERENCES
[1] Emerson, J.D. and Hoaglin, D.C. (1983). Understanding Robust and
Exploratory Data Analysis, eds. D.C. Hoaglin, F. Mosteller, and J.W.
Tukey, New York: John Wiley & Sons.
[2] Scott, D.W. (1992). Multivariate Density Estimation: Theory,
Practice, and Visualization, New York: John Wiley & Sons.
[3] Terrell, G.R. and Scott, D.W. (1985). “Oversmoothed Nonparametric
Density Estimates”, JASA, 80, 209-214.
[4] “Producing Hardcopy Graphics Under Windows 95 and NT”, TS300C. Cary, NC: SAS Institute Inc.
[5] SAS Institute Inc. (1991), SAS System for Statistical Graphics, First
Edition, Cary, NC: SAS Institute Inc.
[6] SAS Institute Inc. (1990), SAS /GRAPH Software; Reference,
Version 6, Vols. 1, 2, Cary, NC: SAS Institute Inc., 1458 pp.
[7] SAS Institute Inc. (1995), SAS /QC Software; Usage and Reference,
Version 6, First Edition,Vols. 1, 2, Cary, NC: SAS Institute Inc., 1697 pp.
NQPLOT macro is reprinted with permission of SAS Institute Inc. from
SAS System for Statistical Graphics, First Edition. Copyright 1991 by
SAS Institute Inc.
AUTHOR CONTACT INFORMATION:
James O’Hearn
Pfizer Inc., 235 E. 42nd St., NY, NY 10017-5755
(212) 733-2795, ohearj@pfizer.com
SAS, SAS/GRAPH, SAS/QC, and SAS/STAT are registered
trademarks of SAS Institute Inc. in the USA and in other countries.
 indicates USA registration.
Windows is a registered trademark of the Microsoft Corporation.
Figure 1. Descriptive Statistics for Duration of Exercise Time for Treatment A at Visit 10. (Figure not drawn to scale.)
Download