Graphically Enhancing Univariate Descriptive Statistics James R. O’Hearn, Pfizer Inc., New York, NY ABSTRACT The problem of analyzing univariate data is investigated. In this paper an enhanced graphical approach based on a template of numerical summary statistics and graphical plots using SAS/GRAPH, SAS/QC, and SAS/STAT software is presented. It is shown that this template provides a more powerful tool for assessing statistical assumptions, for suggesting corrective actions when assumptions are not met, and for elucidating patterns and relationships than that of the UNIVARIATE procedure. It is anticipated that this template will be useful for the data analyst for gaining greater insight into the structure of univariate data. An example using fictitious exercise stress test data will be given to illustrate this methodology. INTRODUCTION The first step in any statistical analysis is to generate univariate descriptive statistics on the primary analysis variables. The most commonly used and comprehensive method for univariate analysis in SAS software is the UNIVARIATE procedure. However, except for a few minor modifications, the output from this procedure has not changed for nearly twenty years. This is most evident in the graphical plots where a number of shortcomings exist by today’s standards: (1) line-printer style graphics, (2) rotated sideways plots with horizontal axis values in reverse order, (3) lack of control in scaling the vertical axis, and (4) lack of graphical sophistication in plot appearance. The objective of this paper is to enhance the display, and at the same time, improve the statistical analysis of univariate data using SAS software. Our overall strategy is to generate and save the descriptive statistics along with a title and a footnote to a graphics catalog, to next create a seven panel template, and then to replay (re-display) the graphical output into the template. The SAS code used in this paper will be based on the SAS for Windows 6.11 operating system. GENERATION OF GRAPHICAL OUTPUT Numerical Summary Statistics The first step in developing our statistical template is to create a graphics output file of summary statistics from a text input file. To do this, we first generate the summary statistics from the UNIVARIATE procedure, and then direct the output to an external file referenced by the fileref on the PRINTTO procedure. Next, we invoke the GPRINT procedure, which inputs this text file and converts it into a graphics file, and then saves it to a graphics catalog. The following SAS code outlines this process more completely. libname gph ‘c:\data\nesug97’; filename univ ‘c:\data\nesug97\univ.out’; options nocenter nodate nonumber ls=132 ps=25; goptions reset=all dev=win gunit=cells chartype=007 hpos=132 vpos=25 htext=1.1 nodisplay; proc printto print=univ new; run; proc univariate data=univ normal; var &var; id pat; output out=stats n=nobs median=median qrange=qrange mean=mean std=std probn=probn normal=normal min=min max=max range=range; run; proc printto; run; proc gprint fileref=univ gout=gph.univ name='stats'; run; quit; Note the following features of this code: • DEV=WIN identifies the Microsoft Windows device driver used to produce/view graphical output on the monitor screen. • GUNIT=CELLS specifies the graphical unit of measurement for the numerical statistical output. • CHARTYPE=007 defines the hardware/True Type font number (SAS Monospace) for the summary statistics. (See (4) for instructions in adding True Type fonts with the WINPRTx series of device drivers. ) • The HPOS and VPOS graphics output options are equal to the LINESIZE and PAGESIZE SAS text output options. • HTEXT=1.1 specifies height of all statistical text. • The PRINTTO procedure directs all statistical output to the external file UNIV.OUT while the NEW option removes any previous output. • The NODISPLAY option suppresses the display of the summary statistics for the time being. • GOUT=GPH.UNIV indicates the graphics catalog where the graphical output from the GPRINT procedure is stored. • NAME=‘STATS’ specifies the name of the graph in the graphics catalog GPH.UNIV. Histogram The histogram is the most familiar and widely used plot for summarizing a data distribution. Unfortunately, it is often the most difficult to create given the complexity associated with choosing the appropriate (or optimal) binwidth/number of bins to display. To deal with this problem, a more “precise” theoretical/adhoc approach is adopted, and then incorporated into the CAPABILITY procedure of SAS/QC software (see p. 52 of (5) for advantages of this procedure over the GCHART procedure). First, we calculate the number of bins using one of the upper bound formulas of Emerson and Hoaglin (1983), depending on the number of observations in the data set. Then, we modify this upper bound based on the normality of the data given by the p-value of the Shapiro-Wilk Test of Normality. In particular, if the data is normal (p ≥ .05), we reduce the upper bound by 40% of the range from the upper bound to the default CAPABILITY lower bound of Terrell and Scott (1985) to smooth the data a little, but to still retain a great amount of detail. If the data is not normal (p < .05), we don’t modify the upper bound, since it is recommended to use more bins than fewer under such circumstances (Scott, 1992). The following SAS code outlines the CAPABILITY part used to create the histogram and includes as well a nonparametric kernel density curve superimposed onto the histogram to examine features possibly obscured by the choice of histogram bins or sampling variation. (See pp. 116-117 of (7) for more details on kernel density estimates.) goptions reset=all dev=win gunit=pct ftext=zapf nodisplay; proc capability data=plots gout=gph.univ graphics noprint; var &var; hist &var/midpoints=(&first to &last by &binwidth) vaxis=axis1 haxis=axis2 vscale=percent kernel (k=normal c=mise l=1 w=2) noframe nolegend name='hist'; axis1 label=(j=c h=&lh a=90 r=0 'Relative Frequency (%)') value=(h=&vh); axis2 offset=(10,) label=(j=c h=&lh "&varlabel (&unit)") value=(h=&vh) order=(&first to &last by &binwidth) minor=none; title1 j=c h=&th 'Histogram Plot'; run; quit; Note the following features of this code: • GUNIT=PCT defines the graphical measurement unit for the plots. • FTEXT=zapf identifies the software font used in the plots. • GRAPHICS invokes high-resolution graphics. • MIDPOINTS specifies the bin midpoints, where &first and &last are designated as the “true” first and last midpoints of the data, respectively, whose values are calculated by subtracting and adding a small percentage of the binwidth to the min and max values, respectively, and where &binwidth is equal to the range of the data/ # of calculated bins. Note that this quantity will probably have to be adjusted depending on the range of the data and the binwidth. • KERNEL superimposes a nonparametric density curve onto the histogram using the normal kernel function (K=normal) to estimate the density function with the degree of smoothness (or bandwidth) indirectly specified by the default C= value, the value that minimizes the approximate mean integrated square error. (See p. 116 of (7).) Box-and-Whisker Plot The box-and-whisker plot presents a quick overview of a number of prominent features (center, middle 50% of the data, tail spread, and extreme values) of a distribution. Even though it is not customary in quality control work, the SHEWHART procedure in SAS/QC can be used to create an individual box-and-whisker plot. The following SAS code creates this plot. %let vnbrlo=%eval(&vnbr-2); %let vnbrup=%eval(&vnbr+2); goptions dev=win hsize=4 in vsize=3.15 in nodisplay proc shewhart data=plots &gout graphics; boxchart &var*vnbr/boxstyle=schematicid stddevs serifs nolegend nolimits noframe idsymbol=circle vminor=1vaxis=axis1 haxis=axis2 name='box'; axis1 label=(a=90 r=0 j=c h=&lh "&varlabel (&unit)") value=(h=&vh); axis2 label=(j=c h=&lh 'Visit Week') order=(&vnbrlo to &vnbrup by 1) value=(h=&vh); symbol1 v=star h=3 c=black; title1 j=c h=&th 'Box and Whisker Plot'; run; quit; Note the following features of this code: • BOXSTYLE=SCHEMATICID extends the whiskers of the box plot 1.5 x IQR below the 25th and above the 75th percentiles, respectively. • NOLIMITS option suppresses the display of control limits. • HSIZE= and VSIZE= options set the dimensions of the graphics output area. Their intended purpose here is to make the graphical output proportional to the size of the template panels, and thus prevent any distortion. The values above are chosen based on the default HSIZE and VSIZE and the dimensions of the template panels. (Note: These options were not applied to the histogram plot for the sake of preventing some potential confusion in interpretation caused by the appearance of two unforeseen axis values, one on each end, being added to the horizontal axis when applied, even though all other aspects of the plot remained the same.) It should be noted that a WARNING message appears in the log when running this code. However, this message pertains not to the construction of the plot, but to the customary practice in using the SHEWHART procedure of comparing the distributions of more than one subgroup (defined by a subgroup variable, here ‘vnbr’) of some analysis variable (&var) via side-by-side box-and-whisker plots. Normal and Detrended Normal Probability Plots The normal and detrended normal probability plots describe more systematically how an empirical (sample) distribution differs from a theoretical standard normal distribution, or more specifically, from a whole family of normal distributions, with different locations (means) and spreads (standard deviations). These plots are generated using a slightly modified version of the SAS System for Statistical Graphics NQPLOT macro. The following SAS code outlines only the normal probability plot, with standard errors lines added in order to examine the relative variability of points in different regions of the plot. data plots; set plots; sigma = &sigma; _p_=round((_n_ - .5)/nobs,.0001); _z_=round(probit(_p_),.0001); _se_=(sigma/((1/sqrt(2*3.1415926)) *exp(-(_z_**2)/2))) *sqrt(_p_*(1-_p_)/nobs); _normal_= sigma * _z_ + &mu ; _resid_ = &var - _normal_; _lower_ = _normal_ - 1.96*_se_; _upper_ = _normal_ + 1.96*_se_; _reslo_ = -1.96*_se_; _reshi_ = 1.96*_se_; label _z_='Standard Gaussian (Z) Quantile' _resid_='Deviation From Normal'; run; proc gplot data=plots &gout; plot &var * _z_= 1 _normal_ * _z_= 2 %if &stderr=YES %then %do; _lower_ * _z_= 3 _upper_ * _z_= 3 %end; / overlay vaxis=axis1 haxis=axis2 hminor=3 vminor=1 name='npp'; symbol1 v=dot h=2 i=none c=black l=1; symbol2 v=none i=join c=blue l=3 w=2; symbol3 v=none i=join c=green l=33 w=2; axis1 label=(a=90 r=0 h=&lh "&varlabel (&unit)") value=(h=&vh); axis2 label=(h=&lh ‘Standard Normal Quantiles') value=(h=&vh); title1 j=c h=&th 'Normal Probability Plot'; run; quit; Title and Footnote The title and footnote for this template are both created using the GSLIDE procedure. The following SAS code generates these two descriptive features. %let title = Descriptive Statistics: &varlabel for Treatment &tg at Visit &vnbr; goptions dev=win target=winprtm ftext=zapf gunit=cells nodisplay; proc gslide &gout name='title'; title1 j=center h=1.20 "&protid"; title2 j=c h=1.25 "&title"; run; quit; proc gslide &gout name='foot'; title1; footnote1 j=left h =.5 "Program: footnote2 j=left h =.5 "Data Source: footnote3 j=left h =.5 "Output Date: footnote4 j=left h =.5 “Programmer: run; quit; &program"; &srce"; &sysdate"; &name”; STATISTICAL TEMPLATE STRUCTURE Let us now create the template. First, the graphics enivironment is reset to DEV=WIN, TARGET=WINPRTM, and DISPLAY so as to preview the template on the monitor screen as it should appear on a hard-copy device. Second, the GREPLAY procedure is invoked with the NOFS, IGOUT, and TC options. The NOFS (no full-screen) option allows us to use line-mode statements for creating templates, replaying graphics, and managing graphics catalogs, while IGOUT=GPH.UNIV specifies the graphics (catalog) to input into the GREPLAY procedure and TC=GPH.UNIVCAT identifies the storage location of the template. Third, the TDEF statement defines the template structure, including the name SPECS and the four (x, y) coordinates for each of the seven panels. Fourth, the TEMPLATE statement assigns the newly defined template as the current template while the DELETE statement deletes any graphical output from the input catalog. Lastly, the TREPLAY statement selects and replays the graphs into the template based on the panel number and graphical catalog entry name. See Figure 1 below. goptions reset=all dev=win target=winprtm display hsize=0 vsize=0; proc greplay nofs igout=gph.univ tc=univcat; tdef specs 1/ llx=0 lly=0 ulx=0 uly=100 urx=100 ury=100 lrx=100 lry=0 color=black 2/ llx=0 lly=65 ulx=0 uly=95 urx=100 ury=95 lrx=100 lry=65 color=black 3/ llx=0 lly=35 ulx=0 uly=65 urx=50 ury=65 lrx=50 lry=35 color=black 4/ llx=50 lly=35 ulx=50 uly=65 urx=100 ury=65 lrx=100 lry=35 color=black 5/ llx=0 lly=5 ulx=0 uly=35 urx=50 ury=35 lrx=50 lry=5 color=black 6/ llx=50 lly=5 ulx=50 uly=35 urx=100 ry=35 lrx=100 lry=5 color=black 7/ llx=0 lly=0 ulx=0 uly=100 urx=100 ury=100 lrx=100 lry=0 color=black; template specs; treplay 1:title 2:stats 3:hist 4:box 5:npp 6:dpp 7:foot; run; quit; CONCLUSION This paper has attempted to improve the visualization and analysis of univariate statistical data. By enhancing the graphical features of the plots, the conveyance and summarization of quantitative information is more readily apparent and more easily understandable. In general, these graphical enhancements are a good idea, however, they do require greater time and computer resources. Thus, it is up to the data analyst to decide whether or not the visual improvement of the template is worth the increase in production time for univariate statistical analysis. REFERENCES [1] Emerson, J.D. and Hoaglin, D.C. (1983). Understanding Robust and Exploratory Data Analysis, eds. D.C. Hoaglin, F. Mosteller, and J.W. Tukey, New York: John Wiley & Sons. [2] Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization, New York: John Wiley & Sons. [3] Terrell, G.R. and Scott, D.W. (1985). “Oversmoothed Nonparametric Density Estimates”, JASA, 80, 209-214. [4] “Producing Hardcopy Graphics Under Windows 95 and NT”, TS300C. Cary, NC: SAS Institute Inc. [5] SAS Institute Inc. (1991), SAS System for Statistical Graphics, First Edition, Cary, NC: SAS Institute Inc. [6] SAS Institute Inc. (1990), SAS /GRAPH Software; Reference, Version 6, Vols. 1, 2, Cary, NC: SAS Institute Inc., 1458 pp. [7] SAS Institute Inc. (1995), SAS /QC Software; Usage and Reference, Version 6, First Edition,Vols. 1, 2, Cary, NC: SAS Institute Inc., 1697 pp. NQPLOT macro is reprinted with permission of SAS Institute Inc. from SAS System for Statistical Graphics, First Edition. Copyright 1991 by SAS Institute Inc. AUTHOR CONTACT INFORMATION: James O’Hearn Pfizer Inc., 235 E. 42nd St., NY, NY 10017-5755 (212) 733-2795, ohearj@pfizer.com SAS, SAS/GRAPH, SAS/QC, and SAS/STAT are registered trademarks of SAS Institute Inc. in the USA and in other countries. indicates USA registration. Windows is a registered trademark of the Microsoft Corporation. Figure 1. Descriptive Statistics for Duration of Exercise Time for Treatment A at Visit 10. (Figure not drawn to scale.)