"Graphical Representation of Data" in: Encyclopedia of Statistics in

advertisement
Graphical Representation
of Data
Graphs, charts, and diagrams offer effective display
and enable easy comprehension of complex, multifaceted relationships. Gnanadesikan and Wilk [1]
point out that “man is a geometrical animal and seems
to need and want pictures for parsimony and to stimulate insight.” Various forms of statistical graphs have
been in use for over 200 years.
Beniger and Robyn [2] cite the following as first
or near-first uses of statistical graphs: Playfair’s [3]
use of the bar chart to display Scotland’s 1781
imports and exports for 17 countries; Fourier’s [4]
cumulative distribution of population age in Paris
in 1817; Lalanne’s [5] contour plot of temperature by hour and month; and Perozzo’s [6] stereogram display of Sweden’s population for the period
1750–1875 by age groupings. Fienberg [7] notes
that Lorenz [8] made the first use of the probability–probability (P–P) plot in 1905. Each of these
plots is discussed subsequently. Beniger and Robyn
[2], Cox [9], and Fienberg [7] provide additional
historical accounts and insights on the evolution of
graphics.
Today, graphical methods play an important role
in all aspects of a statistical investigation – from the
initial exploratory plots, through various stages of
analysis, to the final communication and display of
results. Many persons consider graphical displays as
the single most effective, robust statistical tool.
Not only are graphical procedures helpful but in
many cases they are essential. Tukey [10] claims
that “the greatest value of a picture is when it
forces us to notice what we never expected to see.”
This is no better exemplified than by Anscombe’s
data sets [11], where plots of four equal-size data
sets (Figure 1) reveal large differences among the
sets even though all sets produce the same linear regression summaries. Mahon [12] maintains that
statisticians’ responsibilities include communication
of their findings to decision makers, who frequently are statistically naive, and the best way
to accomplish this is through the power of the
picture.
Good graphs should be simple, self-explanatory,
and not deceiving. Cox [9] offers the following
guidelines:
1. The axes should be clearly labeled with the
names of the variables and the units of measurement.
2. Scale breaks should be used for false origins.
3. Comparison of related diagrams should be easy,
for example, by using identical scales of measurement and placing diagrams side by side.
4. Scales should be arranged so that systematic
and approximately linear relations are plotted at
roughly 45° to the x axis.
5. Legends should make diagrams as nearly selfexplanatory (i.e., independent of the text) as is
feasible.
6. Interpretation should not be prejudiced by the
technique of presentation.
Most of the graphs discussed here, which involve
spatial relationships, are implicitly or explicitly on
Cartesian or rectangular coordinate grids, with axes
that meet at right angles. The horizontal axis is the
abscissa or x axis and the vertical axis is the ordinate or y axis. Each point on the grid is uniquely
specified by an x and a y value, denoted by the
ordered pair (x, y). Ordinary graph paper utilizes linear scales for both axes. Other scales commonly used
are logarithmic (see Figure 8) and inverse distribution
fun ctions (see Figure 6). Craver [13] includes a discussion of plotting techniques with over 200 graph
papers that may be copied without permission of the
publisher.
This discussion includes old and new graphical
forms that have broad application or are specialized
but commonly used. Taxonomies based on the uses
of graphs have been addressed by several authors,
including Tukey [14], Fienberg [7], and Schmid and
Schmid [15]. This discussion includes references to
more than 50 different graphical displays (Table 1)
and is organized according to the principal functions
of graphical techniques:
Exploration
Analysis
Communication and display of results
Graphical aids
Some graphical displays are used in a variety of ways;
however, each display here is discussed only in the
context of its widest use.
2
Graphical Representation of Data
15
15
10
10
5
5
00
(a)
5
10
15
20
0
15
15
10
10
5
5
00
0
5
10
15
20
0
5
10
15
20
(b)
5
10
15
20
(c)
0
(d)
Figure 1 Anscombe’s [11] plots of four equal-size data sets, all of which yield the same regression summaries [Reprinted
with permission from American Statistical Association. 1973.]
Exploratory Graphs
Exploratory graphs are used to help diagnose characteristics of the data and to suggest appropriate
statistical analyses and models. They usually do
not require assumptions about the behavior of the
data or the system or mechanism that generated
the data.
Data Condensation
A listing or tabulation of data can be very difficult to comprehend, even for relatively small data
sets. Data condensation techniques, discussed in most
elementary statistics texts, include several types of
frequency distributions (see, e.g., Freund [16] and
Johnson and Leone [17]). These distributions associate the frequency of occurrence with each distinct
value or distinct group of values in a data set. Ordinarily, data from a continuous variable will first be
grouped into intervals, preferably of equal length,
which completely cover the range of the data without overlap. The number or length of these intervals is usually best determined from the size of the
data set, with larger sets effectively able to support more intervals. Table 2 presents four commonly
used forms: frequency, relative frequency, cumulative
frequency, and cumulative relative frequency. Here
carbon monoxide emissions (grams per mile) of
794 cars are grouped into intervals of length 24,
where the upper limit is included (denoted by the
square upper interval bracket) and the lower limit
is not (denoted by the lower open parenthesis).
Columns 4 and 6 depict relative frequencies, which
are scaled versions (divided by 794) of columns
3 and 5, respectively.
The four distributions tabulated in Table 2 are
useful data summaries; however, plots of them
can help the data analyst develop an even better
understanding of the data. A histogram is a bar
graph associating frequencies or relative frequencies with data intervals. The histogram for carbon
monoxide data shown in Figure 2 clearly shows a
positively skew (see Skewness), unimodal distribution with modal interval (72–96). Other forms
of histograms use symbols such as dots (dotarray diagram) or asterisks in place of bars, with
each symbol representing a designated number of
counts.
A frequency polygon is similar to a histogram.
Points are plotted at coordinates representing interval
midpoints and the associated frequency; consecutive
Graphical Representation of Data
Table 1
3
Graphical displays used in the analysis and interpretation of data
Exploratory plots
Data condensation
Histogram
Dot-array diagram
Stem and leaf diagram
Frequency polygon
Ogive
Box and whisker plot
Relationship among variables
Two variables
Three or more variables
Scatter plot
Sequence plot
Autocorrelation plot
Cross-correlation plot
Labeled scatter plot
Glyphs and metroglyphs
Weathervane plot
Biplot
Face plots
Fourier plot
Cluster trees
Similarity and preference maps
Multidimensional scaling displays
Graphs used in the analysis of data
Distribution
assessment
Probability plot
Q–Q plot
P–P plot
Hanging histogram
Rootogram
Poissonness plot
Model adequacy and
assumption verification
Average versus standard
deviation
Residual plots
Partial residual plot
Component-plus-residual
plot
Decision making
Control chart
CUSUM chart
Youden plot
Half-normal plot
Cp plot
Ridge trace
Communication and display of results
Quantitative
graphics
Summary of
statistical analyses
Bar chart
Pictogram
Pie chart
Contour plot
Stereogram
Color map
Means plots
Sliding reference distribution
Notched-box plot
Factor space/response
Interaction plot
Contour plot
Predicted response plot
Confidence region plot
points are connected with straight lines (e.g., in
Table 2, plot column 3 vs column 2). The form of this
graph is analogous to that of a probability density
function.
A disadvantage of a grouped-data histogram is
that individual data points cannot be identified
since all the data falling in a given interval are
indistinguishable. A display that circumvents this
difficulty is the stem and leaf diagram, a modified
histogram with “stems” corresponding to interval groups and “leaves” corresponding to bars.
Tukey [10] gives a thorough discussion of stem and
leaf and its variations.
Graphical aids
Power curves
Sample-size curves
Confidence limits
Nomographs
Graph paper
Trilinear coordinates
An ogive is a graph of the cumulative frequencies
(or cumulative relative frequencies) against the upper
limits of the intervals (e.g., from Table 2, plot column
5 vs the upper limit of each interval in column 1)
where straight lines connect consecutive points. An
ogive is a grouped-data analog of a graph of the
empirical cumulative distribution function and is
especially useful in graphically estimating percentiles
(see Quantiles), which are data values associated
with specified cumulative percents. Figure 3 shows
the ogive for the carbon monoxide data and how it
is used to obtain the 25th percentile (i.e., the lower
quartile).
200
0.25
160
0.20
120
0.15
80
0.10
40
0.05
0
12
60
108
156 204 252
Grams per mile
300
348
Relative frequency
Graphical Representation of Data
Frequency
4
0
Figure 2 Histogram of Environmental Protection Agency surveillance data (1957–1967) on carbon monoxide emissions
from 794 cars
Table 2
Frequency distributions of carbon monoxide data
(1)
Interval
1. (0–24]
2. (24–48](a)
3. (48–72]
4. (72–96]
5. (96–120]
6. (120–144]
7. (144–168]
8. (168–192]
9. (192–216]
10. (216–240]
11. (240–264]
12. (264–288]
13. (288–312]
14. (312–336]
15. (336–360]
(a)
(2)
(3)
(4)
(5)
Interval
midpoint
Frequency
Relative
frequency
Cumulative
Frequency
(6)
Cumulative
relative
frequency
12
36
60
84
108
132
156
180
204
228
252
276
300
324
348
13
98
161
189
148
85
45
30
10
5
5
1
2
1
1
0.016
0.123
0.203
0.238
0.186
0.107
0.057
0.038
0.013
0.006
0.006
0.001
0.003
0.001
0.001
13
111
272
461
609
694
739
769
779
784
789
790
792
793
794
0.016
0.140
0.343
0.581
0.767
0.874
0.931
0.969
0.981
0.987
0.994
0.995
0.997
0.999
1.000
Notation designates inclusion of all values greater than 24 and less than or equal to 48
Another display, which highlights five important
characteristics of a data set, is a box and whisker or
box plot. The box, usually aligned vertically, encloses
the interquartile range (see Descriptive Statistics),
with the lower line identifying the 25th percentile
(lower quartile, see Quartiles) and the upper one
the 75th (upper quartile). A line sectioning the
box displays the 50th percentile (median) and its
relative position within the interquartile range. The
whiskers at either end may extend to the extreme
values, or, for large data sets, to the 10th/90th
or 5th/95th percentiles. These plots are especially
convenient for comparing two or more data sets, as
shown in Figure 4 for winter snowfalls of Buffalo
and Rochester, New York. (See Tukey [10] for
further discussion, and McGill et al. [18] for some
variations.)
Relationships between Two Variables
Often of interest is the relationship, if any, between
x and y, or the development of a model to predict y
800
100
Table 3
640
80
Unit
480
60
320
40
160
20
0
0
48
96
25th
percentile
144 192 240
Grams per mile
288
336
Cumulative percent
Cumulative frequency
Graphical Representation of Data
0
384
Figure 3 Ogive of automobile carbon monoxide emissions data shown in Figure 2
Scale
200
199
160
162
120
108
97
89
75
80
78
71
40
40
42
0
Buffalo
Rochester
Figure 4 Box and whisker plots comparing winter snowfalls of Buffalo and Rochester, New York, (1939–1940
to 1977–1978) and demonstrating little distributional difference (contrary to popular belief). (Based on local climatological data gathered by the National Oceanic and
Atmospheric Administration, National Climatic Center,
Asheville, NC)
given the value of x. Ordinarily, an (x, y) measurement pair is obtained from the same experimental
unit, as shown in Table 3.
A usual first step is to construct a scatter plot, a
collection of plotted points representing the measurement pairs (xi , yi ), i = 1, . . . , n. The importance of
scatter plots was seen in Figure 1. These four very
different sets of data yield the same regression line
Object
Person
Product
x
Diameter
Age
Raw material
purity
5
y
Weight
Height
Quality
(drawn on the plots) and associated statistics [11].
Consequently, the numerical results of an analysis,
without the benefit of a look at plots of the data,
could result in invalid conclusions.
The objective of regression analyses is to develop
a mathematical relationship between a measured
response or dependent variable, y, and two or more
predictor or independent variables, x1 , x2 , . . . , xp . A
usual initial step is to plot y versus each of the
x variables individually and to plot each xi versus
each of the other x’s. This results in a total of
p + p(p − 1)/2 plots. Plots of y versus xi enable one
to identify the xi variables that appear to have large
effects, to assess the form of a relationship between
the y and xi variables, and to determine whether
any unusual data points are present. Plots of xi
versus xj , i = j , help to identify strong correlations
that may exist among the predictor variables. It
is important to recognize such correlations because
least-squares regression techniques work best when
these correlations are small [19] (see Collinearity;
Data Collection).
In many instances data are collected sequentially
in time and a plot of the data versus the sequence
of collection can help identify sources of important
effects. Figure 5 shows a sequence plot of gasoline
mileage of an automobile versus the sequence of
gasoline fill-ups. The large seasonal effect (summer
mileage is higher than winter mileage) and the
increase in gasoline mileage due to a major tune-up
are clearly evident.
Observations collected sequentially in time, such
as the gasoline mileage data plotted in Figure 5, form
a time series (see Time Series Analysis). Statistical
modeling of a time series is largely accomplished by
studying the correlation between observations separated by 1, 2, . . . , n − 1 units in time. For example,
the lag-1 autocorrelation is the correlation coefficient
between observations collected at time i and time
i + 1, i = 1, 2, . . . , n − 1; it measures the linear relationship among all pairs of consecutive observations
(see Autocorrelation Function). An autocorrelation
6
Graphical Representation of Data
24
22
Miles per gallon
20
18
16
14
12
Tune up
10
June
Jan
1969
June
1970
Jan
June
1971
Jan
June
1972
Jan
Data of fill-up
Figure 5 Sequence plot of gasoline mileage data. Note the seasonal variation and the increased average value and decreased
variation in mileage after tune-up
plot may be used to study the “correlation structure”
in the data where the lag-j autocorrelation coefficient
computed between observations i and i + j is plotted versus lag j . A cross-correlation plot between
two time series is developed similarly. The lag-j
correlation coefficient, computed between observations i in one series and observations i + j in the
other series, is plotted versus the lag j . Box and Jenkins [20] discuss the construction and interpretation of
these plots, in addition to presenting examples on the
use of sequence plots and the modeling of time series.
Relationships among More than Two Variables
Scatter plots directly display relationships between
two variables. Values of a third variable can be
incorporated in a labeled scatter plot, in which each
plotted point (whose location designates the values of
two variables) is labeled by a symbol designating a
level of the third variable. Anderson [21] extended
these to “pictorialized” scatter plots, called glyphs
and metroglyphs, where each coordinate point is
plotted as a circle and has two or more rays emanating
from it; the length of each ray is indicative of the
value of the variable associated with that ray. Bruntz
et al. [22] developed a variation of the glyph for
four variables, called the weathervane plot, where
values of two of the variables are again indicated
by the plotted coordinates and the other two by using
variable-sized plotting symbols and variable-length
arrows attached to the symbols.
Tukey [10], Gabriel’s biplot [23], and Mandel
[24] provide innovative methods for displaying twoway tables of data. All three approaches involve
fitting a model to the table of data and then constructing various plots of the coefficients in the fitted
model to study the relationships among the variables.
Chernoff [25] used facial characteristics to display
values of up to 18 variables through face plots.
Each face represents a multivariate datum point, and
each variable is represented by a different facial
characteristic, such as size of eyes or shape of
mouth. Experience has shown that the interpretation
of these plots can be affected by how the variables
are assigned to facial characteristics.
Exploratory plots of raw data are usually less
effective with measurements on four or more variables (e.g., height, weight, age, sex, race, etc., of
a subject). The usual approach then is to reduce
the dimensionality of the data by grouping variables
with common properties or identifying and eliminating unimportant variables. The variables plotted may
be functions of the original variables as specified by
a statistical model. Exploratory analysis of the data
is then conducted with plots of the “reduced” data.
Andrews’ Fourier plots [26], data clustering [27],
similarity and preference mapping [28], and geometrical representations of multidimensional scaling
Graphical Representation of Data
Graphs Used in the Analysis of Data
The graphical methods discussed next generally
depend on assumptions of the analysis. Decisions
made from these displays may be either subjective in
nature, such as a visual assessment of an underlying
distribution, or objective, such as an out-of-control
signal from a control chart.
Distribution Assessment and Probability Plots
The probability plot is a widely used graphical
procedure for data analysis. Since other graphical
techniques discussed in this article require a basic
understanding of it, a brief discussion follows (see
Probability Plots).
A probability plot on linear rectangular coordinates is a collection of two-dimensional points specifying corresponding quantiles from two distributions.
Typically, one distribution is empirical and the other
is a hypothesized theoretical one. The primary purpose is to determine visually if the data could have
arisen from the given theoretical distribution. If the
empirical distribution is similar to the theoretical one,
the expected shape of the plot is approximately a
straight line; conversely, large departures from linearity suggest different distributions and may indicate
how the distributions differ.
Imagine a sample of size n in which the data
y1 , . . . , yn are independent observations on a random variable Y having some continuous distribution function (df) (see Probability Density Function
(PDF)). The ordered data y(i) , y(1) ≤ · · · ≤ y(n) , i =
1, . . . , n, represent sample Quantiles and are plotted against theoretical quantiles, xi = F −1 (pi ), where
F −1 denotes the inverse of F , the hypothesized df
of Y . Moreover, F (y) may involve unknown location (ν) and scale (δ) parameters (not necessarily
the mean and the standard deviation) as long as
F ((y − ν)/δ) is completely specified. If Y has a df F ,
then pi = F (xi ) = F ((y(i) − ν)/δ); xi is called the
reduced y(i) variate and is a function of the unknown
parameters.
Selection of pi for use in plotting has been much
discussed. Suggested choices have been i/(n + 1),
(i − 1/2)/n, (2i − 1)/2n, and (i − 3/8)/(n + 1/4).
Kimball [32] discusses some choices of pi in the
context of probability plots.
Now F −1 (pi ) is not expressible in closed form for
most commonly encountered distributions and thus
provides an obstacle to easy evaluations of xi . An
equivalent procedure often employed to avoid this
difficulty is to plot y(i) against pi on probability
paper, which is rectangular graph paper with an F −1
scale for the p axis. The pi entries on the p axis
are commonly called plotting positions. Naturally, a
different type of probability paper is needed for each
family of distributions, F .
Normal probability paper, with F −1 based on the
normal distribution, is most common, although many
others have been developed (see, e.g., King [33]).
Normal probability paper is available in two versions:
arithmetic probability paper, where the data axis has a
linear scale, and logarithmic probability paper, where
the data axis has a natural logarithmic scale. The latter
version is used to check for a lognormal distribution
(see Probability Density Functions).
To illustrate the procedure, two data sets of size
n = 15 have been plotted on normal (arithmetic)
probability paper shown in Figure 6, using pi =
10
9
8
7
Data values
analyses [29] are examples of such procedures. One
often attempts to determine whether there are two or
more groups (i.e., clusters) of observations within the
data set. When several different groups are identified,
the next step is usually to determine why the groups
are different. These methods are sophisticated and
require computer programs for implementation on a
routine basis. (See Gnanadesikan [30], Everitt [31].)
7
6
5
4
3
2
Data set I
Data set II
1
0
0.01 0.1
12 5
20 40 60 80
95 99 99.9
Plotting position, 100 pi
Figure 6 Comparison of two data sets plotted on normal
probability paper. Set I can be adequately approximated by
a normal distribution, whereas set II cannot
Graphical Representation of Data
i/(n + 1). Here the horizontal axis is labeled in
percentage; 100pi has been plotted against the ith
smallest observation in each set. The plotted points of
set I appear to cluster around the straight line drawn
through them, visually supportive evidence that these
data come from a population that can be adequately
approximated by a normal distribution. The points for
set II, however, bend upward at the higher percents,
suggesting that the data come from a distribution with
a longer upper tail (i.e., larger upper quantiles) than
the normal distribution.
If set I data are viewed as sufficiently normal,
graphical estimates of the mean (µ) and the standard
deviation (σ ) are easily obtained by noting that the
50th percentile of the normal distribution corresponds
to the mean and that the difference between the
84th and 50th percentiles corresponds to one standard
deviation. The respective graphical estimates from
the line fitted by eye for µ and σ are 5.7 and
7.0 − 5.7 = 1.3, respectively.
The conclusions from Figure 6 were expected
because the data from set I are, in fact, random
normal deviates with µ = 6 and σ = 1. The data in
set II are random lognormal deviates with µ = 0.6
and σ = 1. A plot of set II data on logarithmic
probability paper produces a more nearly linear
collection of points.
Visual inference, such as determining here whether
the collection of points forms a straight line, is
fairly easy, but should be used with theoretical
understanding to enhance its reliability. A curvilinear
pattern of points based on a large sample offers more
evidence against the hypothesized distribution than
does the same pattern based on a smaller sample.
For example, the plot of set I data exhibits some
asymmetric irregularities, which are due to random
fluctuations; however, a similar pattern of irregularity
based on a much larger sample would be much more
unlikely from a normal distribution. Daniel [34] and
Daniel and Wood [35] give excellent discussions on
the behavior of normal probability plots.
The probability plot is a special case of the
quantile–quantile (Q−Q) plot [36] (see Probability
Plots), which is a quantile–quantile comparison of
two distributions, either or both of which may be
empirical or theoretical; whereas a probability plot
is typically a display of sample data on probability paper (i.e., empirical vs theoretical). Q−Q plots
are particularly useful because a straight line will
result when comparing the distributions of X and
Y , whenever one variable can be expressed as a
linear function of the other. Q−Q plots are relatively
more discriminating in low-density or low-frequency
regions (usually the tails) of a distribution than near
high-density regions, since in low-density regions
quantiles are rapidly changing functions of p. In
the plot, this translates into comparatively larger distances between consecutive quantiles in low-density
areas. The quantiles in Figure 6 illustrate this, especially the larger empirical ones of set II.
A related plot considered by Wilk and Gnanadesikan [36] is the P − P plot. Here, for varying
xi , pi1 = F1 , (xi ) is plotted against pi2 = F2 (xi ),
where Fj , j = 1, 2, denotes the df (empirical or theoretical). If F1 = F2 for all xi , the resulting plot is a
straight line with unit slope through the origin. This
plot is especially discriminating near high-density
regions, since here the probabilities are more rapidly
changing functions of xi than in low-density regions.
The P − P plot is not as widely used as the Q−Q
plot since it does not remain linear if either variable
is transformed linearly (e.g., by a location or scale
change).
For large data sets, an obvious approach for comparison of data with a probability model is a graph of
a fitted theoretical density (parameters estimated from
data), with the appropriate scale adjustment, superimposed on a histogram. Gross differences between
the ordinates of the distributions are easily detected.
A translation of the differences to a reference line
(instead of a reference curve) to facilitate visual discrimination is easily accomplished by hanging the
bars of the histogram from the density curve [37].
0.05
Relative
frequency
8
0.00
−0.05
12
60
108
156 204 252
Grams per mile
300
348
Figure 7 Hanging carbon monoxide histogram from fitted
lognormal distribution
Graphical Representation of Data
Figure 7 illustrates the hanging histogram, where the
histogram for the carbon monoxide data (Figure 2) is
hung from a lognormal distribution. Slight, systematic variation about the reference line suggests that
the data are slightly more skewed to the right in the
high-density area than in the lognormal distribution.
Further improvements in detecting systematic
variation may be achieved by rootograms [37]. A
hanging rootogram is analogous to a hanging histogram except that the square roots of the ordinate
values are graphed. The suspended rootogram is an
upside-down graph of the residuals about the baseline of the hanging rootogram.
Graphical assessments for discrete distributions
can also be made by comparing the histogram of
the data to the fitted probability density, p(x). However, as with continuous distributions, curvilinear
discrimination may be difficult and linearizing procedures are helpful. A general approach is to determine
a function of p(x), say, r(x) = r(p(x)), which is linearly related to a function of x, say, s(x). Then using
sample data one calculates relative frequencies to
estimate p(x), evaluates r(x), and plots r(x) against
s(x). The absence of systematic departures from linearity offers some evidence that the data could arise
from density p(x). The slope and intercept will be
functions of the parameters and can be used to estimate the parameters graphically. A suitable r(x) may
be obtained by simply transforming p(x); for example, taking logarithms of the density of the discrete
Pareto distribution, where p(x) ∝ x λ , gives r(x) =
log p(x) and s(x) = log x. In other cases, ratios of
consecutive probabilities (e.g., p(x + 1)/p(x)) are
linear functions of s(x) [38]. Table 4 summarizes
these ratios for three commonly encountered discrete
distributions.
Ord [39] expands on the foregoing ideas by
defining a class of discrete distributions where r(x) =
xp(x)/(p(x − 1)) is a linear function of x (i.e.,
s(x) = x)), thereby keeping the same abscissa scale.
Distributions in this class are binomial, negative binomial, Poisson, logarithmic, and uniform.
These graphical tests for discrete distributions may
be difficult to interpret because the sample relative
frequencies have nonhomogeneous variances. This
difficulty may be compounded when using ratios
of the relative frequencies as functions of s(x).
These procedures are therefore recommended more
as exploratory than confirmatory.
9
Another graphical technique for the Poisson
distribution (see Probability Density Functions) is
the Poissonness plot [40], similar in spirit to probability plotting. It can also be applied to truncated
Poisson data or any one-parameter exponential family
of discrete distributions such as the binomial.
For further insights, see Parzen [41].
Model Adequacy and Assumption Verification
Any statistical analysis is based on certain assumptions. Those usually associated with least-squares
regression analysis are that experimental errors are
independent and have a homogeneous variance and a
normal (Gaussian) distribution. It is standard practice
to check these assumptions as part of the analysis. These checks, most often done graphically, have
the desirable by-product of forcing the analyst to
look at the data critically. This can be effectively
accomplished by graphical analysis of both the raw
data and residuals from the fitted model. In addition
to assumption verification, this evaluation frequently
results in the discovery of unusual observations or
unsuspected relationships. Most of the plots discussed
below are applications of graphical forms previously
discussed.
If repeat observations have been obtained for
each of k groups representing different situations
or conditions being studied, a scatter plot of the
group standard deviation, si , versus the group mean,
y i , i = 1, . . . , k, will appear random and show little
correlation when the homogeneous variance assumption is satisfied. Box et al. [42] point out that if these
assumptions are not satisfied, this plot can be used
to determine a transformed measurement scale on
which the assumptions will be more nearly satisfied
(Figure 8).
The normal distribution assumption can be
checked in this situation from a histogram of the
Residuals, rij = yij − y i , between the observations
in each group (yij ) and the average of the group
(y i ). This histogram will tend to be bell-shaped if
the normal distribution and homogeneous variance
assumptions are satisfied. Alternatively, especially for
small data sets, the rij ’s may be plotted on normal
probability paper. The expected shape is a straight
line if these assumptions are appropriate.
Replicate observations often are not available.
However, the analysis assumptions and adequacy
of the form of the model can still be checked by
10
Graphical Representation of Data
Table 4
Distribution
Binomial
Poisson
Pascal
p(x)
r(x)
π x (1 − π )n−x , x = 0, . . . , n
e−λ λx , x = 0, 1, 2, . . .
x−1 k x!
π (1 − π )x−k , x = k, k + 1, . . .
k−1
p(x+1)
p(x)
p(x)
p(x+1)
p(x)
p(x+1)
n
x
Intercept
+
π
− 1−π
Slope
×
(n+1)π
1−π
1
λ
k−1
1−π
1
λ
1
1−π
s(x)
1
x+1
x
1
x
the correlation structure on the utility of these plots
is negligible [45].
The plot of residuals on normal probability paper
provides a check on the normal distribution assumption. Substantive deviations from linearity may be due
to a nonnormal distribution of experimental errors,
the presence of atypical (i.e., outlying) data points,
or an inadequate model (Figure 9).
The residuals (ri ) versus the predicted values (ŷi )
plot will show a random distribution of points (no
trends, shifts, or peculiar points) if all assumptions are
satisfied. Any curvilinear relationship indicates that
the model is inadequate (Figure 10). Nonhomogeneous variance is indicated if the spread in ri changes
with ŷi . When the spread increases linearly with ŷi ,
a log transformation (i.e., replace y in the analysis
0.40
0.30
0.20
Standard deviation (s)
=
0.10
0.08
0.06
0.04
0.03
+
0.2
0.3 0.4
Average (y )
0.6
0.8 1.0
Figure 8 Toxic agent data [42, Table 7.11]. Linear
log–log relationship suggests that a power transformation
will produce homogeneous variances. The slope of the line
indicates the necessary power [WileyInterscience.]
constructing plots of the residuals or standardized
residuals [43], from the fitted model. The residual associated with observation yi is ri = yi −
ŷi , where ŷi is the value of yi predicted by the model
fitted to the data. Four types of residual plots are
routinely constructed [44]: plots on normal probability paper, residuals (ri ) versus predicted values (ŷi ),
sequence plot of residuals, and residuals (ri ) versus predictor variables (xj ). Note that although the
residuals are not mutually independent, the effect of
0
Outlier
(a)
−
Normal probability
+
r = y − y^
0.01
0.1
r = y − y^
0.02
(b)
0
−
Normal probability
Figure 9 Normal probability plot of residuals. Plot (a)
shows an outlying data point and plot (b) shows a set of
residuals not normally distributed
11
Graphical Representation of Data
Log transformation suggested
0
−
(a)
+
Level shift
+
r = y − y^
r = y − y^
+
y^
0
−
Higher – order terms needed
1
2
Test sequence
Linear trend
Outlier
0
−
r = y − y^
r = y − y^
+
(b)
n
3
(a)
0
y^
Figure 10 Residuals versus fitted values. Plot (a) shows
increased residual variability with increasing fitted values
and plot (b) shows a curvilinear relationship
by y = log y) will often produce a response scale on
which the homogeneous variance assumption will be
satisfied (Figure 10). Plots of the residuals (ri ) versus
raw observations (yi ) are of little value because they
will always show a linear correlation with a value of
(1 − R 2 )1/2 , where R 2 is the coefficient of determination (see Coefficient of Determination (R 2 )) of the
fitted model [46].
If all analysis assumptions are satisfied, the
sequence plot of the residuals (ri ) can be expected
to show a random distribution of points and contain no trends, shifts, or atypical points. Any trends
or shifts here suggest that one or more variables
not included in the model may have changed during the collection of the data (Figure 11). This plot
may show cycles in the residuals and other dependencies, indicating that the assumption of independence of experimental errors is not appropriate. This
assumption can also be checked by constructing an
autocorrelation plot of the residuals (see earlier discussion).
The residuals (ri ) versus the predictor variables (xi ) plot should also show a random distribution of points. Any smooth patterns or trends
−
(b)
1
2
n
3
Test sequence
Figure 11 Sequence plot of residuals. Plot (a) shows an
abrupt change and plot (b) shows a gradual change due to
factors not accounted for by the model
suggests that the form of the model may not be
appropriate (Figure 12).
The scatter plots of residuals discussed above are
not independent of each other. Peculiarities and trends
observed in one plot usually show up in one or more
of the others. Collectively, these plots provide a good
check on data behavior and the model construction
process.
With a large number of predictor variables it
is sometimes hard to see relationships between y
and xi in scatter plots. Larsen and McCleary [47]
developed the partial residual plot to overcome
this problem. Wood [48] and Daniel and Wood [35]
refer to these as component-plus-residual plots. A
regression model must be fitted to the data before the
plot can be constructed. In effect, the relationships
of all the other variables are removed and the plot of
the component-plus-residual versus xi displays only
the computed relationship between y and xi and the
residual variation in the data.
12
Graphical Representation of Data
Higher–order terms needed
0
−
X
Figure 12
Residuals versus a predictor variable
Plots for Decision Making
At various points in a statistical analysis, decisions
concerning the effects of variables and differences
among groups of data are made. Statisticians and
other scientists have developed a variety of statistical
techniques that use graphical displays to make these
decisions. In some instances (e.g., control charts,
Youden plots) these contain both the data in raw or
reduced form and a measure of their uncertainty. The
user, in effect, makes decisions from the plot rather
than by calculating a test statistic. In other situations
(e.g., half-normal plot, ridge trace) a measure of
uncertainty is not available but the analyst makes
decisions concerning the magnitude of an effect or the
appropriateness of a model by assessing deviations
from expected or desired appearance.
The control chart (see Control Charts, Overview) is widely used to control industrial production
and analytical measurement processes [49]. It is a
sequence plot of a measurement or statistic (average,
range, etc.) versus time sequence together with limits
to reflect the expected random variation in the plotted
points. For example, on a plot of sample averages,
limits of ±3 standard deviations are typically shown
about the process average to reflect the uncertainty
in the averages. The process is considered to be out
of control if a plotted average falls outside the limits.
This suggests that a process shift has occurred and
a search for an assignable cause should be made.
The cumulative sum control chart (see Cumulative
Sum (CUSUM) Chart) is another popular process
control technique, particularly useful in detecting
small process shifts [50, 51].
Ott [53] used the control chart concept to develop
his analysis-of-means plotting procedure for the interpretation of data that would ordinarily be analyzed
by analysis-of-variance techniques. Schilling [54, 55]
systematized Ott’s procedure and extended it past
the cross-classification designs to incomplete block
experiments and studies involving random effects.
The analysis-of-means procedure enables those familiar with control chart concepts and technology to
quickly develop an ability to analyze the results from
experimental designs.
The Youden plot [56] was developed to study the
ability of laboratories to perform a test procedure.
Samples of similar materials, A and B, are sent to a
number of laboratories participating in a collaborative
test. Each laboratory runs a predetermined number of
replicate tests on each sample for a number of different characteristics. A Youden plot is constructed
for each measured characteristic. Each point represents a different laboratory, where the average of
the replicate results on material A is plotted versus the average results on material B (Figure 13).
Differences along a 45° line reflect between-lab variation. Differences in the direction perpendicular to
the 45° line reflect within-lab and lab-by-material
interaction variation. An uncertainty ellipse can be
used to identify problem laboratories. Any point
outside this ellipse is an indication that the associated laboratory’s results are significantly different
from those of the laboratories within the ellipse.
Mandel and Lashof [52] generalized and extended
Stress at 300% elongation for material 305–405
r = y − y^
+
1350
1300
1250
1200
1150
1100
1050
1000
950
900
850
800
1100
1200
1300
1400
1500
1600
Stress at 300% elongation for material 105–205
Figure 13 Youden plot [52] [American Society of Quality, 1974.]
13
Graphical Representation of Data
Rank order of absolute effect on
half-normal probability scale
15
Purity
Catalyst concentration
14
pH
12
Solvent concentration
Temperature
10
8
6
4
2
5
10
15
20
25
30
Effect of variable
35
40
Figure 14 Half-normal plot from a 25−1 factorial experiment showing four important effects on a color response
100
df
d
90
cdf
cd
cde
def
de
80
70
bd
cdef
bdf
60
bdef
bcdf
bcd
bde
bcdef
bcde
Cp
the construction and interpretation of Youden’s plot.
Ott [57] showed how this concept can be used to
study paired measurements. For example, “before”
and “after” measurements are frequently collected to
evaluate a process change or a manufacturing stage
of an industrial process.
Daniel [34] developed the Half-Normal Plot) to
interpret two-level factorial and fractional Factorial
Experiments. It displays the absolute value of the
n − 1 contrasts (i.e., main effects and interactions)
from an n-run experiment, versus the probability
scale of the half-normal distribution (Figure 14).
Identification of large effects (positive or negative)
is enhanced by plotting the absolute value of the
contrast. Alternatively, to preserve the sign of the
contrast, the contrast value may be plotted on normal
probability paper [42]. With no significant effects,
either plot will appear as a straight line. Significant
effects are indicated by the associated contrast falling
off the line. Although this plot is usually assessed
visually, uncertainty limits and decision guides have
been developed by Zahn [58, 59]. The half-normal
plot is not restricted to two-level experiments and
can be used in the interpretation of any experiment
for which the effects can be described by independent
1-degree-of-freedom contrasts.
The half-normal plot was significant in marking
the beginning of extensive research on probability
plotting methods in the early 1960s. During this time
the statistical community became convinced of the
usefulness and effectiveness of graphical techniques;
many of these developments were discussed earlier.
50
40
30
ae
ac
ad
a
20
a IN
ab
a AND b
IN
10
acdf
abc
abe
abd
abce
abcd
abde
acef
0
1
2
3
4
adef
acdef
abcde
abcdef
abcef
abcf, abef
abf
0
acde
acd, ace
aef
acf, adf
ade
abcdf
abdef
abdf
5
6
7
P
Figure 15 Cp plot [60]. The six variables in the equation
are denoted by a, b, c, d, e, and f . The number of terms
in the equation is denoted by P [Reprinted with permission
from American Statistical Association. 1966.]
Research in the 1960s and 1970s also focused
on regression analysis and the fitting of equations to
data when the predictor variables (x’s) are correlated.
Two displays, the Cp plot and the ridge trace, were
developed as graphical aids in these studies.
The Cp plot, suggested by C. L. Mallows and
popularized by Gorman and Toman [60] and Daniel
and Wood [35], is used to determine which variables should be included in the regression equation. It (Figure 15) is constructed by plotting, for
each equation considered, Cp versus p, where Cp =
RSSp /s 2 − (n − 2p), p is the number of terms in the
equation, RSSp is the residual sum of squares for
the p-term equation of interest, n is the total number
of observations, and s 2 is the residual mean square
obtained when all the variables are included in the
equation. If all the important terms are in the equation, Cp = p. The line Cp = p is included on the
plot and one looks for the equations that have points
falling near this line. Points above the line indicate
that significant terms have not been included. The
objective is to find the equation with the smallest
14
Graphical Representation of Data
.
number of terms for which Cp = p.
The ridge trace, developed by Hoerl and Kennard [19] and discussed by Marquardt and Snee [61],
identifies which regression coefficients are poorly
estimated because of correlations among the predictor
variables (x’s). This is accomplished by plotting the
regression coefficients versus the bias parameter, k,
which is used to calculate the ridge regression coefficients, β̂ = (X X + kI)−1 X Y (Figure 16). Coefficients whose sign and magnitude are affected by
correlations among the predictor variables (x’s) will
change rapidly as k increases. Hoerl and Kennard recommend that an appropriate value for k is the smallest
for which the coefficients “stabilize” or change very
little as k increases. If the coefficients remain nearly
constant for k > 0, then k = 0 is suggested; this
indicates that the least-squares coefficients should
be used.
Communication and Display of Results
Quantitative Graphics
Quantitative graphics encompass general methods of
presenting and summarizing numerical information,
1.0
0.8
Regression coefficient
0.6
0.4
3
1
2
0.2
0
−0.2
−0.4
−0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
Ridge bias (k)
Figure 16 Ridge trace showing instability of regression
coefficients 1 and 2
usually for easy comprehension by the layman. There
can be no complete catalog of these graphics since
their form is limited only by one’s ingenuity. Beniger
and Robyn [2] trace the historical development
since the seventeenth century of quantitative graphics and offer several pictorial examples illustrating
the improved sophistication of graphs. Schmid and
Schmid [15], in a comprehensive handbook, discuss
and illustrate many quantitative graphical forms. A
few of the more common forms follow briefly.
A bar chart is similar in appearance to a histogram
(Figure 2) but more general in application. It is
frequently used to make quantitative comparisons
of qualitative variables, as a company might do to
compare revenues among departments. Comparisons
within and between companies are easily made by
superimposing a similar graph of another comparable
company and identifying the bars.
A pictogram is similar to the bar chart except
that “bars” consist of objects related to the response
tallied. For example, figures of people might be
used in a population comparison where each figure
represents a specified number of people [62]. When
partial objects are used (e.g., the bottom half of a
person), it should be stated whether the height or
volume of the figure is proportional to frequency.
A pie chart is a useful display for comparing
attributes on a relative basis, usually a percentage.
The angle formed by each slice of the pie is an indication of that attribute’s relative worth. Governments
use this device to illustrate the number of pennies
of a taxpayer’s dollar going into each major budget
category.
A contour plot may effectively depict a relationship among three variables by one or more contours.
Each contour is a locus of values of two variables
associated with a constant value of the third. A
relief map that shows latitudinal and longitudinal
locations of constant altitude by contours or isolines is a familiar example. Contours displaying a
response surface, y = f (x1 , x2 ), are discussed subsequently (Figure 21). A stereogram is another form
for displaying spatial relationship of three variables
in two dimensions, similar to a draftsman’s threedimensional perspective drawing in which the viewing angle is not perpendicular to any of the object’s
surfaces (planes).
The field of computer graphics has rapidly
expanded, offering many new types of graphics, see
Tufte [63].
Graphical Representation of Data
Summary of Statistical Analysis
Many of the displays previously discussed are useful
in summarizing and communicating the results of a
study. For example, histograms or dot-array diagrams
are used to display the distribution of data. Box plots
are useful (see Figure 4) for large data sets or when
several groups of data are compared.
The results of many analyses may be displayed
by a means plot of the group means (y) together
with a measure of their uncertainty, such as standard error (y ± SE), confidence limits (y ± tSE), or
least significant interval limits (LSI = y ± LSD/2).
The use of the LSI [65] is particularly advantageous
because the intervals provide a useful decision tool
(Figure 17). Any two averages are significantly different at the assigned probability level if and only
if their LSIs do not overlap. Intervals based on the
standard error and confidence limits for the mean
do not have this straightforward interpretation. Similar intervals can be developed for other multiple
comparison procedures, such as those developed by
Tukey (honest significant difference), Dunnett, or
Scheffé.
Box et al. [42] use the sliding reference distribution to display and compare means graphically.
The distribution width is determined by the Standard Error of the means. Any two means not
within the bounds of the distribution are judged to
be significantly different. They also use this display
in the interpretation of factor effects in two-level factorial experiments.
The notched-box plot of McGill et al. [18] is the
nonparametric analog of the LSI-type plots discussed
above. The median of the sample and confidence
limits for the difference between two medians are
displayed in this plot; any two medians whose intervals do not overlap are significantly different. McGill
et al. [18] discuss other variations and applications of
the box plot.
Factorial experimental designs (see Factorial
Experiments) are widely used in science and engineering. In many instances the effects of two and
three variables are displayed on factor space/response
plots with squares and cubes similar to those shown
in Figure 18. The numbers in these figures are
the mean responses obtained at designated values of
the independent variables studied. These figures show
the region, or factor space, over which the experiments were conducted, and aid the investigator in
determining how the response changes as one moves
around in the experimental region.
An interaction plot is used to study the nature
and magnitude of the interaction of two variables
from a designed experiment by plotting the response
(averaged over all replicates) versus the level of
one of the independent variables while holding the
20
Males
Male rat weight gain
Any two means whose intervals do
not overlap are significantly different
at the 0.05 probability level
30
Females
16
28
14
26
12
24
C1
C2 200 1000 2500
C1
C2 200 1000 2500
Dose of test compound (ppm)
Figure 17
18
Least significant interval plot [64] [International Biometrics Society, 1979.]
10
Female rat weight gain
34
32
15
16
Graphical Representation of Data
Metal etching process
weight loss (GM)
Brittleness (%) of iron bars
60
25
Yes 684
90 10
52
35
250°
tu
re
10
0
30
2
30 45°
450°
30
M
1215
No 713
ol
d
te
m
pe
ra
Stirred
Injection cycle (s)
2966
530°
Solution temperature (°C)
Injection temperature (°C)
22 Factorial design
23 Factorial design
Figure 18 Factor space/response plots showing results of 22 and 23 experiments. The average responses are shown at the
corners of the figures [66]
Metal etching process
3500
3000
2500
Weight loss
other independent variable(s) fixed at a given level.
In Figure 19 it is seen by the nonparallel lines that
the size of the effect of solution temperature on
weight loss depends on whether stirring occurred.
Solution temperature and stirring are said to interact;
another way of saying that their effects are not
additive. Monlezun [67] discusses ways of plotting
and interpreting three-factor interactions.
The objective of many experiments is to develop
a prediction model for the response (y) of the system
as a function of experimental (predictor) variables.
A typical two-predictor variable model, based on a
second-order Taylor series, is
2000
Stirred
1500
×
×
×
×
y = b0 + b1 X1 + b2 X2 + b12 X1 X2
+ b11 X12 + b22 X22
(1)
where the b’s are coefficients estimated by regression
analysis techniques. One of the best ways to understand all the effects described by this equation is by
a contour plot, which gives the loci of X1 and X2
values associated with a fixed value of the response.
By constructing contours on a rectangular X1 − X2
grid for a series of fixed values of y, one obtains a
global picture of the response surface (Figure 20a).
1000
500
××
×
×
Not stirred
0
30
Solution temperature (°C)
Figure 19 Nonparallelism of lines suggests an interaction
between stirring and solution temperature. Stirring has a
larger effect at 30 ° C than at 0 ° C
Graphical Representation of Data
High
Medium
(a)
Low
95
97
0
10
99
X2 = Enzyme concentration
Contour
curves
showing
constant
sensitivity
83
87
95
92
97
100
Sensitivity
95
90
85
X2 low
X2 medium
80
X2 high
75
Low
Medium
High
X1 = pH
(b)
Figure 20 Contour plot of (a) quadratic response surface
and (b) X1 –X2 interaction plot [70] [American Association for Clinical Chemistry, 1979.]
The response surface of a mixture system such as a
gasoline or paint can be displayed by a contour plot
on a triangular grid [68, 69]. Another use of trilinear
coordinates is shown in Figure 22.
Predicted response plots help in interpreting
interaction terms in regression equations [71]. The
interaction between Xi and Xj is studied by fixing
the other variables in the equation at some level (e.g.,
set Xk = X k , k = i, j ) and constructing an interaction plot of the values predicted by the equation for
different values of Xi and Xj (Figure 20b). Similar
plots are also useful in interpreting response surface
models for mixture systems [72] and in the analysis
of the results of mixture screening experiments [73].
A contour plot is also the basis for a confidence
region plot, used to display a plausible region of
simultaneous hypothesized values of two or more
parameters. For example, Figure 21 [74] shows a
contour representing the joint confidence region (see
Confidence Intervals) for regression parameters β1
and β2 associated with each of four coal classes. The
figure clearly shows that these four groups do not all
have the same values of regression parameters.
Confidence region plots on trilinear coordinates
[75] can also be used to display and identify variable
dependence in two-way contingency tables where one
variable is partitioned into three categories (see also
Draper et al. [76]). A two-dimensional plot results
(Figure 22) from the constraint that the sum of the
6
Volatile content coefficient (b2)
17
Black
p1 = 1
4
B
2
C
0
A
−2
p3 = 0
−4
D
Brown
−6
−8
p2 = 0
Hazel
−80
−60
−40
−20
0
Coking rate coefficient (b1)
Figure 21 Joint 95% β1 − β2 confidence regions for coal
classes A, B, C, and D [74] [American Society for
Quality, 1973.]
Brunette
or red
p2 = 1
Green
Blue
p1 = 0
Blond
p3 = 1
Figure 22 Joint 95% confidence region plots for hair
color probabilities for brown, hazel, green, and blue eye
colors [75] [Reprinted with permission from American
Statistical Association. 1974.]
18
Graphical Representation of Data
three probabilities, π1 , π2 , and π3 , is unity. It is
also possible to display three-dimensional response
surfaces or confidence regions via a series of twodimensional slices through three-dimensional space.
[2]
[3]
[4]
Graphical Aids
[5]
Statisticians, like chemists, physicists, and other scientists, rely on graphical devices to help them do
their job more effectively. The following graphs are
used as “tools of the trade” in offering a parsimonious
representation for complex functional relationships
among two or more variables.
Graphs of power functions (see Power) or operating characteristics (e.g., Natrella [77]) are used for
the evaluation of error probabilities of hypothesis
tests (see Hypothesis Testing) when expressed as
a function of the unknown parameter(s). Test procedures are easily compared by superimposing two
power curves on the same set of axes. Sample-size
curves are constructed from a family of operating
characteristic curves, each associated with a different
sample size. These curves are useful in planning the
size of an experiment to control the chances of wrong
decisions. Natrella [77] offers a number of these for
some common testing situations.
Similar contour graphics using families of curves
indexed by sample size are also useful for determining confidence limits (see Confidence Intervals).
These are especially convenient when the endpoints
are not expressible in closed form, such as those for
the correlation coefficient or the success probability
in a Bernoulli sample [78].
Nomographs are graphical representations of
mathematical relationships, frequently involving
more than three variables. Unlike most graphics,
they do not offer a picture of the relationship,
but only enable the determination of the value of
(usually) any one variable from the specification
of the others. Levens [79] discusses techniques for
straightline and curved scale nomograph construction.
Other statistical nomographs have appeared in the
Journal of Quality Technology.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
References
[22]
[1]
Gnanadesikan, R. & Wilk, M.B. (1969). in Multivariate
Analysis, P.R. Krishnaiah, ed, Academic Press, New
York, Vol. 2, pp. 593–637.
Beniger, J.R. & Robyn, D.L. (1978). Quantitative graphics in statistics: a brief history, American Statistician 32,
1–11.
Playfair, W. (1786). The Commercial and Political Atlas,
London.
Fourier, J.B.J. (1821). Recherches Statistiques sur la
Ville de Paris et le Department de la Seine, Vol. 1, pp.
1–70.
Lalanne, L. (1845). Appendix to Cours Complet de
météorologie de L.F. Kaemtz, translated and annotated
by C. Martins, Paris.
Perozzo, L. (1880). Della rappresentazione graphica di
una collettivita di individui nella successione del tempo,
Annals of Statistics 12, 1–16.
Fienberg, S.E. (1979). Graphical methods in statistics,
American Statistician 33, 165–178.
Lorenz, M.O. (1905). Methods for measuring concentration of wealth, Journal of the American Statistical
Association 9, 209–219.
Cox, D.R. (1978). Some remarks on the role of statistics
in graphical methods, Applied Statistics 27, 4–9.
Tukey, J.W. (1977). Exploratory Data Analysis,
Addison-Wesley, Reading, Mass.
Anscombe, F.J. (1973). Graphs in statistical analysis,
American Statistician 27, 17–21.
Mahon, B.H. (1977). Journal of the Royal Statistical
Society. A 140, 298–307.
Craver, J.S. (1980). Graph Paper from Your Copier, H.P.
Books, Tucson, Ariz.
Tukey, J.W. (1972). in Statistical Papers in Honor of
George W. Snedecor, T.A. Bancroft, ed, Iowa State
University Press, Ames, Iowa.
Schmid, C.F. & Schmid, S.E. (1979). Handbook of
Graphic Presentation, 2nd Edition, John Wiley & Sons,
New York.
Freund, J.E. (1976). Statistics: A First Course, 2nd
Edition, Prentice-Hall, Englewood Cliffs.
Johnson, N.L. & Leone, F.C. (1977). Statistics and
Experimental Design in Engineering and the Physical
Sciences, 2nd Edition, John Wiley & Sons, New York,
Vol. 1.
McGill, R., Tukey, J.W. & Larsen, W.A. (1978). Variations of box plots, American Statistician 32, 12–16.
Hoerl, A.E. & Kennard, R.W. (1970). Ridge regression:
biased estimation of nonorthogonal problems, Technometrics 12, 55–70.
Box, G.E.P. & Jenkins, G.M. (1970). Time Series Analysis: Forecasting and Control, Holden-Day, San Francisco.
Anderson, E. (1960). A semigraphical method for
the analysis of complex problems, Technometrics 2,
387–391.
Bruntz, S.M., Cleveland, W.S., Kleiner, B. & Warner,
J.L. (1974). Proceedings of the Symposium on Atmospheric Diffusion and Air Pollution, American Meteorological Society, 125–128.
Graphical Representation of Data
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
Gabriel, K.R. (1971). The biplot graphic display of
matrices with application to principal component analysis, Biometrika 58, 453–467.
Mandel, J. (1971). A new analysis of variance model for
non-additive data, Technometrics 13, 1–18.
Chernoff, H. (1973). Using faces to represent points in kdimensional space graphically, Journal of the American
Statistical Association 68, 361–368.
Andrews, D.F. (1972). Plots of high-dimensional data,
Biometrics 28, 125–136.
Hartigan, J.A. (1975). Clustering Algorithms, WileyInterscience, New York.
Green, P.E. & Carmone, F.J. (1970). Multidimensional
Scaling and Related Techniques in Marketing Analysis,
Allyn & Bacon, Boston.
Lingoes, J.C., Roskam, E.E. & Borg, I. (1979). Geometrical Representations of Relational Data – Readings in
Multidimensional Scaling, Mathesis Press, Ann Arbor.
Gnanadesikan, R. (1977). Methods for Statistical Data
Analysis of Multivariate Observations, John Wiley &
Sons, New York.
Everitt, B.S. (1978). Graphical Techniques for Multivariate Data, North-Holland, Amsterdam.
Kimball, B.F. (1960). On the choice of plotting positions
on probability paper, Journal of the American Statistical
Association 55, 546–560.
King, J.R. (1971). Probability Charts for Decision Making, Industrial Press, New York.
Daniel, C. (1959). Use of half-normal plots in interpreting factorial two-level experiments, Technometrics
1, 311–341.
Daniel, C. & Wood, F.S. (1980). Fitting Equations to
Data, 2nd Edition, John & Wiley Sons, New York.
Wilk, M.B. & Gnanadesikan, R. (1968). Probability
plotting methods for the analysis of data, Biometrika 55,
1–17.
Wainer, H. (1974). The suspended rootogram and other
visual displays: an empirical validation, American Statistician 28, 143–145.
Dubey, S.D. (1966). Graphical tests for discrete distributions, American Statistician 20, 23–24.
Ord, J.K. (1967). Graphical methods for a class of
discrete distributions, Journal of the Royal Statistical
Society. A 13, 232–238.
Hoaglin, D.C. (1980). A Poissonness plot, American
Statistician 34, 146–149.
Parzen, E. (1979). A very good discussion of the properties of quantile functions, density quantile, Journal of
the American Statistical Association 74, 105–121.
Box, G.E.P., Hunter, W.G. & Hunter, J.S. (1978). Statistics for Experimenters, Wiley-Interscience, New York.
Draper, N.R. & Behnken, D.W. (1972). Residuals and
their variance patterns, Technometrics 14, 101–111.
Draper, N.R. & Smith, H. (1981). Applied Regression
Analysis, 2nd Edition, John Wiley & Sons, New York.
Anscombe, F.J. & Tukey, J.W. (1963). The examination
and analysis of residuals, Technometrics 5, 141–160.
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
19
Jackson, J.E. & Lawton, W.H. (1967). Technometrics 9,
339–341.
Larsen, W. & McCleary, S. (1972). The use of partial
residual plots in regression anal, Technometrics 14,
781–790.
Wood, F.S. (1973). The use of individual effects and
residuals in fitting equations to data, Technometrics 15,
677–695.
Grant, E.L. & Leavenworth, R.S. (1980). Statistical
Quality Control, 5th Edition, McGraw-Hill, New York.
Barnard, G.A. (1959). Control charts and stochastic
processes, Journal of the Royal Statistical Society. B 21,
239–271.
Lucas, J.M. (1976). The design and use of V-mask
control schemes, Journal of Quality Technology 8, 1–12.
Mandel, J. & Lashof, T.W. (1974). Interpretation and
generalization of. Youden’s two-sample diagram, Journal of Quality Technology 6, 22–36.
Ott, E.R. (1967). Industrial Quality Control 24,
101–109.
Schilling, E.G. (1973). A systemic approach to the
analysis of means, Journal of Quality Technology Part
1, 5, 93–108.
Schilling, E.G. (1973). A systemic approach to the
analysis of means, Journal of Quality Technology Parts
2 and 3 5, 147–159.
Youden, W.J. (1959). Graphical diagnosis of interlaboratory test results, Industrial Quality Control 15, 133–137.
Ott, E.R. (1957). Industrial Quality Control 13, 1–4.
Zahn, D.A. (1975). Modiÿcations of and revised critical values for the half-normal plot, Technometrics 17,
189–200.
Zahn, D.A. (1975). An empirical study of the halfnormal plot, Technometrics 17, 201–212.
Gorman, J.W. & Toman, R.J. (1966). Selection of
variables for fitting equations to data, Technometrics 8,
27–51.
Marquardt, D.W. & Snee, R.D. (1975). Ridge regression
in practice, American Statistician 29, 3–20.
Joiner, B.L. (1975). International Statistical Review 43,
339–340.
Tufte, E.R. (1997). Visual Explanation; Images and
Quantities, Evidence and Narrative, Graphics Press,
Cheshire.
Snee, R.D., Acuff, S.K. & Gibson, J.R. (1979). A useful
method for the analysis of growth studies, Biometrics
35, 835–848.
Andrews, H.P., Snee, R.D. & Sarner, M.H. (1980).
Graphical display of means, American Statistician 34,
195–199.
Bennett, C.A. & Franklin, N.L. (1954). Statistical Analysis in Chemistry and the Chemical Industries, John Wiley
& Sons, New York.
Monlezun, C.J. (1979). Two-dimensional plots for interpreting interactions in the three-factor analysis of variance model, American Statistician 33, 63.
Cornell, J.A. (1981). Experiments with Mixtures, WileyInterscience, New York.
20
[69]
[70]
[71]
[72]
[73]
[74]
[75]
Graphical Representation of Data
Snee, R.D. (1979). Experimenting with mixtures,
Chemtech 9, 702–710.
Rautela, G.S., Snee, R.D. & Miller, W.K. (1979).
Response-surface co-optimization of reaction conditions
in clinical chemical methods, Clinical Chemistry 25,
1954–1964.
Snee, R.D. (1973). Some Aspects of nonorthogonal
data analysis. Part I. Developing prediction equations,
Journal of Quality Technology 5, 67–79.
Snee, R.D. (1975). Technometrics 17, 425–430.
Snee, R.D. & Marquardt, D.W. (1976). Screening concepts and designs for experiments with mixtures, Technometrics 18, 19–29.
Snee, R.D. (1973). Journal of Quality Technology 5,
109–122.
Snee, R.D. (1974). Graphical display of two-way contingency tables,American Statistician 28, 9–12.
[76]
Draper, N.R., Hunter, W.G. & Tierney, D.E. (1969).
Which product is better? Technometrics 11, 309–320.
[77] Natrella, M.G. (1963). Experimental Statistics, National
Bureau of Standards Handbook 91, U.S. Government
Printing Office, Washington, DC.
[78] Pearson, E.S. & Hartley, H.O. (1970). Biometrika Tables
for Statisticians, Cambridge University Press, Cambridge, Vol. 1.
[79] Levens, A.S. (1959). Nomography, John Wiley & Sons,
New York.
RONALD D. SNEE
AND
CHARLES G. PFEIFER
Article originally published in Encyclopedia of Statistical Sciences, 2nd Edition (2005, John Wiley & Sons, Inc.). Minor
revisions for this publication by Jeroen de Mast.
Download