Correlation and simple linear regression

advertisement
Chapter 14
Correlation and simple linear
regression
Contents
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . .
14.3.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . .
14.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . .
14.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . .
14.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.2 Equation for a line - getting notation straight (no pun intended)
14.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . .
14.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . .
14.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . .
14.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . .
14.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . .
14.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . .
14.4.11 Transformations . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.12 Example: Monitoring Dioxins - transformation . . . . . . . . .
14.4.13 Example: Weight-length relationships - transformation . . . . .
14.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . .
14.4.15 The perils of R2 . . . . . . . . . . . . . . . . . . . . . . . . .
14.5 A no-intercept model: Fulton’s Condition Factor K . . . . . . . . .
14.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . .
14.6.1 Do I need a random sample; power analysis . . . . . . . . . .
The suggested citation for this chapter of notes is:
888
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
889
890
890
891
894
894
897
898
901
902
902
903
903
904
907
908
909
909
919
923
923
924
934
944
947
950
955
955
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Schwarz, C. J. (2015). Correlation and simple linear regression.
In Course Notes for Beginning and Intermediate Statistics.
Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved
2015-08-20.
14.1
Introduction
A nice book explaining how to use JMP to perform regression analysis is: Freund, R., Littell, R., and
Creighton, L. (2003) Regression using JMP. Wiley Interscience.
Much of statistics is concerned with relationships among variables and whether observed relationships are real or simply due to chance. In particular, the simplest case deals with the relationship between
two variables.
Quantifying the relationship between two variables depends upon the scale of measurement of each
of the two variables. The following table summarizes some of the important analyses that are often
performed to investigate the relationship between two variables.
Type of variables
Y is Interval or Ratio
or what JMP
calls Continuous
X is Interval or Ratio or
what JMP calls Continuous
X is Nominal or Ordinal
• Scatterplots
• Running
dian/spline fit
• Side-by-side dot plot
• Side-by-side
plot
me-
• Regression
box
• ANOVA or t-tests
• Correlation
Y is Nominal or Ordinal
• Logistic regression
• Mosaic chart
• Contingency tables
• Chi-square tests
In JMP these combination of two variables are obtained by the Analyze->Fit Y-by-X platform, the
Analyze->Correlation-of-Ys platform, or the Analyze->Fit Model platform.
When analyzing two variables, one question becomes important as it determines the type of analysis
that will be done. Is the purpose to explore the nature of the relationship, or is the purpose to use one
variable to explain variation in another variable? For example, there is a difference between examining
height and weight to see if there is a strong relationship, as opposed to using height to predict weight.
Consequently, you need to distinguish between a correlational analysis in which only the strength of
the relationship will be described, or regression where one variable will be used to predict the values of
a second variable.
c
2015
Carl James Schwarz
889
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The two variables are often called either a response variable or an explanatory variable. A response
variable (also known as a dependent or Y variable) measures the outcome of a study. An explanatory variable (also known as an independent or X variable) is the variable that attempts to explain the
observed outcomes.
14.2
Graphical displays
14.2.1
Scatterplots
The scatter-plot is the primary graphical tool used when exploring the relationship between two interval
or ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X platform – be sure that
both variables have a continuous scale.
In graphing the relationship, the response variable is usually plotted along the vertical axis (the Y
axis) and the explanatory variables is plotted along the horizontal axis (the X axis). It is not always
perfectly clear which is the response and which is the explanatory variable. If there is no distinction
between the two variables, then it doesn’t matter which variable is plotted on which axis – this usually
only happens when finding correlation between variables is the primary purpose.
For example, look at the relationship between calories/serving and fat from the cereal dataset using
JMP. [We will create the graph in class at this point.]
What to look for in a scatter-plot
Overall pattern. - What is the direction of association? A positive association occurs when aboveaverage values of one variable tend to be associated with above-average variables of another. The
plot will have an upward slope. A negative association occurs when above-average values of
one variable are associated with below-average values of another variable. The plot will have a
downward slope. What happens when there is “no association” between the two variables?
Form of the relationship. Does a straight line seem to fit through the ‘middle’ of the points? Is the line
linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to
form a curve)?
Strength of association. Are the points clustered tightly around the curve? If the points have a lot
of scatter above and below the trend line, then the association is not very strong. On the other
hand, if the amount of scatter above and below the trend line is very small, then there is a strong
association.
Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far from
the trend curve - i.e., they are further away from the trend curve than you would expect from the
usual level of scatter. There is no formal rule for detecting outliers - use common sense. [If you
set the role of a variable to be a label, and click on points in a linked graph, the label for the point
will be displayed making it easy to identify such points.]
One’s usual initial suspicion about any outlier is that it is a mistake, e.g., a transcription error.
Every effort should be made to trace the data back to its original source and correct the value if
possible. If the data value appears to be correct, then you have a bit of a quandary. Do you keep
the data point in even though it doesn’t follow the trend line, or do you drop the data point because
it appears to be anomalous? Fortunately, with computers it is relatively easy to repeat an analysis
with and without an outlier - if there is very little difference in the final outcome - don’t worry
about it.
c
2015
Carl James Schwarz
890
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
In some cases, the outliers are the most interesting part of the data. For example, for many years
the ozone hole in the Antarctic was missed because the computers were programmed to ignore
readings that were so low that ‘they must be in error’!
Lurking variables. A lurking variable is a third variable that is related to both variables and may confound the association.
For example, the amount of chocolate consumed in Canada and the number of automobile accidents are positively related, but most people would agree that this is coincidental and each variable
is independently driven by population growth.
Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by using
a different plotting symbol to distinguish between the values of the third variables. For example,
consider the following plot of the relationship between salary and years of experience for nurses.
The individual lines show a positive relationship, but the overall pattern when the data are pooled,
shows a negative relationship.
It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points.
From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markers
menu.
14.2.2
Smoothers
Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example,
consider the following data:
c
2015
Carl James Schwarz
891
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
There are several common methods available to fit a line through this data.
By eye The eye has remarkable power for providing a reasonable approximation to an underlying
trend, but it needs a little education. A trend curve is a good summary of a scatter-plot if the differences
between the individual data points and the underlying trend line (technically called residuals) are small.
As well, a good trend curve tries to minimize the total of the residuals. And the trend line should try and
go through the middle of most of the data.
Although the eye often gives a good fit, different people will draw slightly different trend curves.
Several automated ways to derive trend curves are in common use - bear in mind that the best ways of
estimating trend curves will try and mimic what the eye does so well.
Median or mean trace The idea is very simple. We choose a “window” width of size w, say. For
each point along the bottom (X) axis, the smoothed value is the median or average of the Y -values for
all data points with X-values lying within the “window” centered on this point. The trend curve is then
the trace of these medians or means over the entire plot. The result is not exactly smooth. Generally,
the wider the window chosen the smoother the result. However, wider windows make the smoother
react more slowly to changes in trend. Smoothing techniques are too computationally intensive to be
performed by hand. Unfortunately, JMP is unable to compute the trace of data, but splines are a very
good alternative (see below).
The mean or median trace is too unsophisticated to be a generally useful smoother. For example,
the simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights of
troughs. (Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of trying
to summarize a pattern in a weak relationship for a moderately large data set. In a very weak relationship
it can even help you to see the trend.
Box plots for strips The following gives a conceptually simple method which is useful for exploring
a weak relationship in a large data set. The X-axis is divided into equal-sized intervals. Then separate
box plots of the values of Y are found for each strip. The box-plots are plotted side-by-side and the means
or median are joined. Again, we are able to see what is happening to the variability as well as the trend.
There is even more detailed information available in the box plots about the shape of the Y -distribution
etc. Again, this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a new
variable that groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X
c
2015
Carl James Schwarz
892
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
platform using these groupings. This is illustrated below:
Spline methods A spline is a series of short smooth curves that are joined together to create a larger
smooth curve. The computational details are complex, but can be done in JMP. The stiffness of the
spline indicates how straight the resulting curve will be. The following shows two spline fits to the same
data with different stiffness measures:
c
2015
Carl James Schwarz
893
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
14.3
Correlation
WARNING!: Correlation is probably the most abused concept in statistics. Many people use
the word ‘correlation’ to mean any type of association between two variables, but it has a very strict
technical meaning, i.e. the strength of an apparent linear relationship between the two interval or ratio
scaled variables.
The correlation measure does not distinguish between explanatory and response variables and it
treats the two variables symmetrically. This means that the correlation between Y and X is the same as
the correlation between X and Y.
Correlations are computed in JMP using the Analyze->Correlation of Y’s platform. If there are
several variables, then the data will be organized into a table. Each cell in the table shows the correlation of the two corresponding variables. Because of symmetry (the correlation between variable1 and
variable2 is the same as between variable2 and variable1 ), only part of the complete matrix will be
shown. As well, the correlation between any variable and itself is always 1.
14.3.1
Scatter-plot matrix
To illustrate the ideas of correlation, look at the FITNESS dataset in the DATAMORE directory of JMP.
This is a dataset on 31 people at a fitness centre and the following variables were measured on each
subject:
• name
• gender
c
2015
Carl James Schwarz
894
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
• age
• weight
• oxygen consumption (high values are typically more fit people)
• time to run one mile (1.6 km)
• average pulse rate during the run
• the resting pulse rate
• maximum pulse rate during the run.
We are interested in examining the relationship among the variables. At the moment, ignore the fact
that the data contains both genders. [It would be interesting to assign different plotting symbols to the
two genders to see if gender is a lurking variable.]
One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze>Correlation of Ys to get the following scatter-plot:
c
2015
Carl James Schwarz
895
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Interpreting the scatter plot matrix
The entries in the matrix are scatter-plots for all the pairs of variables. For example, the entry in row
1 column 3 represents the scatter-plot between age and oxygen consumption with age along the vertical
axis and oxygen consumption along the horizontal axis, while the entry in row 3 column 1 has age along
the horizontal axis and oxygen consumption along the vertical axis.
There is clearly a difference in the ’strength’ of relationships. Compare the scatter plot for average
running pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and resting
pulse rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2).
Similarly, there is a difference in the direction of association. Compare the scatter plot for the average
running pulse rate and maximum pulse rate (row 5 column 7) and that for oxygen consumption and
running time (row 3, column 4).
c
2015
Carl James Schwarz
896
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
14.3.2
Correlation coefficient
It is possible to quantify the strength of association between two variables. As with all statistics, the way
the data are collected influences the meaning of the statistics.
The population correlation coefficient between two variables is denoted by the Greek letter rho (ρ)
and is computed as:.
N
1 X (Xi − µX ) (Yi − µY )
ρ=
N i=1
σx
σy
The corresponding sample correlation coefficient is denoted r has a similar form:1
n
1 X Xi − X Yi − Y
r=
n − 1 i=1
sx
sy
If the sampling scheme is simple random sample from the corresponding population, then r is an
estimate of ρ. This is a crucial assumption. If the sampling is not a simple random
sample, the above definition of the sample correlation coefficient should not be used! It is possible to
find a confidence interval for ρ and to perform statistical tests that ρ is zero. However, for the most part,
these are rarely done in ecological research and so will not be pursued further in this course.
The form of the formula does provide some insight into interpreting its value.
• ρ and r (unlike other population parameters) are unitless measures.
• the sign of ρ and r is largely determined by the pairing of the relationship of each of the (X,Y)
values with their respective means, i.e. if both of X and Y are above the mean, or both X and Y are
below their mean, this pair contributes a positive value towards ρ or r, while if X is above and Y
is below, or X is below and Y is above their respective means contributes a negative value towards
ρ or r.
• ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlation; a
value of ρ or r equal to 1 implies a perfect positive correlation; a value of ρ or r equal to 0 implied
no correlation. A perfect population correlation (i.e. ρ or r equal to 1 or -1) implies that all points
lie exactly on a straight line, but the slope of the line has NO effect on the correlation coefficient.
This latter point is IMPORTANT and often is wrongly interpreted - give some examples.
• ρ and r are unaffected by linear transformations of the individual variables, e.g. unit changes such
as converting from imperial to metric units.
• ρ and r only measures the linear association, and is not affected by the slope of the line, but only
by the scatter about the line.
Because correlation assumes both variables have an interval or ratio scale, it makes no sense to
compute the correlation
• between gender and oxygen (gender is a nominal scale data);
• between non-linear variables (not shown on graph);
1
Note that this formula SHOULD NOT be used for the actual computation of r, it is numerically unstable and there are better
computing formulae available.
c
2015
Carl James Schwarz
897
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
• for data collected without a known probability scheme. If a sampling scheme other than simple
random sampling is used, it is possible to modify the estimation formula; if a non-probability
sample scheme was used, the patient is dead on arrival, and no amount of statistical wizardry will
revive the corpse.
The data collection scheme for the fitness data set is unknown - we will have to assume that a some
sort of random sample form the relevant population was taken before we can make much sense of the
number computed.
Before looking at the details of its computation, look at the sample correlation coefficients for each
scatter plot above. These can be arranged into a matrix:
Variable
Age
Weight
Oxy
Runtime
RunPulse
RstPulse
MaxPulse
Age
1.00
-0.24
-0.31
0.19
-0.31
-0.15
-0.41
Weight
-0.24
1.00
-0.16
0.14
0.18
0.04
0.24
Oxy
-0.31
-0.16
1.00
-0.86
-0.39
-0.39
-0.23
Runtime
0.19
0.14
-0.86
1.00
0.31
0.45
0.22
RunPulse
-0.31
0.18
-0.39
0.31
1.00
0.35
0.92
RstPulse
-0.15
0.04
-0.39
0.45
0.35
1.00
0.30
MaxPulse
-0.41
0.24
-0.23
0.22
0.92
0.30
1.00
Notice that the sample correlation between any two variables is the same regardless of ordering of
the variables – this explains the symmetry in the matrix between the above and below diagonal elements.
As well each variable has a perfect sample correlation with itself – this explains the value of 1 along the
main diagonal.
Compare the sample correlations between the average running pulse rate and the other variables and
compare them to the corresponding scatter-plot above.
14.3.3
Cautions
• Random Sampling Required Sample correlation coefficients are only valid under simple random samples. If the data were collected in a haphazard fashion or if certain data points were
oversampled, then the correlation coefficient may be severely biased.
• There are examples of high correlation but no practical use and low correlation but great practical
use. These will be presented in class. This illustrates why I almost never talk about correlation.
• correlation measures ‘strength’ of a linear relationship; a curvilinear relationship may have a
correlation of 0, but there will still be a good correlation.
c
2015
Carl James Schwarz
898
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
• the effect of outliers and high leverage points will be presented in class
• effects of lurking variables. For example, suppose there is a positive association between wages
of male nurses and years of experience; between female nurses and years of experience; but males
are generally paid more than females. There is a positive correlation within each group, but an
overall negative correlation when the data are pooled together.
c
2015
Carl James Schwarz
899
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
• ecological fallacy - the problem of correlation applied to averages. Even if there is a high correlation between two variables on their averages, it does not imply that there is a correlation between
individual data values.
For example, if you look at the average consumption of alcohol and the consumption of cigarettes,
there is a high correlation among the averages when the 12 values from the provinces and territories
are plotted on a graph. However, the individual relationships within provinces can be reversed or
non-existent as shown below:
The relationship between cigarette consumption and alcohol consumption shows no relationship
c
2015
Carl James Schwarz
900
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
for each province, yet there is a strong correlation among the per-capita averages. This is an
example of the ecological fallacy.
• correlation does not imply causation. This is the most frequent mistake made by people. There
are set of principles of causal inference that need to be satisfied in order to imply cause and effect.
14.3.4
Principles of Causation
Types of association
An association may be found between two variables for several reasons (show causal modeling figures):
• There may be direct causation, e.g. smoking causes lung cancer.
• There may be a common cause, e.g. ice cream sales and number of drownings both increase with
temperature.
• There may be a confounding factor, e.g. highway fatalities decreased when the speed limits were
reduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and people
drove fewer miles.
• There may be a coincidence, e.g., the population of Canada has increased at the same time as the
moon has gotten closer by a few miles.
Establishing cause-and effect.
How do we establish a cause and effect relationship? Bradford Hill (Hill, A. B.. 1971. Principles of
Medical Statistics, 9th ed New York: Oxford University Press) outlined 7 criteria that have been adopted
by many epidemiological researchers. It is generally agreed that most or all of the following must be
considered before causation can be declared.
Strength of the association. The stronger an observed association appears over a series of different
studies, the less likely this association is spurious because of bias.
Dose-response effect. The value of the response variable changes in a meaningful way with the dose
(or level) of the suspected causal agent.
Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The ability
to establish this time pattern will depend upon the study design used.
Consistency of the findings. Most, or all, studies concerned with a given causal hypothesis produce
similar findings. Of course, studies dealing with a given question may all have serious bias problems that can diminish the importance of observed associations.
Biological or theoretical plausibility. The hypothesized causal relationship is consistent with current
biological or theoretical knowledge. Note, that the current state of knowledge may be insufficient
to explain certain findings.
Coherence of the evidence. The findings do not seriously conflict with accepted facts about the outcome variable being studied.
Specificity of the association. The observed effect is associated with only the suspected cause (or few
other causes that can be ruled out).
c
2015
Carl James Schwarz
901
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
IMPORTANT: NO CAUSATION WITHOUT MANIPULATION!
Examples:
Discuss the above in relation to:
• amount of studying vs. grades in a course.
• amount of clear cutting and sediments in water.
• fossil fuel burning and the greenhouse effect.
14.4
Single-variable regression
14.4.1
Introduction
Along with the Analysis of Variance, this is likely the most commonly used statistical methodology in
ecological research. In virtually every issue of an ecological journal, you will find papers that use a
regression analysis.
There are HUNDREDS of books written on regression analysis. Some of the better ones (IMHO)
are:
Draper and Smith. Applied Regression Analysis. Wiley.
Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.
Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.
Zar. Biostatistics. Prentice Hall.
Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of
regression analysis. Please consult the above references for all the gory details.
It turns out that both Analysis of Variance and Regression are special cases of a more general statistical methodology called General Linear Models which in turn are special cases of Generalized Linear
Models (covered in Stat 402/602), which in turn are special cases of Generalized Additive Models, which
in turn are special cases of .....
The key difference between a Regression analysis and an ANOVA is that the X variable is nominal
scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This implies that in
ANOVA, the shape of the response profile was unspecified (the null hypothesis was that all means were
equal while the alternate was that at least one mean differs), while in regression, the response profile
must be a linear line.
Because both ANOVA and regression are from the same class of statistical models, many of the
assumptions are similar, the fitting methods are similar, hypotheses testing and inference are similar as
well.
c
2015
Carl James Schwarz
902
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
14.4.2
Equation for a line - getting notation straight (no pun intended)
In order to use regression analysis effectively, it is important that you understand the concepts of slopes
and intercepts and how to determine these from data values.
This will be QUICKLY reviewed here in class.
In previous courses at high school or in linear algebra, the equation of a straight line was often written
y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, the
authors decided to write the equation of a line as y = a + bx. Now a is the intercept, and b is the slope.
Statisticians, for good reasons, have rationalized this notation and usually write the equation of a line
as y = β0 + β1 x or as Y = b0 + b1 X. (the distinction between β0 and b0 will be made clearer in a
few minutes). The use of the subscripts 0 to represent the intercept and the subscript 1 to represent the
coefficient for the X variable then readily extends to more complex cases.
Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unit
change in X.
14.4.3
Populations and samples
All of statistics is about detecting signals in the face of noise and in estimating population parameters
from samples. Regression is no different.
First consider the the population. As in previous chapters, the correct definition of the population is
important as part of any study. Conceptually, we can think of the large set of all units of interest. On each
unit, there is conceptually, both an X and Y variable present. We wish to summarize the relationship
between Y and X, and furthermore wish to make predictions of the Y value for future X values that may
be observed from this population. [This is analogous to having different treatment groups corresponding
to different values of X in ANOVA.]
If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma or
P V = nRt. However, in ecology, the relationship between Y and X is much more tenuous. If you
could draw a scatter-plot of Y against X for ALL elements of the population, the points would NOT fall
exactly on a straight line. Rather, the value of Y would fluctuate above or below a straight line at any
given X value. [This is analogous to saying that Y varies randomly around the treatment group mean in
ANOVA.]
We denote this relationship as
Y = β0 + β1 X + where now β0 , β1 are the POPULATION intercept and slope respectively. We say that
E[Y ] = β0 + β1 X
is the expected or average value of Y at X. [In ANOVA, we let each treatment group have its own mean;
here in regression we assume that the means must fit on a straight line.]
The term represent random variation of individual units in the population above and below the
expected value. It is assumed to have constant standard deviation over the entire regression line (i.e. the
spread of data points in the population is constant over the entire regression line). [This is analogous to
the assumption of equal treatment population standard deviations in ANOVA.]
Of course, we can never measure all units of the population. So a sample must be taken in order to
c
2015
Carl James Schwarz
903
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
estimate the population slope, population intercept, and population standard deviation. Unlike a correlation analysis, it is NOT necessary to select a simple random sample from the entire population and more
elaborate schemes can be used. The bare minimum that must be achieved is that for any individual X
value found in the sample, the units in the population that share this X value, must have been selected at
random.
This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X from
the extremes and then only at those X value, randomly select from the relevant subset of the population,
rather than having to select at random from the population as a whole. [This is analogous to the assumptions made in an analytical survey, where we assumed that even though we can’t randomly assign
a treatment to a unit (e.g. we can’t assign sex to an animal) we must ensure that animals are randomly
selected from each group].
Once the data points are selected, the estimation process can proceed, but not before assessing the
assumptions!
14.4.4
Assumptions
The assumptions for a regression analysis are very similar to those found in ANOVA.
Linearity
Regression analysis assume that the relationship between Y and X is linear. Make a scatter-plot between
Y and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs. log(X)). Some
caution is required in transformation in dealing with the error structure as you will see in later examples.
Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g.
quadratic curve), this usually indicates that the relationship between Y and X is not linear. Or, fit a model
that includes X and X 2 and test if the coefficient associated with X 2 is zero. Unfortunately, this test
could fail to detect a higher order relationship. Third, if there are multiple readings at some X-values,
then a test of goodness-of-fit can be performed where the variation of the responses at the same X value
is compared to the variation around the regression line.
Correct scale of predictor and response
The response and predictor variables must both have interval or ratio scale. In particular, using a numerical value to represent a category and then using this numerical value in a regression is not valid. For
example, suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using these
values in a regression either as predictor variable or as a response variable is not sensible.
Correct sampling scheme
The Y must be a random sample from the population of Y values for every X value in the sample.
Fortunately, it is not necessary to have a completely random sample from the population as the regression
line is valid even if the X values are deliberately chosen. However, for a given X, the values from the
population must be a simple random sample.
c
2015
Carl James Schwarz
904
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
No outliers or influential points
All the points must belong to the relationship – there should be no unusual points. The scatter-plot of Y
vs. X should be examined. If in doubt, fit the model with the points in and out of the fit and see if this
makes a difference in the fit.
Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the single
point is an outlier and and influential point:
Equal variation along the line
The variability about the regression line is similar for all values of X, i.e. the scatter of the points above
and below the fitted line should be roughly constant over the entire line. This is assessed by looking at
the plots of the residuals against X to see if the scatter is roughly uniformly scattered around zero with
no increase and no decrease in spread over the entire line.
Independence
Each value of Y is independent of any other value of Y . The most common cases where this fail are time
series data where X is a time measurement. In these cases, time series analysis should be used.
This assumption can be assessed by again looking at residual plots against time or other variables.
Normality of errors
The difference between the value of Y and the expected value of Y is assumed to be normally distributed. This is one of the most misunderstood assumptions. Many people erroneously assume that the
distribution of Y over all X values must be normally distributed, i.e they look simply at the distribution
of the Y ’s ignoring the Xs. The assumption only states that the residuals, the difference between the
value of Y and the point on the line must be normally distributed.
This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, for small
sample sizes, you have little power of detecting non-normality and for large sample sizes it is not that
important.
c
2015
Carl James Schwarz
905
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
X measured without error
This is a new assumption for regression as compared to ANOVA. In ANOVA, the group membership
was always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity.
However, in regression, it can turn out that that the X value may not be known exactly.
This general problem is called the “error in variables” problem and has a long history in statistics.
It turns out that there are two important cases. If the value reported for X is a nominal value and the
actual value of X varies randomly around this nominal value, then there is no bias in the estimates. This
is called the Berkson case after Berkson who first examined this situation. The most common cases are
where the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X that
occurs would vary randomly around this target value.
However, if the value used for X is an actual measurement of the underlying X then there is uncertainty in both the X and Y direction. In this case, estimates of the slope are attenuated towards zero (i.e.
positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimate
are no longer consistent, i.e. as the sample size increases, the estimates no longer tend to the population
values! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may not
be located exactly at the plot where the crop is grown, but may be recorded a nearby weather station a
fair distance away. The reading at the weather station is NOT a true reflection of the rainfall at the test
plot.
This latter case of “error in variables” is very difficult to analyze properly and there are not universally
accepted solutions. Refer to the reference books listed at the start of this chapter for more details.
The problem is set up as follows. Let
Yi =ηi + i
Xi =ξi + δi
with the straight-line relationship between the population (but unobserved) values:
ηi =β0 + β1 ξi
Note the (population, but unknown) regression equation uses ξi rather than the observed (with error)
values Xi .
Now if the regression is done on the observed X (i.e. the error prone measurement), the regression
equation reduces to:
Yi = β0 + β1 Xi + (i − β1 δi )
Now this violates the independence assumption of ordinary least squares because the new “error” term
is not independent of the Xi variable.
If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p.
90) with
β1 r(ρ + r)
E[βb1 ] = β1 −
1 + 2ρr + r2
where ρ is the correlation between ξ and δ; and r is the ratio of the variance of the error in X to the error
in Y .
The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ + r > 0). This is
known as attenuation of the estimate, and in general, pulls the estimate towards zero.
c
2015
Carl James Schwarz
906
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The bias will be small in the following cases:
• the error variance of X is small relative to the error variance in Y . This means that r is small (i.e.
close to zero), and so the bias is also small. In the case where X is measured without error, then
r = 0 and the bias vanishes as expected.
• if the X are fixed (the Berkson case) and actually used2 , then ρ + r = 0 and the bias also vanishes.
The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p.
91) for more details.
14.4.5
Obtaining Estimates
To distinguish between population parameters and sample estimates, we denote the sample intercept by
b0 and the sample slope as b1 . The equation of a particular sample of points is expressed Ybi = b0 + b1 Xi
where b0 is the estimated intercept, and b1 is the estimated slope. The symbol Yb indicates that we are
referring to the estimated line and not to a line in the entire population.
How is the best fitting line found when the points are scattered? We typically use the principle of
least squares. The least-squares line is the line that makes the sum of the squares of the deviations of
the data points from the line in the vertical direction as small as possible.
2
P
Mathematically, the least squares line is the line that minimizes n1
Yi − Ybi where Ybi is the
point on the line corresponding to each X value. This is also known as the predicted value of Y for a
given value of X. This formal definition of least squares is not that important - the concept as expressed in
the previous paragraph is more important – in particular it is the SQUARED deviation in the VERTICAL
direction that is used..
It is possible to write out a formula for the estimated intercept and slope, but who cares - let the
computer do the dirty work.
The estimated intercept (b0 ) is the estimated value of Y when X = 0. In some cases, it is meaningless
to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a plot of income vs.
year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretation
of the intercept, and it merely serves as a placeholder for the line.
The estimated slope (b1 ) is the estimated change in Y per unit change in X. For every unit change
in the horizontal direction, the fitted line increased by b1 units. If b1 is negative, the fitted line points
downwards, and the increase in the line is negative, i.e., actually a decrease.
As with all estimates, a measure of precision can be obtained. As before, this is the standard error
of each of the estimates. Again, there are computational formulae, but in this age of computers, these
are not important. As before, approximate 95% confidence intervals for the corresponding population
parameters are found as estimate ± 2 × se.
Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameter
as this is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is no
relationship between Y and X (can you draw a scatter-plot showing such a relationship?) More formally
2 For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based on the
thermostat readings rather than the (true) unknown temperature, this corresponds to the Berkson case.
c
2015
Carl James Schwarz
907
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
the null hypothesis is:
H : β1 = 0
Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms of
a sample statistic.
The alternate hypothesis is typically chosen as:
A : β1 6= 0
although one-sided tests looking for either a positive or negative slope are possible.
The test-statistics is found as
T =
b1 − 0
se(b1 )
and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. This
is usually automatically done by most computer packages. The p-value is interpreted in exactly the
same way as in ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no
relationship were true.
As before, the p-value does not tell the whole story, i.e. statistical vs. biological (non)significance
must be determined and assessed.
14.4.6
Obtaining Predictions
Once the best fitting line is found it can be used to make predictions for new values of X.
There are two types of predictions that are commonly made. It is important to distinguish between
them as these two intervals are the source of much confusion in regression problems.
First, the experimenter may be interested in predicting a SINGLE future individual value for a particular X. Second the experimenter may be interested in predicting the AVERAGE of ALL future
responses at a particular X.3 The prediction interval for an individual response is sometimes called a
confidence interval for an individual response but this is an unfortunate (and incorrect) use of the term
confidence interval. Strictly speaking confidence intervals are computed for fixed unknown parameter
values; predication intervals are computed for future random variables.
Both of the above intervals should be distinguished from the confidence interval for the slope.
In both cases, the estimate is found in the same manner – substitute the new value of X into the equation and compute the predicted value Yb . In most computer packages this is accomplished by inserting a
new “dummy” observation in the dataset with the value of Y missing, but the value of X present. The
missing Y value prevents this new observation from being used in the fitting process, but the X value
allows the package to compute an estimate for this observation.
What differs between the two predictions are the estimates of uncertainty.
In the first case, there are two sources of uncertainty involved in the prediction. First, there is the
uncertainty caused by the fact that this estimated line is based upon a sample. Then there is the additional
uncertainty that the value could be above or below the predicted line. This interval is often called a
prediction interval at a new X.
3 There
is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.
c
2015
Carl James Schwarz
908
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
In the second case, only the uncertainty caused by estimating the line based on a sample is relevant.
This interval is often called a confidence interval for the mean at a new X.
The prediction interval for an individual response is typically MUCH wider than the confidence
interval for the mean of all future responses because it must account for the uncertainty from the fitted
line plus individual variation around the fitted line.
Many textbooks have the formulae for the se for the two types of predictions, but again, there is
little to be gained by examining them. What is important is that you read the documentation carefully to
ensure that you understand exactly what interval is being given to you.
14.4.7
Residual Plots
After the curve is fit, it is important to examine if the fitted curve is reasonable. This is done using
residuals. The residual for a point is the difference between the observed value and the predicted value,
i.e., the residual from fitting a straight line is found as: residuali = Yi − (b0 + b1 Xi ) = (Yi − Ybi ).
There are several standard residual plots:
• plot of residuals vs. predicted (Yb );
• plot of residuals vs. X;
• plot of residuals vs. time ordering.
In all cases, the residual plots should show random scatter around zero with no obvious pattern.
Don’t plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and don’t
mean anything.
14.4.8
Example - Yield and fertilizer
We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. An
experiment was conducted in the Schwarz household one summer on 11 plots of land where the amount
of fertilizer was varied and the yield measured at the end of the season.
The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. While
the levels were not systematically chosen (e.g. they were not evenly spaced between the highest and
lowest values), they represent commonly used amounts based on a preliminary survey of producers. At
the end of the experiment, the yields were measured and the following data were obtained.
Interest also lies in predicting the yield when 16 kg/ha are assigned.
c
2015
Carl James Schwarz
909
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Fertilizer
Yield
(kg/ha)
(Liters)
12
24
5
18
15
31
17
33
14
30
6
20
11
25
13
27
15
31
8
21
18
29
The raw data is also available in a JMP datasheet called fertilizer.jmp available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the response
variable (Y ) is the yield.
The population consists of all possible field plots with all possible tomato plants of this type grown
under all possible fertilizer levels between about 5 and 18 kg/ha.
If all of the population could be measured (which it can’t) you could find a relationship between
the yield and the amount of fertilizer applied. This relationship would have the form: Y = β0 + β1 ×
(amount of fertilizer) + where β0 and β1 represent the population intercept and population slope respectively. The term represents random variation that is always present, i.e. even if the same plot was
grown twice in a row with the same amount of fertilizer, the yield would not be identical (why?).
The population parameters to be estimated are β0 - the population average yield when the amount of
fertilizer is 0, and β1 , the population average change in yield per unit change in the amount of fertilizer.
These are taken over all plants in all possible field plots of this type. The values of β0 and β1 are
impossible to obtain as the entire population could never be measured.
Here is the data entered into a JMP data sheet. Note the scale of both variables (continuous) and that
an extra row was added to the data table with the value of 16 for the fertilizer and the yield left missing.
c
2015
Carl James Schwarz
910
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The ordering of the rows in the data table is NOT important; however, it is often easier to find
individual data points if the data is sorted by the X value and the rows for future predictions are placed
at the end of the dataset. Notice how missing values are represented.
Use the Analyze->Fit Y-by-X platform to start the analysis. Specify the Y and X variable as needed.
Notice that JMP “reminds” you of the analysis that you will obtain based on the scale of the X and
Y variables as shown in the bottom left of the menu. In this case, both X and Y have a continuous scale,
so JMP will perform a bi-variate fitting procedure. It starts by showing the scatter-plot between Yield
(Y ) and fertilizer (X).
c
2015
Carl James Schwarz
911
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The relationship look approximately linear; there don’t appear to be any outlier or influential points;
the scatter appears to be roughly equal across the entire regression line. Residual plots will be used later
to check these assumptions in more detail.
The drop-down menu item (from the red triangle beside the Bivariate Fit...) allows you to fit the
least-squares line. This produces much output, but the three important part of the output are discussed
below.
First, the actual fitted line is drawn on the scatter plot, and the equation of the fitted line is printed
below the plot.
c
2015
Carl James Schwarz
912
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The estimated regression line is
Yb = b0 + b1 (fertilizer) = 12.856 + 1.10137(amount of fertilizer)
In terms of estimates, b0 =12.856 is the estimated intercept, and b1 =1.101 is the estimated slope.
The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1
unit. In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount is
increased by 1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not the
value of Y when X = 1.
The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the estimated yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningful
interpretation, but I’d be worried about extrapolating outside the range of the observed X values. If the
intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to
13?
Once again, these are the results from a single experiment. If another experiment was repeated, you
would obtain different estimates (b0 and b1 would change). The sampling distribution over all possible experiments would describe the variation in b0 and b1 over all possible experiments. The standard
deviation of b0 and b1 over all possible experiments is again referred to as the standard error of b0 and
b1 .
The formulae for the standard errors of b0 and b1 are messy, and hopeless to compute by hand.
And just like inference for a mean or a proportion, the program automatically computes the se of the
regression estimates.
c
2015
Carl James Schwarz
913
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The estimated standard error for b1 (the estimated slope) is 0.132 L/kg. This is an estimate of the
standard deviation of b1 over all possible experiments. Normally, the intercept is of limited interest, but
a standard error can also be found for it as shown in the above table.
Using exactly the same logic as when we found a confidence interval for the population mean, or for
the population proportion, a confidence interval for the population slope (β1 ) is found (approximately)
as b1 ± 2(estimated se) In the above example, an approximate confidence interval for β1 is found as
1.101 ± 2 × (.132) = 1.101 ± .264 = (.837 → 1.365)L/kg
of fertilizer applied.
An “exact” confidence interval can be computed by JMP as shown above.4 The “exact” confidence
interval is based on the t-distribution and is slightly wider than our approximate confidence interval
because the total sample size (11 pairs of points) is rather small. We interpret this interval as ‘being 95%
confident that the population increase in yield when the amount of fertilizer is increased by one unit is
somewhere between (.837 to 1.365) L/kg.’
Be sure to carefully distinguish between β1 and b1 . Note that the confidence interval is computed
using b1 , but is a confidence interval for β1 - the population parameter that is unknown.
In linear regression problems, one hypothesis of interest is if the population slope is zero. This would
correspond to no linear relationship between the response and predictor variable (why?) Again, this is
a good time to read the papers by Cherry and Johnson about the dangers of uncritical use of hypothesis
testing. In many cases, a confidence interval tells the entire story.
JMP produces a test of the hypothesis that each of the parameters (the slope and the intercept in the
population) is zero. The output is reproduced again below:
The test of hypothesis about the intercept is not of interest (why?).
Let
• β1 be the population (unknown) slope.
• b1 be the estimated slope. In this case b1 = 1.1014.
The hypothesis testing proceeds as follows. Again note that we are interested in the population
parameters and not the sample statistics.
4 If your table doesn’t show the confidence interval, use a Control-Click or Right-Click in the table and select the columns to be
displayed.
c
2015
Carl James Schwarz
914
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
1. Specify the null and alternate hypothesis:
H: β1 = 0
A: β1 6= 0.
Notice that the null hypothesis is in terms of the population parameter β1 . This is a two-sided test
as we are interested in detecting differences from zero in either direction.
2. Find the test statistic and the p-value. The test statistic is computed as:
T =
estimate − hypothesized value
1.1014 − 0
=
= 8.36
estimated se
.132
In other words, the estimate is over 8 standard errors away from the hypothesized value!
This will be compared to a t-distribution with n − 2 = 9 degrees of freedom. The p-value is found
to very small (less than 0.0001).
3. Conclusion. There is strong evidence that the population slope is not zero. This is not too surprising given that the 95% confidence intervals show that plausible values for the population slope are
from about .8 to about 1.4.
It is possible to construct tests of the slope equal to some value other than 0. Most packages can’t do
this. You would compute the T value as shown above, replacing the value 0 with the hypothesized value.
It is also possible to construct one-sided tests. Most computer packages only do two-sided tests.
Proceed as above, but the one-sided p-value is the two-sided p-value reported by the packages divided
by 2.
If sufficient evidence is found against the hypothesis, a natural question to ask is ‘well, what values
of the parameter are plausible given this data’. This is exactly what a confidence interval tells you.
Consequently, I usually prefer to find confidence intervals, rather than doing formal hypothesis testing.
What about making predictions for future yields when certain amounts of fertilizer are applied? For
example, what would be the future yield when 16 kg/ha of fertilizer are applied?
The predicted value is found by substituting the new X into the estimated regression line.
Yb = b0 + b1 (f ertilizer) = 12.856 + 1.10137(16) = 30.48 L
This can also be found by using the cross hairs tool on the actual graph (to be demonstrated in
class).JMP can compute the predicted value by selecting the appropriate option under the drop down
menu item in the Linear Fit item:
c
2015
Carl James Schwarz
915
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
and then going back to look at the new column in the data table:
As noted earlier, there are two types of estimates of precision associated with predictions using the
regression line. It is important to distinguish between them as these two intervals are the source of much
confusion in regression problems.
First, the experimenter may be interested in predicting a single FUTURE individual value for a
particular X. This would correspond to the predicted yield for a single future plot with 16 kg/ha of
fertilizer added.
Second the experimenter may be interested in predicting the average of ALL FUTURE responses at
a particular X. This would correspond to the average yield for all future plots when 16 kg/ha of fertilizer
c
2015
Carl James Schwarz
916
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
is added. The prediction interval for an individual response is sometimes called a confidence interval
for an individual response but this is an unfortunate (and incorrect) use of the term confidence interval.
Strictly speaking confidence intervals are computed for fixed unknown parameter values; predication
intervals are computed for future random variables.
Both intervals can be computed and plotted byJMP by again using the pop-down menu beside the
Linear Fit box:
In this menu, the Confid Curves Fit correspond to confidence intervals for the MEAN response, while
the Confid Curves Indiv correspond to prediction intervals for the future single response. Both can be
plotted on the graph. Unfortunately, there does not appear to be a way to save the prediction limits into
a data table from this platform - the cross hairs tool must be used, or the Analyze->Fit Model platform
should be used.
c
2015
Carl James Schwarz
917
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The innermost set of lines represents the confidence bands for the mean response. The outermost
band of lines represents the prediction intervals for a single future response. As noted earlier, the latter
must be wider than the former to account for an additional source of variation.
The numerical values from the Analyze->Fit Model platform are shown below:
Here the predicted yield for a single future trial at 16 kg/ha is 30.5 L, but the 95% prediction interval
is between 26.1 and 34.9 L. The predicted AVERAGE yield for ALL future plots when 16 kg/ha of
fertilizer is applied is also 30.5 L, but the 95% confidence interval for the MEAN yield is between 28.8
and 32.1 L.
Finally, residual plots can be made using the pop-down menu:
c
2015
Carl James Schwarz
918
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The residuals are simply the difference between the actual data point and the corresponding spot on
the line measured in the vertical direction. The residual plot shows no trend in the scatter around the
value of zero.
The same items are available from the Analyze->Fit Model platform. Here you would specify Yield
as the Y variable, Fertilizer as the X variable in much the same way as in the Analyze->Fit Y-by-X
platform. Much of the same output is produced. Additionally, you can save the actual confidence bounds
for predictions into the data table (as shown above). This will be demonstrated in class.
14.4.9
Example - Mercury pollution
Mercury pollution is a serious problem in some waterways. Mercury levels often increase after a lake
is flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessive
consumption of mercury is well known to be deleterious to human health. It is difficult and time consuming to measure every persons mercury level. It would be nice to have a quick procedure that could
be used to estimate the mercury level of a person based upon the average mercury level found in fish
and estimates of the person’s consumption of fish in their diet. The following data were collected on
the methyl mercury intake of subjects and the actual mercury levels recorded in the blood stream from a
random sample of people around recently flooded lakes.
Here are the raw data:
c
2015
Carl James Schwarz
919
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Methyl Mercury
Mercury in
Intake
whole blood
(ug Hg/day)
(ng/g)
180
90
200
120
230
125
410
290
600
310
550
290
275
170
580
375
600
150
105
70
250
105
60
205
650
480
The data are available in a JMP datasheet called mercury.jmp available from the Sample Program
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The ordering of the rows in the data table is NOT important; however, it is often easier to find
individual data points if the data is sorted by the X value and the rows for future predictions are placed
at the end of the dataset. Notice how missing values are represented.
The population of interest are the people around recently flooded lakes.
This experiment is an analytical survey as it is quite impossible to randomly assign people different
amounts of mercury in their food intake. Consequently, the key assumption is that the subjects chosen
to be measured are random samples from those with similar mercury intakes. Note it is NOT necessary
for this to be a random sample from the ENTIRE population (why?).
The explanatory variable is the amount of mercury ingested by a person. The response variable is the
amount of mercury in the blood stream.
We start by producing the scatter-plot.
c
2015
Carl James Schwarz
920
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
There appears to be two outliers (identified by an X). To illustrate the effects of these outliers upon
the estimates and the residual plots, the line was fit using all of the data.
The residual plot shows the clear presence of the two outliers, but also identifies a third potential
outlier not evident from the original scatter-plot (can you find it?).
c
2015
Carl James Schwarz
921
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The data were rechecked and it appears that there was an error in the blood work used in determining
the readings. Consequently, these points were removed for the subsequent fit.
The estimated regression line (after removing outliers) is
Blood = −1.951691 + 0.581218Intake
.
The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/day
when the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in the
context of this experiment. The negative value is merely a placeholder for the line. Also notice that the
estimated intercept is not very precise in any case (how do I know this and what implications does this
have for worrying that it is not zero?)5
What was the impact of the outliers if they had been retained upon the estimated slope and intercept?
The estimated slope has been determined relatively well (relative standard error of about 10% – how
5 It is possible to fit a regression line that is constrained to go through Y = 0 when X = 0. These must be fit carefully and are
not covered in this course.
c
2015
Carl James Schwarz
922
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
is the relative standard error computed?). There is clear evidence that the hypothesis of no relationship
between blood mercury levels and food mercury levels is not tenable.
The two types of predictions would also be of interest in this study. First, an individual would like to
know the impact upon personal health. Secondly, the average level would be of interest to public health
authorities.
JMP was used to plot both intervals on the scatter-plot:
14.4.10
Example - The Anscombe Data Set
Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable.
All four datasets gave exactly the same results when a regression line was fit, yet are quite different in
their interpretation.
The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms. Fitting of regression lines to this data will be demonstrated in class.
14.4.11
Transformations
In some cases, the plot of Y vs. X is obviously non-linear and a transformation of X or Y may be used to
establish linearity. For example, many dose-response curves are linear in log(X). Or the equation may
be intrinsically non-linear, e.g. a weight-length relationship is of the form weight = β0 lengthβ1 . Or,
some variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured
in L/100 km or km/L? You are already with some variables measured on the log-scale - pH is a common
example.
Often a visual inspection of a plot may identify the appropriate transformation.
c
2015
Carl James Schwarz
923
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
There is no theoretical difficulty in fitting a linear regression using transformed variables other than
an understanding of the implicit assumption of the error structure. The model for a fit on transformed
data is of the form
trans(Y ) = β0 + β1 × trans(X) + error
Note that the error is assumed to act additively on the transformed scale. All of the assumptions of
linear regression are assumed to act on the transformed scale – in particular that the population standard
deviation around the regression line is constant on the transformed scale.
The most common transformation is the logarithmic transform. It doesn’t matter if the natural logarithm (often called the ln function) or the common logarithm transformation (often called the log10
transformation) is used. There is a 1-1 relationship between the two transformations, and linearity on
one transform is preserved on the other transform. The only change is that values on the ln scale are
2.302 = ln(10) times that on the log10 scale which implies that the estimated slope and intercept both
differ by a factor of 2.302. There is some confusion in scientific papers about the meaning of log - some
papers use this to refer to the ln transformation, while others use this to refer to the log10 transformation.
After the regression model is fit, remember to interpret the estimates of slope and intercept on the
transformed scale. For example, suppose that a ln(Y ) transformation is used. Then we have
ln(Yt+1 ) = b0 + b1 × (t + 1)
ln(Yt ) = b0 + b1 × t
and
ln(Yt+1 ) − ln(Yt ) = ln(
.
exp(ln(
Yt+1
) = b1 × (t + 1 − t) = b1
Yt
Yt+1
Yt+1
)) =
= exp(b1 ) = eb1
Yt
Yt
Hence a one unit increase in X causes Y to be MULTIPLED by eb1 . As an example, suppose that on
the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a
multiplicative factor or e−.07 = .93, i.e. roughly a 7% decline per year.6
Similarly, predictions on the transformed scale, must be back-transformed to the untransformed scale.
In some problems, scientists search for the ‘best’ transform. This is not an easy task and using simple
statistics such as R2 to search for the best transformation should be avoided. Seek help if you need to
find the best transformation for a particular dataset.
JMP makes it particularly easy to fit regressions to transformed data as shown below. SAS and R
have an extensive array of functions so that you can create new variables based the transformation of an
existing variable.
14.4.12
Example: Monitoring Dioxins - transformation
An unfortunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material.
This material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated
in living organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the
organisms takes a long time to degrade.
6 It can be shown that on the log scale, that for smallish values of the slope that the change is almost the same on the untransformed scale, i.e. if the slope is −.07 on the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughly
a 7% increase per year.
c
2015
Carl James Schwarz
924
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Government environmental protection agencies take samples of crabs from affected areas each year
and measure the amount of dioxins in the tissue. The following example is based on a real study.
Each year, four crabs are captured from a monitoring station. The liver is excised and the livers from
all four crabs are composited together into a single sample.7 The dioxins levels in this composite sample
is measured. As there are many different forms of dioxins with different toxicities, a summary measure,
called the Total Equivalent Dose (TEQ) is computed from the sample.
Here is the raw data.
Site
Year
TEQ
a
1990
179.05
a
1991
82.39
a
1992
130.18
a
1993
97.06
a
1994
49.34
a
1995
57.05
a
1996
57.41
a
1997
29.94
a
1998
48.48
a
1999
49.67
a
2000
34.25
a
2001
59.28
a
2002
34.92
a
2003
28.16
The data is available in a JMP data file dioxinTEQ.jmp in the http://www.stat.sfu.ca/
~cschwarz/Stat-650/Notes/MyPrograms.
As with all analyses, start with a preliminary plot of the data. Use the Analyze->Fit Y-by-X platform.
7 Compositing is a common analytical tool. There is little loss of useful information induced by the compositing process - the
only loss of information is the among individual-sample variability which can be used to determine the optimal allocation between
samples within years and the number of years to monitor.
c
2015
Carl James Schwarz
925
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Why
is this so? In many cases, a fixed fraction of dioxins degrades per year, e.g. a 10% decline per year. This
can be expressed in a non-linear relationship:
T EQ = Crt
where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this is
plotted over time, this leads to the non-linear pattern seen above.
If logarithms are taken, this leads to the relationship:
log(T EQ) = log(C) + t × log(r)
which can be expressed as:
log(T EQ) = β0 + β1 × t
which is the equation of a straight line with β0 = log(C) and β1 = log(r).
JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashion.
A plot of log(T EQ) vs. year gives the following:
c
2015
Carl James Schwarz
926
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The relationship look approximately linear; there don’t appear to be any outlier or influential points;
the scatter appears to be roughly equal across the entire regression line. Residual plots will be used later
to check these assumptions in more detail.
A line can be fit as before by selecting the Fit Line option from the red triangle in the upper left side
of the plot:
This gives the following output:
c
2015
Carl James Schwarz
927
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The fitted line is:
log(T EQ) = 218.9 − .11(year)
.
The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly nonsensical. The slope
(−.11) is the estimated log(ratio) from one year to the next. For example, exp(−.11) = .898 would
mean that the TEQ in one year is only 89.8% of the TEQ in the previous year or roughly an 11% decline
per year.8
8 It can be shown that in regressions using a log(Y ) vs. time, that the estimated slope on the logarithmic scale is the approximate
fraction decline per time interval. For example, in the above, the estimated slope of −.11 corresponds to an approximate 11%
decline per year. This approximation only works well when the slopes are small, i.e. close to zero.
c
2015
Carl James Schwarz
928
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The standard error of the estimated slope is .02. If you want to find the standard error of the anti-log
of the estimated slope, you DO NOT take exp(0.02). Rather, the standard error of the ant-logged value
is found as seantilog = selog exp(slope) = 0.02 × .898 = .01796.9
A 95% confidence interval for the slope can be obtained by pressing a Right-Click (for Windoze
machines) or a Ctrl-Click (for Macintosh machines) in the Parameter Estimates summary table and
selecting the confidence intervals to display in the table.
The 95% confidence interval for the slope (on the log-scale) is (−.154 to −.061). If you take the
anti-logs of the endpoints, this gives a 95% confidence interval for the fraction of TEQ that remains from
year to year, i.e. between (0.86 to 0.94) of the TEQ in one year, remains to the next year.
As always, the model diagnostics should be inspected early on in the process: These are produced
automatically by JMP.
The residual plot looks fine with no apparent problems but the dip in the middle years could require
further exploration if this pattern was apparent in other sites as well. This type of pattern may be evidence
of autocorrelation.
Here there is no evidence of auto-correlation so we can proceed without worries.
Several types of predictions can be made. For example, what would be the estimated mean logTEQ
in 2010? What is the range of logTEQ’s in 2010? Again, refer back to previous chapters about the
differences in predicting a mean response and predicting an individual response.
The computations could be done by hand, or by using the cross-hairs on the plot from the Analyze>Fit Y-by-X platform. Confidence intervals for the mean response, or prediction intervals for an individual response can be added to the plot from the pop-down menu.
However, a more powerful tool is available from the Analyze->Fit Model platform.
9
This is computed using a method called the delta-method.
c
2015
Carl James Schwarz
929
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Start first, by adding rows to the original data table corresponding to the years for which a prediction
is required. In this case, the additional row would have the value of 2010 in the Year column with the
remainder of the row unspecified. Missing values will be automatically inserted for the other variables.
Then invoke the Analyze->Fit Model platform:
c
2015
Carl James Schwarz
930
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
This gives much the same output as from the Analyze->Fit Y-by-X platform with a few new (useful)
features, a few of which we will explore in the remainder of this section.
Next, save the prediction formula, and the confidence interval for the mean, and for an individual
prediction to the data table (this will take three successive saves):
c
2015
Carl James Schwarz
931
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Now the data table has been augmented with additional columns and more importantly predictions
for 2010 are now available:
The estimated mean log(T EQ) in 2010 is 2.60 (corresponding to an estimated MEDIAN TEQ of
exp(2.60) = 13.46). A 95% confidence interval for the mean log(T EQ) is (1.94 to 3.26) corresponding
to a 95% confidence interval for the actual MEDIAN TEQ of between (6.96 and 26.05).10 Note that the
confidence interval after taking anti-logs is no longer symmetrical.
Why does a mean of a logarithm transform back to the median on the untransformed scale? Basically,
because the transformation is non-linear, properties such mean and standard errors cannot be simply
anti-transformed without introducing some bias. However, measures of location, (such as a median) are
unaffected. On the transformed scale, it is assumed that the sampling distribution about the estimate is
symmetrical which makes the mean and median take the same value. So what really is happening is that
the median on the transformed scale is back-transformed to the median on the untransformed scale.
Similarly, a 95% prediction interval for the log(T EQ) for an INDIVIDUAL composite sample can
be found. Be sure to understand the difference between the two intervals.
Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to some
particular value. For example, health regulations may require that the TEQ of the composite sample be
below 10 units.
The Analyze->Fit Model platform has an inverse prediction function:
10
A minor correction can be applied to estimate the mean if required.
c
2015
Carl James Schwarz
932
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Specify the required value for Y – in this case log(10) = 2.302,
and then press the RUN button to get the following output:
The predicted year is found by solving
2.302 = 218.9 − .11(year)
and gives and estimated year of 2012.7. A confidence interval for the time when the mean log(T EQ) is
equal to log(10) is somewhere between 2007 and 2026!
The application of regression to non-linear problems is fairly straightforward after the transformation
is made. The most error-prone step of the process is the interpretation of the estimates on the TRANSFORMED scale and how these relate to the untransformed scale.
c
2015
Carl James Schwarz
933
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
14.4.13
Example: Weight-length relationships - transformation
A common technique in fisheries management is to investigate the relationship between weight and
lengths of fish.
This is expected to a non-linear relationship because as fish get longer, they also get wider and thicker.
If a fish grew “equally” in all directions, then the weight of a fish should be proportional to the length3
(why?). However, fish do not grow equally in all directions, i.e. a doubling of length is not necessarily
associated with a doubling of width or thickness. The pattern of association of weight with length may
reveal information on how fish grow.
The traditional model between weight and length is often postulated to be of the form:
weight = a × lengthb
where a and b are unknown constants to be estimated from data.
If the estimated value of b is much less than 3, this indicates that as fish get longer, they do not get
wider and thicker at the same rates.
How are such models fit? If logarithms are taken on each side, the above equation is transformed to:
log(weight) = log(a) + b × log(length)
or
log(weight) = β0 + β1 × log(length)
where the usual linear relationship on the log-scale is now apparent.
The following example was provided by Randy Zemlak of the British Columbia Ministry of Water,
Land, and Air Protection.
c
2015
Carl James Schwarz
934
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Length (mm)
Weight (g)
34
585
46
1941
33
462
36
511
32
428
33
396
34
527
34
485
33
453
44
1426
35
488
34
511
32
403
31
379
30
319
33
483
36
600
35
532
29
326
34
507
32
414
33
432
33
462
35
566
34
454
35
600
29
336
31
451
33
474
32
480
35
474
30
330
30
376
34
523
31
353
32
412
32
407
A sample of fish was measured at a lake in British Columbia. The data is as follows and is available
in a JMP datasheet called wtlen.jmp at the Sample Program Library at http://www.stat.sfu.
ca/~cschwarz/Stat-650/Notes/MyPrograms.
c
2015
Carl James Schwarz
935
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The following is an initial plot with a spline fit (lambda=10) to the data.
The fit appears to be non-linear but this may simply be an artifact of the influence of the two largest
fish. The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, the
variance appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30
mm.
There are several (equivalent) ways to fit the growth model to such data in JMP:
• Use Analyze->Fit Y-by-X directly with the Fit Special feature.
• Create two new variables log(weight) and log(length) and then use Analyze->Fit Y-by-X on these
derived variables.
• Use Analyze->Fit Model on these derived variables.
We will fit a model on the log-log scale: Note that there is some confusion in scientific papers about
a “log” transform. In general, a log-transformation refers to taking natural-logarithms (base e), and
NOT the base-10 logarithm. This mathematical convention is often broken in scientific papers where
authors try to use ln to represent natural logarithms, etc. It does not affect the analysis in anyway which
transformation is used other than that values on the natural log scale are approximately 2.3 times larger
than values on the log10 scale. Of course, the appropriate back transformation is required.
Using the Fit Special
.
The Fit Special is available from the drop-down menu item:
c
2015
Carl James Schwarz
936
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
It presents a dialogue box where a transformation on both the Y and X axes may be specified:
the following output is obtained:
c
2015
Carl James Schwarz
937
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. At
smaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the two
definite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm and
c
2015
Carl James Schwarz
938
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
negative residuals at 35 mm.
The fit was repeated dropping the two largest fish with the following output:
c
2015
Carl James Schwarz
939
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Now the fit appears to be much better. The relationship (on the log-scale) is linear, the residual plot
looks OK.
The estimated power coefficient is 2.76 (SE .21). We find the 95% confidence interval for the slope
(the power coefficient):
The 95% confidence interval for the power coefficient is from (2.33 to 3.2) which includes the value
of 3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width and
twice the thickness. Of course, with this small sample size, it is too difficult to say too much.
The actual model in the population is:
log(weight) = β0 + β1 log(length) + This implies that the “errors” in growth act on the LOG-scale. This seems reasonable.
For example, a regression on the original scale would make the assumption that a 20 g error in
predicting weight is equally severe for a fish that (on average) weighs 200 or 400 grams even though the
"error" is 20/200=10% of the predicted value in the first case, while only 5% of the predicted value in the
second case. On the log-scale, it is implicitly assumed that the “errors” operate on the log-scale, i.e. a
10% error in a 200 g fish is equally severe as a 10% error in a 400 g fish even though the absolute errors
of 20g and 40g are quite different.
Another assumption of regression analysis is that the population error variance is assumed to be constant over the entire regression line, but the original plot shows that the standard deviation is increasing
with length. On the log-scale, the standard deviation is roughly constant over the entire regression line.
Using derived variables
The same analysis was repeated using the derived variables log(weight) and log(length) and again
using the Analyze->Fit Y-by-X platform, but this time without the Fit Special. [The Fit Special is not
needed because the derived variables have already been transformed.]
The following are the outputs using the derived variables, again with and with out the two largest
fish.
c
2015
Carl James Schwarz
940
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Because derived variables are used, the fitting plot uses the derived variables and is on the log-scale.
This has the advantage that the the fit at the lower lengths is easier to see, but the lack of fit for the two
largest fish is not as clear. However, it is now easier to see on the residual plot the apparent lack of fit
c
2015
Carl James Schwarz
941
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
with the downward sloping part of the residual plot in the 3.4 to 3.6 log(length).
The two largest fish were removed and the fit repeated using the derived variables:
c
2015
Carl James Schwarz
942
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The results are identical to the previous section.
A non-linear fit
It is also possible to do a direct non-linear least-squares fit. Here the objective is to find values of β0 and
β1 to minimize:
X
(weight − β0 × lengthβ1 )2
directly.
This can also be done in JMP using the Fit NonLinear platform and won’t be explored in much detail
here.
First here are the results from using all of the fish:
Note that the fit apparently is better than the fit on the log-scale as the fitted curve goes through the
middle of the points from the two largest fish. Note that there still appear to be problems with the fit at
the lower lengths.
The same fit, dropping the two largest fish, gives the following output:
c
2015
Carl James Schwarz
943
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The estimated power coefficient from the non-linear fit is 2.73 with a standard error of .24. The
estimated intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to the
previous fit.
Which is a better method to fit this data? The non-linear fit assumes that error are additive on the
original scale. The consequences of this were discussed earlier, i.e. a 20 g error is equally serious for a
200 g fish as for a 400 g fish.
For this problem, both the non-linear fit and the fit on the log-scale gave the same results, but this
will not always be true. In particular, look at the large difference in estimates when the models were fit
to the all of the fish. The non-linear fit was more influenced by the two large fish - this is a consequence
of the minimizing the square of the absolute deviation (as opposed to the relative deviation) between the
observed weight and predicted weight.
14.4.14
Power/Sample Size
A power analysis and sample size determination can also be done for regression problems, but is more
complicated than power analyses for simple experimental designs. This is for a number of reasons:
c
2015
Carl James Schwarz
944
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
• The power depends not only on the total number of points collected, but also on the actual distribution of the X values.
For example, the power to detect a trend is different if the X values are evenly distributed over the
range of predictors than if the X values are clustered at the ends of the range of the predictors. A
regression analysis has the most power to detect a trend if half the observations are collected at a
small X value and half of the observations are collected at a large X value. However, this type of
data gives no information on the linearity (or lack there-of) between the two X values and is not
recommended in practice. A less powerful design would have a range of X values collected, but
this is often more of interest as lack-of-fit and non-linearity can be collected.
• Data collected for regression analysis is often opportunistic with little chance of choosing the X
values. Unless you have some prior information on the distribution of the X values, it is difficult
to determine the power.
• The formula are clumsy to compute by hand, and most power packages tend not to have modules
for power analysis of regression. However, modern software should be able to deal with this issue.
For a power analysis, the information required is similar to that requested for ANOVA designs:
• α level. As in power analyses for ANOVA, this is traditionally set to α = 0.05.
• effect size. In ANOVA, power deals with detection of differences among means. In regression
analysis, power deals with detection of slopes that are different from zero. Hence, the effect size
is measured by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X.
• sample size. Recall in ANOVA with more than two groups, that the power depended not only
only the sample size per group, but also how the means are separated. In regression analysis, the
power will depend upon the number of observations taken at each value of X and the spread of
the X values. For example, the greatest power is obtained when half the sample is taken at the two
extremes of the X space - but at a cost of not being able to detect non-linearity. It turns out that a
simple summary of the distribution of the X values (the standard deviation of the X values) is all
that is needed.
• standard deviation. As in ANOVA, the power will depend upon the variation of the individual
objects around the regression line.
JMP (V.10) does not currently contain a module to do power analysis for regression. R also does
not include a power computation module for regression analysis but I have written a small function that
is available in the SampleProgramLibrary. SAS (Version 9+) includes a power analysis module (GLMPOWER) for the power analysis. Russ Lenth also has a JAVA applet that can be used for determining
power in a regression context http://homepage.stat.uiowa.edu/~rlenth/Power/.
The problem simplifies considerably when the the X variable is time, and interest lies in detecting
a trend (increasing or decreasing) over time. A linear regression of the quantity of interest against time
is commonly used to evaluate such a trend. For many monitoring designs, observations are taken on
a yearly basis, so the question reduces to the number of years of monitoring required. The analysis of
trend data and power/sample size computations is treated in a following chapter.
Let us return to the example of the yield of tomatoes vs. the amount of fertilizer. We wish to design
an experiment to detect a slope of 1 (the effect size). From past data (on a different field), the standard
deviation of values about the regression line is about 4 units (the standard deviation of the residuals).
We have enough money to plant 12 plots with levels of fertilizer ranging from 10 to 20. How does
the power compare under different configuration of choices of fertilizer levels. More specifically, how
c
2015
Carl James Schwarz
945
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
does the power compare between using fertilizer levels (10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20),
i.e. an even distribution of levels of fertilizer, and (10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20), i.e.
doing two replicates at each level of fertilizer but doing fewer distinct levels?
JMP (v.10) does not currently have a module for a power analysis of a regression problem. Please
consult the documentation on R, SAS or Lenth’s JAVA program (see below)..
The power to detect a range of slopes using the last set of X values was also computed (see the R and
SAS code) and a plot of the power vs. the size of the slope can be made.
Because JMP does not include facilities for a power analysis of a simple linear regression, the plot from R
is shown.
The power to detect smaller slopes is limited.
Russ Lenth’s power modules11 can be used to compute the power for these two cases. Here the
modules require the standard deviation of the X values but this needs to be computed using the n divisor
rather than the n − 1 divisor, i.e.
s
P
(X − X)2
SDLenth (X) =
n
For the two sets of fertilizer values the SDs are 3.02765 and 3.41565 respectively.
11 http://homepage.stat.uiowa.edu/~rlenth/Power/
c
2015
Carl James Schwarz
946
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The output from Lenth’s power analysis are:
which match the earlier results (as they must).
14.4.15
The perils of R2
R2 is a “popular” measure of the fit of a regression model and is often quoted in research papers as
evidence of a good fit etc. However, there are several fundamental problems of R2 which, in my opinion,
make it less desirable. A nice summary of these issues is presented in Draper and Smith (1998, Applied
Regression Analysis, p. 245-246).
Before exploring this, how is R2 computed and how is it interpreted?
While I haven’t discussed the decomposition of the Error SS into Lack-of-Fit and Pure error, this can
be done when there are replicated X values. A prototype ANOVA table would look something like:
c
2015
Carl James Schwarz
947
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Source
df
SS
Regression
p−1
A
Lack-of-fit
n − p − ne
B
Pure error
ne
C
Corrected Total
n-1
D
where there are n observations and a regression model is fit with p additional X values over and above
the intercept.
R2 is computed as
R2 =
A
B+C
SS(regression)
=
=1−
SS(total)
D
D
where SS(·) represents the sum of squares for that term in the ANOVA table. At this point, rerun the
three examples presented earlier to find the value of R2 .
For example, in the fertilizer example, the ANOVA table is:
Analysis of Variance
Source
DF
Sum of Squares Mean Square
Model
1
225.18035
225.180
Error
9
29.00147
3.222
C. Total
10
254.18182
F Ratio
69.8800
p-value
<.0001
Here R2 = 225.18035/254.18182 = .885 = 88.5%.
R2 is interpreted as the proportion of variance in Y accounted for by the regression. In this case,
almost 90% of the variation in Y is accounted for by the regression. The value of R2 must range between
0 and 1.
It is tempting to think that R2 must be measure of the “goodness of fit”. In a technical sense it is, but
R is not a very good measure of fit, and other characteristics of the regression equation are much more
informative. In particular, the estimate of the slope and the se of the slope are much more informative.
2
Here are some reasons, why I decline to use R2 very much:
B
• Overfitting. If there are no replicate X points, then ne = 0, C = 0, and R2 = 1 − D
. B has
n − p degrees of freedom. As more and more X variables are added to the model, n − p, and B
become smaller, and R2 must increase even if the additional variables are useless.
• Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate the
value of C (if the outlier occurs among the set of replicate X values), or B if the outlier occurs at
a singleton X value. In any cases, they reduce R2 , so R2 is not resistant to outliers.
• People misinterpret high R2 as implying the regression line is useful. It is tempting to believe
that a higher value of R2 implies that a regression line is more useful. But consider the pair of
plots below:
c
2015
Carl James Schwarz
948
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The graph on the left has a very high R2 , but the change in Y as X varies is negligible. The graph
on the right has a lower R2 , but the average change in Y per unit change in X is considerable. R2
measures the “tightness” of the points about the line – the higher value of R2 on the left indicates
that the points fit the line very well. The value of R2 does NOT measure how much actual change
occurs.
• Upper bound is not always 1. People often assume that a low R2 implies a poor fitting line. If
you have replicate X values, then C > 0. The maximum value of R2 for this problem can be
much less than 100% - it is mathematically impossible for R2 to reach 100% with replicated X
values. In the extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R2
C
.
can never exceed 1 − D
P
• No intercept models If there is no intercept then D =
(Yi − Y )2 does not exist, and R2 is not
really defined.
• R2 gives no additional information. In actual fact, R2 is a 1-1 transformation of the slope and
its standard error, as is the p-value. So there is no new information in R2 .
• R2 is not useful for non-linear fits. R2 is really only useful for linear fits with the estimated
regression line free to have a non-zero intercept. The reason is that R2 is really a comparison
between two types of models. For example, refer back to the length-weight relationship examined
earlier.
In the linear fit case, the two models being compared are
log(weight) = log(b0 ) + error
vs.
log(weight) = log(b0 ) + b1 ∗ log(length) + error
2
and so R is a measure of the improvement with the regression line. [In actual fact, it is a 1-1
transform of the test that β1 = 0, so why not use that statistics directly?]. In the non-linear fit case,
the two models being compared are:
weight = 0 + error
vs.
weight = b0 ∗ length ∗ ∗b1 + error
The model weight=0 is silly and so R2 is silly.
Hence, the R2 values reported are really all for linear fits - it is just that sometimes the actual linear
fit is hidden.
• Not defined in generalized least squares. There are more complex fits that don’t assume equal
variance around the regression line. In these cases, R2 is again not defined.
c
2015
Carl James Schwarz
949
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
• Cannot be used with different transformations of Y . R2 cannot be used to compare models
that are fit to different transformations of the Y variable. For example, many people try fitting a
model to Y and to log(Y ) and choose the model with the highest R2 . This is not appropriate as
the D terms are no longer comparable between the two models.
• Cannot be used for non-nested models. R2 cannot be used to compare models with different
sets of X variables unless one model is nested within another model (i.e. all of the X variables in
the smaller model also appear in the larger model). So using R2 to compare a model with X1 , X3 ,
and X5 to a model with X1 , X2 , and X4 is not appropriate as these two models are not nested. In
these cases, AIC should be used to select among models.
14.5
A no-intercept model: Fulton’s Condition Factor K
It is possible to fit a regression line that has an intercept of 0, i.e., goes through the origin. Most computer
packages have an option to suppress the fitting of the intercept.
The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced are
misleading for these models. As this varies from package to package, please seek advice when fitting
such models.
The following is an example of where such a model may be sensible.
Not all fish within a lake are identical. How can a single summary measure be developed to represent
the condition of fish within a lake?
In general, the the relationship between fish weight and length follows a power law:
W = aLb
where W is the observed weight; L is the observed length, and a and b are coefficients relating length
to weight. The usual assumption is that heavier fish of a given length are in better condition than than
lighter fish. Condition indices are a popular summary measure of the condition of the population.
There are at least eight different measures of condition which can be found by a simple literature
search. Conne (1989) raises some important questions about the use of a single index to represent the
two-dimensional weight-length relationship.
One common measure is Fulton’s12 K:
K=
W eight
(Length/100)3
This index makes an implicit assumption of isometric growth, i.e. as the fish grows, its body proportions
and specific gravity do not change.
How can K be computed from a sample of fish, and how can K be compared among different subset
of fish from the same lake or across lakes?
The B.C. Ministry of Environment takes regular samples of rainbow trout using a floating and a
sinking net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was
recorded.
12 There is some doubt about the first authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.
(2005). The Origin of Fulton’s Condition Factor – Setting the Record Straight. Fisheries, 31, 236-238.
c
2015
Carl James Schwarz
950
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The data is available in the rainbow-condition.csv file in the Sample Program Library at http:
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The data are imported into rainbow-condition.jmp, a JMP data file, in the usual way. A portion of
the raw data data appears below:
K was computed for each individual fish, and the resulting histogram is displayed below:
There is a range of condition numbers among the individual fish with an average (among the fish caught)
K of about 13.6.
Deriving a single summary measure to represent the entire population of fish in the lake depends
heavily on the sampling design used to capture fish.
Some case must be taken to ensure that the fish collected are a simple random sample from the fish
in the population. If a net of a single mesh size are used, then this has a selectivity curve and the nets are
typically more selective for fish of a certain size. In this experiment, several different mesh sizes were
used to try and ensure that all fish of all sizes have an equal chance of being selected.
As well, if regression methods have an advantage in that a simple random sample from the population
is no longer required to estimate the regression coefficients. As an analogy, suppose you are interested
in the relationship between yield of plants and soil fertility. Such a study could be conducted by finding
a random sample of soil plots, but this may lead to many plots with similar fertility and only a few plots
c
2015
Carl James Schwarz
951
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
with fertility at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots with
a range of fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a
regression curve to these selected data points.
Fulton’s index is often re-expressed for regression purposes as:
W =K
This looks like a simple regression between W and
L
100
3
L 3
100
but with no intercept.
A plot of these two variables:
shows a tight relationship among fish but with possible increasing variance with length.
There is some debate about the proper way to estimate the regression coefficient K. Classical regression methods (least squares) implicitly implies that all of the “error” in the regression is in the vertical
direction, i.e. conditions on the observed lengths. However, the structural relationship between weight
and length likely is violated in both variables. This would lead to the error-in-variables problem in
regression, which has a long history. Fortunately, the relationship between the two variables is often
sufficiently tight that it really doesn’t matter which method is used to find the estimates.
JMP can be used to fit the regression line constraining the intercept to be zero by using the Fit Special
option under the red-triangle:
c
2015
Carl James Schwarz
952
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
This gives rise to the fitted line and statistics about the fit:
c
2015
Carl James Schwarz
953
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
Note that R2 really doesn’t make sense in cases where the regression is forced through the origin because
the null model to which it is being compared is the line Y = 0 which is silly.13 For this reason, JMP
does not report a value of R2 .
The estimated value of K is 13.72 (SE 0.099).
13
Consult any of the standard references on regression such as Draper and Smith for more details.
c
2015
Carl James Schwarz
954
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
The residual plot:
shows clear evidence of increasing variation with the length variable. This usually implies that a weighted
regression is needed with weights proportional to the 1/length2 variable. In this case, such a regression
b = 13.67, SE = .11).
gives essentially the same estimate of the condition factor (K
Comparing condition factors
This dataset has a number of sub-groups – do all of the subgroups have the same condition factor?
For example, suppose we wish to compare the K value for immature and mature fish. This is covered in
more detail in the Chapter on the Analysis of Covariance (ANCOVA).
14.6
Frequent Asked Questions - FAQ
14.6.1
Do I need a random sample; power analysis
A student wrote:
I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstract
attached). I would like to define a regional hydraulic geometry for a fairly small hydrologic/geologic homogeneous area in the coast mountains close to SFU. Hydraulic geometry
is the study of how the primary flow variables (width, depth and velocity) change with discharge in a stream. Typically, a straight-regression line is fitted to data plotted on a log-log
plot. The equation is of the form w = aQb where a is the intercept, b is the slope, w is the
water surface width, and Q is the stream discharge.
I am struggling with the last part of my research proposal which is how do I select (randomly) my field sites and how many sites are required. My supervisor - suggests that I
select stream segments for study based on a-priori knowledge of my field area and select
streams from across it. My argument is that to define a regionally applicable relationship
(not just one that characterizes my chosen sites) I must randomly select the sites.
I think that GIS will help me select my sites but have the usual questions of how many
sites are required to give me a certain level of confidence and whether or not I’m on the
right track. As well, the primary controlling variables that I am looking at are discharge
and stream slope. I will be plotting the flow variables against discharge directly but will
c
2015
Carl James Schwarz
955
2015-08-20
CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION
deal with slope by breaking my stream segments into slope classes - I guess that the null
hypothesis would be that there is no difference in the exponents and intercepts between slope
classes.
You are both correct!
If you were doing a simple survey, then you are correct in that a random sample from the entire
population must be selected - you can’t deliberately choose streams.
However, because you are interested in a regression approach, the assumption can be relaxed a bit.
You can deliberately choose values of the X variables, but must randomly select from streams with
similar X values.
As an analogy, suppose you wanted to estimate the average length of male adult arms. You would
need a random sample from the entire population. However, suppose that you were interested in the
relationship between body height (X) and arm length (Y ). You could deliberately choose which X
values to measure - indeed it would be good idea to get a good contrast among the X values, i.e. find
people who are 4 ft tall, 5 ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fit
the regression curve. However, at each height level, you must now choose randomly among those people
that meet that criterion. Hence you could could deliberately choose to have 1/4 of people who are 4 ft
tall, 1/4 who are 5 feet tall, 1/4 who are 6 feet tall, 1/4 who are 7 feet tall which is quite different from the
proportions in the population, but at each height level must choose people randomly, i.e. don’t always
choose skinny 4 ft people, and over-weight 7 ft people.
Now sample size is a bit more difficult as the required sample size depends both on the number of
streams selected and how they are scattered along the X axis. For example, the highest power occurs
when observations are evenly divided between the very smallest X and very largest X value. However,
without intermediate points, you can’t assess linearity very well. So you will want points scattered
around the range of X values.
If you have some preliminary data, a power/sample size can be done using JMP, SAS, and other
packages. If you do a google search for power analysis regression, there are several direct links to
examples. Refer to the earlier section of the notes.
c
2015
Carl James Schwarz
956
2015-08-20
Download