advertisement

Chapter 14 Correlation and simple linear regression Contents 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . 14.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . 14.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.2 Equation for a line - getting notation straight (no pun intended) 14.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . 14.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . 14.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . 14.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . 14.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . . 14.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . 14.4.11 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.12 Example: Monitoring Dioxins - transformation . . . . . . . . . 14.4.13 Example: Weight-length relationships - transformation . . . . . 14.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . 14.4.15 The perils of R2 . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 A no-intercept model: Fulton’s Condition Factor K . . . . . . . . . 14.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . . 14.6.1 Do I need a random sample; power analysis . . . . . . . . . . The suggested citation for this chapter of notes is: 888 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889 890 890 891 894 894 897 898 901 902 902 903 903 904 907 908 909 909 919 923 923 924 934 944 947 950 955 955 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Schwarz, C. J. (2015). Correlation and simple linear regression. In Course Notes for Beginning and Intermediate Statistics. Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved 2015-08-20. 14.1 Introduction A nice book explaining how to use JMP to perform regression analysis is: Freund, R., Littell, R., and Creighton, L. (2003) Regression using JMP. Wiley Interscience. Much of statistics is concerned with relationships among variables and whether observed relationships are real or simply due to chance. In particular, the simplest case deals with the relationship between two variables. Quantifying the relationship between two variables depends upon the scale of measurement of each of the two variables. The following table summarizes some of the important analyses that are often performed to investigate the relationship between two variables. Type of variables Y is Interval or Ratio or what JMP calls Continuous X is Interval or Ratio or what JMP calls Continuous X is Nominal or Ordinal • Scatterplots • Running dian/spline fit • Side-by-side dot plot • Side-by-side plot me- • Regression box • ANOVA or t-tests • Correlation Y is Nominal or Ordinal • Logistic regression • Mosaic chart • Contingency tables • Chi-square tests In JMP these combination of two variables are obtained by the Analyze->Fit Y-by-X platform, the Analyze->Correlation-of-Ys platform, or the Analyze->Fit Model platform. When analyzing two variables, one question becomes important as it determines the type of analysis that will be done. Is the purpose to explore the nature of the relationship, or is the purpose to use one variable to explain variation in another variable? For example, there is a difference between examining height and weight to see if there is a strong relationship, as opposed to using height to predict weight. Consequently, you need to distinguish between a correlational analysis in which only the strength of the relationship will be described, or regression where one variable will be used to predict the values of a second variable. c 2015 Carl James Schwarz 889 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The two variables are often called either a response variable or an explanatory variable. A response variable (also known as a dependent or Y variable) measures the outcome of a study. An explanatory variable (also known as an independent or X variable) is the variable that attempts to explain the observed outcomes. 14.2 Graphical displays 14.2.1 Scatterplots The scatter-plot is the primary graphical tool used when exploring the relationship between two interval or ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X platform – be sure that both variables have a continuous scale. In graphing the relationship, the response variable is usually plotted along the vertical axis (the Y axis) and the explanatory variables is plotted along the horizontal axis (the X axis). It is not always perfectly clear which is the response and which is the explanatory variable. If there is no distinction between the two variables, then it doesn’t matter which variable is plotted on which axis – this usually only happens when finding correlation between variables is the primary purpose. For example, look at the relationship between calories/serving and fat from the cereal dataset using JMP. [We will create the graph in class at this point.] What to look for in a scatter-plot Overall pattern. - What is the direction of association? A positive association occurs when aboveaverage values of one variable tend to be associated with above-average variables of another. The plot will have an upward slope. A negative association occurs when above-average values of one variable are associated with below-average values of another variable. The plot will have a downward slope. What happens when there is “no association” between the two variables? Form of the relationship. Does a straight line seem to fit through the ‘middle’ of the points? Is the line linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to form a curve)? Strength of association. Are the points clustered tightly around the curve? If the points have a lot of scatter above and below the trend line, then the association is not very strong. On the other hand, if the amount of scatter above and below the trend line is very small, then there is a strong association. Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far from the trend curve - i.e., they are further away from the trend curve than you would expect from the usual level of scatter. There is no formal rule for detecting outliers - use common sense. [If you set the role of a variable to be a label, and click on points in a linked graph, the label for the point will be displayed making it easy to identify such points.] One’s usual initial suspicion about any outlier is that it is a mistake, e.g., a transcription error. Every effort should be made to trace the data back to its original source and correct the value if possible. If the data value appears to be correct, then you have a bit of a quandary. Do you keep the data point in even though it doesn’t follow the trend line, or do you drop the data point because it appears to be anomalous? Fortunately, with computers it is relatively easy to repeat an analysis with and without an outlier - if there is very little difference in the final outcome - don’t worry about it. c 2015 Carl James Schwarz 890 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION In some cases, the outliers are the most interesting part of the data. For example, for many years the ozone hole in the Antarctic was missed because the computers were programmed to ignore readings that were so low that ‘they must be in error’! Lurking variables. A lurking variable is a third variable that is related to both variables and may confound the association. For example, the amount of chocolate consumed in Canada and the number of automobile accidents are positively related, but most people would agree that this is coincidental and each variable is independently driven by population growth. Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by using a different plotting symbol to distinguish between the values of the third variables. For example, consider the following plot of the relationship between salary and years of experience for nurses. The individual lines show a positive relationship, but the overall pattern when the data are pooled, shows a negative relationship. It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points. From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markers menu. 14.2.2 Smoothers Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example, consider the following data: c 2015 Carl James Schwarz 891 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION There are several common methods available to fit a line through this data. By eye The eye has remarkable power for providing a reasonable approximation to an underlying trend, but it needs a little education. A trend curve is a good summary of a scatter-plot if the differences between the individual data points and the underlying trend line (technically called residuals) are small. As well, a good trend curve tries to minimize the total of the residuals. And the trend line should try and go through the middle of most of the data. Although the eye often gives a good fit, different people will draw slightly different trend curves. Several automated ways to derive trend curves are in common use - bear in mind that the best ways of estimating trend curves will try and mimic what the eye does so well. Median or mean trace The idea is very simple. We choose a “window” width of size w, say. For each point along the bottom (X) axis, the smoothed value is the median or average of the Y -values for all data points with X-values lying within the “window” centered on this point. The trend curve is then the trace of these medians or means over the entire plot. The result is not exactly smooth. Generally, the wider the window chosen the smoother the result. However, wider windows make the smoother react more slowly to changes in trend. Smoothing techniques are too computationally intensive to be performed by hand. Unfortunately, JMP is unable to compute the trace of data, but splines are a very good alternative (see below). The mean or median trace is too unsophisticated to be a generally useful smoother. For example, the simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights of troughs. (Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of trying to summarize a pattern in a weak relationship for a moderately large data set. In a very weak relationship it can even help you to see the trend. Box plots for strips The following gives a conceptually simple method which is useful for exploring a weak relationship in a large data set. The X-axis is divided into equal-sized intervals. Then separate box plots of the values of Y are found for each strip. The box-plots are plotted side-by-side and the means or median are joined. Again, we are able to see what is happening to the variability as well as the trend. There is even more detailed information available in the box plots about the shape of the Y -distribution etc. Again, this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a new variable that groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X c 2015 Carl James Schwarz 892 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION platform using these groupings. This is illustrated below: Spline methods A spline is a series of short smooth curves that are joined together to create a larger smooth curve. The computational details are complex, but can be done in JMP. The stiffness of the spline indicates how straight the resulting curve will be. The following shows two spline fits to the same data with different stiffness measures: c 2015 Carl James Schwarz 893 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION 14.3 Correlation WARNING!: Correlation is probably the most abused concept in statistics. Many people use the word ‘correlation’ to mean any type of association between two variables, but it has a very strict technical meaning, i.e. the strength of an apparent linear relationship between the two interval or ratio scaled variables. The correlation measure does not distinguish between explanatory and response variables and it treats the two variables symmetrically. This means that the correlation between Y and X is the same as the correlation between X and Y. Correlations are computed in JMP using the Analyze->Correlation of Y’s platform. If there are several variables, then the data will be organized into a table. Each cell in the table shows the correlation of the two corresponding variables. Because of symmetry (the correlation between variable1 and variable2 is the same as between variable2 and variable1 ), only part of the complete matrix will be shown. As well, the correlation between any variable and itself is always 1. 14.3.1 Scatter-plot matrix To illustrate the ideas of correlation, look at the FITNESS dataset in the DATAMORE directory of JMP. This is a dataset on 31 people at a fitness centre and the following variables were measured on each subject: • name • gender c 2015 Carl James Schwarz 894 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION • age • weight • oxygen consumption (high values are typically more fit people) • time to run one mile (1.6 km) • average pulse rate during the run • the resting pulse rate • maximum pulse rate during the run. We are interested in examining the relationship among the variables. At the moment, ignore the fact that the data contains both genders. [It would be interesting to assign different plotting symbols to the two genders to see if gender is a lurking variable.] One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze>Correlation of Ys to get the following scatter-plot: c 2015 Carl James Schwarz 895 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Interpreting the scatter plot matrix The entries in the matrix are scatter-plots for all the pairs of variables. For example, the entry in row 1 column 3 represents the scatter-plot between age and oxygen consumption with age along the vertical axis and oxygen consumption along the horizontal axis, while the entry in row 3 column 1 has age along the horizontal axis and oxygen consumption along the vertical axis. There is clearly a difference in the ’strength’ of relationships. Compare the scatter plot for average running pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and resting pulse rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2). Similarly, there is a difference in the direction of association. Compare the scatter plot for the average running pulse rate and maximum pulse rate (row 5 column 7) and that for oxygen consumption and running time (row 3, column 4). c 2015 Carl James Schwarz 896 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION 14.3.2 Correlation coefficient It is possible to quantify the strength of association between two variables. As with all statistics, the way the data are collected influences the meaning of the statistics. The population correlation coefficient between two variables is denoted by the Greek letter rho (ρ) and is computed as:. N 1 X (Xi − µX ) (Yi − µY ) ρ= N i=1 σx σy The corresponding sample correlation coefficient is denoted r has a similar form:1 n 1 X Xi − X Yi − Y r= n − 1 i=1 sx sy If the sampling scheme is simple random sample from the corresponding population, then r is an estimate of ρ. This is a crucial assumption. If the sampling is not a simple random sample, the above definition of the sample correlation coefficient should not be used! It is possible to find a confidence interval for ρ and to perform statistical tests that ρ is zero. However, for the most part, these are rarely done in ecological research and so will not be pursued further in this course. The form of the formula does provide some insight into interpreting its value. • ρ and r (unlike other population parameters) are unitless measures. • the sign of ρ and r is largely determined by the pairing of the relationship of each of the (X,Y) values with their respective means, i.e. if both of X and Y are above the mean, or both X and Y are below their mean, this pair contributes a positive value towards ρ or r, while if X is above and Y is below, or X is below and Y is above their respective means contributes a negative value towards ρ or r. • ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlation; a value of ρ or r equal to 1 implies a perfect positive correlation; a value of ρ or r equal to 0 implied no correlation. A perfect population correlation (i.e. ρ or r equal to 1 or -1) implies that all points lie exactly on a straight line, but the slope of the line has NO effect on the correlation coefficient. This latter point is IMPORTANT and often is wrongly interpreted - give some examples. • ρ and r are unaffected by linear transformations of the individual variables, e.g. unit changes such as converting from imperial to metric units. • ρ and r only measures the linear association, and is not affected by the slope of the line, but only by the scatter about the line. Because correlation assumes both variables have an interval or ratio scale, it makes no sense to compute the correlation • between gender and oxygen (gender is a nominal scale data); • between non-linear variables (not shown on graph); 1 Note that this formula SHOULD NOT be used for the actual computation of r, it is numerically unstable and there are better computing formulae available. c 2015 Carl James Schwarz 897 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION • for data collected without a known probability scheme. If a sampling scheme other than simple random sampling is used, it is possible to modify the estimation formula; if a non-probability sample scheme was used, the patient is dead on arrival, and no amount of statistical wizardry will revive the corpse. The data collection scheme for the fitness data set is unknown - we will have to assume that a some sort of random sample form the relevant population was taken before we can make much sense of the number computed. Before looking at the details of its computation, look at the sample correlation coefficients for each scatter plot above. These can be arranged into a matrix: Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41 Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24 Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23 Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22 RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92 RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30 MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00 Notice that the sample correlation between any two variables is the same regardless of ordering of the variables – this explains the symmetry in the matrix between the above and below diagonal elements. As well each variable has a perfect sample correlation with itself – this explains the value of 1 along the main diagonal. Compare the sample correlations between the average running pulse rate and the other variables and compare them to the corresponding scatter-plot above. 14.3.3 Cautions • Random Sampling Required Sample correlation coefficients are only valid under simple random samples. If the data were collected in a haphazard fashion or if certain data points were oversampled, then the correlation coefficient may be severely biased. • There are examples of high correlation but no practical use and low correlation but great practical use. These will be presented in class. This illustrates why I almost never talk about correlation. • correlation measures ‘strength’ of a linear relationship; a curvilinear relationship may have a correlation of 0, but there will still be a good correlation. c 2015 Carl James Schwarz 898 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION • the effect of outliers and high leverage points will be presented in class • effects of lurking variables. For example, suppose there is a positive association between wages of male nurses and years of experience; between female nurses and years of experience; but males are generally paid more than females. There is a positive correlation within each group, but an overall negative correlation when the data are pooled together. c 2015 Carl James Schwarz 899 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION • ecological fallacy - the problem of correlation applied to averages. Even if there is a high correlation between two variables on their averages, it does not imply that there is a correlation between individual data values. For example, if you look at the average consumption of alcohol and the consumption of cigarettes, there is a high correlation among the averages when the 12 values from the provinces and territories are plotted on a graph. However, the individual relationships within provinces can be reversed or non-existent as shown below: The relationship between cigarette consumption and alcohol consumption shows no relationship c 2015 Carl James Schwarz 900 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION for each province, yet there is a strong correlation among the per-capita averages. This is an example of the ecological fallacy. • correlation does not imply causation. This is the most frequent mistake made by people. There are set of principles of causal inference that need to be satisfied in order to imply cause and effect. 14.3.4 Principles of Causation Types of association An association may be found between two variables for several reasons (show causal modeling figures): • There may be direct causation, e.g. smoking causes lung cancer. • There may be a common cause, e.g. ice cream sales and number of drownings both increase with temperature. • There may be a confounding factor, e.g. highway fatalities decreased when the speed limits were reduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and people drove fewer miles. • There may be a coincidence, e.g., the population of Canada has increased at the same time as the moon has gotten closer by a few miles. Establishing cause-and effect. How do we establish a cause and effect relationship? Bradford Hill (Hill, A. B.. 1971. Principles of Medical Statistics, 9th ed New York: Oxford University Press) outlined 7 criteria that have been adopted by many epidemiological researchers. It is generally agreed that most or all of the following must be considered before causation can be declared. Strength of the association. The stronger an observed association appears over a series of different studies, the less likely this association is spurious because of bias. Dose-response effect. The value of the response variable changes in a meaningful way with the dose (or level) of the suspected causal agent. Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The ability to establish this time pattern will depend upon the study design used. Consistency of the findings. Most, or all, studies concerned with a given causal hypothesis produce similar findings. Of course, studies dealing with a given question may all have serious bias problems that can diminish the importance of observed associations. Biological or theoretical plausibility. The hypothesized causal relationship is consistent with current biological or theoretical knowledge. Note, that the current state of knowledge may be insufficient to explain certain findings. Coherence of the evidence. The findings do not seriously conflict with accepted facts about the outcome variable being studied. Specificity of the association. The observed effect is associated with only the suspected cause (or few other causes that can be ruled out). c 2015 Carl James Schwarz 901 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION IMPORTANT: NO CAUSATION WITHOUT MANIPULATION! Examples: Discuss the above in relation to: • amount of studying vs. grades in a course. • amount of clear cutting and sediments in water. • fossil fuel burning and the greenhouse effect. 14.4 Single-variable regression 14.4.1 Introduction Along with the Analysis of Variance, this is likely the most commonly used statistical methodology in ecological research. In virtually every issue of an ecological journal, you will find papers that use a regression analysis. There are HUNDREDS of books written on regression analysis. Some of the better ones (IMHO) are: Draper and Smith. Applied Regression Analysis. Wiley. Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin. Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury. Zar. Biostatistics. Prentice Hall. Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of regression analysis. Please consult the above references for all the gory details. It turns out that both Analysis of Variance and Regression are special cases of a more general statistical methodology called General Linear Models which in turn are special cases of Generalized Linear Models (covered in Stat 402/602), which in turn are special cases of Generalized Additive Models, which in turn are special cases of ..... The key difference between a Regression analysis and an ANOVA is that the X variable is nominal scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This implies that in ANOVA, the shape of the response profile was unspecified (the null hypothesis was that all means were equal while the alternate was that at least one mean differs), while in regression, the response profile must be a linear line. Because both ANOVA and regression are from the same class of statistical models, many of the assumptions are similar, the fitting methods are similar, hypotheses testing and inference are similar as well. c 2015 Carl James Schwarz 902 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION 14.4.2 Equation for a line - getting notation straight (no pun intended) In order to use regression analysis effectively, it is important that you understand the concepts of slopes and intercepts and how to determine these from data values. This will be QUICKLY reviewed here in class. In previous courses at high school or in linear algebra, the equation of a straight line was often written y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, the authors decided to write the equation of a line as y = a + bx. Now a is the intercept, and b is the slope. Statisticians, for good reasons, have rationalized this notation and usually write the equation of a line as y = β0 + β1 x or as Y = b0 + b1 X. (the distinction between β0 and b0 will be made clearer in a few minutes). The use of the subscripts 0 to represent the intercept and the subscript 1 to represent the coefficient for the X variable then readily extends to more complex cases. Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unit change in X. 14.4.3 Populations and samples All of statistics is about detecting signals in the face of noise and in estimating population parameters from samples. Regression is no different. First consider the the population. As in previous chapters, the correct definition of the population is important as part of any study. Conceptually, we can think of the large set of all units of interest. On each unit, there is conceptually, both an X and Y variable present. We wish to summarize the relationship between Y and X, and furthermore wish to make predictions of the Y value for future X values that may be observed from this population. [This is analogous to having different treatment groups corresponding to different values of X in ANOVA.] If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma or P V = nRt. However, in ecology, the relationship between Y and X is much more tenuous. If you could draw a scatter-plot of Y against X for ALL elements of the population, the points would NOT fall exactly on a straight line. Rather, the value of Y would fluctuate above or below a straight line at any given X value. [This is analogous to saying that Y varies randomly around the treatment group mean in ANOVA.] We denote this relationship as Y = β0 + β1 X + where now β0 , β1 are the POPULATION intercept and slope respectively. We say that E[Y ] = β0 + β1 X is the expected or average value of Y at X. [In ANOVA, we let each treatment group have its own mean; here in regression we assume that the means must fit on a straight line.] The term represent random variation of individual units in the population above and below the expected value. It is assumed to have constant standard deviation over the entire regression line (i.e. the spread of data points in the population is constant over the entire regression line). [This is analogous to the assumption of equal treatment population standard deviations in ANOVA.] Of course, we can never measure all units of the population. So a sample must be taken in order to c 2015 Carl James Schwarz 903 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION estimate the population slope, population intercept, and population standard deviation. Unlike a correlation analysis, it is NOT necessary to select a simple random sample from the entire population and more elaborate schemes can be used. The bare minimum that must be achieved is that for any individual X value found in the sample, the units in the population that share this X value, must have been selected at random. This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X from the extremes and then only at those X value, randomly select from the relevant subset of the population, rather than having to select at random from the population as a whole. [This is analogous to the assumptions made in an analytical survey, where we assumed that even though we can’t randomly assign a treatment to a unit (e.g. we can’t assign sex to an animal) we must ensure that animals are randomly selected from each group]. Once the data points are selected, the estimation process can proceed, but not before assessing the assumptions! 14.4.4 Assumptions The assumptions for a regression analysis are very similar to those found in ANOVA. Linearity Regression analysis assume that the relationship between Y and X is linear. Make a scatter-plot between Y and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs. log(X)). Some caution is required in transformation in dealing with the error structure as you will see in later examples. Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g. quadratic curve), this usually indicates that the relationship between Y and X is not linear. Or, fit a model that includes X and X 2 and test if the coefficient associated with X 2 is zero. Unfortunately, this test could fail to detect a higher order relationship. Third, if there are multiple readings at some X-values, then a test of goodness-of-fit can be performed where the variation of the responses at the same X value is compared to the variation around the regression line. Correct scale of predictor and response The response and predictor variables must both have interval or ratio scale. In particular, using a numerical value to represent a category and then using this numerical value in a regression is not valid. For example, suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using these values in a regression either as predictor variable or as a response variable is not sensible. Correct sampling scheme The Y must be a random sample from the population of Y values for every X value in the sample. Fortunately, it is not necessary to have a completely random sample from the population as the regression line is valid even if the X values are deliberately chosen. However, for a given X, the values from the population must be a simple random sample. c 2015 Carl James Schwarz 904 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION No outliers or influential points All the points must belong to the relationship – there should be no unusual points. The scatter-plot of Y vs. X should be examined. If in doubt, fit the model with the points in and out of the fit and see if this makes a difference in the fit. Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the single point is an outlier and and influential point: Equal variation along the line The variability about the regression line is similar for all values of X, i.e. the scatter of the points above and below the fitted line should be roughly constant over the entire line. This is assessed by looking at the plots of the residuals against X to see if the scatter is roughly uniformly scattered around zero with no increase and no decrease in spread over the entire line. Independence Each value of Y is independent of any other value of Y . The most common cases where this fail are time series data where X is a time measurement. In these cases, time series analysis should be used. This assumption can be assessed by again looking at residual plots against time or other variables. Normality of errors The difference between the value of Y and the expected value of Y is assumed to be normally distributed. This is one of the most misunderstood assumptions. Many people erroneously assume that the distribution of Y over all X values must be normally distributed, i.e they look simply at the distribution of the Y ’s ignoring the Xs. The assumption only states that the residuals, the difference between the value of Y and the point on the line must be normally distributed. This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, for small sample sizes, you have little power of detecting non-normality and for large sample sizes it is not that important. c 2015 Carl James Schwarz 905 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION X measured without error This is a new assumption for regression as compared to ANOVA. In ANOVA, the group membership was always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However, in regression, it can turn out that that the X value may not be known exactly. This general problem is called the “error in variables” problem and has a long history in statistics. It turns out that there are two important cases. If the value reported for X is a nominal value and the actual value of X varies randomly around this nominal value, then there is no bias in the estimates. This is called the Berkson case after Berkson who first examined this situation. The most common cases are where the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X that occurs would vary randomly around this target value. However, if the value used for X is an actual measurement of the underlying X then there is uncertainty in both the X and Y direction. In this case, estimates of the slope are attenuated towards zero (i.e. positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimate are no longer consistent, i.e. as the sample size increases, the estimates no longer tend to the population values! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may not be located exactly at the plot where the crop is grown, but may be recorded a nearby weather station a fair distance away. The reading at the weather station is NOT a true reflection of the rainfall at the test plot. This latter case of “error in variables” is very difficult to analyze properly and there are not universally accepted solutions. Refer to the reference books listed at the start of this chapter for more details. The problem is set up as follows. Let Yi =ηi + i Xi =ξi + δi with the straight-line relationship between the population (but unobserved) values: ηi =β0 + β1 ξi Note the (population, but unknown) regression equation uses ξi rather than the observed (with error) values Xi . Now if the regression is done on the observed X (i.e. the error prone measurement), the regression equation reduces to: Yi = β0 + β1 Xi + (i − β1 δi ) Now this violates the independence assumption of ordinary least squares because the new “error” term is not independent of the Xi variable. If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p. 90) with β1 r(ρ + r) E[βb1 ] = β1 − 1 + 2ρr + r2 where ρ is the correlation between ξ and δ; and r is the ratio of the variance of the error in X to the error in Y . The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ + r > 0). This is known as attenuation of the estimate, and in general, pulls the estimate towards zero. c 2015 Carl James Schwarz 906 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The bias will be small in the following cases: • the error variance of X is small relative to the error variance in Y . This means that r is small (i.e. close to zero), and so the bias is also small. In the case where X is measured without error, then r = 0 and the bias vanishes as expected. • if the X are fixed (the Berkson case) and actually used2 , then ρ + r = 0 and the bias also vanishes. The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p. 91) for more details. 14.4.5 Obtaining Estimates To distinguish between population parameters and sample estimates, we denote the sample intercept by b0 and the sample slope as b1 . The equation of a particular sample of points is expressed Ybi = b0 + b1 Xi where b0 is the estimated intercept, and b1 is the estimated slope. The symbol Yb indicates that we are referring to the estimated line and not to a line in the entire population. How is the best fitting line found when the points are scattered? We typically use the principle of least squares. The least-squares line is the line that makes the sum of the squares of the deviations of the data points from the line in the vertical direction as small as possible. 2 P Mathematically, the least squares line is the line that minimizes n1 Yi − Ybi where Ybi is the point on the line corresponding to each X value. This is also known as the predicted value of Y for a given value of X. This formal definition of least squares is not that important - the concept as expressed in the previous paragraph is more important – in particular it is the SQUARED deviation in the VERTICAL direction that is used.. It is possible to write out a formula for the estimated intercept and slope, but who cares - let the computer do the dirty work. The estimated intercept (b0 ) is the estimated value of Y when X = 0. In some cases, it is meaningless to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a plot of income vs. year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretation of the intercept, and it merely serves as a placeholder for the line. The estimated slope (b1 ) is the estimated change in Y per unit change in X. For every unit change in the horizontal direction, the fitted line increased by b1 units. If b1 is negative, the fitted line points downwards, and the increase in the line is negative, i.e., actually a decrease. As with all estimates, a measure of precision can be obtained. As before, this is the standard error of each of the estimates. Again, there are computational formulae, but in this age of computers, these are not important. As before, approximate 95% confidence intervals for the corresponding population parameters are found as estimate ± 2 × se. Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameter as this is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is no relationship between Y and X (can you draw a scatter-plot showing such a relationship?) More formally 2 For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based on the thermostat readings rather than the (true) unknown temperature, this corresponds to the Berkson case. c 2015 Carl James Schwarz 907 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION the null hypothesis is: H : β1 = 0 Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms of a sample statistic. The alternate hypothesis is typically chosen as: A : β1 6= 0 although one-sided tests looking for either a positive or negative slope are possible. The test-statistics is found as T = b1 − 0 se(b1 ) and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. This is usually automatically done by most computer packages. The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relationship were true. As before, the p-value does not tell the whole story, i.e. statistical vs. biological (non)significance must be determined and assessed. 14.4.6 Obtaining Predictions Once the best fitting line is found it can be used to make predictions for new values of X. There are two types of predictions that are commonly made. It is important to distinguish between them as these two intervals are the source of much confusion in regression problems. First, the experimenter may be interested in predicting a SINGLE future individual value for a particular X. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses at a particular X.3 The prediction interval for an individual response is sometimes called a confidence interval for an individual response but this is an unfortunate (and incorrect) use of the term confidence interval. Strictly speaking confidence intervals are computed for fixed unknown parameter values; predication intervals are computed for future random variables. Both of the above intervals should be distinguished from the confidence interval for the slope. In both cases, the estimate is found in the same manner – substitute the new value of X into the equation and compute the predicted value Yb . In most computer packages this is accomplished by inserting a new “dummy” observation in the dataset with the value of Y missing, but the value of X present. The missing Y value prevents this new observation from being used in the fitting process, but the X value allows the package to compute an estimate for this observation. What differs between the two predictions are the estimates of uncertainty. In the first case, there are two sources of uncertainty involved in the prediction. First, there is the uncertainty caused by the fact that this estimated line is based upon a sample. Then there is the additional uncertainty that the value could be above or below the predicted line. This interval is often called a prediction interval at a new X. 3 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice. c 2015 Carl James Schwarz 908 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION In the second case, only the uncertainty caused by estimating the line based on a sample is relevant. This interval is often called a confidence interval for the mean at a new X. The prediction interval for an individual response is typically MUCH wider than the confidence interval for the mean of all future responses because it must account for the uncertainty from the fitted line plus individual variation around the fitted line. Many textbooks have the formulae for the se for the two types of predictions, but again, there is little to be gained by examining them. What is important is that you read the documentation carefully to ensure that you understand exactly what interval is being given to you. 14.4.7 Residual Plots After the curve is fit, it is important to examine if the fitted curve is reasonable. This is done using residuals. The residual for a point is the difference between the observed value and the predicted value, i.e., the residual from fitting a straight line is found as: residuali = Yi − (b0 + b1 Xi ) = (Yi − Ybi ). There are several standard residual plots: • plot of residuals vs. predicted (Yb ); • plot of residuals vs. X; • plot of residuals vs. time ordering. In all cases, the residual plots should show random scatter around zero with no obvious pattern. Don’t plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and don’t mean anything. 14.4.8 Example - Yield and fertilizer We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. An experiment was conducted in the Schwarz household one summer on 11 plots of land where the amount of fertilizer was varied and the yield measured at the end of the season. The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. While the levels were not systematically chosen (e.g. they were not evenly spaced between the highest and lowest values), they represent commonly used amounts based on a preliminary survey of producers. At the end of the experiment, the yields were measured and the following data were obtained. Interest also lies in predicting the yield when 16 kg/ha are assigned. c 2015 Carl James Schwarz 909 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Fertilizer Yield (kg/ha) (Liters) 12 24 5 18 15 31 17 33 14 30 6 20 11 25 13 27 15 31 8 21 18 29 The raw data is also available in a JMP datasheet called fertilizer.jmp available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the response variable (Y ) is the yield. The population consists of all possible field plots with all possible tomato plants of this type grown under all possible fertilizer levels between about 5 and 18 kg/ha. If all of the population could be measured (which it can’t) you could find a relationship between the yield and the amount of fertilizer applied. This relationship would have the form: Y = β0 + β1 × (amount of fertilizer) + where β0 and β1 represent the population intercept and population slope respectively. The term represents random variation that is always present, i.e. even if the same plot was grown twice in a row with the same amount of fertilizer, the yield would not be identical (why?). The population parameters to be estimated are β0 - the population average yield when the amount of fertilizer is 0, and β1 , the population average change in yield per unit change in the amount of fertilizer. These are taken over all plants in all possible field plots of this type. The values of β0 and β1 are impossible to obtain as the entire population could never be measured. Here is the data entered into a JMP data sheet. Note the scale of both variables (continuous) and that an extra row was added to the data table with the value of 16 for the fertilizer and the yield left missing. c 2015 Carl James Schwarz 910 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The ordering of the rows in the data table is NOT important; however, it is often easier to find individual data points if the data is sorted by the X value and the rows for future predictions are placed at the end of the dataset. Notice how missing values are represented. Use the Analyze->Fit Y-by-X platform to start the analysis. Specify the Y and X variable as needed. Notice that JMP “reminds” you of the analysis that you will obtain based on the scale of the X and Y variables as shown in the bottom left of the menu. In this case, both X and Y have a continuous scale, so JMP will perform a bi-variate fitting procedure. It starts by showing the scatter-plot between Yield (Y ) and fertilizer (X). c 2015 Carl James Schwarz 911 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The relationship look approximately linear; there don’t appear to be any outlier or influential points; the scatter appears to be roughly equal across the entire regression line. Residual plots will be used later to check these assumptions in more detail. The drop-down menu item (from the red triangle beside the Bivariate Fit...) allows you to fit the least-squares line. This produces much output, but the three important part of the output are discussed below. First, the actual fitted line is drawn on the scatter plot, and the equation of the fitted line is printed below the plot. c 2015 Carl James Schwarz 912 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The estimated regression line is Yb = b0 + b1 (fertilizer) = 12.856 + 1.10137(amount of fertilizer) In terms of estimates, b0 =12.856 is the estimated intercept, and b1 =1.101 is the estimated slope. The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit. In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount is increased by 1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not the value of Y when X = 1. The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the estimated yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningful interpretation, but I’d be worried about extrapolating outside the range of the observed X values. If the intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to 13? Once again, these are the results from a single experiment. If another experiment was repeated, you would obtain different estimates (b0 and b1 would change). The sampling distribution over all possible experiments would describe the variation in b0 and b1 over all possible experiments. The standard deviation of b0 and b1 over all possible experiments is again referred to as the standard error of b0 and b1 . The formulae for the standard errors of b0 and b1 are messy, and hopeless to compute by hand. And just like inference for a mean or a proportion, the program automatically computes the se of the regression estimates. c 2015 Carl James Schwarz 913 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The estimated standard error for b1 (the estimated slope) is 0.132 L/kg. This is an estimate of the standard deviation of b1 over all possible experiments. Normally, the intercept is of limited interest, but a standard error can also be found for it as shown in the above table. Using exactly the same logic as when we found a confidence interval for the population mean, or for the population proportion, a confidence interval for the population slope (β1 ) is found (approximately) as b1 ± 2(estimated se) In the above example, an approximate confidence interval for β1 is found as 1.101 ± 2 × (.132) = 1.101 ± .264 = (.837 → 1.365)L/kg of fertilizer applied. An “exact” confidence interval can be computed by JMP as shown above.4 The “exact” confidence interval is based on the t-distribution and is slightly wider than our approximate confidence interval because the total sample size (11 pairs of points) is rather small. We interpret this interval as ‘being 95% confident that the population increase in yield when the amount of fertilizer is increased by one unit is somewhere between (.837 to 1.365) L/kg.’ Be sure to carefully distinguish between β1 and b1 . Note that the confidence interval is computed using b1 , but is a confidence interval for β1 - the population parameter that is unknown. In linear regression problems, one hypothesis of interest is if the population slope is zero. This would correspond to no linear relationship between the response and predictor variable (why?) Again, this is a good time to read the papers by Cherry and Johnson about the dangers of uncritical use of hypothesis testing. In many cases, a confidence interval tells the entire story. JMP produces a test of the hypothesis that each of the parameters (the slope and the intercept in the population) is zero. The output is reproduced again below: The test of hypothesis about the intercept is not of interest (why?). Let • β1 be the population (unknown) slope. • b1 be the estimated slope. In this case b1 = 1.1014. The hypothesis testing proceeds as follows. Again note that we are interested in the population parameters and not the sample statistics. 4 If your table doesn’t show the confidence interval, use a Control-Click or Right-Click in the table and select the columns to be displayed. c 2015 Carl James Schwarz 914 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION 1. Specify the null and alternate hypothesis: H: β1 = 0 A: β1 6= 0. Notice that the null hypothesis is in terms of the population parameter β1 . This is a two-sided test as we are interested in detecting differences from zero in either direction. 2. Find the test statistic and the p-value. The test statistic is computed as: T = estimate − hypothesized value 1.1014 − 0 = = 8.36 estimated se .132 In other words, the estimate is over 8 standard errors away from the hypothesized value! This will be compared to a t-distribution with n − 2 = 9 degrees of freedom. The p-value is found to very small (less than 0.0001). 3. Conclusion. There is strong evidence that the population slope is not zero. This is not too surprising given that the 95% confidence intervals show that plausible values for the population slope are from about .8 to about 1.4. It is possible to construct tests of the slope equal to some value other than 0. Most packages can’t do this. You would compute the T value as shown above, replacing the value 0 with the hypothesized value. It is also possible to construct one-sided tests. Most computer packages only do two-sided tests. Proceed as above, but the one-sided p-value is the two-sided p-value reported by the packages divided by 2. If sufficient evidence is found against the hypothesis, a natural question to ask is ‘well, what values of the parameter are plausible given this data’. This is exactly what a confidence interval tells you. Consequently, I usually prefer to find confidence intervals, rather than doing formal hypothesis testing. What about making predictions for future yields when certain amounts of fertilizer are applied? For example, what would be the future yield when 16 kg/ha of fertilizer are applied? The predicted value is found by substituting the new X into the estimated regression line. Yb = b0 + b1 (f ertilizer) = 12.856 + 1.10137(16) = 30.48 L This can also be found by using the cross hairs tool on the actual graph (to be demonstrated in class).JMP can compute the predicted value by selecting the appropriate option under the drop down menu item in the Linear Fit item: c 2015 Carl James Schwarz 915 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION and then going back to look at the new column in the data table: As noted earlier, there are two types of estimates of precision associated with predictions using the regression line. It is important to distinguish between them as these two intervals are the source of much confusion in regression problems. First, the experimenter may be interested in predicting a single FUTURE individual value for a particular X. This would correspond to the predicted yield for a single future plot with 16 kg/ha of fertilizer added. Second the experimenter may be interested in predicting the average of ALL FUTURE responses at a particular X. This would correspond to the average yield for all future plots when 16 kg/ha of fertilizer c 2015 Carl James Schwarz 916 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION is added. The prediction interval for an individual response is sometimes called a confidence interval for an individual response but this is an unfortunate (and incorrect) use of the term confidence interval. Strictly speaking confidence intervals are computed for fixed unknown parameter values; predication intervals are computed for future random variables. Both intervals can be computed and plotted byJMP by again using the pop-down menu beside the Linear Fit box: In this menu, the Confid Curves Fit correspond to confidence intervals for the MEAN response, while the Confid Curves Indiv correspond to prediction intervals for the future single response. Both can be plotted on the graph. Unfortunately, there does not appear to be a way to save the prediction limits into a data table from this platform - the cross hairs tool must be used, or the Analyze->Fit Model platform should be used. c 2015 Carl James Schwarz 917 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The innermost set of lines represents the confidence bands for the mean response. The outermost band of lines represents the prediction intervals for a single future response. As noted earlier, the latter must be wider than the former to account for an additional source of variation. The numerical values from the Analyze->Fit Model platform are shown below: Here the predicted yield for a single future trial at 16 kg/ha is 30.5 L, but the 95% prediction interval is between 26.1 and 34.9 L. The predicted AVERAGE yield for ALL future plots when 16 kg/ha of fertilizer is applied is also 30.5 L, but the 95% confidence interval for the MEAN yield is between 28.8 and 32.1 L. Finally, residual plots can be made using the pop-down menu: c 2015 Carl James Schwarz 918 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The residuals are simply the difference between the actual data point and the corresponding spot on the line measured in the vertical direction. The residual plot shows no trend in the scatter around the value of zero. The same items are available from the Analyze->Fit Model platform. Here you would specify Yield as the Y variable, Fertilizer as the X variable in much the same way as in the Analyze->Fit Y-by-X platform. Much of the same output is produced. Additionally, you can save the actual confidence bounds for predictions into the data table (as shown above). This will be demonstrated in class. 14.4.9 Example - Mercury pollution Mercury pollution is a serious problem in some waterways. Mercury levels often increase after a lake is flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessive consumption of mercury is well known to be deleterious to human health. It is difficult and time consuming to measure every persons mercury level. It would be nice to have a quick procedure that could be used to estimate the mercury level of a person based upon the average mercury level found in fish and estimates of the person’s consumption of fish in their diet. The following data were collected on the methyl mercury intake of subjects and the actual mercury levels recorded in the blood stream from a random sample of people around recently flooded lakes. Here are the raw data: c 2015 Carl James Schwarz 919 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Methyl Mercury Mercury in Intake whole blood (ug Hg/day) (ng/g) 180 90 200 120 230 125 410 290 600 310 550 290 275 170 580 375 600 150 105 70 250 105 60 205 650 480 The data are available in a JMP datasheet called mercury.jmp available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The ordering of the rows in the data table is NOT important; however, it is often easier to find individual data points if the data is sorted by the X value and the rows for future predictions are placed at the end of the dataset. Notice how missing values are represented. The population of interest are the people around recently flooded lakes. This experiment is an analytical survey as it is quite impossible to randomly assign people different amounts of mercury in their food intake. Consequently, the key assumption is that the subjects chosen to be measured are random samples from those with similar mercury intakes. Note it is NOT necessary for this to be a random sample from the ENTIRE population (why?). The explanatory variable is the amount of mercury ingested by a person. The response variable is the amount of mercury in the blood stream. We start by producing the scatter-plot. c 2015 Carl James Schwarz 920 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION There appears to be two outliers (identified by an X). To illustrate the effects of these outliers upon the estimates and the residual plots, the line was fit using all of the data. The residual plot shows the clear presence of the two outliers, but also identifies a third potential outlier not evident from the original scatter-plot (can you find it?). c 2015 Carl James Schwarz 921 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The data were rechecked and it appears that there was an error in the blood work used in determining the readings. Consequently, these points were removed for the subsequent fit. The estimated regression line (after removing outliers) is Blood = −1.951691 + 0.581218Intake . The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/day when the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in the context of this experiment. The negative value is merely a placeholder for the line. Also notice that the estimated intercept is not very precise in any case (how do I know this and what implications does this have for worrying that it is not zero?)5 What was the impact of the outliers if they had been retained upon the estimated slope and intercept? The estimated slope has been determined relatively well (relative standard error of about 10% – how 5 It is possible to fit a regression line that is constrained to go through Y = 0 when X = 0. These must be fit carefully and are not covered in this course. c 2015 Carl James Schwarz 922 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION is the relative standard error computed?). There is clear evidence that the hypothesis of no relationship between blood mercury levels and food mercury levels is not tenable. The two types of predictions would also be of interest in this study. First, an individual would like to know the impact upon personal health. Secondly, the average level would be of interest to public health authorities. JMP was used to plot both intervals on the scatter-plot: 14.4.10 Example - The Anscombe Data Set Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable. All four datasets gave exactly the same results when a regression line was fit, yet are quite different in their interpretation. The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-650/ Notes/MyPrograms. Fitting of regression lines to this data will be demonstrated in class. 14.4.11 Transformations In some cases, the plot of Y vs. X is obviously non-linear and a transformation of X or Y may be used to establish linearity. For example, many dose-response curves are linear in log(X). Or the equation may be intrinsically non-linear, e.g. a weight-length relationship is of the form weight = β0 lengthβ1 . Or, some variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100 km or km/L? You are already with some variables measured on the log-scale - pH is a common example. Often a visual inspection of a plot may identify the appropriate transformation. c 2015 Carl James Schwarz 923 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION There is no theoretical difficulty in fitting a linear regression using transformed variables other than an understanding of the implicit assumption of the error structure. The model for a fit on transformed data is of the form trans(Y ) = β0 + β1 × trans(X) + error Note that the error is assumed to act additively on the transformed scale. All of the assumptions of linear regression are assumed to act on the transformed scale – in particular that the population standard deviation around the regression line is constant on the transformed scale. The most common transformation is the logarithmic transform. It doesn’t matter if the natural logarithm (often called the ln function) or the common logarithm transformation (often called the log10 transformation) is used. There is a 1-1 relationship between the two transformations, and linearity on one transform is preserved on the other transform. The only change is that values on the ln scale are 2.302 = ln(10) times that on the log10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302. There is some confusion in scientific papers about the meaning of log - some papers use this to refer to the ln transformation, while others use this to refer to the log10 transformation. After the regression model is fit, remember to interpret the estimates of slope and intercept on the transformed scale. For example, suppose that a ln(Y ) transformation is used. Then we have ln(Yt+1 ) = b0 + b1 × (t + 1) ln(Yt ) = b0 + b1 × t and ln(Yt+1 ) − ln(Yt ) = ln( . exp(ln( Yt+1 ) = b1 × (t + 1 − t) = b1 Yt Yt+1 Yt+1 )) = = exp(b1 ) = eb1 Yt Yt Hence a one unit increase in X causes Y to be MULTIPLED by eb1 . As an example, suppose that on the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a multiplicative factor or e−.07 = .93, i.e. roughly a 7% decline per year.6 Similarly, predictions on the transformed scale, must be back-transformed to the untransformed scale. In some problems, scientists search for the ‘best’ transform. This is not an easy task and using simple statistics such as R2 to search for the best transformation should be avoided. Seek help if you need to find the best transformation for a particular dataset. JMP makes it particularly easy to fit regressions to transformed data as shown below. SAS and R have an extensive array of functions so that you can create new variables based the transformation of an existing variable. 14.4.12 Example: Monitoring Dioxins - transformation An unfortunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material. This material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms takes a long time to degrade. 6 It can be shown that on the log scale, that for smallish values of the slope that the change is almost the same on the untransformed scale, i.e. if the slope is −.07 on the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughly a 7% increase per year. c 2015 Carl James Schwarz 924 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Government environmental protection agencies take samples of crabs from affected areas each year and measure the amount of dioxins in the tissue. The following example is based on a real study. Each year, four crabs are captured from a monitoring station. The liver is excised and the livers from all four crabs are composited together into a single sample.7 The dioxins levels in this composite sample is measured. As there are many different forms of dioxins with different toxicities, a summary measure, called the Total Equivalent Dose (TEQ) is computed from the sample. Here is the raw data. Site Year TEQ a 1990 179.05 a 1991 82.39 a 1992 130.18 a 1993 97.06 a 1994 49.34 a 1995 57.05 a 1996 57.41 a 1997 29.94 a 1998 48.48 a 1999 49.67 a 2000 34.25 a 2001 59.28 a 2002 34.92 a 2003 28.16 The data is available in a JMP data file dioxinTEQ.jmp in the http://www.stat.sfu.ca/ ~cschwarz/Stat-650/Notes/MyPrograms. As with all analyses, start with a preliminary plot of the data. Use the Analyze->Fit Y-by-X platform. 7 Compositing is a common analytical tool. There is little loss of useful information induced by the compositing process - the only loss of information is the among individual-sample variability which can be used to determine the optimal allocation between samples within years and the number of years to monitor. c 2015 Carl James Schwarz 925 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Why is this so? In many cases, a fixed fraction of dioxins degrades per year, e.g. a 10% decline per year. This can be expressed in a non-linear relationship: T EQ = Crt where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this is plotted over time, this leads to the non-linear pattern seen above. If logarithms are taken, this leads to the relationship: log(T EQ) = log(C) + t × log(r) which can be expressed as: log(T EQ) = β0 + β1 × t which is the equation of a straight line with β0 = log(C) and β1 = log(r). JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashion. A plot of log(T EQ) vs. year gives the following: c 2015 Carl James Schwarz 926 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The relationship look approximately linear; there don’t appear to be any outlier or influential points; the scatter appears to be roughly equal across the entire regression line. Residual plots will be used later to check these assumptions in more detail. A line can be fit as before by selecting the Fit Line option from the red triangle in the upper left side of the plot: This gives the following output: c 2015 Carl James Schwarz 927 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The fitted line is: log(T EQ) = 218.9 − .11(year) . The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly nonsensical. The slope (−.11) is the estimated log(ratio) from one year to the next. For example, exp(−.11) = .898 would mean that the TEQ in one year is only 89.8% of the TEQ in the previous year or roughly an 11% decline per year.8 8 It can be shown that in regressions using a log(Y ) vs. time, that the estimated slope on the logarithmic scale is the approximate fraction decline per time interval. For example, in the above, the estimated slope of −.11 corresponds to an approximate 11% decline per year. This approximation only works well when the slopes are small, i.e. close to zero. c 2015 Carl James Schwarz 928 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The standard error of the estimated slope is .02. If you want to find the standard error of the anti-log of the estimated slope, you DO NOT take exp(0.02). Rather, the standard error of the ant-logged value is found as seantilog = selog exp(slope) = 0.02 × .898 = .01796.9 A 95% confidence interval for the slope can be obtained by pressing a Right-Click (for Windoze machines) or a Ctrl-Click (for Macintosh machines) in the Parameter Estimates summary table and selecting the confidence intervals to display in the table. The 95% confidence interval for the slope (on the log-scale) is (−.154 to −.061). If you take the anti-logs of the endpoints, this gives a 95% confidence interval for the fraction of TEQ that remains from year to year, i.e. between (0.86 to 0.94) of the TEQ in one year, remains to the next year. As always, the model diagnostics should be inspected early on in the process: These are produced automatically by JMP. The residual plot looks fine with no apparent problems but the dip in the middle years could require further exploration if this pattern was apparent in other sites as well. This type of pattern may be evidence of autocorrelation. Here there is no evidence of auto-correlation so we can proceed without worries. Several types of predictions can be made. For example, what would be the estimated mean logTEQ in 2010? What is the range of logTEQ’s in 2010? Again, refer back to previous chapters about the differences in predicting a mean response and predicting an individual response. The computations could be done by hand, or by using the cross-hairs on the plot from the Analyze>Fit Y-by-X platform. Confidence intervals for the mean response, or prediction intervals for an individual response can be added to the plot from the pop-down menu. However, a more powerful tool is available from the Analyze->Fit Model platform. 9 This is computed using a method called the delta-method. c 2015 Carl James Schwarz 929 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Start first, by adding rows to the original data table corresponding to the years for which a prediction is required. In this case, the additional row would have the value of 2010 in the Year column with the remainder of the row unspecified. Missing values will be automatically inserted for the other variables. Then invoke the Analyze->Fit Model platform: c 2015 Carl James Schwarz 930 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION This gives much the same output as from the Analyze->Fit Y-by-X platform with a few new (useful) features, a few of which we will explore in the remainder of this section. Next, save the prediction formula, and the confidence interval for the mean, and for an individual prediction to the data table (this will take three successive saves): c 2015 Carl James Schwarz 931 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Now the data table has been augmented with additional columns and more importantly predictions for 2010 are now available: The estimated mean log(T EQ) in 2010 is 2.60 (corresponding to an estimated MEDIAN TEQ of exp(2.60) = 13.46). A 95% confidence interval for the mean log(T EQ) is (1.94 to 3.26) corresponding to a 95% confidence interval for the actual MEDIAN TEQ of between (6.96 and 26.05).10 Note that the confidence interval after taking anti-logs is no longer symmetrical. Why does a mean of a logarithm transform back to the median on the untransformed scale? Basically, because the transformation is non-linear, properties such mean and standard errors cannot be simply anti-transformed without introducing some bias. However, measures of location, (such as a median) are unaffected. On the transformed scale, it is assumed that the sampling distribution about the estimate is symmetrical which makes the mean and median take the same value. So what really is happening is that the median on the transformed scale is back-transformed to the median on the untransformed scale. Similarly, a 95% prediction interval for the log(T EQ) for an INDIVIDUAL composite sample can be found. Be sure to understand the difference between the two intervals. Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to some particular value. For example, health regulations may require that the TEQ of the composite sample be below 10 units. The Analyze->Fit Model platform has an inverse prediction function: 10 A minor correction can be applied to estimate the mean if required. c 2015 Carl James Schwarz 932 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Specify the required value for Y – in this case log(10) = 2.302, and then press the RUN button to get the following output: The predicted year is found by solving 2.302 = 218.9 − .11(year) and gives and estimated year of 2012.7. A confidence interval for the time when the mean log(T EQ) is equal to log(10) is somewhere between 2007 and 2026! The application of regression to non-linear problems is fairly straightforward after the transformation is made. The most error-prone step of the process is the interpretation of the estimates on the TRANSFORMED scale and how these relate to the untransformed scale. c 2015 Carl James Schwarz 933 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION 14.4.13 Example: Weight-length relationships - transformation A common technique in fisheries management is to investigate the relationship between weight and lengths of fish. This is expected to a non-linear relationship because as fish get longer, they also get wider and thicker. If a fish grew “equally” in all directions, then the weight of a fish should be proportional to the length3 (why?). However, fish do not grow equally in all directions, i.e. a doubling of length is not necessarily associated with a doubling of width or thickness. The pattern of association of weight with length may reveal information on how fish grow. The traditional model between weight and length is often postulated to be of the form: weight = a × lengthb where a and b are unknown constants to be estimated from data. If the estimated value of b is much less than 3, this indicates that as fish get longer, they do not get wider and thicker at the same rates. How are such models fit? If logarithms are taken on each side, the above equation is transformed to: log(weight) = log(a) + b × log(length) or log(weight) = β0 + β1 × log(length) where the usual linear relationship on the log-scale is now apparent. The following example was provided by Randy Zemlak of the British Columbia Ministry of Water, Land, and Air Protection. c 2015 Carl James Schwarz 934 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Length (mm) Weight (g) 34 585 46 1941 33 462 36 511 32 428 33 396 34 527 34 485 33 453 44 1426 35 488 34 511 32 403 31 379 30 319 33 483 36 600 35 532 29 326 34 507 32 414 33 432 33 462 35 566 34 454 35 600 29 336 31 451 33 474 32 480 35 474 30 330 30 376 34 523 31 353 32 412 32 407 A sample of fish was measured at a lake in British Columbia. The data is as follows and is available in a JMP datasheet called wtlen.jmp at the Sample Program Library at http://www.stat.sfu. ca/~cschwarz/Stat-650/Notes/MyPrograms. c 2015 Carl James Schwarz 935 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The following is an initial plot with a spline fit (lambda=10) to the data. The fit appears to be non-linear but this may simply be an artifact of the influence of the two largest fish. The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, the variance appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30 mm. There are several (equivalent) ways to fit the growth model to such data in JMP: • Use Analyze->Fit Y-by-X directly with the Fit Special feature. • Create two new variables log(weight) and log(length) and then use Analyze->Fit Y-by-X on these derived variables. • Use Analyze->Fit Model on these derived variables. We will fit a model on the log-log scale: Note that there is some confusion in scientific papers about a “log” transform. In general, a log-transformation refers to taking natural-logarithms (base e), and NOT the base-10 logarithm. This mathematical convention is often broken in scientific papers where authors try to use ln to represent natural logarithms, etc. It does not affect the analysis in anyway which transformation is used other than that values on the natural log scale are approximately 2.3 times larger than values on the log10 scale. Of course, the appropriate back transformation is required. Using the Fit Special . The Fit Special is available from the drop-down menu item: c 2015 Carl James Schwarz 936 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION It presents a dialogue box where a transformation on both the Y and X axes may be specified: the following output is obtained: c 2015 Carl James Schwarz 937 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. At smaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the two definite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm and c 2015 Carl James Schwarz 938 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION negative residuals at 35 mm. The fit was repeated dropping the two largest fish with the following output: c 2015 Carl James Schwarz 939 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Now the fit appears to be much better. The relationship (on the log-scale) is linear, the residual plot looks OK. The estimated power coefficient is 2.76 (SE .21). We find the 95% confidence interval for the slope (the power coefficient): The 95% confidence interval for the power coefficient is from (2.33 to 3.2) which includes the value of 3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width and twice the thickness. Of course, with this small sample size, it is too difficult to say too much. The actual model in the population is: log(weight) = β0 + β1 log(length) + This implies that the “errors” in growth act on the LOG-scale. This seems reasonable. For example, a regression on the original scale would make the assumption that a 20 g error in predicting weight is equally severe for a fish that (on average) weighs 200 or 400 grams even though the "error" is 20/200=10% of the predicted value in the first case, while only 5% of the predicted value in the second case. On the log-scale, it is implicitly assumed that the “errors” operate on the log-scale, i.e. a 10% error in a 200 g fish is equally severe as a 10% error in a 400 g fish even though the absolute errors of 20g and 40g are quite different. Another assumption of regression analysis is that the population error variance is assumed to be constant over the entire regression line, but the original plot shows that the standard deviation is increasing with length. On the log-scale, the standard deviation is roughly constant over the entire regression line. Using derived variables The same analysis was repeated using the derived variables log(weight) and log(length) and again using the Analyze->Fit Y-by-X platform, but this time without the Fit Special. [The Fit Special is not needed because the derived variables have already been transformed.] The following are the outputs using the derived variables, again with and with out the two largest fish. c 2015 Carl James Schwarz 940 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Because derived variables are used, the fitting plot uses the derived variables and is on the log-scale. This has the advantage that the the fit at the lower lengths is easier to see, but the lack of fit for the two largest fish is not as clear. However, it is now easier to see on the residual plot the apparent lack of fit c 2015 Carl James Schwarz 941 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION with the downward sloping part of the residual plot in the 3.4 to 3.6 log(length). The two largest fish were removed and the fit repeated using the derived variables: c 2015 Carl James Schwarz 942 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The results are identical to the previous section. A non-linear fit It is also possible to do a direct non-linear least-squares fit. Here the objective is to find values of β0 and β1 to minimize: X (weight − β0 × lengthβ1 )2 directly. This can also be done in JMP using the Fit NonLinear platform and won’t be explored in much detail here. First here are the results from using all of the fish: Note that the fit apparently is better than the fit on the log-scale as the fitted curve goes through the middle of the points from the two largest fish. Note that there still appear to be problems with the fit at the lower lengths. The same fit, dropping the two largest fish, gives the following output: c 2015 Carl James Schwarz 943 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The estimated power coefficient from the non-linear fit is 2.73 with a standard error of .24. The estimated intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to the previous fit. Which is a better method to fit this data? The non-linear fit assumes that error are additive on the original scale. The consequences of this were discussed earlier, i.e. a 20 g error is equally serious for a 200 g fish as for a 400 g fish. For this problem, both the non-linear fit and the fit on the log-scale gave the same results, but this will not always be true. In particular, look at the large difference in estimates when the models were fit to the all of the fish. The non-linear fit was more influenced by the two large fish - this is a consequence of the minimizing the square of the absolute deviation (as opposed to the relative deviation) between the observed weight and predicted weight. 14.4.14 Power/Sample Size A power analysis and sample size determination can also be done for regression problems, but is more complicated than power analyses for simple experimental designs. This is for a number of reasons: c 2015 Carl James Schwarz 944 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION • The power depends not only on the total number of points collected, but also on the actual distribution of the X values. For example, the power to detect a trend is different if the X values are evenly distributed over the range of predictors than if the X values are clustered at the ends of the range of the predictors. A regression analysis has the most power to detect a trend if half the observations are collected at a small X value and half of the observations are collected at a large X value. However, this type of data gives no information on the linearity (or lack there-of) between the two X values and is not recommended in practice. A less powerful design would have a range of X values collected, but this is often more of interest as lack-of-fit and non-linearity can be collected. • Data collected for regression analysis is often opportunistic with little chance of choosing the X values. Unless you have some prior information on the distribution of the X values, it is difficult to determine the power. • The formula are clumsy to compute by hand, and most power packages tend not to have modules for power analysis of regression. However, modern software should be able to deal with this issue. For a power analysis, the information required is similar to that requested for ANOVA designs: • α level. As in power analyses for ANOVA, this is traditionally set to α = 0.05. • effect size. In ANOVA, power deals with detection of differences among means. In regression analysis, power deals with detection of slopes that are different from zero. Hence, the effect size is measured by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X. • sample size. Recall in ANOVA with more than two groups, that the power depended not only only the sample size per group, but also how the means are separated. In regression analysis, the power will depend upon the number of observations taken at each value of X and the spread of the X values. For example, the greatest power is obtained when half the sample is taken at the two extremes of the X space - but at a cost of not being able to detect non-linearity. It turns out that a simple summary of the distribution of the X values (the standard deviation of the X values) is all that is needed. • standard deviation. As in ANOVA, the power will depend upon the variation of the individual objects around the regression line. JMP (V.10) does not currently contain a module to do power analysis for regression. R also does not include a power computation module for regression analysis but I have written a small function that is available in the SampleProgramLibrary. SAS (Version 9+) includes a power analysis module (GLMPOWER) for the power analysis. Russ Lenth also has a JAVA applet that can be used for determining power in a regression context http://homepage.stat.uiowa.edu/~rlenth/Power/. The problem simplifies considerably when the the X variable is time, and interest lies in detecting a trend (increasing or decreasing) over time. A linear regression of the quantity of interest against time is commonly used to evaluate such a trend. For many monitoring designs, observations are taken on a yearly basis, so the question reduces to the number of years of monitoring required. The analysis of trend data and power/sample size computations is treated in a following chapter. Let us return to the example of the yield of tomatoes vs. the amount of fertilizer. We wish to design an experiment to detect a slope of 1 (the effect size). From past data (on a different field), the standard deviation of values about the regression line is about 4 units (the standard deviation of the residuals). We have enough money to plant 12 plots with levels of fertilizer ranging from 10 to 20. How does the power compare under different configuration of choices of fertilizer levels. More specifically, how c 2015 Carl James Schwarz 945 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION does the power compare between using fertilizer levels (10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20), i.e. an even distribution of levels of fertilizer, and (10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20), i.e. doing two replicates at each level of fertilizer but doing fewer distinct levels? JMP (v.10) does not currently have a module for a power analysis of a regression problem. Please consult the documentation on R, SAS or Lenth’s JAVA program (see below).. The power to detect a range of slopes using the last set of X values was also computed (see the R and SAS code) and a plot of the power vs. the size of the slope can be made. Because JMP does not include facilities for a power analysis of a simple linear regression, the plot from R is shown. The power to detect smaller slopes is limited. Russ Lenth’s power modules11 can be used to compute the power for these two cases. Here the modules require the standard deviation of the X values but this needs to be computed using the n divisor rather than the n − 1 divisor, i.e. s P (X − X)2 SDLenth (X) = n For the two sets of fertilizer values the SDs are 3.02765 and 3.41565 respectively. 11 http://homepage.stat.uiowa.edu/~rlenth/Power/ c 2015 Carl James Schwarz 946 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The output from Lenth’s power analysis are: which match the earlier results (as they must). 14.4.15 The perils of R2 R2 is a “popular” measure of the fit of a regression model and is often quoted in research papers as evidence of a good fit etc. However, there are several fundamental problems of R2 which, in my opinion, make it less desirable. A nice summary of these issues is presented in Draper and Smith (1998, Applied Regression Analysis, p. 245-246). Before exploring this, how is R2 computed and how is it interpreted? While I haven’t discussed the decomposition of the Error SS into Lack-of-Fit and Pure error, this can be done when there are replicated X values. A prototype ANOVA table would look something like: c 2015 Carl James Schwarz 947 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Source df SS Regression p−1 A Lack-of-fit n − p − ne B Pure error ne C Corrected Total n-1 D where there are n observations and a regression model is fit with p additional X values over and above the intercept. R2 is computed as R2 = A B+C SS(regression) = =1− SS(total) D D where SS(·) represents the sum of squares for that term in the ANOVA table. At this point, rerun the three examples presented earlier to find the value of R2 . For example, in the fertilizer example, the ANOVA table is: Analysis of Variance Source DF Sum of Squares Mean Square Model 1 225.18035 225.180 Error 9 29.00147 3.222 C. Total 10 254.18182 F Ratio 69.8800 p-value <.0001 Here R2 = 225.18035/254.18182 = .885 = 88.5%. R2 is interpreted as the proportion of variance in Y accounted for by the regression. In this case, almost 90% of the variation in Y is accounted for by the regression. The value of R2 must range between 0 and 1. It is tempting to think that R2 must be measure of the “goodness of fit”. In a technical sense it is, but R is not a very good measure of fit, and other characteristics of the regression equation are much more informative. In particular, the estimate of the slope and the se of the slope are much more informative. 2 Here are some reasons, why I decline to use R2 very much: B • Overfitting. If there are no replicate X points, then ne = 0, C = 0, and R2 = 1 − D . B has n − p degrees of freedom. As more and more X variables are added to the model, n − p, and B become smaller, and R2 must increase even if the additional variables are useless. • Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate the value of C (if the outlier occurs among the set of replicate X values), or B if the outlier occurs at a singleton X value. In any cases, they reduce R2 , so R2 is not resistant to outliers. • People misinterpret high R2 as implying the regression line is useful. It is tempting to believe that a higher value of R2 implies that a regression line is more useful. But consider the pair of plots below: c 2015 Carl James Schwarz 948 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The graph on the left has a very high R2 , but the change in Y as X varies is negligible. The graph on the right has a lower R2 , but the average change in Y per unit change in X is considerable. R2 measures the “tightness” of the points about the line – the higher value of R2 on the left indicates that the points fit the line very well. The value of R2 does NOT measure how much actual change occurs. • Upper bound is not always 1. People often assume that a low R2 implies a poor fitting line. If you have replicate X values, then C > 0. The maximum value of R2 for this problem can be much less than 100% - it is mathematically impossible for R2 to reach 100% with replicated X values. In the extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R2 C . can never exceed 1 − D P • No intercept models If there is no intercept then D = (Yi − Y )2 does not exist, and R2 is not really defined. • R2 gives no additional information. In actual fact, R2 is a 1-1 transformation of the slope and its standard error, as is the p-value. So there is no new information in R2 . • R2 is not useful for non-linear fits. R2 is really only useful for linear fits with the estimated regression line free to have a non-zero intercept. The reason is that R2 is really a comparison between two types of models. For example, refer back to the length-weight relationship examined earlier. In the linear fit case, the two models being compared are log(weight) = log(b0 ) + error vs. log(weight) = log(b0 ) + b1 ∗ log(length) + error 2 and so R is a measure of the improvement with the regression line. [In actual fact, it is a 1-1 transform of the test that β1 = 0, so why not use that statistics directly?]. In the non-linear fit case, the two models being compared are: weight = 0 + error vs. weight = b0 ∗ length ∗ ∗b1 + error The model weight=0 is silly and so R2 is silly. Hence, the R2 values reported are really all for linear fits - it is just that sometimes the actual linear fit is hidden. • Not defined in generalized least squares. There are more complex fits that don’t assume equal variance around the regression line. In these cases, R2 is again not defined. c 2015 Carl James Schwarz 949 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION • Cannot be used with different transformations of Y . R2 cannot be used to compare models that are fit to different transformations of the Y variable. For example, many people try fitting a model to Y and to log(Y ) and choose the model with the highest R2 . This is not appropriate as the D terms are no longer comparable between the two models. • Cannot be used for non-nested models. R2 cannot be used to compare models with different sets of X variables unless one model is nested within another model (i.e. all of the X variables in the smaller model also appear in the larger model). So using R2 to compare a model with X1 , X3 , and X5 to a model with X1 , X2 , and X4 is not appropriate as these two models are not nested. In these cases, AIC should be used to select among models. 14.5 A no-intercept model: Fulton’s Condition Factor K It is possible to fit a regression line that has an intercept of 0, i.e., goes through the origin. Most computer packages have an option to suppress the fitting of the intercept. The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced are misleading for these models. As this varies from package to package, please seek advice when fitting such models. The following is an example of where such a model may be sensible. Not all fish within a lake are identical. How can a single summary measure be developed to represent the condition of fish within a lake? In general, the the relationship between fish weight and length follows a power law: W = aLb where W is the observed weight; L is the observed length, and a and b are coefficients relating length to weight. The usual assumption is that heavier fish of a given length are in better condition than than lighter fish. Condition indices are a popular summary measure of the condition of the population. There are at least eight different measures of condition which can be found by a simple literature search. Conne (1989) raises some important questions about the use of a single index to represent the two-dimensional weight-length relationship. One common measure is Fulton’s12 K: K= W eight (Length/100)3 This index makes an implicit assumption of isometric growth, i.e. as the fish grows, its body proportions and specific gravity do not change. How can K be computed from a sample of fish, and how can K be compared among different subset of fish from the same lake or across lakes? The B.C. Ministry of Environment takes regular samples of rainbow trout using a floating and a sinking net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded. 12 There is some doubt about the first authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J. (2005). The Origin of Fulton’s Condition Factor – Setting the Record Straight. Fisheries, 31, 236-238. c 2015 Carl James Schwarz 950 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The data is available in the rainbow-condition.csv file in the Sample Program Library at http: //www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into rainbow-condition.jmp, a JMP data file, in the usual way. A portion of the raw data data appears below: K was computed for each individual fish, and the resulting histogram is displayed below: There is a range of condition numbers among the individual fish with an average (among the fish caught) K of about 13.6. Deriving a single summary measure to represent the entire population of fish in the lake depends heavily on the sampling design used to capture fish. Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the population. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically more selective for fish of a certain size. In this experiment, several different mesh sizes were used to try and ensure that all fish of all sizes have an equal chance of being selected. As well, if regression methods have an advantage in that a simple random sample from the population is no longer required to estimate the regression coefficients. As an analogy, suppose you are interested in the relationship between yield of plants and soil fertility. Such a study could be conducted by finding a random sample of soil plots, but this may lead to many plots with similar fertility and only a few plots c 2015 Carl James Schwarz 951 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION with fertility at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots with a range of fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regression curve to these selected data points. Fulton’s index is often re-expressed for regression purposes as: W =K This looks like a simple regression between W and L 100 3 L 3 100 but with no intercept. A plot of these two variables: shows a tight relationship among fish but with possible increasing variance with length. There is some debate about the proper way to estimate the regression coefficient K. Classical regression methods (least squares) implicitly implies that all of the “error” in the regression is in the vertical direction, i.e. conditions on the observed lengths. However, the structural relationship between weight and length likely is violated in both variables. This would lead to the error-in-variables problem in regression, which has a long history. Fortunately, the relationship between the two variables is often sufficiently tight that it really doesn’t matter which method is used to find the estimates. JMP can be used to fit the regression line constraining the intercept to be zero by using the Fit Special option under the red-triangle: c 2015 Carl James Schwarz 952 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION This gives rise to the fitted line and statistics about the fit: c 2015 Carl James Schwarz 953 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION Note that R2 really doesn’t make sense in cases where the regression is forced through the origin because the null model to which it is being compared is the line Y = 0 which is silly.13 For this reason, JMP does not report a value of R2 . The estimated value of K is 13.72 (SE 0.099). 13 Consult any of the standard references on regression such as Draper and Smith for more details. c 2015 Carl James Schwarz 954 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION The residual plot: shows clear evidence of increasing variation with the length variable. This usually implies that a weighted regression is needed with weights proportional to the 1/length2 variable. In this case, such a regression b = 13.67, SE = .11). gives essentially the same estimate of the condition factor (K Comparing condition factors This dataset has a number of sub-groups – do all of the subgroups have the same condition factor? For example, suppose we wish to compare the K value for immature and mature fish. This is covered in more detail in the Chapter on the Analysis of Covariance (ANCOVA). 14.6 Frequent Asked Questions - FAQ 14.6.1 Do I need a random sample; power analysis A student wrote: I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstract attached). I would like to define a regional hydraulic geometry for a fairly small hydrologic/geologic homogeneous area in the coast mountains close to SFU. Hydraulic geometry is the study of how the primary flow variables (width, depth and velocity) change with discharge in a stream. Typically, a straight-regression line is fitted to data plotted on a log-log plot. The equation is of the form w = aQb where a is the intercept, b is the slope, w is the water surface width, and Q is the stream discharge. I am struggling with the last part of my research proposal which is how do I select (randomly) my field sites and how many sites are required. My supervisor - suggests that I select stream segments for study based on a-priori knowledge of my field area and select streams from across it. My argument is that to define a regionally applicable relationship (not just one that characterizes my chosen sites) I must randomly select the sites. I think that GIS will help me select my sites but have the usual questions of how many sites are required to give me a certain level of confidence and whether or not I’m on the right track. As well, the primary controlling variables that I am looking at are discharge and stream slope. I will be plotting the flow variables against discharge directly but will c 2015 Carl James Schwarz 955 2015-08-20 CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION deal with slope by breaking my stream segments into slope classes - I guess that the null hypothesis would be that there is no difference in the exponents and intercepts between slope classes. You are both correct! If you were doing a simple survey, then you are correct in that a random sample from the entire population must be selected - you can’t deliberately choose streams. However, because you are interested in a regression approach, the assumption can be relaxed a bit. You can deliberately choose values of the X variables, but must randomly select from streams with similar X values. As an analogy, suppose you wanted to estimate the average length of male adult arms. You would need a random sample from the entire population. However, suppose that you were interested in the relationship between body height (X) and arm length (Y ). You could deliberately choose which X values to measure - indeed it would be good idea to get a good contrast among the X values, i.e. find people who are 4 ft tall, 5 ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fit the regression curve. However, at each height level, you must now choose randomly among those people that meet that criterion. Hence you could could deliberately choose to have 1/4 of people who are 4 ft tall, 1/4 who are 5 feet tall, 1/4 who are 6 feet tall, 1/4 who are 7 feet tall which is quite different from the proportions in the population, but at each height level must choose people randomly, i.e. don’t always choose skinny 4 ft people, and over-weight 7 ft people. Now sample size is a bit more difficult as the required sample size depends both on the number of streams selected and how they are scattered along the X axis. For example, the highest power occurs when observations are evenly divided between the very smallest X and very largest X value. However, without intermediate points, you can’t assess linearity very well. So you will want points scattered around the range of X values. If you have some preliminary data, a power/sample size can be done using JMP, SAS, and other packages. If you do a google search for power analysis regression, there are several direct links to examples. Refer to the earlier section of the notes. c 2015 Carl James Schwarz 956 2015-08-20