Level 2 Computing — Project 2: Linear Regression Analysis page 1 Introduction This exercise will be quite different from the first. We are going to explore regression analysis which is commonly used in geology to quantify associations between variables. For years scientific hand calculators and later spreadsheet programs like MS Excel and many graphing programs have had the capability of performing regression analyses of data sets. We (collective) tend to take it on faith that these calculators/programs are handling the data we input in an appropriate way. In project 2, we will cast a critical eye on exactly how MS Excel and the graphing program Deltagraph perform linear regressions of geochemical data. Why would one want to quantify associations among variables? Because for valid correlations, knowledge of one variable can be used to predict the amount of another. Example: Assume you are doing geochemical prospecting for element X and you knew from earlier detailed work that Xʼs concentration correlates positively with element Y. Further assume that measuring element X directly is costly and time-consuming, but measuring element Y is fast and cheap. The correlation would give you a cost-effective method to identify concentrations of X by using measurements of Y. Linear Regressions Ordinary least squares: Y on X y A Linear Regression is simply a line fitted to a set of data points. In geochemistry these data are generally X-Y plots of amounts of chemical elements obtained by some analytical technique. There several methods to fit lines to data points. One method that is commonly used in geochemistry is the least squares method, and there are two variants of this method. The Y on X variation fits a line for which the sum of the squares of the Y deviations are minimized (Fig. 1). x The other variation on the ordinary least squares method is X on Y. The line fitted by this method minimizes the the sum of the squares of the X values between the data points and the line (Fig. 2). Figure 1. page 2 Ordinary least squares: X on Y Ordinary least squares y X on Y y Y on X (x, y) Figure 2. x Figure 3. x As a result for the ordinary least squares method two regression lines can be obtained from a single data set. The intersection point gives the mean x and y values (x and y) of the entire data set (Fig. 3). The angle between the two regressed lines is larger for poorly correlated data, and, as the correlation coefficient (usually indicated with r) approaches 1 or -1, the angle becomes smaller. The two lines coincide when r = 1 or -1. The reduced major axis (RMA) method minimizes the areas of triangles between the points and the best fit lines (Fig. 4). This method is less widely applied than the least squares method, but is more appropriate for geochemical data because the fitted line is independent of the Reduced major axis correlation coefficient. The regression line for the RMA method will lie between the lines for the least squares Y on X and X on Y methods. As y mentioned above, as the correlation coefficient approaches 1 or -1 the lines produced by all three of these regression methods will converge. Figure 4. x page 3 Basic Excel Formulae Calculating the Regression Lines In order to calculate the regressions lines, one needs to be familiar with the equation for straight lines. The basic equation is given below. equation for a straight line: Y = m · X + by Y = vertical axis, X = horizontal axis, by = Y intercept (value of Y where X = 0) Y and m = slope = X Y Y X origin by X The main objective of project 2 is to calculate the regression lines for three methods of linear reggression mentioned at the beginning of this document. Since the regression lines are derived from from the same data set of X and Y values. The only differences among the three methods will be the way in which the slope (m) and the y-intercept (by) are calculated. Quantity Name math expression/value number of samples n 31 X data set Fe2O3 Y data set CaO each X value squared Xsq X2 each X value squared Ysq Y2 mean of X values mX ΣX/n mean of Y values mY ΣY/n sum of squares about X SSx Σ(X2) - [(ΣX)2]/n sum of squares about Y SSy Σ(Y2) - [(ΣY)2]/n each cross product for X and Y cPxy (X-mX)·(Y-mY) sum of cross products for X and Y SPxy Σ[(X-mX)·(Y-mY)] Excel formula for math expression Assume that we have used the Name... item in the Excel Insert menu to name cells and ranges of values of cells in the XL worksheet. Fe2O3, CaO, Xsq,Ysq and cPxy are ranges of cells, and the remaining Names are values in individual cells. The cross product for x and y must be calculted for each value; assume x is in B4 and y is in C4 below. = Sum(Fe2O3)/n = Sum(Xsq)-Sum(Fe2O3)^2/n page 4 Fill in the blanks Notation & Calculation – Example Notation & Calculation – Example symbol math expression Fill in the blanks Excel formula least squares (Y on X) y x intercept b(0) mY - b(1)·mX slope b(1) SPxy/SSx intercept b(0) mX - b(1)·mY slope b(1) SPxy/SSy intercept b(0) mY - b(1)·mX slope b(1) SQRT(SSy/SSx) least squares (X on Y) x y reduced major axis y x page 5 page 6 Calculation - Exercise Fill in the blanks Using the data in Excel Sheet 1(web link) and the formulae you have just written out, calculate the values indicated below. Note that Sheet 1 is locked. You will need to copy these values into a new worksheet. mX mY SSx SSy SPxy for Y on X: b(o) b(1) for X on Y: b(o) b(1) b(o) for RMA: b(1) The values below are the correct ones (in no particular order). Use them to check your calculations: 0.5988 57.779 0.7280 3.610 1.1298 0.7636 3.909 96.492 1.2688 -0.1689 51.141 Calculation – Exercise Using the data in Excel Sheet 2 (web link) repeat the calculations you just did. Sheet 2 is also locked. You will need to copy these values into a new worksheet. mX mY SSx SSy SPxy for Y on X: b(0) b(1) for X on Y: b(0) b(1) for RMA: b(0) b(1) Plotting the data Use the intercepts [b(0)] and and slopes [b(1)] to calculate the three regression lines. These can be obtained from the equation for a line, here y = b(1) · x + b(0), and by choosing some value of x. For this example use X = 0 and 6 and calculate the y-values. Make one chart. Plot the data points in one chart and the three regression lines in another. Important – The slope and intercept values that were determined above for the (Y on X) and (X on Y) least squares regression lines do not use the same X and Y axes. These axes are transposed. In order to plot all the curves on the same diagram with the same X and Yorientation, the (X on Y) linear equation needs to be rearranged. See the explanation below. page 7 page 8 To plot Y on X and RMA regression lines, use the slopey [= b(1)] and by [= b(0)] in equation 1 to obtain values of Y where X = 0 and X = 6. Y = slopey · X + by (1) The slopex [= b(1)] and bx [= b(0)] for the X on Y regression line cannot be used in equation 1; equation 2 shows the form of the X on Y regression line for the values of slopex and bx that you have determined. X = slopex · Y + bx (2) Equation 2 must be rearranged in order to plot the X on Y regression line on the same x-y plot as the RMA and Y on X regression lines. X = slopex · Y + bx -slopex · Y = bx - X bx 1 Y= ·Xslopex slopex (2) rearranging (3) Use the slopex [= b(1)] and bx [= b(0)] in equation 3 to obtain values of Y where X = 0 and X = 6. These values can be plotted on the same x-y plot as the RMA and Y on X regression lines (CaO is the ordinate; Al2O3 is the abscissa). Plotting the data with MS Excel Make two plots. One plot should contain the CaO and Al2O3 data, and the other plot should contain data points and (Y on X), (X on Y) and RMA regression lines. Size both plots such that they can be printed on a single A4 page. To start plotting, use this XL tool page 9 On your plots set the lower and upper values of the X axis to 0 and 6, and set the lower and upper values of Y to 12 and 18. Your plots should look roughly like this: 18 18 12 12 0 6 0 If you have succeeded you will have noticed that plotting in Excel is neither an intuitive nor a fun experience. 6 Things to think about Do the two least squares regressions nearly coincide or are they far apart? What does this tell us? In the introduction it was stated that as the correlation coefficient (r) approaches 1 or -1 the two least squares lines will approach coincidence. Do you think the value of r is closer to 1 or closer to 0? Determining r in Excel Select an empty cell and in the Insert menu chose Function... The function PEARSON. array1 and array2 are the ranges of the cell containing the X and Y data. The result is the correlation coefficient (r). So what is it, closer to 1 or closer to 0? page 10 DeltaGraph In the final part of this exercise you will plot the same data in DeltaGraph. This should be a fairly quick exercise. Open DeltaGraph (Launcher). Copy the data out of Sheet 2 and past it into the DeltaGraph worksheet. Take the same data (Sheet 2) and paste it into DeltaGraph. Select the data and plot it (see below). Shortcut: Clicking here selects all the data in the worksheet. Opens the chart type and plot windows. page 11 DeltaGraph Continued... This is the simplest. I suggest using it. Make sure you have selected a paired scatter diagram. Select Scatters to reduce the choices page 12 DeltaGraph Continued... You can set the lower and upper bounds and the axis length with this menu. Use the same bounds as in the Excel plot, but here set the length of each axis to 10 cm. Are these similar values to what you have previously calculated in excel? Toggle these buttons and examine this formula. page 13 DeltaGraph Continued... DeltaGraph plots can be edited in Adobe Illustrator. In the File menu there is and item called Export... If you export as EPSF (encapsulated postscript file) the file can be read into and edited in Illustrator via the Open command in the File menu. Make and export the following three plots. 18 18 X on Y Y on X 10 cm 12 0 6 10 cm 10 cm 12 0 6 10 cm 18 RMA You will need to plot the RMA regression line separately (as before 2 points are sufficient). 10 cm 12 0 6 10 cm Use Adobe Illustrator, edit these three plots into a single diagram containing the X on Y, Y on X and RMA regression curves (all labeled). Turn in only one DeltaGraph plot on an A4 page. 18 Y on X X on Y 12 10 cm RMA 0 6 10 cm