MS Excel Linear Regression

advertisement
LINEAR REGRESSION:
One of the most common methods of evaluating data, e.g., calibration curve data, is to graph the points and
draw a straight line through them. Often the points have some degree of scatter and do not perfectly fit a
straight line so the student must exercise judgment in determining the best straight-line fit.
The method of least-squares (linear regression) is completely objective and can be performed easily in Excel.
Recall the equation of a straight line is y = mx + b, where m is the slope and b is the y-intercept. For
example, x-values may be molar concentrations and y-values may be absorbance readings from a
spectrophotometric calibration curve.
KMnO4 Calibration Curve using SLOPE, INTERCEPT, RSQ, STEYX & TREND Functions
X values Y observed
Y calc.
Y(obs -calc)
% deviation
new y's
calc. x's
[M]
Abs Abs (calc)
sple. Abs
sple. calc. [M]
0.000E+00
0
0.002
-0.002
-100
0.2
7.96E-05
1.029E-04
0.257
0.258
-0.001
-0.310
0.4
1.60E-04
2.058E-04
0.518
0.513
0.005
0.896
0.6
2.41E-04
3.087E-04
0.771
0.769
0.002
0.260
0.8
3.21E-04
4.116E-04
1.021
1.025
-0.004
-0.351
1.2
4.82E-04
slope =
intercept =
R2 =
2483.965
0.0022
0.9999333
std. error of est.
0.003812261
R2 measures the goodness of fit of data to the regression equation. A perfect fit has R2 = 1.0 while the worst
possible fit, i.e., completely scattered data, has R2 = 0. For established analytical methods such as
spectrophotometry, specific ion, pH, you should obtain an R2 value of at least ‘four 9’s’, i.e.,  0.9999.
The SLOPE, INTERCEPT, RSQ, STEYX and TREND Functions:
The slope, y-intercept, R2 value (RSQ), standard error (STEYX) and data interpolation (TREND) of an array
of (x,y) data points are obtained as follows:
SLOPE:
1. Select a blank cell where the slope value is to appear. Click in the Formula Entry box, then, on the
toolbar click Insert, Function, Statistical, SLOPE, OK.
2. In the Slope Dialog Box that appears, click the ‘go to worksheet’ icon in the known_y’s box, select
the y-value cells (absorbance values), Enter.
3. In the known_x’s box click the ‘go to worksheet’ icon, select the x-value cells (conc. values),
Enter, OK. The slope is displayed in the cell where the formula was written.
CH 503 LINEAR REGRESSION
PAGE 1
INTECEPT, RSQ, STEYX and TREND Functions:
Repeat the process for the intercept using INTERCEPT(known_y’s, known_x’s)
Repeat the process for R2 value using RSQ(known_y’s, known_x’s)
Repeat the process for the standard error of y-estimates using STEYX(known_y’s, known_x’s)
TREND Function:
‘Trend’ allows you to calculate unknown values of linear functions by interpolating the data mathematically,
i.e., using the regression equation to calculate unknown values. This is more accurate than trying to visually
read a graph whether it is hand-drawn or computer-generated, e.g., enter a set of sample absorbances and
calculate the sample concentrations. (Always include  2 standard absorbances as a check). Here’s how:
1. Select a range of empty cells opposite the sample absorbance values (new_y’s), then click the cursor
in the formula bar, then click Insert, Function, Statistical, TREND, OK.
2. In the Trend dialog box you’ll enter the cell addresses requested but wherever y-values are requested
insert x-values instead. Be careful here. i.e.,
use
TREND(known_x’s, known_y’s, new_y’s) (This calculates conc., x from Abs., y)
instead of
TREND(known_y’s, known_x’s, new_x’s). (This calculates Abs., y from conc., x)
3. Finally, CTRL+SHIFT+ENTER (instead of just ENTER or OK since you want an array of results).
If, by mistake, you just press ENTER, press F2 (edit) and press CTRL+SHIFT+ENTER.
CHECK YOUR DATA FOR FLYERS (Outliers):
Calculate y-values (absorbance values) based on your regression equation and compare these with the
corresponding observed y-values. If only one of the calculated y-values differs greatly from the observed
value, it is likely that an error was made in preparing/reading the sample. If possible, the student should
repeat this measurement or delete this point from the data set and repeat the regression analysis.
1. In one of the cells where a y-calc. value is to appear, type: =m*x+b e.g., =$C$10*B4+$C$11,
where absolute cell references, e.g., $C$10 and $C$11 are used to get the values of m and b, but
relative referencing, e.g., B4, is used to get a value of x. This formula is then copied (by dragging)
to other cells to get other values of x-calc.
Alternately, use the Trend function and x-obs. values to generate y-calc. values
2. Now calculate the deviations (yobs-ycalc) and % deviations to identify outliers.
Number Format: Number format can be interchanged between scientific notation, common notation and
others as follows: Select the cells, click Format, Cells, Number tab, Number; also select # of decimal
places, then OK.
PLOTTING YOUR DATA:
Plot both your observed data and your calculated regression line. This will give a visual indication of
goodness of fit of your data. To plot both data sets proceed as follows:
1. In the spreadsheet, drag to select the x-values and both sets of y-values. Note that Excel
automatically chooses the left-most column of selected data as the x-values. Then click the Chart
Wizard icon.
2. In the Chart Type Dialog box, on the Standard Types tab, click XY(Scatter). On the Chart subtype section, click the various types until you see the description ‘Scatter with data points connected
by lines’, then Next.
3. On the Data Range tab, note that ‘Series in Columns’ should be selected. Click the Series tab and
with Series 1 selected, in the Name box, type in a name to appear in the legend for series 1 data,
e.g., ‘y-obs.’ Verify that the x-values are correct and y-values correspond to the y-observed values
on the spreadsheet. Next select Series 2 and name it, e.g., ‘y-calc’, then Next
4. On step 3 of Chart Wizard, on the Titles tab, type in a name for the graph. Don’t worry about the
font size or subscripts. They can be adjusted later. Then tab or click to enter labels for the X-axis
and Y-axis. On the Gridlines tab, you may add more gridlines if desired. Then click Next.
CH 503 LINEAR REGRESSION
PAGE 2
5. On step 4 of Chart Wizard, you may choose to have the graph displayed on a new (separate) sheet or
‘as an object in’ the same sheet. Then click Finish.
6. To obtain a white rather than gray background (save your printer some ink), place the cursor on the
gray area (do not allow the cursor arrow to touch any other line; just gray space). You should see a
pop-up menu that says ‘Plot Area’. Right click the Plot Area then left click ‘Clear’.
7. Note that the two lines are superimposed and as such are not useful for examining goodness of fit or
identifying flyers. You will now change the format of the observed data to points only and change
the format of the regression line to a line only so that the goodness of fit will be more apparent.
Proceed as follows:
8. Move the mouse pointer over the lines on the graph and read the pop-up menu to see which data set
is indicated (e.g., Series “y-obs”). With the chart selected, you can use any of the cursor move
arrows to cycle through the chart components. Read their names in the Formula bar. Also note the
blue box Excel draws around the selected y-values. Be sure that the observed y-values are selected,
then left click on that data set (if not already selected). Click Format, Selected Data Series. In
the Format Data Series dialog box that appears, on the Patterns tab, choose line, none and marker,
automatic, then OK. This will show only the (x,y-obs.) data points and not a connecting line.
9. Move the mouse pointer over the remaining line and note that the pop-up menu identifies it as Series
“y-calc”. Left click this series and then Format, Selected Data Series. Then on the Patterns tab,
select line, automatic and marker, none, then OK. This will show only the (x,y-calc) line without
its points.
10. Click the edge of the graph to deselect the lines and points; then examine the fit of y-obs. points to
the y-calc. line. You may decide to repeat or delete flyers based the fit you observe.
11. Additional formatting can be done. For example, click the title ‘KMnO4 Calibration Curve’. To
change the ‘4’ to subscript, select the character, ‘4’, by dragging the I-beam over it, then Format,
Selected Chart Title. On the Font screen Effects area, select ‘subscript’, OK. The title will
change to ‘KMnO4 Calibration Curve’. The font size of axes labels can be changed in similar
fashion. The font size and number of decimal places of the axes values can be changed after
selecting the axes, then click Format, Selected axis and use the Font and Number tabs to access
these controls.
12. Click the Drawing Tools icon to Insert a Text Box and in it type important regression statistics..
KMnO4 Calibration Curve
Observed Data
Calc. Trendline
1.2
Absorbance
1
0.8
0.6
m = 2484.0
b = 0.0022
R2 = 0.999933
0.4
0.2
0
0.0E+00
1.0E-04
2.0E-04
3.0E-04
4.0E-04
5.0E-04
conc. [M]
CH 503 LINEAR REGRESSION
PAGE 3
The LINEST Function:
LINEST (short for LINear ESTimation) function performs linear regression analyses on a set of (x,y) data
points. The general form of the linear equation that can be handled by LINEST is
y = m1x1 + m2x2 + m3x3 + … + b
LINEST returns the array of regression parameters mn …m2, m1, b
The syntax is LINEST(known_ys, known_xs, const_logical, stats_logical)
If const_logical is TRUE, or 1, or omitted, the intercept b is given.
If const_logical is FALSE, or 0, the intercept b is not given.
If stats_logical is TRUE, or 1, an array of regression statistics are given in addition to values of m and b.
KMnO4 calibration curve using Linest
X values
Y observed
[M]
Abs
0.000E+00
0 parameters
1.029E-04
0.257 std. dev's
2.058E-04
0.518 R2, SE(y)
3.087E-04
0.771 F, df
4.116E-04
1.021 ss(reg), ss(resid)
m
2484.0
11.7
0.999933
44952.8
0.653
b
0.0022
0.00295
0.003812261
3
4.36E-05
Applying LINEST to the spectrophotometric data for KMnO4 follows:
1. Select an array of cells (5R  2C). Click in the formula box, then Insert, Function, Statistical,
LINEST, OK.
2. In the LINEST dialog box for known_ys, click the go to worksheet icon and select the known y
values, Enter.
3. In the LINEST dialog box for known_xs, click the go to worksheet icon and select the known x
values, Enter.
4. In the LINEST dialog box for Const, type 1, then tab to the Stats box and type 1, then do not click
the enter button but rather type CTRL+SHIFT+ENTER. If, by mistake, you hit only ENTER, type
F2 (formula edit), then CTRL + SHIFT + ENTER.
SE(y) is short for ‘Standard Error’ of the y-values.
Explain the meaning of SE(y), i.e., how is it derived?
How can you use it?
CH 503 LINEAR REGRESSION
PAGE 4
REGRESSION USING DATA ANALYSIS IN TOOL PAK:
1. Arrange the data in columns with the x variable on the left and the y variable on the right. Make
space for the results of the regression analysis to the right of the data. Allow a 16R7C array of
space.
2. Choose Data Analysis from the Tools menu; if the Data Analysis command is not present in the
Tools menu, you must use the Add-Ins command in the Tools menu to install it.
3. After choosing Data Analysis, choose Regression, OK from the Analysis Tools list box. The
Regression dialog box will prompt you to enter the range of dependent variables (y). Enter your data
label along with the data.
4. Tab to the ‘Input X_Range box and enter the range of independent variables (x) along with its label.
5. Select the Labels box because we have included the labels with the data ranges
6. Select ‘Constant is zero’ only if you want to force the regression line through the origin (0,0).
7. Confidence Level: Excel automatically includes 95% confidence intervals for the regression
coefficients.
8. Output location: Click the Output Range button. Click to select the range edit box on its right, and
point to or type a reference for the top left corner cell of the array where the output summary will
appear.
9. Residuals: Select this box to obtain the fitted values (predicted y) and residuals.
10. Residual Plots: Select this box to obtain charts of residuals versus each x-value.
11. Standardized Residuals: Select this box to obtain standardized residuals (each residual divided by
the standard deviation of the residuals). This output makes it easy to identify outliers.
12. Line Fit Plots: Select this box to obtain and XY(Scatter) chart of markers for the (x,y) data points
and a line showing predicted (x,y) values, i.e., the regression line.
13. Normal Probability: Do not check this option.
14. Finally click OK The summary chart appears.
15. To change column widths so that all summary info is visible, make nonadjacent selections of all
cells that are too small. From the Format menu choose Column, Auto Fit Selection.
16. Three tables are produced: regression statistics, analysis of variance (labeled ANOVA), and
regression coefficients. The analysis of variance data can be deleted as follows: Select the cells, the
Edit (or right click), Delete, Shift Cells Up, OK.
17. Residuals data may take extra space here and may have to be moved opposite raw data.
CH 503 LINEAR REGRESSION
PAGE 5
SUMMARY OUTPUT
This section moved from right to fit page.
Upper 95% Lower 95.0% Upper 95.0%
0.01159766 -0.00719766 0.01159766
2521.24955 2446.680479 2521.24955
Regression Statistics
Multiple R
0.999966633
R Square
0.999933268
Adjusted R Square
0.999911024
Standard Error
0.003812261
Observations
5
ANOVA
df
Regression
Residual
Total
SS
0.6533136
4.36E-05
0.6533572
1
3
4
Coefficients
Standard Error
0.0022
0.002952965
2483.965015
11.71567303
Intercept
[M]
MS
F
Significance F
0.6533136 44952.77 2.31367E-07
1.45333E-05
t Stat
P-value
0.745014008 0.510319
212.0206845 2.31E-07
Lower 95%
-0.00719766
2446.680479
RESIDUAL OUTPUT
Observation
1
2
3
4
5
Predicted Abs
0.0022
0.2578
0.5134
0.769
1.0246
Residuals
Standard Residuals
-0.0022
-0.666360786
-0.0008
-0.242313013
0.0046
1.393299825
0.002
0.605782533
-0.0036
-1.090408559
Comparison of LINEST and Data Analysis:
 Data Analysis regression provides more complete statistical data and the output is clearly labeled.
 The advantage of LINEST is that it is dynamically linked to the original data and is automatically
updated if the raw data is changed.
Abs
[M] Line Fit Plot
1.5
1
0.5
0
0.0E+00
Abs
Predicted Abs
1.0E-04
2.0E-04
3.0E-04
4.0E-04
5.0E-04
[M]
The ‘Line Fit Plot’ is simply a graph of the data. Both predicted and observed data appear as points.
Change the predicted values to a trend line without markers to see how well the data fits the trend line.
CH 503 LINEAR REGRESSION
PAGE 6
Residual plots are better indicators of data/trend line fit. The residual chart is located to the right of the
regression summary output. If the relationship between concentration and absorbance is linear, then random
pattern should appear in the residuals plot. On the other hand, if we see curvature or some other systematic
pattern, then we should change our model to incorporate the nonlinear relationship. We will study nonlinear
regression next.
Residual plots are also helpful in identifying situations where the residuals are smaller in one region and
larger in another. The residuals plot would have the shape of a tree resting on its side. In such cases the
standard error of the estimate, which summarizes all of the residuals terms, would overestimate the
calculated concentration in one region and underestimate the concentration in another.
Residuals
[M] Residual Plot
0.01
0.005
0
-0.005
0
0.0001
0.0002
0.0003
0.0004
0.0005
[M]
Residuals are determined as (yobserved-ycalculated). Both the residuals plot and the residuals output data show
that the calculated y-values are low in the middle concentrations and higher at the extremes. However the
relative deviation, as we have already calculated, is quite small (in the order of 0.3 to 1.0% deviation).
Exercise: Apply Data Analysis to the KMnO4 calibration data. Obtain the same results as shown above.
CH 503 LINEAR REGRESSION
PAGE 7
LINEAR REGRESSION by ‘Adding a TRENDLINE’:
For ‘Add a Trendline’ function, you plot a graph first and then get the regression equation from the graph.
1. Enter the x and y values, e.g., conc. and absorbance, on an Excel spreadsheet, putting the y-values in
a column to the right of the column of x-values.
2. Drag to select the entire array of x and y values.
3. Click the Chart Wizard icon on the toolbar.
4. Step 1 of the Wizard: Under Standard type choose XY(Scatter). Under Chart sub-type select
‘Scatter, compares pairs of values’; then Next. (Do not use a line joining data points.)
5. Step 2: On the Data Range tab, ensure that ‘series in columns’ in selected and under the Series tab
type in a name, e.g., Data; then Next.
6. Step 3: Under the Titles tab, type a chart title and X axis and Y-axis labels. Under the Gridlines tab,
add x-axis major gridlines; then Next.
7. Step 4: Click to place the chart as an object in the spreadsheet or on a new sheet, Finish.
8. Clear the background by right clicking on the gray area and click Clear. Using the cursor click on a
data point, click Chart, Add Trendline. Under the Type tab, in the Trendline dialog box choose the
Linear Trend/Regression type. Also under the Options tab, select ‘Display equation of chart’ and
‘Display R-squared value on chart, then click OK. The R2 value and regression equation are
displayed and a line corresponding to the regression equation is drawn on the chart. Make the trend
line very thin by first selecting it, then choose Format, Selected Trendline, and change the ‘weight’
on the patterns tab to hairline (very thin), OK.
9. IMPORTANT: Increase the number of decimal points in the regression equation and R2-value by
clicking to select this data and repeatedly clicking the ‘increase decimal points’ icon. This change
cannot be made after any other reformatting of this data, such as changing font size.
10. Change the name of the trend line as follows: Click to select the trend line, then click Format,
Selected Trendline. Under the Options tab change Trendline name from automatic to custom and
type in ‘Regression Line’, OK. Make other formatting changes as desired, e.g., font size of X axes
values, etc.
11. You now have a graph and an equation of the line of best fit. Use this equation to calculate any
unknown x-values, e.g., conc. of samples, from measured y-values, e.g., absorbance of samples. Do
the calculation by writing the appropriate formula in an empty spreadsheet cell and drag to other
cells to automatically obtain all results.
What formula will you write in an empty cell to complete step 11? Do this now.
Graph & Regression Line using Trendline Function
Data Points
Regression Line
Absorbance
1.2
1
0.8
y = 2483.96501x + 0.00220
R2 = 0.99993
0.6
0.4
0.2
0
0.0E+00
1.0E-04
2.0E-04
3.0E-04
4.0E-04
5.0E-04
Molar Conc.
CH 503 LINEAR REGRESSION
PAGE 8
PRINTING YOUR GRAPH:
1. Printing from Excel: In step 4 of the Chart Wizard, you may opt to ‘place your chart as an object in’ the
current spreadsheet or to ‘place the chart as a new sheet’. You can print the graph from either location.
If the chart is placed in the current spreadsheet, select it first (sizing handles appear when selected), then
choose File, Print. It is always good practice to Print Preview before activating the Print command.
2. Printing from MS Word (or other programs): In Excel, Select, then Copy the Excel graph to the
clipboard using the copy icon or the Edit, Copy command (or CTRL + C).
Open MSWord. In MS Word, place the cursor where you want the graph to be inserted. When inserting
between existing lines of text, you do not have to create blank spaces; Word will do this for you. Then
use the Edit, Paste Special command. In the Paste Special dialog box choose either ‘Paste’ or ‘Paste
Link’ and choose ‘Microsoft Excel Chart Object’, then OK. Using the Paste Special command pastes
the chart as an object making it easier to move and center on the page compared to simply Pasting from
the clipboard (CTRL + V). The Paste Link option allows your graph to automatically update in
MSWord if you later make changes to the Excel data or graph, however, both the Excel and Word files
must be located in the same place (disk or drive) as they were when the Paste Link was created.
3. Print the graph using the program’s Print command.
CH 503 LINEAR REGRESSION
PAGE 9
Download