A cross validation macro

advertisement

Sixth handout

A cross validation macro.

I have written a macro, specifically for our data, that does cross validation. It is written to check the prediction of each of the nine molecular weights from the other eight using the regression model

Relative mobility = a + b * logMW + c * logMW

2

.

This macro could be modified, fairly easily, for any model that fits a dependent variable with an intercept and two “columns” of “independent” data. To modify for more than two variables, some reorganization of the placement of data is necessary.

This subroutine was a little more difficult to write. Writing procedures can snag in many unforeseen ways. This is a corollary of the axiom that changes may produce unintended consequences. Patience is necessary. Here is a part of the code.

For Col = 4 To 23

For Band = 1 To 9

Range("A2:W12").Select

Selection.Copy

Range("A14").Select

ActiveSheet.Paste

Range("A25").Select

Selection.EntireRow.Insert

Range(Cells(15 + Band, 1), Cells(15 + Band, 1)).Select

Application.CutCopyMode = False

Selection.EntireRow.Delete

The first two lines start a “for-loop” that will go through all of the gels (in columns with the name Col) and all of the rows from Band 1 to Band 9. Then the entire data set is copied (next four lines). Then a row is inserted (next two lines).

Then the row corresponding to a particular band is deleted (last three lines). We can go through this macro step by step using the step through option (part of the

“debugger”); in this way we can see the effect of each step.

Perhaps you understand perfectly why everything was done except the second operation, inserting a row. Why insert a row? To be honest, my first attempt to write the macro did not have this step. But it did have the row deletion step, a step that is necessary for cross-validation; we have to imagine a data point is missing and estimate the appropriate value. The rest of the macro does the regression, estimates the correct value, and records this value in the spreadsheet.

The problem was that I was losing my recorded values, and the reason is tricky.

Sixth handout, p.1 of 6 Dr. M. Lawrence Clevenson

The results were recorded in a row below the row that was deleted. Each time through the loop, a row was deleted, and when this happens, all the other rows automatically move up. So the cell in which I was recording my results was migrating up a row each time I copied all the data and then eliminated one row.

Eventually, it would end up in the cells on which I was recopying the original data and be overwritten. No good. Inserting a row would move all the cells below that row down one row, and then when a row of cells was deleted, these cells would come back to their original position. (There are undoubtedly better ways of handling this problem; one might be to record the results on a new sheet.)

This spreadsheet still requires some work. After running the macro, you have all the predicted values, and these need to be compared to the actual values. Perhaps we can use all of our data to remove the “bias” (consistent underestimation or overestimation) by a model. (This is only possible because we have replicates.)

1.

In cell X24, enter the label Average Prediction

2.

Compute the average predictions in cells X25:X33. E.g., in X25 is the formula =AVERAGE(D25:W25).

3.

In cell Y24, enter the label Bias

4.

In cells Y25:Y33, enter the bias: Average Prediction – Correct value.

E.g., in cell Y25, is the formula =X25-B4 .

After looking at these biases, I think this model is not working a cross-validation model, especially at the extreme bands. The data at one end is not being well predicted by all the other data, and there is little hope that this will be a useful model at the extreme points. Here is the work that would be done if these results had been better:

1.

We want to compute the squares of the differences between predicted values and actual values, so: a.

In cell C34, enter the label “Squares of Differences” b.

In cell D35, enter the formula =(D25-$B4)^2 . c.

Copy this formula appropriately.

2.

Sum all the squares of differences across the rows.

If we adjust for the bias, and this is only appropriate if we have many samples to replicate with unknowns, then the quality of the cross validation will be measured simply by the standard deviation of the predictions. So compute these in column

Z:

Sixth handout, p.2 of 6 Dr. M. Lawrence Clevenson

Homework:

A graph of log(Molecular weight) versus log relative mobility indicates that these data might be fit well with a quadratic function, where y = log(Molecular weight) and x = log(relative mobility).

1.

Graph some of the data of log(Molecular weight) versus log (relative mobility) Extra credit: Write a macro that can add more series to the graphs for data like this. You cannot just “add data” the way we did before because the new series are x-values, not y-values.

2.

Write a macro to produce the estimates and residuals for this model, a quadratic model of log(molecular weight) versus log(Relative mobility); be sure to include both the R 2 and adjusted R 2 values.

3.

Modify the cross-validation macro to do a cross validation prediction of the molecular weights using this model. (You will be predicting log molecular weights using the model above.)

4.

Modify the cross-validation macro to do a cross validation prediction of log molecular weights using either log(Relative mobility) or just Relative mobilility, but not a quadratic model.

5.

Based on what you’ve seen so far, which model will you use to estimate unknowns.

Local Prediction

There is one major reason why we searched for a model that was good for the entire data set. When we estimate an unknown, we’d like to have some idea of the accuracy of the estimate. A single data set and model, together with all the predicted values and residuals, will provide that estimate, not only for the data for which the model was used, but for a new data point as well. This accuracy is part of the answer to the guess for the unknown.

Because we have many gels, this variation can be measured without the need for measuring the residual variation. For example, if we are still concerned with the estimate for the molecular weight for serum albumin, and we have not been successful at finding a global model that seems to work, we could use the two closest relative mobility values to the relative mobility of serum albumin and fit a straight line for these two points.

Local linear prediction

This is sufficiently easy that we do not need a macro. We can enter the local linear predicted values write on the spreadsheet.

1.

Make a new sheet with values (not formulas) of the relative mobility values and (log) molecular weights. So as to keep a pattern to the data, keep everything in the same relative position.

Sixth handout, p.3 of 6 Dr. M. Lawrence Clevenson

2.

Write a formula for the linear interpolation of the log Molecular weight of band 2, using bands 1 and 3. Write it so that it can be copied to Bands 3 through 8. This formula should look something like

$B6+(D6-D5)/(D6-D4)*($B4-$B6)

3.

Bands 1 and bands 9 require slightly different treatment, since the linearly interpolated value is not between the two points; in fact, this should be called linear extrapolation, not linear interpolation. For one of the bands, the result might be $B6+($B5-$B6)*(D4-D6)/(D5-D6)

4.

Find a similar formula for band 9, and copy and paste the results. I’ll leave this as a homework problem.

5.

Once you have the linearly interpolated results for all the bands and all the gels, compare these to the actual values, find differences, square, and sum. Compare to other predictive results.

Quadratic and Higher Order Interpolation

If we had enough energy, we could try thousands of models. Perhaps a supermacro could try linear fits with two points, three points, four points, …, up to 9 points (our first attempt). Then we could move onto quadratic fits with three points, 4 points, …, up to 9 points. Models with powers other than 1, 2, …, and so forth could be attempted. For example, one of your homework problems involves using

 x and x. There was one aspect of the pavement data with which I was working that found the best three-parameter model in powers of x that used y = a + b x 0.6

+ c x 2.2

at the left-hand end of the data, but at the right hand end, the 0.6 value had to be changed to 0.9 Data analysis has a reward to effort ratio that can become small if we keep trying new models.

The guide to the models to try is in the graphs. If they look approximately linear, then there is no reason to attempt to fit fourth degree polynomials. Our data seems to be quadratic on the left, and then approximately linear. Or, perhaps they are cubic throughout or reverse S-shaped (the SLIC models). We’ve looked at cubic regressions and the SLIC model, and the residuals are still patterned. Our last attempt will be local quadratic interpolation, which looks more promising than linear, especially for the low molecular weight values.

What is quadratic interpolation or extrapolation? It is the use of three known points to interpolate or extrapolate the unknown y-coordinate of a fourth point.

The first three known points determine a quadratic function that fits them perfectly? Do you know the equation for a quadratic function given three points?

How about a cubic function given four points? You don’t have to try to compute these relations. Multilinear regression will do it for you.

Again, make a new workbook and copy the relevant values in the usual places, from A2 to W12. Let us try to predict the smallest log(molecular weight) using the three points closest to this on the graph.

Sixth handout, p.4 of 6 Dr. M. Lawrence Clevenson

1.

Copy the values from B9:B11 to A15

2.

Give them the label 3 y values (or a more creative choice) in A14

3.

Copy the values from D9 to D11 to A20

4.

Give them a label in A19 like 3 x values

5.

In B19, enter a label for the squares of the x’s

6.

In B20:B22, square the x-values.

7.

Turn on the macro recorder, and give the macro a name like Regression step. a.

Tools > Macro > Record b.

Type RegressionStep c.

A control letter is not necessary, since this step will become part of a larger macro d.

OK

8.

Do the regression of 3 y-values on the 3 x-values and their squares. a.

Tools > Data > Regression b.

Choose the 3 y-values for the y-input (with labels for clarity in output) c.

Choose the x’s and x-squareds (with labels also) for x’s. d.

Check labels e.

Put the output in Y1 f.

Check residuals; they should be zero, why?

9.

Stop the recorder. Please remember this step.

Stop for a moment and examine the regression output and note that the adjusted R-squared is questionable. (There were some divisions by small numbers in this procedure.) This is an overfit model, and cannot be used for any statistical purposes. The whole purpose was to obtain the perfect parabola to fit the x and y values based on Bands 6, 7, and 8. We want to use this parabola to predict Band 9.

First, to stay organized, copy the band labels in Column C so that we know where to place the predicted value for Band 9.

1.

Select cell D23 or whatever is opposite the Band 9 label you just copied.

2.

Type =

3.

Find the value of the intercept; (it is cell Z17)

4.

Click on Z17

5.

Type +

6.

Click Z 18 (x-coefficient)

7.

Type *

8.

Type D12, which is the x-value we want, which is adjacent to Band 9 in the original data.

9.

Type +

10.

Click on Z19 (x^2 coefficient)

11.

Click *

Sixth handout, p.5 of 6 Dr. M. Lawrence Clevenson

12.

Type D12^2

13.

Enter

14.

Stop the recorder.

In Cell D23 should be the formula, =Z17+Z18*D12+Z19*D12^2 .

Turn on the recorder again and give this macro the name PredictionStep (See step 7 above if necessary). Try recording a macro that gives this formula; perhaps it will be easy to modify.

That was no help. I’d rather start from scratch than to attempt to modify that sucker. I only took you through this so that you would know that sometimes the recorder doesn’t help so much.

OK. So what do we want to do? In this column, we want to find the values in the above bold-faced formula. Enter the macro editor and go to the bottom with Ctrl+End.

1.

Type Sub Band9 ( Enter )

2.

Type Dim Col as Integer (note that it is trying to help you.) ( Enter )

3.

Type Col = 4 ( Enter )

4.

Type Cells(23, Col) = Cells(17, 26) + Cells(18, 26) * Cells(12, Col) +

Cells(19, 26) * Cells(12, Col) ^ 2 (All on one line with no attention to capitalization or spaces.) ( Enter )

5.

See if it works a.

Go back to the worksheet (Alt+Tab or use the taskbar) b.

Delete the formula in D23 c.

Play this macro call Band 9 d.

Look at the value in D23; notice it is not a formula but a value.

Now what? We’d like to do this in every column, ie, get all of the band 9 predictions. We need a for-loop from col = 4 to col = 23. Then we need to copy the correct x and y values over to the place where the regression grabs its inputs.

For Band9, the y-values stay the same, and the x-values move as the column changes. and we want to do all that. We’ll try to do it in class, step by step, but in case we don’t have time, here is the result.

Sub Band9()

Dim Col As Integer

For Col = 4 To 23

Range(Cells(9, Col), Cells(11, Col)).Copy (Cells(20, 1))

Range(Cells(1, 25), Cells(30, 40)).ClearContents

RegressionStep

Cells(23, Col) = Cells(17, 26) + Cells(18, 26) * Cells(12, Col) + Cells(19, 26) *

Cells(12, Col) ^ 2

Next

Sixth handout, p.6 of 6 Dr. M. Lawrence Clevenson

Download