Second Steps: Graphs and Statistics for Two

advertisement
Second steps, p.1 of 3, Dr. M. Lawrence Clevenson
Second Steps: Graphs and Statistics for Two-Variable Data
Retrieve the Excel file BostonMarathonWomen…. The other files we’ll be retrieving
today are GasVsTemp, FuelVsSpeed, StatesAndEducation, and SpeakAndReading.
Scatterplots
1. Open the Boston Marathon File. Let us plot the data in the most common way for
two-variable non-deterministic (random) data, a scatterplot. Excel calls this x-y
scatter.
2. Click anywhere in the data, and then click the chart wizard.
a. Choose x-y scatter
b. Connect the dots either smoothly or with straight lines, or leave them
disconnected.
c. Save in this sheet. It’s ugly; too compact.
3. Change the location to a new sheet with Chart > Options
a. Then change the scale on the y-axis by double clicking on the y-axis
i. Choose the Scale tab
ii. Change the minimum to 140 or less
iii. Change the maximum to 195 or more.
b. Give titles to the graph and the x and y-axes.
Is the graph revealing anything that you didn’t know? (Note: 1972 was the first
year women were allowed to compete in the Boston Marathon.)
Fitting Functions to the Data
1.
2.
3.
4.
Retrieve the data on Fuel efficiency.
Plot it following the steps above.
Then choose Chart > Add Trendline
For Type, choose Linear (OK, it’s not, but there is always a “best-fitting”
curve or straight line to any set of data.)
5. Click the Options Tab, and check the last two boxes for equation, and Rsquared. R-squared is an important topic in statistics that will be discussed
later.
The point: Even if your data are clearly non-linear, be wary of statistical
procedures that are inappropriate. They “work” to give you results, but the results
are not useful.
Perhaps these data are better fit by some other simple polynomial. Since these
data don’t appear to have a change of concavity, try a quadratic function, i.e., a
polynomial of degree 2 as follows: Start at the same Step 3 again:
4. For Type, choose polynomial of order (degree) 2
5. Same as above.
Second steps, p.2 of 3, Dr. M. Lawrence Clevenson
This is better, and gives a higher value of R-squared.
Formulas for Linear Regression
Usually, when a statistician use the term "best-fit", she wants to choose values for
unknown parameters that minimize the sum of squares of errors, or residuals. To
understand this concept requires some effort, but we can learn a little more about
spreadsheet formulas in the process. First, let us use some data that seems fairly
linear, GasVsTemp.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Open the Excel file, GasVsTemp.
Plot the data on a scatterplot
Use fill handles to fill-in Column A for the month.
Format cells B1 and C1 with Wrap Text by:
a. Highlight B1 and C1
b. Choose Format > Cells
c. Click on the Alignment Tab
d. Check Wrap Text. While in this dialog box, note that you can choose
to rotate your text either up, down, or slanted, and some people like to
use these. Try it, if you’re waiting for others to catch up.
Insert twos row by highlighting rows 1 and 2 anywhere and click Insert >
Rows
In D1, type Intercept, and in D2 type Slope.
In E1, paste the Intercept function and input the appropriate y and x values.
The y-Array is the Gas data, and the x-Array is the Temp data.
In E2, paste the Slope function with the appropriate y and x values.
Check that these values are the same as given by Add Linear Trendline option
in Chart described above.
Predicted (Fitted) Values and Residuals (Errors)
The values of the slope and intercept can be used to predict or guess (estimate) the
temperature on any number of degrees (average) below 65 degrees. The predicted
value is Gas used = Intercept + Slope  Degrees. For each actual value in the
table, find the predicted value as follows:
1.
2.
3.
4.
5.
6.
Activate E4
Enter the formula = E$1 + E$2 * B4.
Copy and paste this formula into E5 to E19.
Enter the label Predicted Gas Usage from Linear Model into E3
Go back to the Chart and choose Chart > Add Data
Point and click on these new predicted values, which range from E3 to E19 in
the Data sheet. (Including E3, the label, gives a name to this series for the
legend.)
Second steps, p.3 of 3, Dr. M. Lawrence Clevenson
How would you tell someone what the predicted values are, and how to plot
them?
The residuals or errors are the differences between the actual y-values and the
predicted y-values. On the spreadsheet, make a column of residuals by:
1. In F3, enter the label Residuals
2. In F4, enter the formula = C3 – E3
3. Copy and paste this formula into F5 to F19.
Statistics textbooks try to emphasize the formula: Data = Fit + Error. This is not
a complicated equation, since Error is defined as Data – Fit. So your data, in
column C, is the sum of your fitted values in column E and the Errors in Column
F.
A model is considered a good model when the errors, or residuals, do not have a
pattern to them. If there is a pattern, then perhaps a better model will reduce the
magnitudes of the errors. The linear model for Gas Used versus Heating Degree
Days was good for two reasons. The errors, in column F on our worksheet, do not
have a pattern, and the errors are “small.” Can we quantify what we mean by
small?
R-squared, Total Variation, Predicted Variation, and Residual Variation
One of the main ideas in statistics is to study and quantify variation, the
variability of data. The Gas used to heat a home varies from month to month.
One measure of the variability of that variable is its standard deviation. Compute
the standard deviation of the Gas data, and while we’re considering this, let’s
compute the means and standard deviations of both variables in GasVsTemp.
1. Go below the note in the Data sheet to cell A22 and enter the label Mean
2. In cell A23 enter the label Std Devn
3. Compute these values using the Average and Stdev functions. Note:
Once you’ve done it for column B, copy and paste the formulas to column
C.
The standard deviation has this formula
 ( y  y)
2
. The numerator inside the
n 1
radical is called the total variation in y, the sum of the squared deviations of y from its
mean.
Download