ST7002 Introduction to Regression 20132015

advertisement
ST7002 Introduction to Regression 20132015-16
Lab 2
Diamonds Data Set
This exercise will provide you with practice at creating and using derived variables: in particular we use log
transforms and indicator variables. The critical interpretation of the model leads to the idea of raising questions by
exploring a data set through tools provided by Multiple Linear Regression.
We use the Diamonds data set at Week 4 in https://www.scss.tcd.ie/John.Haslett/st7002/
More details are available at http://www.amstat.org/publications/jse/jse_data_archive.htm and
http://www.amstat.org/publications/jse/v9n2/datasets.chu.html. The analysis we will follow will differ a little from
that discussed on these sites.
Open the Minitab file and save it as a Minitab project, - or alternatively open the Excel file and copy and paste into a
new Minitab project - click in the C1 name box in the new Minitab worksheet and paste. Save the Minitab file1.
Create a ‘log book’ for your notes as you proceed through the questions below. This can be a physical page, or it
could be a new Word document. MINITAB will create a NotePad file for this purpose; see > Tools. Note that you can
copy and paste from the session window (and in fact you can cut, as well.)
1. As in Week 1, we use versions of Stat > Regression > Fitted Line Plot to plot Price Y (= Price) against Carat weight
X. Having selected X and Y and the linear regression option, choose in turn – from the Options dialogue (i) the
(default) untransformed data (ii) log transformation of both variables with graphs in the regular scale and (iii) as
in (ii) but using log scales. Do include the option for prediction intervals2. In the following we will consider
Multiple Linear Regression variations of the model, in the log scale, and plotted in the log scale. We do this by
transformation.
2. Use the Calc menu to derive new variables LogPrice, via LogTen(Price), and LogCarat, via LogTen (Carat). Use the
default Fitted Line plot with these variables as Y and X and confirm that the regular plot of your newly derived
variables, and the fitted equation, correspond to one of the variations of the raw data.. (Don’t inadvertently ask
Minitab to apply log transforms to your newly derived variables. Check the Options.)
3. Repeat, using Stat>Regression>Regression>Fit Regression Model. From the Dialogue Window, under Graphs…
select the Four-in-One option. We will examine these below. Choose OK. Now, inspect the output in the ‘Session
Window’. Note that there are many ‘unusual observations’; we return to these later. What do the coefficients
mean?
Using ‘pencil and paper’( or a calculator, or EXCEL):
1
As you proceed below, repeatedly Save your project; I have managed to cause MINTAB to crash when using
brushing. I had to use Ctrl-Alt-Del to invoke ‘Task Manager, and within that End Task for MINITAB. Saving
your work is wise.
2
(Tiling and Cascading under the Windows menu is one way to keep track of the many graphs you will create.
Deletion is also good practice. Most plots will allow you to edit the Title more informatively than MINITAB’s default.
a. What are the LogPrice and Price in Singapore $ of a diamond whose LogCarat is 0? Whose Carat weight
is 0.7? is 1.0? is 1.1?
b. What are 95% prediction intervals for LogPrice at these LogCarat values? Use 2 S as an approx.
c. What are the corresponding 95% intervals for Price?
Now use the Stat>Regression>Regression>Predict mode to compute the same Prediction Intervals. Note that
MINITAB uses a more accurate formula than 2 S.
4. Now “Interrogate” the Four-in-One plot via Brushing; our focus in this lab is on the Residual vs Fitted Values plot.
Do this by selecting this plot and, by ‘right-clicking’, selecting the ‘brush option’. The cursor becomes a ‘pointing
finger’. Clicking on individual data points shows the same point in all of these plots. Selecting a set of points by
dragging identifies all the points in all the plots. We will look for structure in these residuals.
We explore this by considering other aspects of the data, such as the colour and the clarity, and by using indictor
variables below. Here we use graphical methods.
a. Create an ‘Individual value Plot’ using the Graph> menu. Select the ‘One Y With groups’ option. Choose
LogPrice as the ‘Graph variable’ and Clarity as the Categorical Variable.
b. Using brushing3 of this plot, connect the information to the residual plot. Confirm that Clarity “VS2” has
many negative residuals; what does this mean? Many of the Clarity “VVS2” have high Fitted Values; what
does this mean?
c. Using brushing of the residual plots, connect the information to the Clarity grades. What are the main
clarities of the largest positive residuals? Of the largest negative residuals?
In the light of all this, is there other ‘structure’ in the residuals? Are they – at least partially – predictable from
Clarity? There is no formal test for ‘structure’ in this sense for it is not perfectly defined.
5. A graphical way to view, for each Clarity separately, the relationships between LogPrice and LogCarat is to use
Scatterplot (with Regression and Groups) from the Graph menu. Use this to plot LogPrice vs LogCarat, using
Clarity as the categorical value; under Multiple Graphs I find that ‘overlaid’ plots are a useful place to start. Do
straight line plots seem sensible? How about parallel lines? In Q6 we formalise the idea of parallel straight lines.
6. Using Calc>Make Indicator Variables create a set of binary variables corresponding to Clarity. These will be
added to your worksheet by MINITAB. Inspect these new variables. Add ALL of these to your regression model
by selecting these indicator variables as ‘continuous predictors’. Minitab will then drop the ‘last’ one of the
indictor variables. Consider the output in the Session Window. (Again ignore the ‘Unusual values’ for the
moment.)
a. Is the S lower than in 3 above? Is the R2 larger? (You can scroll up and down in the Session window.
b. Write down the linear equations for each of the categories (including the one dropped by MINITAB
c. What are the predicted LogPrices, for a LogCarat of 0, for each of the Clarities? (Use pencil and paper
and or Excel or the > Predict option as above). What are 95% prediction intervals for these LogPrices?
What are the Corresponding Prices in Singapore $?
d. Which Clarity is most expensive on average – for LogCarat values 0.3?for 1.0?
3
Before you can brush in a Window, you need to enable to Window for brushing. Easiest is via ‘right-button’ as in Q4.
7. Now repeat Q6b by
a. Using a different set of 4 of the 5 clarity indicator variables
b. Using the option to enter ‘Clarity’ as a categorical predictor instead of using the indictor variables as
above.
Confirm that all procedures in Q6 and Q7 yield exactly the same predictor variables.
8. Inspect the role of Colour (Use the categorical variable procedure as in Q7b).
a. Is the S lower than in 5 above? Is the R2 larger?
b. What are the predicted LogPrices, for a LogCarat of 0, for each of the Colours? What are 95% prediction
intervals for these LogPrices? What are the Corresponding Prices in Singapore $? Which Colour is most
expensive on average – for this carat weight? How much more expensive? What if the Carat Weight was
0.3? 1.0?
9. Use both Colour and Clarity in the regression as categorical variables.
a. Is the S lower than in 6 above? Is the R2 larger?
b. What are the predicted LogPrices, for a LogCarat of 0, for each of several “Clarity/Colour” combinations?
(You choose.) What are 95% prediction intervals for these LogPrices? What are the Corresponding Prices
in Singapore $?
c. Which “Clarity/Colour” combination is most expensive on average – for a given carat weight? How much
more expensive?
10. Study the additional role of Certifying Body as in Q9.
11. Consider the residual plots arising from Q7, 8 ,9 above. There is, by construction, no relationship between the
size of the residuals and variables used to predict; the ‘Individual variable plots’ are the simplest way to check
this. But the residuals in all models seem to be negative for mid-range Weight and Price. Additionally, all models
produce a consistently surprisingly large number of outliers (very large residuals). All in all, I regard the model
fits as suggesting that we are missing something. I’ld prefer at this stage to talk to a Diamond expert!
Review the discussions referenced below. These raise the issue of whether there IS a parallel line solution for all
the data. He suggests - effectively – differing analyses for small, medium and large Carat Weight.
It seems to me, purely empirically that those with Certifying Body IGI are (a) mostly small weight and (b) rather
different from the other bodies (which are mostly heavier). Alternatively the appropriate metric from weight
may be neither Carat Weight nor LogCarat weight. And this shows up more for the IGI group than the others
because it alone spans the range of small to large weight and price. There’s no strong reason why either of these
explanations should be right. Nor in fact is there any overwhelming reason why LogPrice should be linearly
related to LogCarat, though that does seem to me to be a good starting point.
Discussion at http://www.amstat.org/publications/jse/jse_data_archive.htm and
http://www.amstat.org/publications/jse/v9n2/datasets.chu.html
Download