Assignment 2

advertisement
STAT 360 - Assignment 2
Fall 2015
83 pts.
Name(s): ________________________
________________________
1 – Haystack Volume
Data Description: These data are haystack measurements taken in Nebraska in 1927 and 1928,
when farmers sold hay unbaled and in a mound-shaped stack, requiring estimation of the
volume of a stack. Two measurements that could be made easily with a rope were usually
employed: the circumference around the base of the stack and the OVER distance, i.e. the
distance from the ground on one side of a stack to the ground on the other side of the stack.
Stacks vary in height and shape so using a simple computation like the volume of a hemisphere,
while perhaps a useful first approximation, may or may not be sufficiently accurate.
Source: Methods of Correlation Analysis, Ezekiel, M., (1941), 2nd ed, New York: Wiley, p. 378-380
Datafile: Haystacks.JMP
Typical Haystack (I think…)
1
Answer the following using whatever software you’d like.
Marginal Distributions
a. Obtain estimates of the following quantities for Volume (ft3). (3 pts)
Mean =
Variance (SD) =
Total Variation (SYY) =
b. Obtain estimates of the following quantities for Circumference (ft.) and Over (ft.)
Statistic
Mean
Variance (SD)
Total Variation (SXX)
Circumference (ft.)
(3 pts.)
Over (ft.)
Note: To obtain SXX and SYY you can use Corrected SS from the Summary Statistics > Customize Summary Statistics list.
Joint Distributions of (Circumference,Volume) and (Over,Volume)
c. Create a plot visualizing the joint distribution of Volume (Y) vs. Circumference (X).
Give a brief statement (one or two sentences) about the general patterns you see in this
plot. Would you characterize the joint distribution as BVN? Why or why not? (4 pts.)
d. Create a plot visualizing the joint distribution of Volume (Y) vs. Over (X). Give a brief
statement (one or two sentences) about the general patterns you see in this plot. Would
you characterize the joint distribution as BVN? Why or why not? (4 pts.)
e. Which potential predictor, circumference or over, appears to have the strongest
relationship with the volume of the haystack? Explain. (2 pts.)
2
f.
Show the math as to why it is reasonable to estimate the volume of a haystack using the
the relationship below. (3 pts)
𝐸(π‘‰π‘œπ‘™π‘’π‘šπ‘’|πΆπ‘–π‘Ÿπ‘π‘’π‘šπ‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’) ≈
g.
πΆπ‘–π‘Ÿπ‘π‘’π‘šπ‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’ 3
12πœ‹ 2
Use the function above as an estimate of the mean function for Volume | Circumference.
You can use the JMP Calculator to create a new column containing the mean function
values. Plot this estimated mean function on a scatterplot of Volume vs. Circumference.
This is can done by first forming a new column containing the formula above and then
using Graph > Overlay Plots in JMP to plot both Volume and Volume Formula vs.
Circumference. (3 pts)
h.
Obtain Residual2 (𝑒̂ 2 ) value for each point in your dataset. Sum up these values to
obtain the total unexplained variation for the mean function given above. (3 pts)
𝑛
𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 =
𝑖=1
i.
What proportion of the total variation in volume (SYY) is being explained by using this
mean function? Show the math here. (2 pts)
Distribution of Volume | Over
j.
What would be a reasonable function for estimating volume if the Over measurement
was used instead of the Circumference measurement? Explain how you obtained this
value. (4 pts)
𝐸(π‘‰π‘œπ‘™π‘’π‘šπ‘’|π‘‚π‘£π‘’π‘Ÿ) =? ?
3
k.
Use your function in Problem #10 to obtain the Residual2 (𝑒̂ 2 ) value for each point in
your dataset. Sum up these values to obtain the variation unexplained by the mean
function using Over. (3 pts)
𝑛
𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 =
𝑖=1
l.
Once again, what proportion of the total unexplained variation is being explained by the
using a mean function based on Over? (2 pts)
Comparing the Estimating Mean Functions
m.
Which is a better estimate to use for estimating the average volume of a haystack – the
Circumference of the Over measurement? Explain using appropriate summary statistics
from fitting these two models for haystack volume. (3 pts)
Investigation of the Variance Function
n.
Pick one set of the residuals obtained above. Obtain the |Residual| value for each data
point and plot it against Circumference or Over, depending on which mean function
was used. Discuss the general pattern see in this plot. Does the variance function
appear to increase/decrease/not change as the haystacks get larger? Discuss the possible
consequences of a changing variance function on someone who purchases hay in this
manner. (4 pts)
4
2 – Percent Body Fat (%) and Abdominal Circumference (cm)
(Datafiles: Bodyfat.JMP and Bodyfat.txt)
These data can be used to relate percent body fat found by determining the subject’s density by
weighing them underwater. This is an expensive and inconvenient way to measure a subject’s
percent body fat. Regression techniques can be used to develop models to predict percent body
fat (Y) using easily measured body dimensions. In this study n = 252 men were used and in this
problem we will focus on the relationship between percent body fat and abdominal
circumference. These data were used in the notes to examine the relationship between percent
body fat and abdominal circumference.
a. Create a plot visualizing the joint distribution of (Percent Body Fat, Abdomen). Does the
joint distribution appear to be approximately bivariate normal? Justify your answer.
(3 pts.)
b. Find the estimates for the following parameters of a bivariate normal distribution
for these data (πœ‡π‘‹ , πœ‡π‘Œ , πœŽπ‘‹2 , πœŽπ‘Œ2 , 𝜌).
(5 pts.)
c. Assuming bivariate normality and using the statistics found in part (b) find
𝐸̂ (π‘ƒπ‘’π‘Ÿπ‘π‘’π‘›π‘‘ π΅π‘œπ‘‘π‘¦ πΉπ‘Žπ‘‘|π΄π‘π‘‘π‘œπ‘šπ‘–π‘›π‘Žπ‘™ πΆπ‘–π‘Ÿπ‘π‘’π‘šπ‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’) = 𝐸̂ (π‘Œ|𝑋)
Μ‚ (π‘ƒπ‘’π‘Ÿπ‘π‘’π‘›π‘‘ π΅π‘œπ‘‘π‘¦ πΉπ‘Žπ‘‘|π΄π‘π‘‘π‘œπ‘šπ‘–π‘›π‘Žπ‘™ πΆπ‘–π‘Ÿπ‘π‘’π‘šπ‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’) = π‘‰π‘Žπ‘Ÿ
Μ‚ (π‘Œ|𝑋)
π‘‰π‘Žπ‘Ÿ
Μ‚ (π‘ƒπ‘’π‘Ÿπ‘π‘’π‘›π‘‘ π΅π‘œπ‘‘π‘¦ πΉπ‘Žπ‘‘|π΄π‘π‘‘π‘œπ‘šπ‘–π‘›π‘Žπ‘™ πΆπ‘–π‘Ÿπ‘π‘’π‘šπ‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’) = 𝑆𝐷
Μ‚ (π‘Œ|𝑋)
𝑆𝐷
(7 𝑝𝑑𝑠. )
d. What is the 𝑅 2 for the regression of percent body fat (Y) on abdominal
circumference (X). Interpret this quantity. (2 pts.)
e. Use Analyze > Fit Y by X to fit a regression line to these data. How does the
estimate regression line compare to your estimate of 𝐸(π‘Œ|𝑋) found in part (c)
above?
(2 pts.)
5
f. How does the RMSE and 𝑅 2 for the regression line fit part (e) compare to your
estimate of 𝑆𝐷(π‘Œ|𝑋) from part (c) and your R-square estimate from part (d)?
(2 pts.)
3 – List Price ($) and Finished Living Area (ft2.) for Minneapolis and St. Paul
These data were obtained using a real estate browsing site called Redfin ® (www.redfin.com). By
using this site you can search numerous metropolitan areas for homes currently on the market
within in the U.S. using a mapping tool and then download numerous characteristics of the
properties on the map into a .csv file.
Data Source: www.redfin.com
Data File: Twin Cities Homes.JMP and Twin Cities Homes.txt
The variables in this data file are:
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
City (city or suburb the home is located in)
ZIP – ZIP code
ListPrice – current list price of the home ($)
BEDS – number of bedrooms
BATHS – number of bathrooms
SQFT – square footage of the finished living area (ft2)
LotSize – square footage of the lot (ft2)
Age – age of the home in years
HasGarage – categorical variable (Garage or No Garage)
LATITUDE – latitude of the home
LONGITUDE – longitude of the home
a. How would you characterize the marginal distributionsof List Price (Y) and
SQFT (X), the finished living area (ft2)? (2 pts.)
6
b. Use Tukey’s Ladder of Powers to find power transformations of ListPrice and
SQFT to improve approximate normality. Include plots of your transformed
variables. (4 pts.)
c. Use trendscatter(ListPrice~SQFT)to obtain visualizations of
𝐸(π‘Œ|𝑋) π‘Žπ‘›π‘‘ 𝑆𝐷(π‘Œ|𝑋). Discuss each. (4 pts.)
d. Use the Bulging Rule to find a transformation of List Price ( call it π‘Œ ∗ ) and SQFT
(call it 𝑋 ∗ ) such that 𝐸(π‘Œ ∗ |𝑋 ∗ ) ≈ π›½π‘œ + 𝛽1 𝑋 ∗ and π‘‰π‘Žπ‘Ÿ(π‘Œ ∗ |𝑋 ∗ ) ≈ π‘π‘œπ‘›π‘ π‘‘π‘Žπ‘›π‘‘. You may
not be able to achieve both of these goals, but try. Include a scatterplot of
π‘Œ ∗ 𝑣𝑠. 𝑋 ∗ . You can do this in JMP or using trendscatter in R. (5 pts.)
7
Download