STAT 360 - Assignment 2 Fall 2015 83 pts. Name(s): ________________________ ________________________ 1 – Haystack Volume Data Description: These data are haystack measurements taken in Nebraska in 1927 and 1928, when farmers sold hay unbaled and in a mound-shaped stack, requiring estimation of the volume of a stack. Two measurements that could be made easily with a rope were usually employed: the circumference around the base of the stack and the OVER distance, i.e. the distance from the ground on one side of a stack to the ground on the other side of the stack. Stacks vary in height and shape so using a simple computation like the volume of a hemisphere, while perhaps a useful first approximation, may or may not be sufficiently accurate. Source: Methods of Correlation Analysis, Ezekiel, M., (1941), 2nd ed, New York: Wiley, p. 378-380 Datafile: Haystacks.JMP Typical Haystack (I think…) 1 Answer the following using whatever software you’d like. Marginal Distributions a. Obtain estimates of the following quantities for Volume (ft3). (3 pts) Mean = Variance (SD) = Total Variation (SYY) = b. Obtain estimates of the following quantities for Circumference (ft.) and Over (ft.) Statistic Mean Variance (SD) Total Variation (SXX) Circumference (ft.) (3 pts.) Over (ft.) Note: To obtain SXX and SYY you can use Corrected SS from the Summary Statistics > Customize Summary Statistics list. Joint Distributions of (Circumference,Volume) and (Over,Volume) c. Create a plot visualizing the joint distribution of Volume (Y) vs. Circumference (X). Give a brief statement (one or two sentences) about the general patterns you see in this plot. Would you characterize the joint distribution as BVN? Why or why not? (4 pts.) d. Create a plot visualizing the joint distribution of Volume (Y) vs. Over (X). Give a brief statement (one or two sentences) about the general patterns you see in this plot. Would you characterize the joint distribution as BVN? Why or why not? (4 pts.) e. Which potential predictor, circumference or over, appears to have the strongest relationship with the volume of the haystack? Explain. (2 pts.) 2 f. Show the math as to why it is reasonable to estimate the volume of a haystack using the the relationship below. (3 pts) πΈ(ππππ’ππ|πΆππππ’ππππππππ) ≈ g. πΆππππ’ππππππππ 3 12π 2 Use the function above as an estimate of the mean function for Volume | Circumference. You can use the JMP Calculator to create a new column containing the mean function values. Plot this estimated mean function on a scatterplot of Volume vs. Circumference. This is can done by first forming a new column containing the formula above and then using Graph > Overlay Plots in JMP to plot both Volume and Volume Formula vs. Circumference. (3 pts) h. Obtain Residual2 (πΜ 2 ) value for each point in your dataset. Sum up these values to obtain the total unexplained variation for the mean function given above. (3 pts) π π ππ = ∑(π¦π − π¦Μπ )2 = π=1 i. What proportion of the total variation in volume (SYY) is being explained by using this mean function? Show the math here. (2 pts) Distribution of Volume | Over j. What would be a reasonable function for estimating volume if the Over measurement was used instead of the Circumference measurement? Explain how you obtained this value. (4 pts) πΈ(ππππ’ππ|ππ£ππ) =? ? 3 k. Use your function in Problem #10 to obtain the Residual2 (πΜ 2 ) value for each point in your dataset. Sum up these values to obtain the variation unexplained by the mean function using Over. (3 pts) π π ππ = ∑(π¦π − π¦Μπ )2 = π=1 l. Once again, what proportion of the total unexplained variation is being explained by the using a mean function based on Over? (2 pts) Comparing the Estimating Mean Functions m. Which is a better estimate to use for estimating the average volume of a haystack – the Circumference of the Over measurement? Explain using appropriate summary statistics from fitting these two models for haystack volume. (3 pts) Investigation of the Variance Function n. Pick one set of the residuals obtained above. Obtain the |Residual| value for each data point and plot it against Circumference or Over, depending on which mean function was used. Discuss the general pattern see in this plot. Does the variance function appear to increase/decrease/not change as the haystacks get larger? Discuss the possible consequences of a changing variance function on someone who purchases hay in this manner. (4 pts) 4 2 – Percent Body Fat (%) and Abdominal Circumference (cm) (Datafiles: Bodyfat.JMP and Bodyfat.txt) These data can be used to relate percent body fat found by determining the subject’s density by weighing them underwater. This is an expensive and inconvenient way to measure a subject’s percent body fat. Regression techniques can be used to develop models to predict percent body fat (Y) using easily measured body dimensions. In this study n = 252 men were used and in this problem we will focus on the relationship between percent body fat and abdominal circumference. These data were used in the notes to examine the relationship between percent body fat and abdominal circumference. a. Create a plot visualizing the joint distribution of (Percent Body Fat, Abdomen). Does the joint distribution appear to be approximately bivariate normal? Justify your answer. (3 pts.) b. Find the estimates for the following parameters of a bivariate normal distribution for these data (ππ , ππ , ππ2 , ππ2 , π). (5 pts.) c. Assuming bivariate normality and using the statistics found in part (b) find πΈΜ (πππππππ‘ π΅πππ¦ πΉππ‘|π΄ππππππππ πΆππππ’ππππππππ) = πΈΜ (π|π) Μ (πππππππ‘ π΅πππ¦ πΉππ‘|π΄ππππππππ πΆππππ’ππππππππ) = πππ Μ (π|π) πππ Μ (πππππππ‘ π΅πππ¦ πΉππ‘|π΄ππππππππ πΆππππ’ππππππππ) = ππ· Μ (π|π) ππ· (7 ππ‘π . ) d. What is the π 2 for the regression of percent body fat (Y) on abdominal circumference (X). Interpret this quantity. (2 pts.) e. Use Analyze > Fit Y by X to fit a regression line to these data. How does the estimate regression line compare to your estimate of πΈ(π|π) found in part (c) above? (2 pts.) 5 f. How does the RMSE and π 2 for the regression line fit part (e) compare to your estimate of ππ·(π|π) from part (c) and your R-square estimate from part (d)? (2 pts.) 3 – List Price ($) and Finished Living Area (ft2.) for Minneapolis and St. Paul These data were obtained using a real estate browsing site called Redfin ® (www.redfin.com). By using this site you can search numerous metropolitan areas for homes currently on the market within in the U.S. using a mapping tool and then download numerous characteristics of the properties on the map into a .csv file. Data Source: www.redfin.com Data File: Twin Cities Homes.JMP and Twin Cities Homes.txt The variables in this data file are: ο· ο· ο· ο· ο· ο· ο· ο· ο· ο· ο· City (city or suburb the home is located in) ZIP – ZIP code ListPrice – current list price of the home ($) BEDS – number of bedrooms BATHS – number of bathrooms SQFT – square footage of the finished living area (ft2) LotSize – square footage of the lot (ft2) Age – age of the home in years HasGarage – categorical variable (Garage or No Garage) LATITUDE – latitude of the home LONGITUDE – longitude of the home a. How would you characterize the marginal distributionsof List Price (Y) and SQFT (X), the finished living area (ft2)? (2 pts.) 6 b. Use Tukey’s Ladder of Powers to find power transformations of ListPrice and SQFT to improve approximate normality. Include plots of your transformed variables. (4 pts.) c. Use trendscatter(ListPrice~SQFT)to obtain visualizations of πΈ(π|π) πππ ππ·(π|π). Discuss each. (4 pts.) d. Use the Bulging Rule to find a transformation of List Price ( call it π ∗ ) and SQFT (call it π ∗ ) such that πΈ(π ∗ |π ∗ ) ≈ π½π + π½1 π ∗ and πππ(π ∗ |π ∗ ) ≈ ππππ π‘πππ‘. You may not be able to achieve both of these goals, but try. Include a scatterplot of π ∗ π£π . π ∗ . You can do this in JMP or using trendscatter in R. (5 pts.) 7