Assignment 2

STAT 360 - Assignment 2 Fall 2015 83 pts. Name(s): ________________________ ________________________ 1 – Haystack Volume Data Description: These data are haystack measurements taken in Nebraska in 1927 and 1928, when farmers sold hay unbaled and in a mound-shaped stack, requiring estimation of the volume of a stack. Two measurements that could be made easily with a rope were usually employed: the circumference around the base of the stack and the OVER distance, i.e. the distance from the ground on one side of a stack to the ground on the other side of the stack. Stacks vary in height and shape so using a simple computation like the volume of a hemisphere, while perhaps a useful first approximation, may or may not be sufficiently accurate. Source: Methods of Correlation Analysis, Ezekiel, M., (1941), 2nd ed, New York: Wiley, p. 378-380 Datafile: Haystacks.JMP Typical Haystack (I think…) 1 Answer the following using whatever software you’d like. Marginal Distributions a. Obtain estimates of the following quantities for Volume (ft3). (3 pts) Mean = Variance (SD) = Total Variation (SYY) = b. Obtain estimates of the following quantities for Circumference (ft.) and Over (ft.) Statistic Mean Variance (SD) Total Variation (SXX) Circumference (ft.) (3 pts.) Over (ft.) Note: To obtain SXX and SYY you can use Corrected SS from the Summary Statistics > Customize Summary Statistics list. Joint Distributions of (Circumference,Volume) and (Over,Volume) c. Create a plot visualizing the joint distribution of Volume (Y) vs. Circumference (X). Give a brief statement (one or two sentences) about the general patterns you see in this plot. Would you characterize the joint distribution as BVN? Why or why not? (4 pts.) d. Create a plot visualizing the joint distribution of Volume (Y) vs. Over (X). Give a brief statement (one or two sentences) about the general patterns you see in this plot. Would you characterize the joint distribution as BVN? Why or why not? (4 pts.) e. Which potential predictor, circumference or over, appears to have the strongest relationship with the volume of the haystack? Explain. (2 pts.) 2 f. Show the math as to why it is reasonable to estimate the volume of a haystack using the the relationship below. (3 pts) 𝐸(𝑉𝑜𝑙𝑢𝑚𝑒|𝐶𝑖𝑟𝑐𝑢𝑚𝑓𝑒𝑟𝑒𝑛𝑐𝑒) ≈ g. 𝐶𝑖𝑟𝑐𝑢𝑚𝑓𝑒𝑟𝑒𝑛𝑐𝑒 3 12𝜋 2 Use the function above as an estimate of the mean function for Volume | Circumference. You can use the JMP Calculator to create a new column containing the mean function values. Plot this estimated mean function on a scatterplot of Volume vs. Circumference. This is can done by first forming a new column containing the formula above and then using Graph > Overlay Plots in JMP to plot both Volume and Volume Formula vs. Circumference. (3 pts) h. Obtain Residual2 (𝑒̂ 2 ) value for each point in your dataset. Sum up these values to obtain the total unexplained variation for the mean function given above. (3 pts) 𝑛 𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = 𝑖=1 i. What proportion of the total variation in volume (SYY) is being explained by using this mean function? Show the math here. (2 pts) Distribution of Volume | Over j. What would be a reasonable function for estimating volume if the Over measurement was used instead of the Circumference measurement? Explain how you obtained this value. (4 pts) 𝐸(𝑉𝑜𝑙𝑢𝑚𝑒|𝑂𝑣𝑒𝑟) =? ? 3 k. Use your function in Problem #10 to obtain the Residual2 (𝑒̂ 2 ) value for each point in your dataset. Sum up these values to obtain the variation unexplained by the mean function using Over. (3 pts) 𝑛 𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = 𝑖=1 l. Once again, what proportion of the total unexplained variation is being explained by the using a mean function based on Over? (2 pts) Comparing the Estimating Mean Functions m. Which is a better estimate to use for estimating the average volume of a haystack – the Circumference of the Over measurement? Explain using appropriate summary statistics from fitting these two models for haystack volume. (3 pts) Investigation of the Variance Function n. Pick one set of the residuals obtained above. Obtain the |Residual| value for each data point and plot it against Circumference or Over, depending on which mean function was used. Discuss the general pattern see in this plot. Does the variance function appear to increase/decrease/not change as the haystacks get larger? Discuss the possible consequences of a changing variance function on someone who purchases hay in this manner. (4 pts) 4 2 – Percent Body Fat (%) and Abdominal Circumference (cm) (Datafiles: Bodyfat.JMP and Bodyfat.txt) These data can be used to relate percent body fat found by determining the subject’s density by weighing them underwater. This is an expensive and inconvenient way to measure a subject’s percent body fat. Regression techniques can be used to develop models to predict percent body fat (Y) using easily measured body dimensions. In this study n = 252 men were used and in this problem we will focus on the relationship between percent body fat and abdominal circumference. These data were used in the notes to examine the relationship between percent body fat and abdominal circumference. a. Create a plot visualizing the joint distribution of (Percent Body Fat, Abdomen). Does the joint distribution appear to be approximately bivariate normal? Justify your answer. (3 pts.) b. Find the estimates for the following parameters of a bivariate normal distribution for these data (𝜇𝑋 , 𝜇𝑌 , 𝜎𝑋2 , 𝜎𝑌2 , 𝜌). (5 pts.) c. Assuming bivariate normality and using the statistics found in part (b) find 𝐸̂ (𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐵𝑜𝑑𝑦 𝐹𝑎𝑡|𝐴𝑏𝑑𝑜𝑚𝑖𝑛𝑎𝑙 𝐶𝑖𝑟𝑐𝑢𝑚𝑓𝑒𝑟𝑒𝑛𝑐𝑒) = 𝐸̂ (𝑌|𝑋) ̂ (𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐵𝑜𝑑𝑦 𝐹𝑎𝑡|𝐴𝑏𝑑𝑜𝑚𝑖𝑛𝑎𝑙 𝐶𝑖𝑟𝑐𝑢𝑚𝑓𝑒𝑟𝑒𝑛𝑐𝑒) = 𝑉𝑎𝑟 ̂ (𝑌|𝑋) 𝑉𝑎𝑟 ̂ (𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐵𝑜𝑑𝑦 𝐹𝑎𝑡|𝐴𝑏𝑑𝑜𝑚𝑖𝑛𝑎𝑙 𝐶𝑖𝑟𝑐𝑢𝑚𝑓𝑒𝑟𝑒𝑛𝑐𝑒) = 𝑆𝐷 ̂ (𝑌|𝑋) 𝑆𝐷 (7 𝑝𝑡𝑠. ) d. What is the 𝑅 2 for the regression of percent body fat (Y) on abdominal circumference (X). Interpret this quantity. (2 pts.) e. Use Analyze > Fit Y by X to fit a regression line to these data. How does the estimate regression line compare to your estimate of 𝐸(𝑌|𝑋) found in part (c) above? (2 pts.) 5 f. How does the RMSE and 𝑅 2 for the regression line fit part (e) compare to your estimate of 𝑆𝐷(𝑌|𝑋) from part (c) and your R-square estimate from part (d)? (2 pts.) 3 – List Price ($) and Finished Living Area (ft2.) for Minneapolis and St. Paul These data were obtained using a real estate browsing site called Redfin ® (www.redfin.com). By using this site you can search numerous metropolitan areas for homes currently on the market within in the U.S. using a mapping tool and then download numerous characteristics of the properties on the map into a .csv file. Data Source: www.redfin.com Data File: Twin Cities Homes.JMP and Twin Cities Homes.txt The variables in this data file are:            City (city or suburb the home is located in) ZIP – ZIP code ListPrice – current list price of the home ($) BEDS – number of bedrooms BATHS – number of bathrooms SQFT – square footage of the finished living area (ft2) LotSize – square footage of the lot (ft2) Age – age of the home in years HasGarage – categorical variable (Garage or No Garage) LATITUDE – latitude of the home LONGITUDE – longitude of the home a. How would you characterize the marginal distributionsof List Price (Y) and SQFT (X), the finished living area (ft2)? (2 pts.) 6 b. Use Tukey’s Ladder of Powers to find power transformations of ListPrice and SQFT to improve approximate normality. Include plots of your transformed variables. (4 pts.) c. Use trendscatter(ListPrice~SQFT)to obtain visualizations of 𝐸(𝑌|𝑋) 𝑎𝑛𝑑 𝑆𝐷(𝑌|𝑋). Discuss each. (4 pts.) d. Use the Bulging Rule to find a transformation of List Price ( call it 𝑌 ∗ ) and SQFT (call it 𝑋 ∗ ) such that 𝐸(𝑌 ∗ |𝑋 ∗ ) ≈ 𝛽𝑜 + 𝛽1 𝑋 ∗ and 𝑉𝑎𝑟(𝑌 ∗ |𝑋 ∗ ) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡. You may not be able to achieve both of these goals, but try. Include a scatterplot of 𝑌 ∗ 𝑣𝑠. 𝑋 ∗ . You can do this in JMP or using trendscatter in R. (5 pts.) 7

Assignment 2

Related documents

Products

Support

Assignment 2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib