STAT 425 – Modern Methods of Data Analysis (41 pts.) Assignment 4 – ACE/AVAS and Projection Pursuit Regression PROBLEM 1 – PREDICTING THE AGE OF AN ABALONE The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are often times used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. Attribute Information: Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict. These data are contained in the data frame Abalone. Name / Data Type / Measurement Unit / Description Length / continuous / mm / Longest shell measurement Diameter / continuous / mm / perpendicular to length Height / continuous / mm / with meat in shell Whole weight / continuous / grams / whole abalone Shucked weight / continuous / grams / weight of meat Viscera weight / continuous / grams / gut weight (after bleeding) Shell weight / continuous / grams / after being dried Rings / integer / -- / +1.5 gives the age in years length diam height whole.weight shucked.weight visc.weight shell.weight Rings a) Use ACE to find “optimal” transformation for the predictors and for the response. Plot the results. Discuss the transformations and try to identify what parametric transformation you think they suggest. (5 pts.) b) Use the parametric versions of the transformations suggested by ACE in an OLS model. Compare these results to ACE via the R2. Examine residual plots and critique this model. (5 pts.) c) Use AVAS to find “optimal” transformation for the predictors and the response. Plot the results. How the transformations compare to those from ACE. Discuss the transformations and try to identify what parametric transformation you think they suggest. (6 pts.) d) Use the parametric version of the transformation suggested by AVAS in an OLS model. Compare these results to ACE and OLS via the R2. Examine residual plots and critique this model. (6 pts.) 1 PROBLEM 2 – BOSTON HOUSING DATA The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property. Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below. (Boston.working) Variables Used in the Harrison-Rubinfeld Housing Value Equation VARIABLE CMEDV TYPE DEFINITION SOURCE Median value of homes in thousands of dollars 1970 U.S. Census Average number of rooms 1970 U.S. Census % of units built prior to 1940 1970 U.S. Census B Black % of population 1970 U.S. Census LSTAT % of population that is lower socioeconomic status 1970 U.S. Census CRIM Crime rate FBI (1970) ZN % of residential land zoned for lots > than 25,000 sq. ft. Metro Area Planning Commission (1972) % of non-retail business acres (proxy for industry) Mass. Dept. of Commerce & Development (1965) TAX Property tax rate Mass. Taxpayers Foundation (1970) PTRATIO Pupil-Teacher ratio Mass. Dept. of Ed (’71‘72) CHAS Dummy variable indicating proximity to Charles River (1 = on river) 1970 U.S. Census Tract maps Weighted distances to major employment centers in area Schnare dissertation (Unpublished, 1973) Index of accessibility to radial highways MIT Boston Project Nitrogen oxide concentrations (pphm) TASSIM RM AGE INDUS Dependent Variable Structural Neighborhood DIS Accessibility RAD NOX Air Pollution REFERENCE Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102. 2 a) Use projection pursuit regression to develop a model for CMEDV using the available predictors. Choose an optimal number of terms accordingly. (5 pts.) b) Show the results of a PPplot of your final model from part (a). Use the bar = T option to obtain loadings for the predictors on each term in the final model. Which variables are most important on each term and in the overall model in general? Your predictor matrix should be scaled for these loadings to be fairly compared. (5 pts.) c) Cross-validate the final model using MCCV and compare to MCCV results for the following OLS model that the authors of the paper above used. (5 pts.) log(CMEDV) = π½0 + π½1 π π2 + π½2 π΄πΊπΈ + π½3 log(π·πΌπ) + π½4 log(π π΄π·) + π½5 ππ΄π + π½6 πππ π΄ππΌπ + π½7 (π΅ − .63)2 + π½8 log(πΏπππ΄π) + π½9 πΆπ πΌπ + π½10 ππ + π½11 πΌππ·ππ + π½12 πΆπ»π΄π + π½13 πππ 2 + π d) What do you think of the model used by the authors in terms of regression assumptions and diagnostics? Explain. (4 pts.) 3