Assignment #4

advertisement
STAT 425 – Modern Methods of Data Analysis (41 pts.)
Assignment 4 – ACE/AVAS and Projection Pursuit Regression
PROBLEM 1 – PREDICTING THE AGE OF AN ABALONE
The age of abalone is determined by cutting the shell through the cone, staining it, and
counting the number of rings through a microscope -- a boring and time-consuming
task. Other measurements, which are easier to obtain, are often times used to predict the
age. Further information, such as weather patterns and location (hence food
availability) may be required to solve the problem.
Attribute Information:
Given is the attribute name, attribute type, the measurement unit and a brief
description. The number of rings is the value to predict. These data are contained in the
data frame Abalone.
Name / Data Type / Measurement Unit / Description
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years
length
diam
height
whole.weight
shucked.weight
visc.weight
shell.weight
Rings
a) Use ACE to find “optimal” transformation for the predictors and for the response. Plot
the results. Discuss the transformations and try to identify what parametric
transformation you think they suggest. (5 pts.)
b) Use the parametric versions of the transformations suggested by ACE in an OLS model.
Compare these results to ACE via the R2. Examine residual plots and critique this
model. (5 pts.)
c) Use AVAS to find “optimal” transformation for the predictors and the response. Plot
the results. How the transformations compare to those from ACE. Discuss the
transformations and try to identify what parametric transformation you think they
suggest. (6 pts.)
d) Use the parametric version of the transformation suggested by AVAS in an OLS model.
Compare these results to ACE and OLS via the R2. Examine residual plots and critique
this model. (6 pts.)
1
PROBLEM 2 – BOSTON HOUSING DATA
The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld,
which discussed approaches for using housing market data to estimate the willingness
to pay for clean air. The authors employed a hedonic price model, based on the premise
that the price of the property is determined by structural attributes (such as size, age,
condition) as well as neighborhood attributes (such as crime rate, accessibility,
environmental factors). This type of approach is often used to quantify the effects of
environmental factors that affect the price of a property.
Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical
Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census
and the Boston Metropolitan Area Planning Committee. The variables used to develop
the Harrison Rubinfeld housing value equation are listed in the table below.
(Boston.working)
Variables Used in the Harrison-Rubinfeld Housing Value Equation
VARIABLE
CMEDV
TYPE
DEFINITION
SOURCE
Median value of homes in thousands of
dollars
1970 U.S. Census
Average number of rooms
1970 U.S. Census
% of units built prior to 1940
1970 U.S. Census
B
Black % of population
1970 U.S. Census
LSTAT
% of population that is lower
socioeconomic status
1970 U.S. Census
CRIM
Crime rate
FBI (1970)
ZN
% of residential land zoned for lots >
than 25,000 sq. ft.
Metro Area Planning
Commission (1972)
% of non-retail business acres (proxy for
industry)
Mass. Dept. of
Commerce &
Development (1965)
TAX
Property tax rate
Mass. Taxpayers
Foundation (1970)
PTRATIO
Pupil-Teacher ratio
Mass. Dept. of Ed (’71‘72)
CHAS
Dummy variable indicating proximity
to Charles River (1 = on river)
1970 U.S. Census Tract
maps
Weighted distances to major
employment centers in area
Schnare dissertation
(Unpublished, 1973)
Index of accessibility to radial highways
MIT Boston Project
Nitrogen oxide concentrations (pphm)
TASSIM
RM
AGE
INDUS
Dependent
Variable
Structural
Neighborhood
DIS
Accessibility
RAD
NOX
Air Pollution
REFERENCE
Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean
Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.
2
a) Use projection pursuit regression to develop a model for CMEDV using the
available predictors. Choose an optimal number of terms accordingly. (5 pts.)
b) Show the results of a PPplot of your final model from part (a).
Use the bar = T option to obtain loadings for the predictors on each term in
the final model. Which variables are most important on each term and in the
overall model in general? Your predictor matrix should be scaled for these
loadings to be fairly compared.
(5 pts.)
c) Cross-validate the final model using MCCV and compare to MCCV results for
the following OLS model that the authors of the paper above used. (5 pts.)
log(CMEDV) = 𝛽0 + 𝛽1 𝑅𝑀2 + 𝛽2 𝐴𝐺𝐸 + 𝛽3 log(𝐷𝐼𝑆) + 𝛽4 log(𝑅𝐴𝐷) + 𝛽5 𝑇𝐴𝑋 +
𝛽6 𝑃𝑇𝑅𝐴𝑇𝐼𝑂 + 𝛽7 (𝐡 − .63)2 + 𝛽8 log(𝐿𝑆𝑇𝐴𝑇) + 𝛽9 𝐢𝑅𝐼𝑀 + 𝛽10 𝑍𝑁 + 𝛽11 πΌπ‘π·π‘ˆπ‘† +
𝛽12 𝐢𝐻𝐴𝑆 + 𝛽13 𝑁𝑂𝑋 2 + πœ–
d) What do you think of the model used by the authors in terms of regression
assumptions and diagnostics? Explain. (4 pts.)
3
Download