Stat 512-1 Project Dr. Levine Data set 1 & Task

advertisement
Stat 512-1
Project
Dr. Levine
Due: Mon, Dec 14th
Data set 1 & Task
Serum prostate-specific antigen (PSA) was determined in 97 men with advanced prostate
cancer. PSA is a well-known screening test for prostate cancer; the goal is to examine the
correlation between level of PSDA and a number of clinical measures for men who were
about to undergo radical prostatectomy. The measures are cancer volume, prostate
weight, patient age, the amount of benign prostatic hyperplasia, seminal vesicle invasion,
capsular penetration, and Gleason score. Your goal is to develop a multiple regression
model to analyze how strongly the measures given above correlate with PSA.
There are 97 men (subjects) in this study. You should select a random sample of 65 men
to be used in model-building. Check p. 1353 in KNNL for the detailed description of the
dataset.
There are two objectives in this project:
1. Come up with the best model for PSA using the best (in your opinion) subset of
the predictor variables.
2. Provide PSA prediction for the remaining 32 subjects and compare it to the true
values.
Note: If you prefer, the model used for prediction can be different from the model you
use to explain the data.
Data set 2 &Task
The city tax assessor tried to predict residential home sales prices in a Midwestern city as
a function of various characteristics of the home and surrounding property. The possible
predictor variables are finished square feet, number of bedrooms, number of bathrooms,
air conditioning (its presence or absence), garage size, presence/absence of the swimming
pool, year the hose was built in, index of quality construction, are finished, style
indicator, lot size and adjacency to the highway. The details can be found on pp.13531354 in KNNL. Your goal is to develop a multiple regression model to analyze how
strongly the measures given above correlate with the sales price.
Select a random sample of 300 observations to be used for model building. Use the
remainder of observations for prediction validation purposes. The same objectives as
above should be pursued and the remark about using two possibly different models is
effective.
Report & test set prediction
The written report should be 8 pages or less. It should include the following sections:
1. Describe data. Describe the variables and report any interesting patterns you
found about the data set.
2. Model. Explain the statistical model you used for the data analysis and why.
Explain the model you use for prediction (if it is different from the previous
model).
3. Results. Describe the results of your analysis. You can include short tables and
graphical displays but DO NOT include unprocessed SAS output.
4. Conclusions.
5. Submit the prediction results on the test set. Your file format: each line contains a
numerical value which is the predicted PSA, in the same order as the test set file;
next to it should be the true value of PSA.
Suggestions for getting started







Plan your analysis.
Study the data set. Do the first set of analyses including basic descriptive statistics
with plots and charts as appropriate.
Run some initial models; discuss the results and refine the analysis. Do model
selection.
Check model assumptions; take remedial actions as necessary. Re-run the models.
Interpret the results.
Make predictions for test set.
Prepare the report.
Download