Stat 512-1 Project Dr. Levine Due: Mon, Dec 14th Data set 1 & Task Serum prostate-specific antigen (PSA) was determined in 97 men with advanced prostate cancer. PSA is a well-known screening test for prostate cancer; the goal is to examine the correlation between level of PSDA and a number of clinical measures for men who were about to undergo radical prostatectomy. The measures are cancer volume, prostate weight, patient age, the amount of benign prostatic hyperplasia, seminal vesicle invasion, capsular penetration, and Gleason score. Your goal is to develop a multiple regression model to analyze how strongly the measures given above correlate with PSA. There are 97 men (subjects) in this study. You should select a random sample of 65 men to be used in model-building. Check p. 1353 in KNNL for the detailed description of the dataset. There are two objectives in this project: 1. Come up with the best model for PSA using the best (in your opinion) subset of the predictor variables. 2. Provide PSA prediction for the remaining 32 subjects and compare it to the true values. Note: If you prefer, the model used for prediction can be different from the model you use to explain the data. Data set 2 &Task The city tax assessor tried to predict residential home sales prices in a Midwestern city as a function of various characteristics of the home and surrounding property. The possible predictor variables are finished square feet, number of bedrooms, number of bathrooms, air conditioning (its presence or absence), garage size, presence/absence of the swimming pool, year the hose was built in, index of quality construction, are finished, style indicator, lot size and adjacency to the highway. The details can be found on pp.13531354 in KNNL. Your goal is to develop a multiple regression model to analyze how strongly the measures given above correlate with the sales price. Select a random sample of 300 observations to be used for model building. Use the remainder of observations for prediction validation purposes. The same objectives as above should be pursued and the remark about using two possibly different models is effective. Report & test set prediction The written report should be 8 pages or less. It should include the following sections: 1. Describe data. Describe the variables and report any interesting patterns you found about the data set. 2. Model. Explain the statistical model you used for the data analysis and why. Explain the model you use for prediction (if it is different from the previous model). 3. Results. Describe the results of your analysis. You can include short tables and graphical displays but DO NOT include unprocessed SAS output. 4. Conclusions. 5. Submit the prediction results on the test set. Your file format: each line contains a numerical value which is the predicted PSA, in the same order as the test set file; next to it should be the true value of PSA. Suggestions for getting started Plan your analysis. Study the data set. Do the first set of analyses including basic descriptive statistics with plots and charts as appropriate. Run some initial models; discuss the results and refine the analysis. Do model selection. Check model assumptions; take remedial actions as necessary. Re-run the models. Interpret the results. Make predictions for test set. Prepare the report.