Statistics 112: Final Project - Wharton Statistics Department

advertisement
Statistics 112: Final Project
For the final project, you should work alone or with a partner. The purpose of the project
is for you to gain experience in applying the methods taught in the class to a real data set
of interest to you.
Due Dates
Now – November 16th: Set up appointment with me or talk with me during office hours
about your ideas for the final project.
Thursday, November 16th (beginning of class): Hand in to me a paragraph describing the
data set you plan to analyze and the questions of interest.
Tuesday, December 12th (5 p.m.): Hand in to me annotated JMP output on which your
report will be based along with a few paragraphs describing your results. If you have any
issues about what you should do in your data analysis, write them down for me and I will
discuss them with you. I will look this over and have my comments available for you by
Wednesday afternoon. If you give me your draft earlier, I will return it to you earlier.
Tuesday, December 19th (5 p.m.): Hand in to me your final report. Note that the final
homework assignment will also be due at this time.
I will be available throughout the reading and exam period to discuss your projects with
you.
Project Description
The standard project is to use multiple regression analysis to analyze a data set that is of
interest to you. If you have a strong interest in analysis of variance (the topic we will
cover after multiple regression), your project can consist of using analysis of variance to
analyze a data set.
The final report for the project should be a 5-10 page paper (this does not include
additional JMP output) that describes the questions of interest, how you used your data
set to analyze these questions with details on the steps you used in your analysis, your
findings about your question of interest and the limitations of your study. Specifically,
your report should contain the following:
1. Abstract: A one paragraph summary of what you set out to learn, and what you
ended up finding. It should summarize the entire report.
2. Introduction: A discussion of what questions you are interested in.
3. Data Set: Describe details about how the data set was collected and the variables
in the data set.
4. Analysis: Describe how you used multiple regression to analyze the data set.
Specifically, you should discuss how you carried out the steps in analysis
discussed in class, i.e., exploration of data to find an initial reasonable model,
checking the model and changes to the model based on your checking of the
model.
5. Results: Provide inferences about the questions of interest and discussion.
6. Limitations of study and conclusion: Describe any limitations of your study and
how they might be overcome in future research and provide brief conclusions
about the results of your study.
Data Sets
The project will be of most interest to you if you find questions of interest and a data set
that are of interest to you.
Examples of questions of interest are as follows:
What properties of a baseball team best predict its success over the course of a season?
What properties of a college are related to its rank in the U.S. News and World Report
rankings?
Is the unemployment rate related to economic measures such as interest rates, stock
returns, and the inflation rate?
What properties of a state predict the proportion of the vote that George Bush (John
Kerry) received in it?
You will need a data set to explore your question of interest. I will be happy to help you
with suggestions. The data set should ideally contain at least 30-50 observations (e.g.,
companies, people, countries, etc., as the case may be), and at least 4 variables (pieces of
information about the observations; e.g., stock price, revenues, profits, salaries, gender,
etc.), although if that is not possible, exceptions will be allowed (subject to my approval).
One of the variables should be such that it is a numerical variable that would be of
interest to try to model or forecast (e.g., for the examples above, team winning
percentage, stock price change, U.S. News and World Report rank, gas mileage,
unemployment rate, and proportion of vote received respectively).
I will be happy to discuss ideas with you. Here are a few potential sources of ideas and
data:
The Data and Story Library (DASL) has many interesting data sets:
http://lib.stat.cmu.edu/DASL/
The following web site from a course at Duke has several interesting data sets:
http://www.isds.duke.edu/courses/Spring02/sta114/
Samples
A good sample of what I’m expecting from the projects and reports is contained at the
web site http://pages.stern.nyu.edu/~jsimonof/classes/1305/projdoc/ . Note that
these reports are for a class taught at New York University by Jeffrey Simonoff, so some
of the methods used in the regression analyses may be unfamiliar to you.
Download