Statistics 112: Final Project - Wharton Statistics Department

advertisement
Statistics 112: Final Project
For the final project, you should work alone or with a partner (if there is some
exceptional circumstance, a group of three people will be allowed). The purpose of the
project is for you to gain experience in applying the methods taught in the class to a real
data set of interest to you.
Due Dates
Tuesday, November 23rd (beginning of class): Hand in to me a paragraph describing the
data set you plan to analyze and the questions of interest.
Monday, December 13th (5 p.m.): Hand in to me annotated JMP output on which your
report will be based along with a few paragraphs describing your results. If you have any
issues about what you should do in your data analysis, write them down for me and I will
discuss them with you. I will look this over and have my comments available for you by
Tuesday afternoon. If you give me your draft earlier, I will return it to you earlier.
Tuesday, December 21st (Noon): Hand in to me your final report. Note that the final
homework assignment will also be due at this time.
I will be available throughout the reading and exam period to discuss your projects with
you.
Project Description
The standard project is to use multiple regression analysis to analyze a data set that is of
interest to you. If you have a strong interest in analysis of variance (the topic we will
cover after multiple regression), your project can consist of using analysis of variance to
analyze a data set.
The final report for the project should be a 5-10 page paper that describes the questions of
interest, how you used your data set to analyze these questions with details on the steps
you used in your analysis, your findings about your question of interest and the
limitations of your study. Specifically, your report should contain the following:
1. Abstract: A one paragraph summary of what you set out to learn, and what you
ended up finding. It should summarize the entire report.
2. Introduction: A discussion of what questions you are interested in.
3. Data Set: Describe details about how the data set was collected and the variables
in the data set.
4. Analysis: Describe how you used multiple regression to analyze the data set.
Specifically, you should discuss how you carried out the steps in analysis
discussed in class, i.e., exploration of data to find an initial reasonable model,
checking the model and changes to the model based on your checking of the
model.
5. Results: Provide inferences about the questions of interest and discussion.
6. Limitations of study and conclusion: Describe any limitations of your study and
how they might be overcome in future research and provide brief conclusions
about the results of your study.
Data Sets
The project will be of most interest to you if you find questions of interest and a data set
that are of interest to you.
Examples of questions of interest are as follows: What properties of a baseball team best
predict its success over the course of a season? What properties of a college are related
to its rank in the U.S. News and World Report rankings? Is the gas mileage of an
automobile predictable from properties such as weight, horsepower, and so on? Is the
unemployment rate related to economic measures such as interest rates, stock returns, and
the inflation rate? What properties of a state predict the proportion of the vote that
George Bush (John Kerry) received in it? You will need a data set to explore your
question of interest. I will be happy to help you with suggestions. The data set should
ideally contain at least 30-50 observations (e.g., companies, people, countries, etc., as the
case may be), and at least 4 variables (pieces of information about the observations; e.g.,
stock price, revenues, profits, salaries, gender, etc.), although if that is not possible,
exceptions will be allowed (subject to my approval). One of the variables should be such
that it is a numerical variable that would be of interest to try to model or forecast (e.g., for
the examples above, team winning percentage, stock price change, U.S. News and World
Report rank, gas mileage, unemployment rate, and proportion of vote received
respectively).
I will be happy to discuss ideas with you. Here are a few potential sources of ideas and
data:
The Data and Story Library (DASL) has many interesting data sets:
http://lib.stat.cmu.edu/DASL/
The following web site from a course at Duke has several interesting data sets:
http://www.isds.duke.edu/courses/Spring02/sta114/
I am handing out a list of web sites with interesting data sets.
Samples
A good sample of what I’m expecting from the projects and reports is contained at the
web site http://pages.stern.nyu.edu/~jsimonof/classes/1305/projdoc/ . Note that
these reports are for a class taught at New York University by Jeffrey Simonoff, so some
of the methods used in the regression analyses may be unfamiliar to you.
Download