1 Outline

advertisement
Applied Statistics
1
Final Project
Outline
From a mathematical point of view, statistics is simply a collection of analytical
tools used to make insights into data. It is assumed that working statisticians have
a broad knowledge of these tools and understand their mathematical underpinnings.
The goal of my instruction up until this point has been to introduce you to such
tools and develop your mathematical intuition to understand how they work.
Working with data is however frequently the easiest part of a statisticians job. Since
statisticians almost always work as components of a much larger research team it is
a necessity that a statistician can both work within a team and convey their findings
to others. The goal of this project is to practice these aspects of a statisticians job.
1.1
Teamwork
This project is a “group” project in the sense that you will be working in small
groups of size 3-4. You will be allowed to form groups on your own. I will intervene
in this process only if necessary. At the end of the project, each group member will
fill out a questionnaire evaluating their teammates performance within the project.
A part of your individual grade will be determined based on your group members
assessment of the effort that you contributed to the overall completion of the
project.
1.2
Write-up and Oral Presentation
The final result of the project is a write-up and oral presentation which addresses
your group’s findings related to the problems below.
1. Write-up: The write-up should include an introduction to each problem, a
hypotheses section (when appropriate), a results section, and an appendix for
each problem which includes any code which you used to solve the problem.
The write-up should be clear, concise, well organized, and grammatically
correct. In my opinion, the most important part of technical writing is making
the problem which you are addressing clear (and important sounding) and
giving a concise and insightful solution.
2. Oral Presentation: The project will also include an oral presentation where
you present your results. Each member of the group is expected to contribute
to the oral presentation (i.e. talk). The presentation should last for 35-40
minutes and must be aided by the use of an electronic slide show, for example
power-point, prezi, beamer, etc..
1.3
Due Dates
Your write-up will be due at the time of your presentation. Your presentation must
take place between April 22, and May 3rd. Potential presentation times will be
scheduled during our final exam period, as well as other times during this period
which will be scheduled according to the availability of the groups.
1.4
Grading
You will be graded on both an individual and group basis. Your final project grade
will be determined as follows.
1. Write-up: 50 % For each problem:
(a) 20 % Grammar and Organization
(b) 50 % Solution and Analysis
(c) 20 % Organization of Code
(d) 10 % Aesthetics (Pictures, overall appearance etc.)
2. Presentation: 35 %
(a) 30 % Appearance of slide show
(b) 40 % Content and Organization
(c) 30 % Delivery and preparedness
3. Teamwork Evaluation: 15 %
2
Project Problems
Your team must address two of the four problems outlined below. One problem is
to be done among the first two, and one problem among the last two.
2.1
Comparing Tukey’s Method to Bonferroni’s Method
During the course, we discussed two methods of producing simultaneous confidence
intervals for the difference between the theoretical means of independent normal
populations; Tukey’s Method which was based on the studentized range
distribution, and Bonferroni’s Method which used a conservative approximation. By
means of a simulation study, investigate and compare the effectiveness of these two
methods. Using a confidence level of 95%, construct confidence intervals for the
differences between mean parameters using simulated normal data with different
means (and the same variances) and determine the rate at which these confidence
intervals simultaneously contain their respective parameters by repetition of this
simulation. This should be done for both methods. If I is the number of
populations, and J is the number of samples per population, the analysis should be
done for I=3,5 and 10, and J=10, 50, 100 and 200.
2.2
ANOVA under non-normality
One focus of the course was to determine if normal populations with the same
variances have equal means. We called our approach to this problem the “analysis
of variance”. The assumption for the analysis of variance F − test was that the
observations were of the form
Xi,j = µi + i,j
where i,j ∼ N (0, σ 2 ). Sometimes the assumption of the normality of the errors i,j
is not valid. Does the F − test still work when the errors are not Normally
distributed? To address this, apply the F − test to non-normal data, i.e. when the
the errors do not follow the normal distribution. Nice examples to check include
when i,j are distributed as: Exponential(λ) − λ, U nif (−a, a), or P oisson(λ) − λ.
If I is the number of populations, and J is the number of samples per population,
the analysis should be done for I=3,5 and 10, and J=10, 50, 100 and 200. Determine
by simulation if the size of the test (the probability of a type 1 error) is close to α
when the size α F − test is applied to the simulated data. Also investigate the
power of the test under non-normality. Report and Explain your results.
2.3
Multiple Linear Regression
Consider the model,
Yi = β0 + β1 xi,1 + β2 xi,2 + · · · + βJ xi,J + i ,
(1)
where i ∼ N (0, σ 2 ). For a detailed description of such models, see Chapter 13
section 4 of our text book. This is sometimes referred to as a multiple linear model ,
since the dependent variable Y is thought of as a linear function of J independent
variables x1 , ..., xJ . Define Y = (Y1 , Y2 , ..., Yn )0 , β = (β0 , β1 , ..., βJ )0 and X = (xi,j ),
i.e. Y is the vector of n observations of Y , β is the vector of parameters, and X is a
matrix consisting of the values of the independent variables. The model in (1) can
be then be expressed via a matrix equation
Y = Xβ + .
1. Using arguments from linear algebra, show that β̂ = (X0 X)−1 X0 Y is the least
squares estimator of the vector of parameters β. That is to say, show that β̂ is
the unique minimizer of f (β) = ||Y − Xβ||, where || · || is the standard
Euclidean norm in Rn .
2. Consider the canned data set in R
mtcars
Suppose we wish to model a cars mpg as a linear function of its weight,
# of cylinders, and horsepower:
mpg = β0 + β1 weight + β2 # of cylinders + β3 horsepower.
Using the data set, estimate the parameters βi , 0 ≤ i ≤ 3 via least squares.
Let ŷi be the fitted values using these parameter estimates. Define
n
X
SSE =
(yi − ŷi )2 ,
i=1
and
n
X
SST =
(yi − ȳi )2 .
i=1
The coefficient of determination for a multiple linear regression model is
defined by
SSE
.
R2 = 1 −
SST
Discuss whether or not the use of the multiple linear model seems reasonable
in this case. Consider transformations of the variables as well as interactions
between the independent variables. Are there any variables you would add or
delete from the model? Using the estimated parameters, estimate the mpg of
a car which weighs 2000 pounds, has 6 cylinders, and 250 horsepower. Of the
three variables we are considering, which seems the most important when it
comes to predicting the mpg?
2.4
Predicting Student Debt
A hot topic as of late is the large amount of debt college students are accruing in
the US. What factors are important in predicting a student’s indebtedness based on
the institution they will attend? The problem above as well as section 13.4 from our
text book discusses the use of multiple linear regression models. Using the Colleges
and Universities Data below, develop a multiple linear regression model to predict
the Average Indebtedness from all other variables. Be sure to consider
transformations of the variables as well as interactions between the independent
variables. Are there any variables you would delete from the model? If so, explain
why. Remove the variables you do not find useful and give the best model you
would recommend. Give the reasons for your choice. Perform a thorough residual
analysis, discuss the usefulness of your model and discuss whether the model
assumptions are reasonably satisfied.
This data is taken from David A. Levine, Patricia M. Ramsey & Robert K. Smidt,
"Applied Statistics for Engineers and Scientists," Prentice Hall (2001) p. 668.
Variables and Description:
School
Type of Term
Type of School
Average Total SAT
TOEFL Score
Room and Board
Annual Total Cost
Average Indebtedness
School
ArizonaStateUniversity
BallStateUniversity
Cal.StateUniv.-Fresno
ClemsonUniversity
CollegeofWilliam-Mary
FloridaInternationalUniv.
FloridaStateUniversity
GeorgeMasonUniversity
GeorgiaStateUniversity
MontclairStateUniversity
NorthCarolinaStateUniv.
OregonStateUniversity
PurdueUniversity
SanDiegoStateUniversity
SlipperyRockUniv.ofPenn.
SUNY-Binghamton
TexasA-MUniversity
Univ.ofGeorgia
Univ.ofHawaii-Manoa
Univ.ofHouston
Univ.ofMaryland
Univ.ofMass.-Amherst
Univ.ofNevada-LasVegas
Univ.ofNewHampshire
Univ.ofNorthCarolina-C.H.
Univ.ofTexas-Austin
Univ.ofVermont
VirginiaCommonwealthUniv.
VirginiaTech
WestVirginiaUniversity
BabsonCollege
BostonCollege
BostonUniversity
BowdoinCollege
BryantCollege
BucknellUniversity
CanisiusCollege
CarnegieMellonUniversity
CaseWesternReserveUniv.
ClarkUniversity
ColbyCollege
ColgateUniversity
CollegeofHolyCross
EmoryUniversity
FordhamUniversity
Franklin-MarshallCollege
GeorgeWashingtonUniversity
GeorgetownUniversity
GettysburgCollege
HarvardUniversity
IonaCollege
LafayetteCollege
LaSalleUniversity
LehighUniversity
ManhattanCollege
NewYorkUniversity
NiagaraUniversity
NortheasternUniversity
NorthwesternUniversity
ProvidenceCollege
RiceUniversity
RochesterInst.Technology
SeattleUniversity
SetonHallUniversity
SienaCollege
SouthernMethodistUniversity
St.BonaventureUniversity
StanfordUniversity
SyracuseUniversity
TulaneUniversity
Univ.ofChicago
Univ.ofMiami
Univ.ofNotreDame
Univ.ofPennsylvania
Univ.ofPortland
Univ.ofScranton
VanderbiltUniversity
VillanovaUniversity
WakeForestUniversity
YaleUniversity
Term
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
1
1
0
1
1
1
1
0
1
1
1
1
Name of Institution
Academic Calendar Type (1=semester, 0=other)
Institution is (1=private, 0=public)
School average for total score of Scholastic Aptitude Test
Test of English as a Foreign Language (1=criterion at least 550, 0=otherwise
Room and board expenses (in thousands of dollars)
Annual total cost (in thousands of dollars)
Average indebtedness at graduation (in thousands of dollars)
Type
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
SAT
1080
985
955
1130
1295
1135
1180
1055
1115
1025
1145
1072
1095
945
955
1039
1150
1180
1075
1065
1170
1100
980
1110
1225
1215
1115
1005
1265
1025
1165
1285
1235
1345
1080
1255
1143
1335
1330
1121
1275
1300
1275
1310
1150
1260
1235
1330
1200
1465
955
1185
1105
1225
952
1260
1065
1055
1350
1185
1395
1185
1100
1030
1095
1150
1098
1430
1180
1270
1370
1145
1320
1355
1135
1115
1295
1242
1280
1450
TOEFL
0
1
0
1
1
0
1
1
0
0
1
1
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
Room / Board
4.3
4
5.4
3.9
4.5
2.7
4.5
5
7.4
5.3
4
4.4
4.5
6.2
3.6
4.6
3.9
4
4.7
4.1
5.5
4.2
5.5
4.4
4.5
3.9
5.1
4.3
3.5
4.6
7.6
7.5
7
6
6.7
5
5.9
6.1
5
4.4
5.7
5.9
6.7
6.5
7.4
4.5
6.9
7.5
4.8
7
7.3
6.3
6.7
6
7.1
7.8
5.4
8.2
6.1
6.7
6
6.1
5.3
7.1
5.4
5.3
5.1
7.3
7.2
6.3
7.3
7.1
4.8
7.5
4.5
6.6
7.1
7
5.2
6.7
Total Cost
12.7
12.5
13.1
12.4
19.4
10
11.5
17
15.4
10.2
14.3
15.5
15.2
13.6
12.9
13.4
12.7
11.9
12.6
12.1
15.7
16.4
12.3
18.6
15.2
12.9
22.4
16.3
14.9
11.7
26.4
26.8
27.9
27.8
20.6
25.4
18.8
25.6
22.2
24.4
27.9
27.6
26.8
26.6
23.4
26.4
26.7
27.5
26.4
28.9
19.8
26.7
20.8
26.8
22
28.6
17.6
23.4
24.2
22.9
18
21.8
19.5
20.8
17.6
21.3
17.5
27.8
24.3
27.5
28.8
25.7
23.8
28.6
18.9
21.6
27.3
24.8
23.7
28.9
Indebtedness
12.9
8.21
8.76
9.98
13.42
4.14
16.5
13
8.08
4.5
14.99
10.5
11.84
6.75
17
6.25
4.1
10.8
3.62
9.4
16.64
10.2
10
9.66
9.41
10.2
21.5
14.73
10.33
10.7
18
15.86
14.46
13.64
18
12.5
14.82
15.68
26.03
17.5
11.63
9.24
12.63
15.31
8.59
11.5
14.37
14.01
11.75
11.65
18
11.5
11.7
13.84
9.27
17.32
11.58
25.6
11.98
17.5
2.32
17.5
12
14.9
18.25
12.11
14
12.77
14.5
13.85
14.07
16.07
16.57
17.62
13.9
13.5
14.5
17.13
18.7
13.57
Download