Applied Statistics 1 Final Project Outline From a mathematical point of view, statistics is simply a collection of analytical tools used to make insights into data. It is assumed that working statisticians have a broad knowledge of these tools and understand their mathematical underpinnings. The goal of my instruction up until this point has been to introduce you to such tools and develop your mathematical intuition to understand how they work. Working with data is however frequently the easiest part of a statisticians job. Since statisticians almost always work as components of a much larger research team it is a necessity that a statistician can both work within a team and convey their findings to others. The goal of this project is to practice these aspects of a statisticians job. 1.1 Teamwork This project is a “group” project in the sense that you will be working in small groups of size 3-4. You will be allowed to form groups on your own. I will intervene in this process only if necessary. At the end of the project, each group member will fill out a questionnaire evaluating their teammates performance within the project. A part of your individual grade will be determined based on your group members assessment of the effort that you contributed to the overall completion of the project. 1.2 Write-up and Oral Presentation The final result of the project is a write-up and oral presentation which addresses your group’s findings related to the problems below. 1. Write-up: The write-up should include an introduction to each problem, a hypotheses section (when appropriate), a results section, and an appendix for each problem which includes any code which you used to solve the problem. The write-up should be clear, concise, well organized, and grammatically correct. In my opinion, the most important part of technical writing is making the problem which you are addressing clear (and important sounding) and giving a concise and insightful solution. 2. Oral Presentation: The project will also include an oral presentation where you present your results. Each member of the group is expected to contribute to the oral presentation (i.e. talk). The presentation should last for 35-40 minutes and must be aided by the use of an electronic slide show, for example power-point, prezi, beamer, etc.. 1.3 Due Dates Your write-up will be due at the time of your presentation. Your presentation must take place between April 22, and May 3rd. Potential presentation times will be scheduled during our final exam period, as well as other times during this period which will be scheduled according to the availability of the groups. 1.4 Grading You will be graded on both an individual and group basis. Your final project grade will be determined as follows. 1. Write-up: 50 % For each problem: (a) 20 % Grammar and Organization (b) 50 % Solution and Analysis (c) 20 % Organization of Code (d) 10 % Aesthetics (Pictures, overall appearance etc.) 2. Presentation: 35 % (a) 30 % Appearance of slide show (b) 40 % Content and Organization (c) 30 % Delivery and preparedness 3. Teamwork Evaluation: 15 % 2 Project Problems Your team must address two of the four problems outlined below. One problem is to be done among the first two, and one problem among the last two. 2.1 Comparing Tukey’s Method to Bonferroni’s Method During the course, we discussed two methods of producing simultaneous confidence intervals for the difference between the theoretical means of independent normal populations; Tukey’s Method which was based on the studentized range distribution, and Bonferroni’s Method which used a conservative approximation. By means of a simulation study, investigate and compare the effectiveness of these two methods. Using a confidence level of 95%, construct confidence intervals for the differences between mean parameters using simulated normal data with different means (and the same variances) and determine the rate at which these confidence intervals simultaneously contain their respective parameters by repetition of this simulation. This should be done for both methods. If I is the number of populations, and J is the number of samples per population, the analysis should be done for I=3,5 and 10, and J=10, 50, 100 and 200. 2.2 ANOVA under non-normality One focus of the course was to determine if normal populations with the same variances have equal means. We called our approach to this problem the “analysis of variance”. The assumption for the analysis of variance F − test was that the observations were of the form Xi,j = µi + i,j where i,j ∼ N (0, σ 2 ). Sometimes the assumption of the normality of the errors i,j is not valid. Does the F − test still work when the errors are not Normally distributed? To address this, apply the F − test to non-normal data, i.e. when the the errors do not follow the normal distribution. Nice examples to check include when i,j are distributed as: Exponential(λ) − λ, U nif (−a, a), or P oisson(λ) − λ. If I is the number of populations, and J is the number of samples per population, the analysis should be done for I=3,5 and 10, and J=10, 50, 100 and 200. Determine by simulation if the size of the test (the probability of a type 1 error) is close to α when the size α F − test is applied to the simulated data. Also investigate the power of the test under non-normality. Report and Explain your results. 2.3 Multiple Linear Regression Consider the model, Yi = β0 + β1 xi,1 + β2 xi,2 + · · · + βJ xi,J + i , (1) where i ∼ N (0, σ 2 ). For a detailed description of such models, see Chapter 13 section 4 of our text book. This is sometimes referred to as a multiple linear model , since the dependent variable Y is thought of as a linear function of J independent variables x1 , ..., xJ . Define Y = (Y1 , Y2 , ..., Yn )0 , β = (β0 , β1 , ..., βJ )0 and X = (xi,j ), i.e. Y is the vector of n observations of Y , β is the vector of parameters, and X is a matrix consisting of the values of the independent variables. The model in (1) can be then be expressed via a matrix equation Y = Xβ + . 1. Using arguments from linear algebra, show that β̂ = (X0 X)−1 X0 Y is the least squares estimator of the vector of parameters β. That is to say, show that β̂ is the unique minimizer of f (β) = ||Y − Xβ||, where || · || is the standard Euclidean norm in Rn . 2. Consider the canned data set in R mtcars Suppose we wish to model a cars mpg as a linear function of its weight, # of cylinders, and horsepower: mpg = β0 + β1 weight + β2 # of cylinders + β3 horsepower. Using the data set, estimate the parameters βi , 0 ≤ i ≤ 3 via least squares. Let ŷi be the fitted values using these parameter estimates. Define n X SSE = (yi − ŷi )2 , i=1 and n X SST = (yi − ȳi )2 . i=1 The coefficient of determination for a multiple linear regression model is defined by SSE . R2 = 1 − SST Discuss whether or not the use of the multiple linear model seems reasonable in this case. Consider transformations of the variables as well as interactions between the independent variables. Are there any variables you would add or delete from the model? Using the estimated parameters, estimate the mpg of a car which weighs 2000 pounds, has 6 cylinders, and 250 horsepower. Of the three variables we are considering, which seems the most important when it comes to predicting the mpg? 2.4 Predicting Student Debt A hot topic as of late is the large amount of debt college students are accruing in the US. What factors are important in predicting a student’s indebtedness based on the institution they will attend? The problem above as well as section 13.4 from our text book discusses the use of multiple linear regression models. Using the Colleges and Universities Data below, develop a multiple linear regression model to predict the Average Indebtedness from all other variables. Be sure to consider transformations of the variables as well as interactions between the independent variables. Are there any variables you would delete from the model? If so, explain why. Remove the variables you do not find useful and give the best model you would recommend. Give the reasons for your choice. Perform a thorough residual analysis, discuss the usefulness of your model and discuss whether the model assumptions are reasonably satisfied. This data is taken from David A. Levine, Patricia M. Ramsey & Robert K. Smidt, "Applied Statistics for Engineers and Scientists," Prentice Hall (2001) p. 668. Variables and Description: School Type of Term Type of School Average Total SAT TOEFL Score Room and Board Annual Total Cost Average Indebtedness School ArizonaStateUniversity BallStateUniversity Cal.StateUniv.-Fresno ClemsonUniversity CollegeofWilliam-Mary FloridaInternationalUniv. FloridaStateUniversity GeorgeMasonUniversity GeorgiaStateUniversity MontclairStateUniversity NorthCarolinaStateUniv. OregonStateUniversity PurdueUniversity SanDiegoStateUniversity SlipperyRockUniv.ofPenn. SUNY-Binghamton TexasA-MUniversity Univ.ofGeorgia Univ.ofHawaii-Manoa Univ.ofHouston Univ.ofMaryland Univ.ofMass.-Amherst Univ.ofNevada-LasVegas Univ.ofNewHampshire Univ.ofNorthCarolina-C.H. Univ.ofTexas-Austin Univ.ofVermont VirginiaCommonwealthUniv. VirginiaTech WestVirginiaUniversity BabsonCollege BostonCollege BostonUniversity BowdoinCollege BryantCollege BucknellUniversity CanisiusCollege CarnegieMellonUniversity CaseWesternReserveUniv. ClarkUniversity ColbyCollege ColgateUniversity CollegeofHolyCross EmoryUniversity FordhamUniversity Franklin-MarshallCollege GeorgeWashingtonUniversity GeorgetownUniversity GettysburgCollege HarvardUniversity IonaCollege LafayetteCollege LaSalleUniversity LehighUniversity ManhattanCollege NewYorkUniversity NiagaraUniversity NortheasternUniversity NorthwesternUniversity ProvidenceCollege RiceUniversity RochesterInst.Technology SeattleUniversity SetonHallUniversity SienaCollege SouthernMethodistUniversity St.BonaventureUniversity StanfordUniversity SyracuseUniversity TulaneUniversity Univ.ofChicago Univ.ofMiami Univ.ofNotreDame Univ.ofPennsylvania Univ.ofPortland Univ.ofScranton VanderbiltUniversity VillanovaUniversity WakeForestUniversity YaleUniversity Term 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 Name of Institution Academic Calendar Type (1=semester, 0=other) Institution is (1=private, 0=public) School average for total score of Scholastic Aptitude Test Test of English as a Foreign Language (1=criterion at least 550, 0=otherwise Room and board expenses (in thousands of dollars) Annual total cost (in thousands of dollars) Average indebtedness at graduation (in thousands of dollars) Type 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SAT 1080 985 955 1130 1295 1135 1180 1055 1115 1025 1145 1072 1095 945 955 1039 1150 1180 1075 1065 1170 1100 980 1110 1225 1215 1115 1005 1265 1025 1165 1285 1235 1345 1080 1255 1143 1335 1330 1121 1275 1300 1275 1310 1150 1260 1235 1330 1200 1465 955 1185 1105 1225 952 1260 1065 1055 1350 1185 1395 1185 1100 1030 1095 1150 1098 1430 1180 1270 1370 1145 1320 1355 1135 1115 1295 1242 1280 1450 TOEFL 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 Room / Board 4.3 4 5.4 3.9 4.5 2.7 4.5 5 7.4 5.3 4 4.4 4.5 6.2 3.6 4.6 3.9 4 4.7 4.1 5.5 4.2 5.5 4.4 4.5 3.9 5.1 4.3 3.5 4.6 7.6 7.5 7 6 6.7 5 5.9 6.1 5 4.4 5.7 5.9 6.7 6.5 7.4 4.5 6.9 7.5 4.8 7 7.3 6.3 6.7 6 7.1 7.8 5.4 8.2 6.1 6.7 6 6.1 5.3 7.1 5.4 5.3 5.1 7.3 7.2 6.3 7.3 7.1 4.8 7.5 4.5 6.6 7.1 7 5.2 6.7 Total Cost 12.7 12.5 13.1 12.4 19.4 10 11.5 17 15.4 10.2 14.3 15.5 15.2 13.6 12.9 13.4 12.7 11.9 12.6 12.1 15.7 16.4 12.3 18.6 15.2 12.9 22.4 16.3 14.9 11.7 26.4 26.8 27.9 27.8 20.6 25.4 18.8 25.6 22.2 24.4 27.9 27.6 26.8 26.6 23.4 26.4 26.7 27.5 26.4 28.9 19.8 26.7 20.8 26.8 22 28.6 17.6 23.4 24.2 22.9 18 21.8 19.5 20.8 17.6 21.3 17.5 27.8 24.3 27.5 28.8 25.7 23.8 28.6 18.9 21.6 27.3 24.8 23.7 28.9 Indebtedness 12.9 8.21 8.76 9.98 13.42 4.14 16.5 13 8.08 4.5 14.99 10.5 11.84 6.75 17 6.25 4.1 10.8 3.62 9.4 16.64 10.2 10 9.66 9.41 10.2 21.5 14.73 10.33 10.7 18 15.86 14.46 13.64 18 12.5 14.82 15.68 26.03 17.5 11.63 9.24 12.63 15.31 8.59 11.5 14.37 14.01 11.75 11.65 18 11.5 11.7 13.84 9.27 17.32 11.58 25.6 11.98 17.5 2.32 17.5 12 14.9 18.25 12.11 14 12.77 14.5 13.85 14.07 16.07 16.57 17.62 13.9 13.5 14.5 17.13 18.7 13.57