R Workshop: Day 3 Yun Ju Sung 7/15/2014 5. Regression Analysis Regression analysis is fundamental and forms a major part of statistical analysis! Linear regression analysis • The regression model: y = β0 + β1x + ε • The values of β0 and β1 are unknown and will be estimated in a reasonable manner from the data • The estimated regression line is yˆ ˆ 0 ˆ 1 x (using "hats" to denote the estimates) • For each data point xi we have yˆ i ˆ 0 ˆ1 x i (called the predicted value) • The difference between the true and predicted value is the residual e i y i yˆ i Example: linear regression with R • The maximum heart rate of a person is often said to be related to age by the equation: Max rate= 220 - Age • Suppose this is to be empirically proven and 15 people of varying ages are tested for their maximum heart rate. • We use lm() to fit a linear model Example: linear regression with R • The result of the lm function can be stored: lm.result=lm(y~x) • We can use summary to get regression coefficients (and more): summary(lm.result) • The result of the lm function is of class lm and so the plot and summary commands have been adapted. • For several generic functions (including print, plot, and summary), the result depends on the class of object that is given as argument. Prediction for new values • One primary use of regression analysis is to make predictions for the response value for new values of the predictor. • We can use a regression line yˆ ˆ 0 ˆ1 x to get the predicted value for a new value of x. • It is a little chunk to use the predict function! • For example, to get a prediction for 50 and 60 year old • We need a data frame with column names matching the predictor variable, which can be done with the data.frame function with name=values. • Can you get the same answer with estimated coefficients? Assignment: Question 10 For the data set on housing prices, homedata, investigate the relationship between old assessed value and new. a. Use old as the predictor variable. Does the data suggest a linear relationship? b. Are there any outliers? What may have caused these outliers? b. What is the predicted new assessed value for a $75,000 house in 1970? Interacting with a scatterplot • When looking at a scatterplot we see the trend of the data as • • • • well as individual data points. If one of these data points stands out, how can it be identified? We can use identify function to identify points on a scatterplot by their corresponding index with each mouse click. A template for its usage is identify (x, y, labels=…,n=…) The value n specifies the number of points to identify. For example, identify(BUSH,BUCHANAN,n=2,labels=County) can be used to identify two outliers in Question 11. Assignment: Question 11 For the florida dataset, investigate the relationship between Bush vs. Buchanan. a. Fit a regression line using Bush as the predictor variable. Does the data suggest a linear relationship? b. Are there any outliers? Identify two outliers c. Fit another regression line after excluding two outliers. What is the predicted value for the number of votes Buchanan would get in Miami-Dade county based on the number of Bush votes? The assumptions of the linear model • The regression model is yi=β0+β1xi+εi • In order to make statistical inference, we need to make assumptions about the errors εi. • The standard assumptions are that these errors are independent, normals with mean 0 and common variance 2 • The validity of the model can be checked graphically • The plot function can do this if we give it the result of the regression, as plot(lm.result, which=1:4) Checking the validity of the linear model • Residuals vs. fitted: Look for spread around the line y = 0 and no obvious trend. • Normal Q-Q plot: The residuals are normal if this graph falls close to a straight line. • Scale-Location plot shows the square root of the standardized residuals. The tallest points, are the largest residuals. • Cook's distance plot identifies points which have a lot of influence in the regression line. Inference about the slope β1 • Once you are satisfied that the model fits the data, statistical inferences can be made. • The normalized value of estimator b1 for β1 has the t-distribution with n-2 degrees of freedom (here we use b1 instead of ˆ1 ) • To test H0: β1 = a (the null hypothesis) against HA: β1 ≠ a (the alternative hypothesis) we compute the t statistic and finds the p-value from the t-distribution. Inference about the slope β1 • Let’s do a test to see if the slope of -1 is correct for our heart- rate example. This corresponds to H0: β1 = -1 vs. HA: β1 ≠ -1 • The actual p-value is twice this as the problem is two-sided. It is unlikely that for this data the slope is -1. • R will automatically test for the assumption β1= 0, which means no slope. See how the p-value is included in the output of the summary command in the column Pr(>|t|). • Can you compute this p-value by yourself? Inference about the intercept β0 • A statistical test for b0 can be done using that has also a t-distribution with n-2 degrees of freedom • Let's check if the data supports the intercept of 220. Formally, we test H0: β0 = 220 vs. HA: β0 < 220. • We need to compute the value of the test statistic and then look up the one-sided p-value. • We would reject the value of 220 based on this p-value as well. Assignment: Question 12 The cost of a home depends on the number of bedrooms in the house. Suppose the following data is recorded for homes in a given town Price (in 300 250 400 550 317 389 425 289 389 559 thousands) No. of 3 3 4 5 4 3 6 3 4 5 bedrooms Make a scatterplot, and fit the data with a regression line. b. Test the hypothesis that an extra bedroom costs $60,000 against the alternative that it costs more. a. Assignment: Question 13 It is well known that the more beer you drink, the more your blood alcohol level rises. Suppose we have the following data on student beer consumption Beers BAL 5 2 9 8 3 7 3 5 3 5 0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 0.02 0.05 Make a scatterplot and fit the data with a regression line. b. Test the hypothesis that another beer raises your BAL by 0.02 percent against the alternative that it is less. c. Test the hypothesis that the intercept is 0 with a two-sided alternative. a. 6. Multivariate Data Graphics and other simple functions to explore multivariate data, data with many variables Multivariate data • Multivariate data consists of multiple data vectors considered as a whole. • There are many advantages to combining them into a single data object. This makes it easier to save our work, is convenient for many functions, and is much more organized. • Multivariate data can be summarized and viewed in ways that are similar to those for bivariate and univariate data. • As there are many more possible relationships, we can look at all the variables simultaneously or hold some variables constant while we look at others. Data frame • Often in statistics, data is presented in a tabular format similar to a spreadsheet. The columns are for different variables. Each row corresponds to measurements on the same subject or experimental unit. • R uses a data frame to store rectangular grids of data. • Each column is a data vector containing data for one of the variables. • The columns need not all be of the same type (e.g., numeric, character, or logical). Store multivariate data in data frames • Data frames are created with the data.frame() function • We can use the row.names() function to set the rows names: List • A list is a more general type of storage than a data frame. • Think of a list as a collection of components. Each component can be any R object, such as a vector, a data frame, a function, or even another list. • A list is a very flexible type of data object containing ordered collections of objects (components)! • Many of the functions in R, such as lm(), have return values that are lists, although only selected portions may be displayed when the return value is printed. • Data frames must have variables of the same length, whereas lists can have variables of different length. Store multivariate data in list • Lists are created with list() > x=1:2 ; y=letters[1:2] ; z=1:3 > data.frame(x,y,z) # Can’t > list(x,y,z) [[1]] [1] 1 2 [[2]] [1] "a" "b" [[3]] [1] 1 2 3 > list(x=x, y=y, z=z) $x [1] 1 2 $y [1] "a" "b" $z [1] 1 2 3 Accessing data in data frames • To access as an array, use single brackets [row, column] • To access as a list, use either a dollar sign ($) or double bracket Sorting a data frame by one column • The sort() function can be used to arrange the values of a data vector in increasing or decreasing order. • To sort a data frame by one of its columns, sort() will not work. • We use order() to rearrange, or permute, the row indices so that the new data frame is sorted as desired. • Try: > sort(mpg) > mpg[order(mpg)] > mtcars[order(mpg),] # sort by miles per gallon > mtcars[order(mpg, decreasing=TRUE), ] # best mpg first Assignment: Question 14 Use the data set mtcars: a. Sort the data set by weight, heaviest first. b. Which car gets the best mileage (largest mpg)? Which gets the worst? c. The cars in rows c(1:3, 8:14, 18:21, 26:28, 30:32) were imported into the United States. Compare the variable mpg for imported and domestic cars using a boxplot. Is there a difference? d. Make a scatterplot of weight, wt, versus miles per gallon, mpg. Label the points according to the number of cylinders, cyl. Describe any trends. Assignment: Question 15 The data set u2 (in UsingR) contains the time in seconds for albums released by the band U2 from 1980 to 1997. The data is stored in a list. a. Make a boxplot of the song lengths by album. Which album has the most spread? Are the means all similar? b. Use sapply() to find the mean song time for each album. Which album has the shortest mean? Repeat with the median. Are the results similar? c. What are the three longest songs? The unlist() function will turn a list into a vector. First unlist the song lengths, then sort. Could you use a data frame to store this data? Scatterplot matrix • R has a built-in function for creating scatterplots of all possible pairs of variables. • This scatterplot matrix can be made with the pairs() function Scatterplot with colors • We can plot multiple relationships on the same scatterplot 180 using different plot characters or colors to distinguish the variables. • Example: gestation period vs. birth weight in babies data > gestation[gestation == 999]= NA # 999 for NA > plot(gestation, wt, pch=smoke) > table(smoke) # values of plot characters 120 60 80 100 wt 140 160 never yes until pregnant longago unknown 150 > legend(locator(1), legend=c("never","yes","until pregnant","longago","unknown"), pch=c(0:3,9),) 200 250 gestation 300 350 180 Example: gestation vs. weight by smoke 160 never smoked smokes now 120 100 80 60 wt 140 > f0=wt[smoke==0]~gestation[smoke==0] > plot(f0,xlab="gestation",ylab="weight") > abline(lm(f0)) > f1=wt[smoke==1]~gestation[smoke==1] > points(f1,col=2,pch=3) > abline(lm(f1),col=2) > legend("topleft", legend=c("never smoked","smokes now"), pch=c(1,3), lty=1, bty="n", col=1:2) 150 200 250 gestation 300 350 Assignment: Question 16 The data set UScereal (in MASS) lists facts about specific cereals sold in a United States supermarket. a. Is there a relationship between manufacturer (mfr) and vitamin type (vitamins) by shelf location (shelf)? Do you expect one? Why? b. Look at the relationship between calories and sugars with a scatterplot. Identify the outliers. Are these also fat-laden cereals? c. Now look at the relationship between calories and sugars with a scatterplot using different size points given by cex=2*sqrt (fat). (This is called a bubble plot. The area of each bubble is proportional to the value of fat.) Describe any additional trend that is seen by adding the bubbles. Any other expected relationships between the variables? The lattice package • Using various colors and plotting characters, we tried to show • • • • whether the third variable, smoke, affected the relationship between gestation time and birth weight. A better solution would be to create a separate scatterplot for each level of the third variable. These graphs can be made using the lattice graphics package. If the package is loaded (with library (lattice)), a description of lattice graphics is available in the help system under ?Lattice. Let's use the data set Cars93 to illustrate. Histogram and boxplot using lattice • Histograms • Boxplots 20 6 40 60 20 80 8 rotary 40 60 80 6 8 rotary 3 4 5 100 80 60 Percent of Total 40 20 0 3 4 5 100 80 60 40 20 0 20 20 40 60 80 20 Max.Price 40 60 80 40 60 80 20 Max.Price 40 60 80 The lattice package • The basic idea is that the graphic consists of a number of panels. Each panel corresponds to some value(s) of a conditioning variable. • The lattice graphing functions are called using the modelformula interface. The formulas have the format response ~ predictor|condition • For univariate graphs such as histograms, the response variable, the left side, is empty. Scatterplot: MPG and fuel rank size • The following will plot the relationship between MPG and tank size. We expect that cars with better mileage can have smaller tanks. This type of plot allows us to check if it is the same for all types of cars. Scatterplot: MPG and fuel rank size 10 Small 15 20 25 Sporty Van 50 45 40 35 MPG.highway 30 25 20 Compact Large Midsize 50 45 40 35 30 25 20 10 15 20 25 10 Fuel.tank.capacity 15 20 25 Back to gestation vs weight by smoke 150 200 250 300 350 3 9 180 160 140 120 100 80 wt 60 0 1 2 180 160 140 120 100 80 60 150 200 250 300 350 150 200 250 300 350 gestation Assignment: Question 17 The data set survey (in MASS) contains survey information on students. a. Make scatterplots using xyplot() of the writing-hand size (Wr.Hnd) versus nonwriting-hand size (NW.Hnd), broken down by gender (Sex). Is there a trend? b. Make boxplots using bwplot() of Pulse for the four levels of Smoke broken down by gender Sex. Are there any trends? Differences? Do you expect any other linear relationships among the variables? Summary Data, univariate and bivariate data • One use of R is as a calculator, to evaluate arithmetic expressions. Calculations can be carried out in parallel, across all elements of a vector at once. • The data is stored in R as a vector. Most functions can be applied to vectors without loops! • Exploratory data analysis aims to allow the data to speak for themselves. It gives multiple views of the data that may provide useful insights. • We covered commonly-used functions for exploring data with graphs. Histograms, density plots, stem-and-leaf displays, and boxplots are useful for examining the distributions of individual variables. Scatterplots are useful for looking at relationships two at a time. Bivariate and regression analysis • In exploring the relationships in bivariate data, the correlation • • • • • coefficient provides a numeric measure. The linear regression analysis can provide the richer and more insightful framework. There may be many more possibilities than regression lines. It is useful to fit a smooth curve, to see whether it suggests systematic departure from a line. Model assumptions, such as normality, independence, and constancy of the variance, should be checked. In particular, plot residuals against fitted values and draw a plot that checks homogeneity of variance. Use of the function plot(), with an lm model, gives both these plots. Some technical notes • Help functions include help() (help page of a known function) • • • • and apropos() (search for function names that include a specified character string). The function c() (concatenate) joins vector elements into vectors. It may be used for numeric, logical and character vectors. Use is.na() to identify elements that are NAs. Commonly used generic functions include print(), plot(), and summary(). For such functions, the result depends on the class of object that is given as argument. For simple forms of scatterplot, use plot() or the lattice function xyplot(). More technical notes • Important R data structures are vectors, factors, lists, and data • • • • frames. Vectors may be of mode numeric, or logical, or character. Data frames use a list structure to group columns, which must all have the same length, together into a single R object. The different columns may be any mix of logical, numeric, character, or factor. Contrast data frames with matrices. All matrix elements have the same mode. A matrix is stored as one long vector that is formatted to appear in a row by column layout. The R system has extensive abilities for inputting data from rectangular files (see help(read.table)), from spreadsheets, and from a variety of statistical package formats. Use attach() or with() (temporary attachment) to give access to the columns of a data frame, without the need to name the data frame whenever a column is accessed. Things not covered • Random data: uniform, normal, binomial, exponential,… • Confidence interval estimation: proportion test, t test • Hypothesis testing • Two sample tests • Chi square tests: goodness of fit, independence,… • Analysis of variance Additional reading • To learn more material at a level of this lecture: • simpleR – using R for Introductory Statistics by John Verzani • Using R for Data Analysis and Graphics by John Maindonald • For those who need more sophisticated and challenging material: • Data Analysis and Graphics Using R: An Example-Based Approach by John Maindonald and W. Hohn Braun is my favoriate! • The manual An Introduction to R by the R core development team is excellent for people familiar with statistics. Left-over Material May be included in the next year? Prediction for new values • Interpretation: Changes in the predictor variable correspond to changes in the response variable in a linear manner: a unit change in the predictor corresponds to a change in the response. 7. Multiple Linear Regression Assignment: Question 8 The data set babies contains much information about babies and their mothers for 1236 observations. Find the correlation coefficient (both Pearson and Spearman) between age and weight. Repeat for the relationship between height and weight. Make scatter plots of each pair and see if your answer makes sense. Functions in R • Easy to create your own functions in R • As tasks become complex, it is a good idea to organize code into functions that perform defined tasks • In R, it is good practice to give default values to function arguments • Template for function: Example for defining functions • A function to simulate the t statistic from a sample of normals with mean 10 and standard deviation 5 • To make mean and standard deviation variable: Control statements in R Similar to c, R has a rich set of program control options • if … else … • for, repeat, while loops • next, break statements For loops • A for loop allows you to loop over values in a vector or list of numbers. It is a powerful programming feature. • To add up the numbers in the vector x • better done with sum Conditional expressions • To do different things based on the value of a variable, use conditional expressions • For example, to get the absolute value function, • To fix a bug, try x[x<0]<- -x[x<0] (or the built in function abs). DAGG (page 148)