Computational Methods for Data Analysis – 2014/15 Lab 2: Linear Regression Predicting Weight from Height (Hackers, ch. 5) In this first example of use of linear regression, we use it to predict height from weight. Download the height and weight dataset from the website and load it in R: heights.weights <read.csv(‘01_heights_weights_genders.csv', header = TRUE, sep = ',') have a look at the first few lines of the dataset: head(heights.weights) you can visualize this dataset using a scatterplot: plot(heights.weights[,2:3]) Looking at the plot should convince you that using a linear function to predict a person’s weight using her height should work pretty well. The R function that computes linear regression is called lm(). lm takes as first argument a regression formula using the ~ operator. In this case we are going to predict weight from height, so we are going to use the formula Weight ~ Height As we will see such formulas can also be used with more than one variable. QUESTION: what if you want to predict Height from Weight? We run linear regression as follows: fitted.regression <- lm(Weight ~ Height, data = heights.weights) The value returned by lm is a model – a representation of the linear hypothesis that regression computed from the data: We can find out which parameters –coefficients of the linear equation--were computed by gradient descent using the function coef(): coef(fitted.regression) intercept <- coef(fitted.regression)[1] slope <- coef(fitted.regression)[2] QUESTION: what are these values? The value of slope parameter makes sense – it says that an increase of 1inch (= 3cm) in somebody’s height results in an increase of 7.7 pounds (~ 3 kilos) in weight. The value of the intercept parameter is weird, but this is because there were no data about people with zero height and predictive models generally have trouble at predicting outputs for inputs that are very removed from one’s experience. To verify how well the model predicts actual values we can use R’s function predict(): predict(fitted.regression) we can compare predictions with true values as follows: true.values <- with(heights.weights, Weight) errors <- true.values - predict(fitted.regression) (note the use of with to operate on a dataframe, and of vectorial subtraction). The errors computed by the code above are called residuals because they are the part of the data that’s ‘left over’. You can compute them directly using R’s function residuals(): residuals(fitted.regression) we can now plot residuals against truth: plot(fitted.regression, which = 1) In order to get a single measure of the quality of the hypothesis, we compute the cost function we saw in the lectures: squared.errors <- errors ^ 2 mse <- mean(squared.errors) J <- mse / 2 QUESTION: what value you get for J? Other measures of the quality of an hypothesis are also possible – a popular alternative to J is Root Mean Squared Error or RMSE: the square root of the mean of square errors. rmse <- sqrt(mse) Web traffic (Hackers, ch. 5) The second example of use of linear regression we’re going to see is using linear regression to predict the amount of page views for the top 1,000 websites. The data can be found in the top_1000_sites.tsv file. Notice that this is a tab separated csv file (the cells are separated by tabs instead of commas) so we need to specify a different default to read.csv(): top.1000.sites <- read.csv('top_1000_sites.tsv', sep = '\t', stringsAsFactors = FALSE) Inspection of the dataset will show that there are nine columns (variables): Rank, Site, Category, UniqueVisitors, Reach, PageViews, HasAdvertising, InEnglish, etc. The top site in the list was Facebook. We’re going to predict pageviews. To begin with, we check whether uniquevisitors could be a good predictor: plot(top.1000.sites$UniqueVisitors, top.1000.sites$PageViews) EXERCISE: Do the same but using with to avoid repeating ‘top.1000.sites’ many times. The resulting plot is completely impenetrable. In such cases it’s often a good idea to try taking the log of the values we’re trying to analyze: plot(log(top.1000.sites$UniqueVisitors), log(top.1000.sites$PageViews)) The resulting plot looks much more reasonable and it looks like there’s a potential line to be drawn. We build a linear regression hypothesis as follows: lm.fit <- lm(log(PageViews) ~ log(UniqueVisitors), data = top.1000.sites) We can now use summary to study this model: summary(lm.fit) The output of summary should be something like what shown below. Note that summary displays residuals, coefficients, significance, and various measures of error including residual standard error. Call: lm(formula = log(PageViews) ~ log(UniqueVisitors), data = top.1000.sites) Residuals: Min 1Q Median -2.1825 -0.7986 -0.0741 3Q 0.6467 Max 5.1549 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.83441 0.75201 -3.769 0.000173 *** log(UniqueVisitors) 1.33628 0.04568 29.251 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.084 on 998 degrees of freedom Multiple R-squared: 0.4616, Adjusted R-squared: 0.4611 F-statistic: 855.6 on 1 and 998 DF, p-value: < 2.2e-16 EXERCISE: try to build a model using the other variables as well!