heights.weights - clic

advertisement
Computational Methods for Data Analysis – 2014/15
Lab 2: Linear Regression
Predicting Weight from Height (Hackers, ch. 5)
In this first example of use of linear regression, we use it to predict height from
weight.
Download the height and weight dataset from the website and load it in R:
heights.weights <read.csv(‘01_heights_weights_genders.csv',
header = TRUE,
sep = ',')
have a look at the first few lines of the dataset:
head(heights.weights)
you can visualize this dataset using a scatterplot:
plot(heights.weights[,2:3])
Looking at the plot should convince you that using a linear function to predict a
person’s weight using her height should work pretty well.
The R function that computes linear regression is called lm(). lm takes as first
argument a regression formula using the ~ operator. In this case we are going to
predict weight from height, so we are going to use the formula
Weight ~ Height
As we will see such formulas can also be used with more than one variable.
QUESTION: what if you want to predict Height from Weight?
We run linear regression as follows:
fitted.regression <- lm(Weight ~ Height, data =
heights.weights)
The value returned by lm is a model – a representation of the linear hypothesis
that regression computed from the data:
We can find out which parameters –coefficients of the linear equation--were
computed by gradient descent using the function coef():
coef(fitted.regression)
intercept <- coef(fitted.regression)[1]
slope <- coef(fitted.regression)[2]
QUESTION: what are these values?
The value of slope parameter makes sense – it says that an increase of 1inch (=
3cm) in somebody’s height results in an increase of 7.7 pounds (~ 3 kilos) in
weight. The value of the intercept parameter is weird, but this is because there
were no data about people with zero height and predictive models generally
have trouble at predicting outputs for inputs that are very removed from one’s
experience.
To verify how well the model predicts actual values we can use R’s function
predict():
predict(fitted.regression)
we can compare predictions with true values as follows:
true.values <- with(heights.weights, Weight)
errors <- true.values - predict(fitted.regression)
(note the use of with to operate on a dataframe, and of vectorial subtraction).
The errors computed by the code above are called residuals because they are the
part of the data that’s ‘left over’. You can compute them directly using R’s
function residuals():
residuals(fitted.regression)
we can now plot residuals against truth:
plot(fitted.regression, which = 1)
In order to get a single measure of the quality of the hypothesis, we compute the
cost function we saw in the lectures:
squared.errors <- errors ^ 2
mse <- mean(squared.errors)
J <- mse / 2
QUESTION: what value you get for J?
Other measures of the quality of an hypothesis are also possible – a popular
alternative to J is Root Mean Squared Error or RMSE: the square root of the mean
of square errors.
rmse <- sqrt(mse)
Web traffic (Hackers, ch. 5)
The second example of use of linear regression we’re going to see is using linear
regression to predict the amount of page views for the top 1,000 websites. The
data can be found in the top_1000_sites.tsv file. Notice that this is a tab separated
csv file (the cells are separated by tabs instead of commas) so we need to specify
a different default to read.csv():
top.1000.sites <- read.csv('top_1000_sites.tsv',
sep = '\t',
stringsAsFactors = FALSE)
Inspection of the dataset will show that there are nine columns (variables): Rank,
Site, Category, UniqueVisitors, Reach, PageViews, HasAdvertising, InEnglish, etc.
The top site in the list was Facebook.
We’re going to predict pageviews. To begin with, we check whether
uniquevisitors could be a good predictor:
plot(top.1000.sites$UniqueVisitors,
top.1000.sites$PageViews)
EXERCISE: Do the same but using with to avoid repeating ‘top.1000.sites’ many
times.
The resulting plot is completely impenetrable. In such cases it’s often a good idea
to try taking the log of the values we’re trying to analyze:
plot(log(top.1000.sites$UniqueVisitors),
log(top.1000.sites$PageViews))
The resulting plot looks much more reasonable and it looks like there’s a
potential line to be drawn.
We build a linear regression hypothesis as follows:
lm.fit <- lm(log(PageViews) ~ log(UniqueVisitors),
data = top.1000.sites)
We can now use summary to study this model:
summary(lm.fit)
The output of summary should be something like what shown below. Note that
summary displays residuals, coefficients, significance, and various measures of
error including residual standard error.
Call:
lm(formula = log(PageViews) ~ log(UniqueVisitors), data =
top.1000.sites)
Residuals:
Min
1Q Median
-2.1825 -0.7986 -0.0741
3Q
0.6467
Max
5.1549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-2.83441
0.75201 -3.769 0.000173 ***
log(UniqueVisitors) 1.33628
0.04568 29.251 < 2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.084 on 998 degrees of freedom
Multiple R-squared: 0.4616, Adjusted R-squared: 0.4611
F-statistic: 855.6 on 1 and 998 DF, p-value: < 2.2e-16
EXERCISE: try to build a model using the other variables as well!
Download