Computational Methods for Data Analysis – 2014/15 - clic

advertisement
Computational Methods for Data Analysis – 2014/15
Lab 3: Linear Regression and Logistic Regression for Data Analysis
In the Cognitive Sciences linear and logistic regression are intensively used for
data analysis. In this lab we explore this use. Along the way, we’ll also look some
more at the way statistical tests are done in R.
The lab follows rather closely the presentation in chapters 4.3, 6.1, 6.2 and 6.3 of
Baayen’s book.
Comparing paired vectors
(Baayen ch. 4.3, p. 82-84)
In statistical courses, linear regression is generally introduced as a more
informative method to compare paired vectors. Consider again the ratings
dataset seen in Lab 1 (Baayen ch. 1) containing ratings about 81 nouns denoting
animals and plants, and two of these ratings: weight and size. One way to
compare these ratings is ask whether their means are the same. This can be
tested using the t-test:
>t.test(ratings$meanWeightRating, ratings$meanSizeRating)
The output of the test is as follows, indicating marginal significance:
Welch Two Sample t-test
data: ratings$meanWeightRating and ratings$meanSizeRating
t = -2.1421, df = 159.092, p-value = 0.0337
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-0.64964319 -0.02637656
sample estimates:
mean of x mean of y
2.57037
2.90838
but because these ratings are about the same set of 81 nouns, we can use the
more powerful paired version of the t.test:
> t.test(ratings$meanWeightRating,
ratings$meanSizeRating, paired=T)
Paired t-test
data: ratings$meanWeightRating and ratings$meanSizeRating
t = -36.0408, df = 80, p-value < 2.2e-16
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-0.3566737 -0.3193460
sample estimates:
mean of the differences
-0.3380099
This gives us much stronger evidence that the two means are in fact different.
Much of data analysis is about avoiding simple-minded applications of such tests.
For instance, an issue that is sometimes forgotten is that these tests are based on
the assumption that weight ratings and size ratings are normally distributed. To
be sure of our results we better check if this assumption is justified. In the first
instance, this can be done graphically, with a boxplot:
> par(mfrow=c(1,2))
> boxplot(ratings$meanWeightRating,
ratings$meanSizeRating, names=c("weight", "size"), ylab =
"mean rating")
The boxplot indicates a certain degree of asymmetry, but this is pretty mild – the
vector of differences doesn’t have much asymmetry
> boxplot(ratings$meanWeightRating - ratings$meanSizeRating,
names="difference", ylab = "mean rating difference")
A stronger test is given by normality tests such as the Shapiro-Wilk test. This also
indicates that the vector of differences is normal:
> shapiro.test(ratings$meanWeightRating-ratings$meanSizeRating)
Shapiro-Wilk normality test
data: ratings$meanWeightRating - ratings$meanSizeRating
W = 0.9644, p-value = 0.02374
But in case of uncertainty we could use the Wilcoxon signed rank test with
paired=T instead of the t-test:
> wilcox.test(ratings$meanWeightRating,
ratings$meanSizeRating, paired = T)
Linear regression for paired vectors analysis
What some of you may have already known before this course is that linear
regression is a way to carry out an even more systematic test, to compare the
entire distribution instead of just the mean. A scatterplot of the two
distributions, produced as follows:
> plot(ratings$meanWeightRating, ratings$meanSizeRating, xlab
= "mean weight rating", ylab = "mean size rating")
indicates that the data points pattern along a straight line. This can be
highlighted by plotting a line through the scatterplot using abline().
abline takes two parameters—an intercept and a slope—and adds the line
specified by the parameters to a plot
> plot(c(-4,4), c(-4,4))
> abline(2,-2)
The slope and intercept of the line through the scatterplot of weight ratings and
size ratings can be computed using lm():
> ratings.lm = lm(meanSizeRating
ratings)
~ meanWeightRating, data =
We saw in a previous lab that ratings.lm is a linear model. The intercept and
slope specified by this model can be seen by typing ratings.lm at the prompt:
>ratings.lm
Call:
lm(formula = meanSizeRating ~ meanWeightRating, data =
ratings)
Coefficients:
(Intercept)
0.5270
meanWeightRating
0.9265
R makes it very easy to add the regression line to a scatterplot: we just need to
feed the linear model to abline():
> abline(ratings.lm)
The degree to which the data points cluster around the regression line is called
correlation, and is quantified by a correlation coefficient. The correlation
coefficient of a sample, indicated with r, takes values between +1 (perfect
correlation) and -1 (perfect negative correlation). r can be seen as a measure of
how useful it is to fit a regression line through the data. The function
mvrnorm.fnc can be used to visualize the fit between a regression line and data.
It takes as input a value of r. Try increasing values of r:
>mvrnormplot.fnc(r
>mvrnormplot.fnc(r
>mvrnormplot.fnc(r
>mvrnormplot.fnc(r
=
=
=
=
-0.067)
0.2)
0.61)
0.888)
The measure used to indicate how much of the scatter is accounted for by the
regression line is r2, typically written R2. Going back to the weights/size example,
we can use summary to get the information about R2.
>summary(ratings.lm)
Let’s look in detail the output of this command. summary() shows us:
- the residuals (see lectures);
- a table with the coefficients of the model. This specifies predictor by
predictor the extent to which that predictor gives us information about
the dependent variable. If the coefficient is 0, there is no relation at all.
The p value tells that whether the coefficient is significantly different from
0.
- the residual standard error R2 quantifies on a scale from 0 to 1 the
proportion of the variance that the model explains. We get r by taking the
square root of R2.
Limits of linear regression
One problem with linear regression is the unordinate effect of outliers. Consider
the plot comparing the frequencies of plural and singular words in ratings,
produced as follows:
> plot(ratings$FreqSingular, ratings$FreqPlural)
and now add a regression line:
> abline(lm(FreqPlural ~ FreqSingular, data=ratings), lty = 1)
you’ll see that the regression line is greatly affected by the four points with
singular frequency greater than 500. If we remove these four points as follows:
> abline(lm(FreqPlural ~̃FreqSingular,data =
ratings[ratings$FreqSingular < 500, ]), lty = 2)
we get a very different picture.
The second problem is exemplified by the following plot:
> with(ratings, plot(meanSizeRating, meanFamiliarity))
if we plot the regression line
> ratings.lm = lm(meanSizeRating ~ meanFamiliarity, data =
ratings)
> abline(ratings.lm)
We see that the regression line is not very informative
Part of the problem is that there seem to be two groups of items. If we fit them
separately the results are better. We can do this by creating separate data frames
> plants = ratings[ratings$Class == "plant", ]
> animals = ratings[ratings$Class == "animal", ]
Then adding the points separately, first for plants
> points(plants$meanFamiliarity, plants$meanSizeRating,
pch = ‘p’, col = "darkgrey")
> lines(lowess(plants$meanFamiliarity, plants$meanSizeRating),
col = "darkgrey")
and then for animals
> points(animals$meanFamiliarity, animals$meanSizeRating,
pch = ‘a’)
>
lines(lowess(animals$meanFamiliarity,animals$meanSizeRating))
Now we can fit separate models:
>
>
>
>
plants.lm = lm(meanSizeRating ~ meanFamiliarity, plants)
abline(coef(plants.lm), col = "darkgrey", lty = 2)
animals.lm = lm(meanSizeRating ~ meanFamiliarity, animals)
abline(coef(animals.lm), lty = 2)
But we also need non-linearity
> plants.lm = lm(meanSizeRating ~ meanFamiliarity +
I(meanFamiliarity^2), data = plants)
> summary(plants.lm)$coef
Logistic regression
Baayen, Chap 6.3
For classification, we can use Generalized Linear Models, implemented using the
function glm().
Consider the visual lexical decision dataset , english, and let’s look at the
variable CorrectLexdec.
We can build a logistic model for this prediction task as follows:
> english.glm =
glm(cbind(english$CorrectLexdec,30 -english$CorrectLexdec) ~
Voice + MeanBigramFrequency + LengthInLetters + Ncount +
WordCategory + poly(WrittenFrequency, 2) +
poly(WrittenSpokenFrequencyRatio, 2) + poly(FamilySize, 2) +
InflectionalEntropy + NumberComplexSynsets + AgeSubject,
english, family = "binomial")
Let us now inspect the output of summary
Summary(english.glm)
More linear regression
Baayen ch. 6.1 and Chap 6.2
Download