R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University Thomas.Baguley@ntu.ac.uk 0. Overview 1. What is R? 2. Why use R? 3. R basics 4. R graphics 5. Linear models in R 6. ANOVA and ANCOVA 6. Writing your own R functions 8. Simulation and bootstrapping 2 1. What is R? R is a software environment for statistical analysis “the lingua franca of computational statistics” De Leeuw & Mair (2007) 3 2. Why use R? It is: - free - open source - cross-platform (Mac, Linux, PC) It has: - excellent basic statistics functions - powerful and versatile graphics - hundreds of user-contributed packages - a large community of users 4 3. R basics Some common R objects: - characters (text strings e.g., 'a' or ’1’) - numbers (numbers like 2 or 1e+3) - vectors (1D set of numbers or other elements) - data frames (like vectors organized in columns) - matrices (r by c array of numbers) - lists (objects that contain other objects) 5 Assignment Input: > vector1 <- 6 > vector1 Output: [1] 6 6 Arithmetic > 6 + 6 * 2 [1] 18 > vector1^2 [1] 36 7 Calling functions > c(1, 9, 25) > log(vector1) [1] 1 9 25 [1] 1.791759 > rnorm(1, 100, 15) ?rnorm [1] 123.5336 help(rnorm) 8 Loading data > dat1 <- read.csv('pride.csv') > library(foreign) > h05 <- read.spss('Hayden_2005.sav', to.data.frame = TRUE) 9 Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31 10 Addressing the contents of an object > vector1[1] [1] 6 > days <- h05$days > days[1:9] [1] 2864 1460 2921 2921 2921 1460 2921 1460 31 Or combine the steps with h05$days[1:9] 11 Data frames > is.data.frame(h05) [1] TRUE > dim(h05) > names(h05) [1] 43 [1] "names" "days” 2 - this data frame has 43 rows and 2 named columns - it can be helpful to think of a data frame as a set of named variables (vectors) bound into columns 12 Matrix objects (matrices) > cells <- c(3677,56,3924,31) > cat.names <- list(c("Before", "After"), c("Alive", "Dead")) > checklist <- matrix(cells, 2, 2, byrow=TRUE, dimnames= cat.names) > checklist Alive Dead Before 3677 56 After 3924 31 13 The power of objects > chisq.test(checklist) Pearson's Chi-squared test with Yates' continuity correction data: checklist X-squared = 8.1786, df = 1, p-value = 0.004239 > chisq.test(c(2, 18)) Chi-squared test for given probabilities data: c(2, 18) X-squared = 12.8, df = 1, p-value = 0.0003466 14 Calling functions: defaults and named arguments > chisq.test(checklist, correct=FALSE) Pearson's Chi-squared test data: checklist X-squared = 8.8072, df = 1, p-value = 0.003000 - the default argument (for matrix input is) correct=TRUE - setting correct=FALSE over-rides the default - because unnamed arguments to functions are matched in order, this argument must be named (otherwise naming the arguments is optional) 15 Exercise 1 (R Basics) - entering data - working with objects - simple statistical functions 16 4. R graphics n = 43 > boxplot(h05$days) 17 > hist(h05$days) > plot(density(h05$days)) 18 Distribution functions e.g., Normal distribution dnorm(x, pnorm(q, qnorm(p, rnorm(n, mean mean mean mean = = = = 0, 0, 0, 0, sd sd sd sd = = = = 1) 1, lower.tail = TRUE) 1, lower.tail = TRUE) 1) > rdat <- rnorm(30, 100, 15) 19 > curve(dnorm(x), xlim=c(-4,4), col='purple') > curve(dt(x, 1), col='red', add=TRUE, lty=3) 20 21 Exercise 2 (R Graphics) - exploratory plots - plot parameters - plotting distribution functions - plotting a serial position curve (optional) 22 5. Linear models in R > expenses <- read.csv('expenses.csv') > t.test(majority ~ problem, data = expenses) Welch Two Sample t-test data: majority by problem t = -3.7477, df = 505.923, p-value = 0.000199 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2044.457 -638.160 sample estimates: mean in group 0 mean in group 1 7092.712 8434.021 Data courtesy of Mark Thompson http://markreckons.blogspot.com/ 23 > expenses <- read.csv('expenses.csv') > lm(majority ~ problem, data = expenses) Call: lm(formula = majority ~ problem, data = expenses) Coefficients: (Intercept) 7093 problem 1341 24 > lm.out <- lm(majority ~ problem, data = expenses) > max(rstandard(lm.out)) [1] 2.629733 > AIC(lm.out) [1] 12674.85 > lm.out$coefficients (Intercept) 7092.712 problem 1341.308 > ?lm 25 > summary(lm.out) Call: lm(formula = majority ~ problem, data = expenses) Residuals: Min 1Q -8397.0 -3403.5 Median -260.9 3Q Max 3118.8 11543.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7092.7 218.9 32.397 < 2e-16 *** problem 1341.3 357.0 3.758 0.000187 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4395 on 644 degrees of freedom Multiple R-squared: 0.02145, Adjusted R-squared: 0.01993 F-statistic: 14.12 on 1 and 644 DF, p-value: 0.0001871 26 > glm(problem~I(majority/10000), family='binomial', data = expenses) Call: glm(formula = problem ~ I(majority/10000), family = "binomial", data = expenses) Coefficients: (Intercept) I(majority/10000) -1.0394 0.6878 Degrees of Freedom: 645 Total (i.e. Null); Null Deviance: 644 Residual 855.5 Residual Deviance: 841.6 AIC: 845.6 27 > lrm.out <- glm(problem ~ I(majority/10000), family='binomial', data = expenses) > plot(election$majority, lrm.out$fitted, col='dark green') 28 Exercise 3 (Linear models) - using a formula in a call to a model function - linear models - plotting a regression line - generalized linear models (optional) - plotting confidence bands (optional) 29 6. ANOVA > factor1 <- gl(10,4) > factor 1 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Levels: 1 2 3 4 > library(foreign) > diag.data <- read.spss("diagram.sav", to.data.frame = T) > diag.fit <- lm(descript ~ group, data=diag.data) 30 Regression model (dummy coding) > diag.fit Call: lm(formula = descript ~ group, data = diag.data) Coefficients: (Intercept) groupPicture 17.8 0.5 groupFull Diagram 4.6 groupSegmented 9.5 31 aov() > aov(diag.fit) Call: aov(formula = diag.fit) Terms: group Residuals Sum of Squares Deg. of Freedom 583.7 2440.2 3 36 32 anova() > summary(aov(diag.fit)) > anova(diag.fit) Analysis of Variance Table Response: descript Df Sum Sq Mean Sq F value group 3 583.7 194.567 Residuals 36 2440.2 Pr(>F) 2.8704 0.04977 * 67.783 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 33 Factorial ANOVA and ANCOVA > lm(DV ~ factor1 + factor2 + factor1:factor2, data=data2) > lm(DV ~ factor1 * factor2, data=data2) > lm(descript ~ group + time, diag.data) > lm(descript ~ 0 + group + scale(time, scale=F), diag.data) > diag.ancov <- lm(descript ~ group + scale(time, scale=F), diag.data) 34 Type I and Type II Sums of squares > drop1(diag.ancov, test = 'F') Single term deletions Model: descript ~ group + scale(time, scale = F) Df Sum of Sq <none> RSS AIC F value Pr(F) 2154.4 169.46 group 3 821.45 2975.9 176.38 4.4484 0.009478 ** scale(time, scale = F) 1 285.78 2440.2 172.44 4.6427 0.038149 * - sequential sums of squares (Type I) tests are given by default - hierarchical sums of squares (Type II) tests are given by the drop1() function [Software such as SAS and SPSS defaults to unique (Type III) SS] 35 Repeated measures (within-subjects) factors - a weakness in the base R installation, but easily done using packages such as ez, nlme or lme4 (the latter two able to fit a wider range of repeated measures models) > pride.long <- read.csv("pride_long.csv") > pride.mixed <- aov(accuracy ~ emotion*condition + Error(factor(participant)/emotion), pride.long) > summary(pride.mixed) [Software such as SAS and SPSS defaults to unique (Type III) SS] 36 Multiple comparisons and contrasts - you can run contrasts by changing default contrasts in R - Fisher LSD, Tukey’s HSD and p adjustment (e.g., Hochberg, Holm, Bonferroni, FDR) in R base installation - powerful modified Bonferroni (e.g., Shaffer, Westfall) and general linear contrasts available in multcomp package - flexible contrast and estimable functions in gmodels package 37 Exercise 4 (ANOVA) - factors - aov() and anova() - ANCOVA - repeated measures (within-subjects) - contrasts and multiple comparisons (optional) 38 7. Writing your own functions sd.pool <- function(sd.1, n.1, sd.2, n.2 = n.1){ num <- (n.1-1)*sd.1^2+(n.2-1)*sd.2^2 denom <- n.1+n.2-2 output <- sqrt(num/denom) output } > sd.pool(6.1, 20, 5.3, 25) [1] 5.66743 39 Exercise 5 (Write a function) - pick a simple statistical procedure - write a function to automate it e.g., pooled standard deviation 40 8. Simulation and bootstrapping - distribution functions > rnorm(100, 10, 1) - R boot package - sample() and replicate() 41 Bootstrap CIs for a median or timmed mean 42 The bootstrap (and other Monte Carlo methods) Bootstrapping involves repeatedly resampling with replacement from a data set e.g., Data set = {0,1} 7 simulated samples: > replicate(7, sample(x,2,replace=TRUE)) {1,0}, {0,1}, {0,0}, {0,0}, {0,1}, {0,0}, {1,1} Bootstrapping effectively treats the sample distribution as the population distribution 43 Bootstrapping using R > library(boot) > x1.boot <- boot(samp,function(x,i) median(x[i]),R=10000) > boot.ci(x1.boot) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates CALL : boot.ci(boot.out = x1.boot) Intervals : Level Normal 95% (-0.1747, 0.6948 ) Basic (-0.1648, 0.6614 ) Level Percentile BCa 95% (-0.1538, 0.6724 ) (-0.2464, 0.6620 ) Calculations and Intervals on Original Scale Normal 95% CI for mean (using t) [-2.00, 1.33] 44 Exercise 6 (Bootstrapping) - resample data - construct a simple percentile bootstrap - run a BCa bootstrap using the boot package (optional) 45 46 9. Advanced statistical modelling in R Multilevel modeling nlme package lme4 package MCMCglmm package Meta-analysis meta package metafor package … and many, many more other packages 47 metafor package > m.e <- c(10.7, 16.24, 10.03, 3.65, 5.73) > n.e <- c(31, 57, 9, 17, 10) > m.c <- c(2.87, 6.83, 7.18, 4.65, 7.47) > n.c <- c(14, 52, 9, 18, 10) > sd.pooled <- c(7.88, 9.84, 8.67, 6.34, 5.74) > diff <- m.c - m.e > se.diffs <- sd.pooled * sqrt(1/n.e + 1/n.c) 48 > > > > install.packages('metafor') library(metafor) meta.out <- rma(yi=diff, sei=se.diffs, method = 'FE') meta.out Fixed-Effects Model (k = 5) Test for Heterogeneity: Q(df = 4) = 21.0062, p-val = 0.0003 Model Results: estimate -4.1003 se 1.0750 --Signif. codes: zval -3.8141 pval 0.0001 ci.lb -6.2073 ci.ub -1.9933 *** 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 49 > funnel(trimfill(meta.out)) 50 > forest(rma(yi=diff, sei=se.diffs)) 51 52 53 54 55 56 57