Advanced Research Skills Lecture 1 Introduction Olivier MISSA, om502@york.ac.uk Aims Introduce the use of R for advanced statistical analyses beyond "Statistics for Ecologists". Demonstrate these analyses on a broad range of questions and situations. Develop your understanding of statistical programming. Empower you to tackle future analytical challenges on your own. 2 Aims Other skills will be developed too. Produce posters using CorelDraw (graphics package). Learn how to write a grant proposal. 3 Learning Outcomes At the end of the module, you should be able to : Determine which test to use for significance testing. Explore the inherent structure of your data through a wide range of multivariate techniques. Work out which model "best explains" the variable you are interested in. Produce high quality graphs (ready for publication) using fully R graphical capabilities. 4 Organisation Staff Olivier Missa (OM), module organiser, R sessions om502@york.ac.uk Emma Rand (ER), R sessions er13@york.ac.uk Phil Roberts (PTR), CorelDraw session ptr2@york.ac.uk Peter Mayhew (PJM), Grant writing session pjm19@york.ac.uk 5 Organisation Structure 9 theoretical lectures (OM) on advanced stats. 9 practical sessions (OM & ER) on using R. 1 practical session (PTR) on CorelDraw. 1 tutorial session (PJM) on Grant writing. 6 Organisation Content L1 Introduction L2 – L4 Linear Models L5 – L6 GLMs & Mixed-effects models L7 L8 – L9 Non-Linear Models Multivariate Analyses Each lecture is accompanied by a practical session 7 Organisation Assessment Open Data Analysis exercise, Written report with Introduction, Material & Methods, Results, Discussion. particular emphasis on justifying the analyses and interpreting the results properly. 8 What is R ? "R is a language and environment for statistical computing and graphics" R website A programming language, actually a dialect of S, which was developed in the 80s by John Chambers at the Bell Labs. The Bell Labs then sold S to MathSoft (now Insightful Co.), which developed it further into S-Plus, a commercial Statistical package. In the 90s, S was rewritten from scratch by two statisticians, Ross Ihaka & Rob Gentleman, from New Zealand. Since then R has continued to grow in scale and scope and is currently maintained by about 20 people across the globe. 9 Why use R ? The Key Benefits : it's Free It won't cost you a penny ever Open How things are calculated is not hidden Fully customisable Cutting Edge The user is in full control Stats Pros use it to create new techniques Very Widespread (increasingly so) Thousands of contributors (packages), millions of users Supported by an international user community happy to provide help and assistance 10 Why use R ? The Drawback : Steep Learning Curve You need to learn the language You need to know what you are doing (stats) 11 What is R Good for ? Absolutely everything (to do with data) Statistics Modelling Programming / Simulations Graphics (from very simple to complex, 2D, 3D, ...) Database (simple relational functions) Bioinformatics (Bioconductor project) Platform interacting with other Softwares (e.g. Ggobi, WinBUGS, MySQL, GRASS GIS) 12 Example of a session > data(volcano) > dim(volcano) [1] 87 61 > volcano [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 100 100 101 101 101 101 101 [2,] 101 101 102 102 102 102 102 . . . . . . . . . . . . . . . . . . . . [87,] 97 97 97 98 98 99 99 > volcano[1:3,1:3] [,1] [,2] [,3] [1,] 100 100 101 [2,] 101 101 102 [3,] 102 102 103 . . . . . . . . . . . [,61] . 103 . 104 . . . . . 94 13 > range(volcano) [1] 94 195 > mean(volcano) [1] 130.1879 > sd(volcano) [1] 6.902227 7.565538 8.203669 8.735686 . . . [8] 11.165554 11.735217 12.733854 13.668694 . . . . . . > ?sd ## help('sd') does the same > sd function (x, na.rm = FALSE) { if (is.matrix(x)) apply(x, 2, sd, na.rm = na.rm) else if (is.vector(x)) sqrt(var(x, na.rm = na.rm)) else if (is.data.frame(x)) sapply(x, sd, na.rm = na.rm) else sqrt(var(as.vector(x), na.rm = na.rm)) } . . . 14 > sd(as.vector(volcano)) [1] 25.83233 > summary(as.vector(volcano)) Min. 1st Qu. Median Mean 3rd Qu. 94.0 108.0 124.0 130.2 150.0 > volcano.v <- as.vector(volcano) > dim(volcano.v) NULL > length(volcano.v) [1] 5307 > 61*87 [1] 5307 > volcano.v[1:87] == volcano[,1] Max. 195.0 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE . . . . . . . . . . . . . . . . . . . . . . . . . [87] TRUE > volcano.v[1:61] == volcano[1,] . . . only three values (out of 61) show "TRUE" 15 > plot(volcano) not useful, only show that elevation in columns 1 and 2 tend to be correlated 16 W > > > + > > + E plot(volcano) plot(volcano.v, pch=20) hist(volcano, prob=TRUE, xlab="volcano elevation (m)") x <- seq(90,200,1) curve(dnorm(x, mean=mean(volcano.v), sd=sd(volcano.v)), add=TRUE) > shapiro.test(volcano.v) Error in shapiro.test(volcano.v) : sample size must be between 3 and 5000 > smpl <- sample(volcano.v, 5000) > shapiro.test(smpl) Shapiro-Wilk normality test data: smpl W = 0.9358, p-value < 2.2e-16 17 > library(nortest) ## Package of Normality tests > ad.test(volcano) ## Anderson-Darling Anderson-Darling normality test data: volcano A = 106.2715, p-value < > cvm.test(volcano) > lillie.test(volcano) > pearson.test(volcano) > sf.test(smpl) 2.2e-16 ## Cramer-von Mises ## Lilliefors ## Pearson (Chi2) ## Shapiro-Francia > qqnorm(volcano.v) > qqline(volcano.v, col="red") 18 > x <- 10*(1:nrow(volcano)) ## 10, 20, ..., 610 > y <- 10*(1:ncol(volcano)) ## 10, 20, ..., 870 > image(x, y, volcano) 19 > > > > x <- 10*(1:nrow(volcano)) y <- 10*(1:ncol(volcano)) image(x, y, volcano) image(x, y, volcano, asp=1) 20 > > > > > + + x <- 10*(1:nrow(volcano)) y <- 10*(1:ncol(volcano)) image(x, y, volcano) image(x, y, volcano, asp=1) image(x, y, volcano, asp=1, col = terrain.colors(100), axes = FALSE, asp=1) 21 > > > > > + + > + + x <- 10*(1:nrow(volcano)) y <- 10*(1:ncol(volcano)) image(x, y, volcano) image(x, y, volcano, asp=1) image(x, y, volcano, asp=1, col = terrain.colors(100), axes = FALSE, asp=1) contour(x, y, volcano, levels = seq(90, 200, by=5), add = TRUE, col = "peru") 22 > > > > > + + > + + > + + > + + x <- 10*(1:nrow(volcano)) y <- 10*(1:ncol(volcano)) image(x, y, volcano) image(x, y, volcano, asp=1) image(x, y, volcano, asp=1, col = terrain.colors(100), axes = FALSE) contour(x, y, volcano, levels = seq(90, 200, by=5), add = TRUE, col = "peru") image(x, y, volcano, asp=1, col = terrain.colors(100), axes = FALSE) contour(x, y, volcano, levels = seq(90, 200, by=10), add = TRUE, col = "peru") 23 Gallery of other Volcano Graphs image + contour persp persp with shading surface3d 24 More Classical Graphs Histogram + Theoretical curve Boxplot Stripchart Barplot Pie chart 3D models 25