Introduction

advertisement
Advanced Research Skills
Lecture 1
Introduction
Olivier MISSA, om502@york.ac.uk
Aims
Introduce the use of R for advanced statistical analyses
beyond "Statistics for Ecologists".
Demonstrate these analyses on a broad range of
questions and situations.
Develop your understanding of statistical programming.
Empower you to tackle future analytical challenges on
your own.
2
Aims
Other skills will be developed too.
Produce posters using CorelDraw (graphics package).
Learn how to write a grant proposal.
3
Learning Outcomes
At the end of the module, you should be able to :
Determine which test to use for significance testing.
Explore the inherent structure of your data
through a wide range of multivariate techniques.
Work out which model "best explains"
the variable you are interested in.
Produce high quality graphs (ready for publication)
using fully R graphical capabilities.
4
Organisation
Staff
Olivier Missa (OM), module organiser, R sessions
om502@york.ac.uk
Emma Rand (ER), R sessions
er13@york.ac.uk
Phil Roberts (PTR), CorelDraw session
ptr2@york.ac.uk
Peter Mayhew (PJM), Grant writing session
pjm19@york.ac.uk
5
Organisation
Structure
9 theoretical lectures (OM) on advanced stats.
9 practical sessions (OM & ER) on using R.
1 practical session (PTR) on CorelDraw.
1 tutorial session (PJM) on Grant writing.
6
Organisation
Content
L1
Introduction
L2 – L4
Linear Models
L5 – L6
GLMs & Mixed-effects models
L7
L8 – L9
Non-Linear Models
Multivariate Analyses
Each lecture is accompanied by a practical session
7
Organisation
Assessment
Open Data Analysis exercise,
Written report with Introduction,
Material & Methods,
Results,
Discussion.
particular emphasis on justifying the analyses
and interpreting the results properly.
8
What is R ?
"R is a language and environment
for statistical computing and graphics"
R website
A programming language, actually a dialect of S, which was
developed in the 80s by John Chambers at the Bell Labs.
The Bell Labs then sold S to MathSoft (now Insightful Co.),
which developed it further into S-Plus,
a commercial Statistical package.
In the 90s, S was rewritten from scratch by two statisticians,
Ross Ihaka & Rob Gentleman, from New Zealand.
Since then R has continued to grow in scale and scope and is
currently maintained by about 20 people across the globe.
9
Why use R ?
The Key Benefits :
it's
Free
It won't cost you a penny ever
Open
How things are calculated is not hidden
Fully customisable
Cutting Edge
The user is in full control
Stats Pros use it to create new techniques
Very Widespread (increasingly so)
Thousands of contributors (packages), millions of users
Supported by an international user community
happy to provide help and assistance
10
Why use R ?
The Drawback :
Steep Learning Curve
You need to learn the language
You need to know what you are doing (stats)
11
What is R Good for ?
Absolutely everything (to do with data)
Statistics
Modelling
Programming / Simulations
Graphics (from very simple to complex, 2D, 3D, ...)
Database (simple relational functions)
Bioinformatics (Bioconductor project)
Platform interacting with other Softwares
(e.g. Ggobi, WinBUGS, MySQL, GRASS GIS)
12
Example of a session
> data(volcano)
> dim(volcano)
[1] 87 61
> volcano
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 100 100 101 101 101 101 101
[2,] 101 101 102 102 102 102 102
. . . . . . . . . . . . . . . . . . . .
[87,] 97
97
97
98
98
99
99
> volcano[1:3,1:3]
[,1] [,2] [,3]
[1,] 100 100 101
[2,] 101 101 102
[3,] 102 102 103
.
.
.
.
.
.
.
.
.
.
. [,61]
. 103
. 104
. . . .
.
94
13
> range(volcano)
[1] 94 195
> mean(volcano)
[1] 130.1879
> sd(volcano)
[1] 6.902227 7.565538 8.203669 8.735686 . . .
[8] 11.165554 11.735217 12.733854 13.668694 . . .
. . .
> ?sd ## help('sd') does the same
> sd
function (x, na.rm = FALSE)
{
if (is.matrix(x))
apply(x, 2, sd, na.rm = na.rm)
else if (is.vector(x))
sqrt(var(x, na.rm = na.rm))
else if (is.data.frame(x))
sapply(x, sd, na.rm = na.rm)
else sqrt(var(as.vector(x), na.rm = na.rm))
} . . .
14
> sd(as.vector(volcano))
[1] 25.83233
> summary(as.vector(volcano))
Min. 1st Qu. Median
Mean 3rd Qu.
94.0
108.0
124.0
130.2
150.0
> volcano.v <- as.vector(volcano)
> dim(volcano.v)
NULL
> length(volcano.v)
[1] 5307
> 61*87
[1] 5307
> volcano.v[1:87] == volcano[,1]
Max.
195.0
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE . . .
. . . . . . . . . . . . . . . . . . . . . .
[87] TRUE
> volcano.v[1:61] == volcano[1,]
. . . only three values (out of 61) show "TRUE"
15
> plot(volcano)
not useful,
only show that elevation
in columns 1 and 2 tend to be correlated
16
W
>
>
>
+
>
>
+
E
plot(volcano)
plot(volcano.v, pch=20)
hist(volcano, prob=TRUE,
xlab="volcano elevation (m)")
x <- seq(90,200,1)
curve(dnorm(x, mean=mean(volcano.v),
sd=sd(volcano.v)), add=TRUE)
> shapiro.test(volcano.v)
Error in shapiro.test(volcano.v) :
sample size must be between 3 and 5000
> smpl <- sample(volcano.v, 5000)
> shapiro.test(smpl)
Shapiro-Wilk normality test
data: smpl
W = 0.9358, p-value < 2.2e-16
17
> library(nortest) ## Package of Normality tests
> ad.test(volcano)
## Anderson-Darling
Anderson-Darling normality test
data: volcano
A = 106.2715, p-value <
> cvm.test(volcano)
> lillie.test(volcano)
> pearson.test(volcano)
> sf.test(smpl)
2.2e-16
## Cramer-von Mises
## Lilliefors
## Pearson (Chi2)
## Shapiro-Francia
> qqnorm(volcano.v)
> qqline(volcano.v, col="red")
18
> x <- 10*(1:nrow(volcano)) ## 10, 20, ..., 610
> y <- 10*(1:ncol(volcano)) ## 10, 20, ..., 870
> image(x, y, volcano)
19
>
>
>
>
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
image(x, y, volcano)
image(x, y, volcano, asp=1)
20
>
>
>
>
>
+
+
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
image(x, y, volcano)
image(x, y, volcano, asp=1)
image(x, y, volcano, asp=1,
col = terrain.colors(100),
axes = FALSE, asp=1)
21
>
>
>
>
>
+
+
>
+
+
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
image(x, y, volcano)
image(x, y, volcano, asp=1)
image(x, y, volcano, asp=1,
col = terrain.colors(100),
axes = FALSE, asp=1)
contour(x, y, volcano,
levels = seq(90, 200, by=5),
add = TRUE, col = "peru")
22
>
>
>
>
>
+
+
>
+
+
>
+
+
>
+
+
x <- 10*(1:nrow(volcano))
y <- 10*(1:ncol(volcano))
image(x, y, volcano)
image(x, y, volcano, asp=1)
image(x, y, volcano, asp=1,
col = terrain.colors(100),
axes = FALSE)
contour(x, y, volcano,
levels = seq(90, 200, by=5),
add = TRUE, col = "peru")
image(x, y, volcano, asp=1,
col = terrain.colors(100),
axes = FALSE)
contour(x, y, volcano,
levels = seq(90, 200, by=10),
add = TRUE, col = "peru")
23
Gallery of other Volcano Graphs
image + contour
persp
persp with shading
surface3d
24
More Classical Graphs
Histogram + Theoretical curve
Boxplot
Stripchart
Barplot
Pie chart
3D models
25
Download