Section 1

advertisement
第九届生物多样性会议(2010-11-4, 福建, 厦门)
An brief introduction to R , Package and
Graphics used in biodiversity research
Kechang NIU (牛克昌)
Department of Ecology , Peking University
Outline
What this is
Section 1 >-A short, highly incomplete tour around some
of the basic concepts of R as a programming language
Section 2 >-Some hints on how to obtain documentation
on the many library functions (packages)
Section 3>-A brief introduction of R graphics
Section 1
R, S and S-plus
S: an interactive environment for data analysis
developed at Bell Laboratories since 1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
Exclusively licensed by AT&T/Lucent to Insightful
Corporation, Seattle WA. Product name: “S-plus”.
Implementation languages C, Fortran.
See:
http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html
Section 1
R, S and S-plus
R: initially written by Ross Ihaka and Robert Gentleman
at Dep. of Statistics of U of Auckland, New Zealand
during 1990s.
Since 1997: international “R-core” team of ca. 15 people
with access to common CVS archive.
GNU General Public License (GPL)
- can be used by anyone for any purpose
- contagious
Open Source
-quality control!
-efficient bug tracking and fixing system supported by
the user community
Section 1
What R is
What R is not
o data handling and
storage: numeric, textual
o is not a database, but
connects to DBMSs
o matrix algebra
o has no graphical user
interfaces, but connects
to Java, TclTk
o hash tables and regular
expressions
o high-level data analytic
and statistical functions
o classes (“OO”)
o graphics
o programming language:
loops, branching,
subroutines
o language interpreter
can be very slow, but
allows to call own C/C++
code
o no spreadsheet view of
data, but connects to
Excel/MsOffice
o no professional /
commercial support
Section 1
R Packaging and statistics
o Packaging: a crucial infrastructure to efficiently
produce, load and keep consistent software libraries
from (many) different sources / authors
o Statistics: most packages deal with statistics and data
analysis
o State of the art: many statistical researchers provide
their methods as R packages
Section 1
R as a calculator
> seq(0, 5, length=6)
0.0
-0.5
[1] 1.414214
-1.0
> sqrt(2)
sin(seq(0, 2 * pi, length = 100))
[1] 5
0.5
1.0
> log2(32)
0
20
40
[1] 0 1 2 3 4 5
> plot(sin(seq(0, 2*pi, length=100)))
60
Index
80
100
Section 1
Variables
> a = 49
> sqrt(a)
[1] 7
> a = "The dog ate my homework"
> sub("dog","cat",a)
[1] "The cat ate my homework“
> a = (1+1==3)
>a
[1] FALSE
Numeric
Character
string
Logical
Section 1
Missing values
Variables of each data type (numeric, character, logical)
can also take the value NA: not available.
>- NA is not the same as 0
>- NA is not the same as “”
>- NA is not the same as FALSE
Any operations (calculations, comparisons) that involve NA
> NA | TRUE
may or may not produce NA:
[1] TRUE
> NA==1
> NA & TRUE
[1] NA
[1] NA
> 1+NA
[1] NA
> max(c(NA, 4, 7))
[1] NA
Section 1
Functions and Operators
Functions do things with data
“Input”: function arguments (0,1,2,…)
“Output”: function result (exactly one)
Example:
add = function(a,b)
{ result = a+b
return(result) }
Operators:
Short-cut writing for frequently used functions
of one or two arguments.
Examples: + - * / ! & | %%
Section 1
Vectors and Matrices
Vector: an ordered collection of data of the
same type
> a = c(1,2,3)
> a*2
[1] 2 4 6
Matrix: a rectangular table of data of the same
type
Example: the expression values for 10000 genes
for 30 tissue biopsies: a matrix with 10000
rows and 30 columns.
Section 1
Data frames
Data frame: is supposed to represent the typical data
table that researchers come up with – like a
spreadsheet.
It is a rectangular table with rows and columns; data
within each column has the same type (e.g. number, text,
logical), but different columns may have different
types. And each row on one case
Example:
>a
localisation tumorsize progress
XX348 proximal
6.3 FALSE
XX234
distal
8.0 TRUE
XX987 proximal
10.0 FALSE
Section 1
Loops
When the same or similar tasks need to be
performed multiple times; for all elements of a
list; for all columns of an array; etc.
for(i in 1:10) {
print(i*i)
}
i=1
while(i<=10) {
print(i*i)
i=i+sqrt(i)
}
Section 1
lapply, sapply, apply
When the same or similar tasks need to be performed multiple
times for all elements of a list or for all columns of an array. May be
easier and faster than “for” loops
lapply( li, fct )
To each element of the list li, the function fct is applied. The
result is a list whose elements are the individual fct results.
>
>
>
>
>
>
>
>
li = list("klaus","martin","georg")
lapply(li, toupper)
[[1]]
[1] "KLAUS"
[[2]]
[1] "MARTIN"
[[3]]
[1] "GEORG"
Section 1
Regular expressions
A tool for text matching and replacement which is available in similar
forms in many programming languages (Perl, Unix shells, Java)
> a = c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17")
> grep("L", a)
[1] 2 3 5
> grep("L", a, value=T)
[1] "Ly-9"
"MLN50" "CLH-17"
> grep("^L", a, value=T)
[1] "Ly-9"
> grep("[0-9]", a, value=T)
[1] "Ly-9"
"MLN50" "ZNF191" "CLH-17"
> gsub("[0-9]", "X", a)
[1] "CENP-F" "Ly-X"
"MLNXX"
"ZNFXXX" "CLH-XX"
Section 1
Storing data
Every R object can be stored into and restored
from a file with the commands
“save” and “load”.
This uses the XDR (external data representation)
standard of Sun Microsystems and others, and is
portable between MS-Windows, Unix, Mac.
> save(x, file=“x.Rdata”)
> load(“x.Rdata”)
Section 1
Importing and exporting data
There are many ways to get data into R and out of R.
Most programs (e.g. Excel), as well as humans, know how
to deal with rectangular tables in the form of tabdelimited text files.
> x = read.delim(“filename.txt”)
also: read.table, read.csv
> write.table(x, file=“x.txt”, sep=“\t”)
Section 1
Statistical functions
t.test()
Student (test t), determines if the averages of two
populations are statistically different.
prop.test()
hypothesis tests with proportions
Non-parametrical tests
kruskal.test()
Kruskal-Wallis test (variance analysis)
chisq.test() 2 test for convergence
ks.test()
Kolmogorov-Smirnov test
Section 1
lm()
> pisa.lm <- lm(inclin ~ year)
> pisa.lm
> summary(pisa.lm)
Call:
lm(formula = inclin ~ year)
Residuals:
Min
1Q Median
3Q Max
-5.9670 -3.0989 0.6703 2.3077 7.3956
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.1209 25.1298 -2.432 0.0333 *
year
9.3187 0.3099 30.069 6.5e-12 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 4.181 on 11 degrees of freedom
Multiple R-Squared: 0.988,
Adjusted R-squared: 0.9869
F-statistic: 904.1 on 1 and 11 DF, p-value: 6.503e-012
Section 1
Getting help
Details about a specific command whose name you
know (input arguments, options, algorithm, results):
>? t.test
or
>help(t.test)
Section 1
Getting help
o
HTML search
engine
o
search for topics
with regular
expressions:
“help.search”
Web sites
Section 1
www.r-project.org
cran.r-project.org
www.bioconductor.org
Full text search:
www.r-project.org
or
www.google.com
with ‘… site:.r-project.org’ or other R-specific
keywords
Section 2
R Packages
• There are many contributed
packages that can be used to
extend R.
• These libraries are created and
maintained by the authors.
Section 2
Installing R packages from CRAN
• install.packages(‘packageName’)
• OR in Rgui:
– select a local repository (if needed)
– select package(s) from list
Section 2
Ade4>-Analysis of Ecological Data : Exploratory
and Euclidean methods in Environmental sciences
http://pbil.univlyon1.fr/ADE4/home.php?lang=eng
ade4 is characterized by :
-the implementation of graphical and statistical functions
- the availability of numerical data
- the redaction of technical and thematic documentation
- the inclusion of bibliographic references
Section 2
Vegan: R functions for vegetation ecologists
Vegan: R Labs for Vegetation Ecologists
http://ecology.msu.montana.edu/labdsv/R/labs/
•Familiarization with Data
Lab 1 Loading Vegetation Data and Simple Graphical Data Summaries
Lab 2 Loading Site/Environment Data and Simple Graphical Summaries
Lab 3 Vegetation Tables and Summaries
•Modeling Species Distributions
Lab 4 Modeling Species Distributions with Generalized Linear Models
Lab 5 Modeling Species Distributions with Generalized Additive Models
Lab 6 Modeling Species Distributions with Classification Trees
•Ordination
Lab 7 Principal Components Analysis
Lab 8 Principal Coordinates Analysis
Lab 9 Nonmetric Multi-Dimensional Scaling
Lab 10 Correspondence Analysis and Detrended Corresponence Analysis
Lab 11 Fuzzy Set Ordination
Lab 12 Canonical Correspondence Analysis
•Cluster Analysis
Lab 13 Cluster Analysis
Lab 14 Discriminant Analysis with Tree Classifiers
•Introduction
R for Ecologists, a primer on the S language and available software
Analysis of Phylogenetics
and Evolution
http://ape.mpl.ird.fr/
•ape is not a classical software (i.e., a "black box" though it can be used as
one) but an environment for data analyses and development of new methods
where interactivity and user's decisions are left an important place.
•ape is written in R, but some functions for computer-intensive tasks are
written in C.
•ape is available for all main computer operating systems (Linux, Unix,
Windows, MacOS X 10.4 and later).
•ape is distributed under the terms of the GNU General Public Licence,
meaning that it is free and can be freely modified and redistribued (under
some conditions).
smatr:(Standardised) Major Axis Estimation and Testing Routines
This package provides methods of fitting bivariate lines in allometry using
the major axis (MA) or standardised major axis (SMA), and for making
inferences about such lines.
BiodiversityR: GUI for biodiversity and community
ecology analysis
This package provides a GUI and some utility functions
(often based on the vegan package) for statistical analysis
of biodiversity and ecological communities,
• species accumulation curves,
• diversity indices, Renyi profiles,
• GLMs for analysis of species abundance and presenceabsence
• distance matrices,
• Mantel tests, and cluster, constrained and unconstrained
ordination analysis. A book on biodiversity and community
ecology analysis is available for free download from the
website.
untb: ecological drift under the UNTB
A collection of utilities for biodiversity data. Includes the simulation of
ecological drift under Hubbell's Unified Neutral Theory of Biodiversity, and the
calculation of various diagnostics such as Preston curves.
Functions (34)
alonso Various functions from Alonso and McKane
bci Barro Colorado Island (BCI) dataset
butterflies
abundance data for butterflies
etienne Etienne's sampling formula
expected.abundance
Expected abundances under the neutral model
plot.count Abundance curves
rand.neutral
Random neutral ecosystem
volkov Expected frequency of species
zsm Zero sum multinomial distribution as derived by McKane 2004
sem:
Structural Equation Models
sem
is an R package for fitting structural-equation models. The package supports general
structural equation models with latent varibles, fit by maximum likelihood assuming
multinormality, and single-equation estimation for observed-variable models by two-stage
least.squares
1
2
x1
x2
x11
1
x 21
 31
3
4
x3
x4
5
x5
6
x6
x32
 21
 12
2
x42
x53
x63
1
 11
 22
 32
 23
3
1

2
21

y11
y1
1
y21
y2
2
y32
y3
3
y42
y4
4
2
R2WinBUGS: Running WinBUGS and
OpenBUGS from R / S-PLUS
The R2WinBUGS package provides convenient
functions to call WinBUGS from R.
It automatically writes the data and scripts in a
format readable by WinBUGS for processing in
batch mode, which is possible since version 1.4.
After the WinBUGS process has nished, it is
possible either to read the resulting data into R
by the package itself which gives a compact
graphical summary of inference and
convergence diagnostics or to use the facilities
of the coda package for further analyses of the
output.
Others
bio.infer: Maximum Likelihood Method for Predicting
Environmental Conditions from Assemblage Composition
popbio:
Construction and analysis of matrix population models
Construct and analyze projection matrix models from a demography study
of marked individuals classified by age or stage. The package covers
methods described in Matrix Population Models by Caswell (2001) and
Quantitative Conservation Biology by Morris and Doak (2002).
simecol
allows to implement ecological models (ODEs, IBMs, ...) using a
template-like object-oriented stucture. It helps to organize scenarios and may
also be useful for other areas.
demogR: Analysis of age-structured demographic models
Section 3
Making Elegant Graphics in R
Graphical capabilities
100
120
Simple, exploratory graphics are
easy to produce.
0
20
40
60
dist
80
Publication quality graphics can be created.
Several device drivers are available including:
5
15
20
25
speed
8
6
4
10
2
0
Postscript,
5
-2
-10
0
-5
0
-5
X
5
10
-10
Y
Z
On-screen graphics
10
Section 3
Types of graphics functions
High level functions
>-Plot graphs from data
Low level functions
>-Add to/Modify existing graphs
Graphics parameters
>-Control how plots are displayed
Section 3
Some and high level functions
•
•
•
•
•
•
•
•
Low
points()
lines()
abline()
text()
mtext()
title()
legend()
etc
High
•
•
•
•
plot()
hist()
barplot()
boxplot()
Section 3
Graphical parameters
• list of parameters held in par()
• controls how plots are displayed
• values in par() can be over-ridden by
calls to high-level functions
– values are changed temporarily for plot
– no permanent change to par() values
• can be modified directly
Section 3
Plotting single variables: histogram
• First generate data
rrnorm takes n random samples from a
normal distribution
Can specify mean and sd, defaults are
(0,1)
x<-rnorm(100)
• Plot histogram:
Histogram of x
hist(x)
10
5
0
Frequency
15
20
hist is a generic function to plot
histograms of a dataset
-2
-1
0
x
1
2
Section 3
Plotting single variables: boxplot
-1
-2
• Add title
title(main="Boxplot
of x1")
0
1
• boxplot function
boxplot(x1)
2
Boxplot of x1
Section 3
3
0
-1
-2
-3
boxplot(x1,x2,x3)
title(main=“three
boxplots”)
OR
bp3<data.frame(x1,x2,x3)
boxplot(bp3)
1
2
Generate more data
x2<-rnorm(100)
x3<-rnorm(100,2)
4
Plotting single variables: boxplot
1
2
3
Section 3
• barplot(tN,col=rainbo
w(20)
15
10
5
• barplot(tN)
0
• tN <- table(Ni <rpois(100, lambda=5))
20
Plotting single variables: barplot
0
1
2
3
4
5
6
7
8
9
11
12
Section 3
Barplots for grouped data
• Using the VADeaths table from the datasets package:
•
>-library(datasets); data(VADeaths); VADeaths
• Rural Male Rural Female Urban Male Urban Female
50-54
11.7
8.7
15.4
8.4
55-59
18.1
11.7
24.3
13.6
60-64
26.9
20.3
37.0
19.3
65-69
41.0
30.9
54.6
35.1
70-74
66.0
54.3
71.1
50.0
Section 3
Barplots for grouped data
60
50-54
55-59
60-64
65-69
70-74
0
0
10
20
50
30
40
100
50
150
70-74
65-69
60-64
55-59
50-54
70
200
barplot(VADeaths, col=rainbow(5
barplot(VADeaths,
col=rainbow(5),legend=T) legend=T,beside=T)
Rural Male
Rural Female
Urban Male
Urban Female
Rural Male
Rural Female
Urban Male
Urban Female
Section 3
Plotting single variables: scatterplot
function (n=20)
{x1<-rnorm(n)
par(mfrow=c(3,2))
plot(x1,type="p",main="type p:
points")
plot(x1,type="l",main="type l:
lines")
plot(x1,type="b",main="type b:
both")
plot(x1,type="o",main="type o:
overplot")
plot(x1,type="h",main="type h:
histogram")
plot(x1,type="s",main="type s:
steps")
par(mfrow=c(1,1))}
Section 3
Plotting two variables: scatterplot
• Example data: “women” from the datasets package .
Heights vs weights for America women
To get data into workspace: >library(datasets);data(women); attach(women)
140
130
plot(women$height,women$weight)
120
plot(weight~height)
weight
150
160
plot(x=height,y=weight, data=women)
58
60
62
64
66
height
68
70
72
Section 3
Adding to plots: lines
• Straight lines added using “abline” function
abline(v=x) >-draws a vertical line at value x
abline(v=y) >-draws a horizontal line at value y
abline(a=x,b=m)>- draws a line with intercept x and
gradient m
• lines() adds a series of connected lines
– takes vectors of x and y co-ordinates and draws lines
between them
• segments() draws (disjoint) lines segments between
pairs of points
Section 3
Adding to plots: lines
100
80
60
40
20
abline(cars.lm,
col="red“,lty=2)
0
cars.lm<-lm(dist~speed)
dist
library(datasets)
data(cars)
attach(cars)
plot(dist~speed)
120
• “cars” from the datasets package has speed
and stopping distances
• draw scatter plot and add regression line
5
10
15
speed
20
25
Section 3
Adding to plots: lines and text
•
•
•
•
•
•
•
•
•
•
•
•
•
•
function (df=1,p=0.05)
{x<-seq(1,50,0.01)
y<-dchisq(x,df)
plot(x,y,type="l",xlim=c(0,10),ann=F)
title(xlab=substitute(paste(chi^2,"stati
stic")),ylab="Frequency",main=(substitut
e(paste(chi^2,"distribution with",df,"df,
p=",p))))
critx<-qchisq(p,df,lower.tail=F)
abline(v=critx,col="red")
xr<-c(critx,x[x>critx])
yr<-c(dchisq(critx,df),y[x>critx])
polygon(x=c(critx,critx,xr,10),
y=c(0,dchisq(critx,df),yr,0),col="red”)
xr<-round(critx,0)+2
yr<-max(y)/4
rx<-round(critx,3)
text(xr,yr,substitute(paste("Critical
",chi^2,"=",rx)),col="red")}
Section 3
Adding to plots: margin text
• mtext() adds text in the
margins of figures
• Margins are numbered
1:4, clockwise from the
bottom
• Text is printed a
specified number of
lines away from the edge
of the plot area
• Can print in outer
margins if available
Section 3
Adding to plots: legends
• “Legend” function adds labels to plots
• Needs x and y co-ordinates, and a list of labels
• Can be argument to high-level function, or called separately
legend(x=30,y=0.06,
legend=c("Null","Alternative"
,"Threshold"), lty=c(1,1,2),
col=c("blue","black","red"),lw
d=2)
Section 3
Plotting symbols, pch
• 25 default plotting characters
• Controlled by pch parameter
• Add pch = n argument to a high-level function
plots points of type n
• Argument to pch can be a single character, so
pch = “x” will plot letter x
• pch can take vectors as arguments, so
different points will have different symbols
function () {x<-rep(1:5,times=5) ,y<-rep(1:5,each=5)
par(xpd=T),
plot(x,y,type="p",pch=1:25,cex=3,axes=F,ann=F,bty="n",bg="red")
title(main="Data symbols 1:25")
text(x,y,labels=1:25,pos=1,offset=1)}
Section 3
Plotting symbols: size and width
• symbol size controlled by cex
– cex=n plots a figure n times
the default size
• Line width controlled by lwd
– lwd=n gives the width of the
line
function ()
{x<-rep(1:5,times=5),y<-rep(1:5,each=5)
plot(x,y,type="p",pch=3,lwd=x,cex=y,xli
m=c(0,5),ylim=c(0,5),bty="n", ann=F)
title(main="R plotting symbols: size and
width", xlab="Width (lwd)",ylab="Size
(cex)")}
Section 3
Plotting symbols: colour
• Colour of objects controlled by “col”, see
colours() for all ; palette()
function ()
{x<-rep(1:5,times=5)
y<-rep(1:5,each=5)
plot(x,y,type="p",pch=3,lwd=x,ce
x=y,
col=1:25,
xlim=c(0,5),ylim=c(0,5),bty="
n", ann=F)
title(main="R plotting symbols:
size and width",
xlab="Width (lwd)",
ylab="Size (cex)")}
Section 3
Types of lines, lty
function () {x<-1:10
plot(x,y=rep(1,10),ylim=c(1,6.5),
type="l",lty=1,axes=F,ann=F)
title(main="Line Types")
for(i in 2:6){lines(x,y=rep(i,10),lty=i)}
axis(2,at=1:6,tick=F,las=1)
linenames<paste("Type",1:6,":",c("solid","dashed
","dotted","dotdash","longdash","two
dash"))
text(2,(1:6)+0.25,labels=linenames)
box()
}
Section 3
Using lattice graphics
par(mfrow=c(2,2))
hist(D$wg,
main='Histogram',xlab='Weight Gain',
ylab ='Frequency', col=heat.colors(14))
boxplot(wg.7a$wg, wg.8a$wg, wg.9a$wg,
wg.10a$wg, wg.11a$wg, wg.12p$wg,
main='Weight Gain', ylab='Weight Gain
(lbs)',
xlab='Shift', names =
c('7am','8am','9am','10am','11am','12pm
'))
plot(D$metmin,D$wg,main='Met Minutes
vs. Weight Gain', xlab='Mets
(min)',ylab='Weight Gain (lbs)',pch=2)
plot(t1,D2$Intel,type="l",main='Closing
Stock Prices',xlab='Time',ylab='Price
$')
lines(t1,D2$DELL,lty=2)
Section 3
Grid Graphics
Object Oriented - graphical components can be reused and
recombined: library(grid); library(lattice)
•
•
•
•
•
•
•
A standard set of graphical primitives:
grid.rect(...)
grid.lines(...)
grid.polygon(...)
grid.circle(...)
grid.text(...)
library(help = grid) for details
0
0.25
0.5
text
0.75
1
R Package for Elegant Graphics–
Lattice, ggplot2
Ggplot2: R Graphics
Paul Murrell
Ggplot2: Elegant Graphics
for Data Analysis (Use R)
Hadley Wickham
Thanks you !
www.r-project.org
Download