r_intro_kechang_niu_2010_06_19_771

advertisement
An brief introduction to R and R
Package used in ecology
Kechang Niu
Department of Ecology , PKU
2010-06-19
What this is
o A short, highly incomplete tour around some
of the basic concepts of R as a programming
language
o Some hints on how to obtain documentation on
the many library functions (packages)
R, S and S-plus
S: an interactive environment for data analysis
developed at Bell Laboratories since 1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
Exclusively licensed by AT&T/Lucent to Insightful
Corporation, Seattle WA. Product name: “S-plus”.
Implementation languages C, Fortran.
See:
http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html
R, S and S-plus
R: initially written by Ross Ihaka and Robert
Gentleman at Dep. of Statistics of U of Auckland,
New Zealand during 1990s.
Since 1997: international “R-core” team of ca. 15
people with access to common CVS archive.
GNU General Public License (GPL)
- can be used by anyone for any purpose
- contagious
Open Source
-quality control!
-efficient bug tracking and fixing system supported
by the user community
Advantages
o data handling and
storage: numeric,
textual
o matrix algebra
o hash tables and
regular expressions
o high-level data
analytic and statistical
functions
o classes (“OO”)
o graphics
o programming language:
loops, branching,
subroutines
Disadvantages
o is not a database, but
connects to DBMSs
o has no graphical user
interfaces, but connects
to Java, TclTk
o language interpreter
can be very slow, but
allows to call own C/C++
code
o no spreadsheet view
of data, but connects to
Excel/MsOffice
o no professional /
commercial support
R Packaging and statistics
o Packaging: a crucial infrastructure to efficiently
produce, load and keep consistent software libraries
from (many) different sources / authors
o Statistics: most packages deal with statistics and
data analysis
o State of the art: many statistical researchers
provide their methods as R packages
R as a calculator
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
0.0
-0.5
[1] 1.414214
-1.0
> sqrt(2)
sin(seq(0, 2 * pi, length = 100))
[1] 5
0.5
1.0
> log2(32)
0
20
40
60
Index
> plot(sin(seq(0, 2*pi, length=100)))
80
100
variables
> a = 49
> sqrt(a)
[1] 7
> a = "The dog ate my homework"
> sub("dog","cat",a)
[1] "The cat ate my homework“
> a = (1+1==3)
> a
[1] FALSE
numeric
character
string
logical
missing values
Variables of each data type (numeric, character, logical)
can also take the value NA: not available.
o NA is not the same as 0
o NA is not the same as “”
o NA is not the same as FALSE
Any operations (calculations, comparisons) that involve
> NA | TRUE
NA may or may not produce NA:
[1] TRUE
> NA==1
> NA & TRUE
[1] NA
[1] NA
> 1+NA
[1] NA
> max(c(NA, 4, 7))
[1] NA
Functions and Operators
Functions do things with data
“Input”: function arguments (0,1,2,…)
“Output”: function result (exactly one)
Example:
add = function(a,b)
{ result = a+b
return(result) }
Operators:
Short-cut writing for frequently used
functions of one or two arguments.
Examples: + - * / ! & | %%
vectors and matrices
vector: an ordered collection of data of the
same type
> a = c(1,2,3)
> a*2
[1] 2 4 6
matrix: a rectangular table of data of the
same type
example: the expression values for 10000
genes for 30 tissue biopsies: a matrix with
10000 rows and 30 columns.
Data frames
data frame: is supposed to represent the typical
data table that researchers come up with – like a
spreadsheet.
It is a rectangular table with rows and columns;
data within each column has the same type (e.g.
number, text, logical), but different columns may
have different types.
Example:
> a
localisation tumorsize progress
XX348
proximal
6.3
FALSE
XX234
distal
8.0
TRUE
XX987
proximal
10.0
FALSE
Loops
When the same or similar tasks need to be
performed multiple times; for all elements of
a list; for all columns of an array; etc.
for(i in 1:10) {
print(i*i)
}
i=1
while(i<=10) {
print(i*i)
i=i+sqrt(i)
}
lapply, sapply, apply
When the same or similar tasks need to be performed multiple
times for all elements of a list or for all columns of an array.
May be easier and faster than “for” loops
lapply( li, fct )
To each element of the list li, the function fct is applied. The
result is a list whose elements are the individual fct results.
>
>
>
>
>
>
>
>
li = list("klaus","martin","georg")
lapply(li, toupper)
[[1]]
[1] "KLAUS"
[[2]]
[1] "MARTIN"
[[3]]
[1] "GEORG"
Regular expressions
A tool for text matching and replacement which is available in similar
forms in many programming languages (Perl, Unix shells, Java)
> a = c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17")
> grep("L", a)
[1] 2 3 5
> grep("L", a, value=T)
[1] "Ly-9"
"MLN50" "CLH-17"
> grep("^L", a, value=T)
[1] "Ly-9"
> grep("[0-9]", a, value=T)
[1] "Ly-9"
"MLN50" "ZNF191" "CLH-17"
> gsub("[0-9]", "X", a)
[1] "CENP-F" "Ly-X"
"MLNXX"
"ZNFXXX" "CLH-XX"
Storing data
Every R object can be stored into and
restored from a file with the commands
“save” and “load”.
This uses the XDR (external data
representation) standard of Sun
Microsystems and others, and is portable
between MS-Windows, Unix, Mac.
> save(x, file=“x.Rdata”)
> load(“x.Rdata”)
Importing and exporting data
There are many ways to get data into R and out of R.
Most programs (e.g. Excel), as well as humans, know
how to deal with rectangular tables in the form of
tab-delimited text files.
> x = read.delim(“filename.txt”)
also: read.table, read.csv
> write.table(x, file=“x.txt”, sep=“\t”)
Getting help
Details about a specific command whose name
you know (input arguments, options, algorithm,
results):
>? t.test
or
>help(t.test)
Getting help
o
HTML search
engine
o
search for
topics with regular
expressions:
“help.search”
Web sites
www.r-project.org
cran.r-project.org
www.bioconductor.org
Full text search:
www.r-project.org
or
www.google.com
with ‘… site:.r-project.org’ or other R-specific
keywords
R Packages
• There are many contributed
packages that can be used to
extend R.
• These libraries are created
and maintained by the authors.
Installing R packages from CRAN
• install.packages(‘packageName’)
• OR in Rgui:
– select a local repository (if needed)
– select package(s) from list
dist
40
20
0
5
10
15
speed
8
6
4
Z
10
2
0
5
-2
-10
0
-5
0
-5
X
5
10
-10
Y
barplot()
hist()
image()
plot()
pairs()
persp()
piechart()
polygon()
60
80
100
120
R Package –barpot, simpleboot
20
25
R Package for Elegant Graphics–
Lattice, ggplot2
Ggplot2: R Graphics
Paul Murrell
Ggplot2: Elegant Graphics
for Data Analysis (Use R)
Hadley Wickham
R Package
for Ecological Research
ade4: Analysis of Ecological Data : Exploratory
and Euclidean methods in Environmental sciences
http://pbil.univlyon1.fr/ADE4/home.php?lan
g=eng
ade4 is characterized by :
-the implementation of graphical and statistical functions
- the availability of numerical data
- the redaction of technical and thematic documentation
- the inclusion of bibliographic references
Vegan: R functions for
vegetation ecologists
Vegan: R Labs for Vegetation Ecologists
http://ecology.msu.montana.edu/labdsv/R/labs/
•Familiarization with Data
Lab 1 Loading Vegetation Data and Simple Graphical Data Summaries
Lab 2 Loading Site/Environment Data and Simple Graphical Summaries
Lab 3 Vegetation Tables and Summaries
•Modeling Species Distributions
Lab 4 Modeling Species Distributions with Generalized Linear Models
Lab 5 Modeling Species Distributions with Generalized Additive Models
Lab 6 Modeling Species Distributions with Classification Trees
•Ordination
Lab 7 Principal Components Analysis
Lab 8 Principal Coordinates Analysis
Lab 9 Nonmetric Multi-Dimensional Scaling
Lab 10 Correspondence Analysis and Detrended Corresponence Analysis
Lab 11 Fuzzy Set Ordination
Lab 12 Canonical Correspondence Analysis
•Cluster Analysis
Lab 13 Cluster Analysis
Lab 14 Discriminant Analysis with Tree Classifiers
•Introduction
R for Ecologists, a primer on the S language and available software
Analysis of Phylogenetics
and Evolution
http://ape.mpl.ird.fr/
•ape is not a classical software (i.e., a "black box" though it can be used as one) but an
environment for data analyses and development of new methods where interactivity
and user's decisions are left an important place.
•ape is written in R, but some functions for computer-intensive tasks are written in C.
•ape is available for all main computer operating systems (Linux, Unix, Windows,
MacOS X 10.4 and later).
•ape is distributed under the terms of the GNU General Public Licence, meaning that i
is free and can be freely modified and redistribued (under some conditions).
untb: ecological drift under the UNTB
A collection of utilities for biodiversity data. Includes the simulation of
ecological drift under Hubbell's Unified Neutral Theory of Biodiversity, and the
calculation of various diagnostics such as Preston curves.
Functions (34)
alonso Various functions from Alonso and McKane
bci Barro Colorado Island (BCI) dataset
butterflies abundance data for butterflies
etienne Etienne's sampling formula
expected.abundance
Expected abundances under the neutral model
plot.count Abundance curves
rand.neutral Random neutral ecosystem
volkov Expected frequency of species
zsm Zero sum multinomial distribution as derived by McKane 2004
BiodiversityR: GUI for biodiversity and community
ecology analysis
This package provides a GUI and some utility functions
(often based on the vegan package) for statistical analysis
of biodiversity and ecological communities,
• species accumulation curves,
• diversity indices, Renyi profiles,
• GLMs for analysis of species abundance and presenceabsence
• distance matrices,
• Mantel tests, and cluster, constrained and unconstrained
ordination analysis. A book on biodiversity and community
ecology analysis is available for free download from the
website.
sem:
Structural Equation Models
sem
is an R package for fitting structural-equation models. The package
supports general structural equation models with latent varibles, fit by
maximum likelihood assuming multinormality, and single-equation estimation
for observed-variable models by two-stage least.squares
1
2
x1
x2
x11
1
x 21
 31
3
4
x3
x4
5
x5
6
x6
x32
 21
 12
2
x42
x53
x63
1
 11
 22
 32
 23
3
1

2
21

y11
y1
1
y21
y2
2
y32
y3
3
y42
y4
4
2
smatr:(Standardised) Major Axis Estimation and Testing Routines
This package provides methods of fitting bivariate lines in allometry
using the major axis (MA) or standardised major axis (SMA), and for
making inferences about such lines. The available methods of inference
include confidence intervals and one-sample tests for slope and elevation,
testing for a common slope or elevation amongst several allometric lines,
constructing a confidence interval for a common slope or elevation, and
testing for no shift along a common axis, amongst several samples.
bio.infer: Maximum Likelihood Method for Predicting
Environmental Conditions from Assemblage Composition
popbio:
Construction and analysis of matrix population models
Construct and analyze projection matrix models from a demography study
of marked individuals classified by age or stage. The package covers
methods described in Matrix Population Models by Caswell (2001) and
Quantitative Conservation Biology by Morris and Doak (2002).
simecol
allows to implement ecological models (ODEs, IBMs, ...) using a
template-like object-oriented stucture. It helps to organize scenarios and may
also be useful for other areas.
demogR: Analysis of age-structured demographic models
R2WinBUGS: Running WinBUGS and
OpenBUGS from R / S-PLUS
The R2WinBUGS package provides convenient
functions to call WinBUGS from R.
It automatically writes the data and scripts in a
format readable by WinBUGS for processing in
batch mode, which is possible since version 1.4.
After the WinBUGS process has nished, it is
possible either to read the resulting data into R
by the package itself which gives a compact
graphical summary of inference and
convergence diagnostics or to use the facilities
of the coda package for further analyses of the
output.
Download