第九届生物多样性会议(2010-11-4, 福建, 厦门) An brief introduction to R , Package and Graphics used in biodiversity research Kechang NIU (牛克昌) Department of Ecology , Peking University Outline What this is Section 1 >-A short, highly incomplete tour around some of the basic concepts of R as a programming language Section 2 >-Some hints on how to obtain documentation on the many library functions (packages) Section 3>-A brief introduction of R graphics Section 1 R, S and S-plus S: an interactive environment for data analysis developed at Bell Laboratories since 1976 1988 - S2: RA Becker, JM Chambers, A Wilks 1992 - S3: JM Chambers, TJ Hastie 1998 - S4: JM Chambers Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: “S-plus”. Implementation languages C, Fortran. See: http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html Section 1 R, S and S-plus R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. Since 1997: international “R-core” team of ca. 15 people with access to common CVS archive. GNU General Public License (GPL) - can be used by anyone for any purpose - contagious Open Source -quality control! -efficient bug tracking and fixing system supported by the user community Section 1 What R is What R is not o data handling and storage: numeric, textual o is not a database, but connects to DBMSs o matrix algebra o has no graphical user interfaces, but connects to Java, TclTk o hash tables and regular expressions o high-level data analytic and statistical functions o classes (“OO”) o graphics o programming language: loops, branching, subroutines o language interpreter can be very slow, but allows to call own C/C++ code o no spreadsheet view of data, but connects to Excel/MsOffice o no professional / commercial support Section 1 R Packaging and statistics o Packaging: a crucial infrastructure to efficiently produce, load and keep consistent software libraries from (many) different sources / authors o Statistics: most packages deal with statistics and data analysis o State of the art: many statistical researchers provide their methods as R packages Section 1 R as a calculator > seq(0, 5, length=6) 0.0 -0.5 [1] 1.414214 -1.0 > sqrt(2) sin(seq(0, 2 * pi, length = 100)) [1] 5 0.5 1.0 > log2(32) 0 20 40 [1] 0 1 2 3 4 5 > plot(sin(seq(0, 2*pi, length=100))) 60 Index 80 100 Section 1 Variables > a = 49 > sqrt(a) [1] 7 > a = "The dog ate my homework" > sub("dog","cat",a) [1] "The cat ate my homework“ > a = (1+1==3) >a [1] FALSE Numeric Character string Logical Section 1 Missing values Variables of each data type (numeric, character, logical) can also take the value NA: not available. >- NA is not the same as 0 >- NA is not the same as “” >- NA is not the same as FALSE Any operations (calculations, comparisons) that involve NA > NA | TRUE may or may not produce NA: [1] TRUE > NA==1 > NA & TRUE [1] NA [1] NA > 1+NA [1] NA > max(c(NA, 4, 7)) [1] NA Section 1 Functions and Operators Functions do things with data “Input”: function arguments (0,1,2,…) “Output”: function result (exactly one) Example: add = function(a,b) { result = a+b return(result) } Operators: Short-cut writing for frequently used functions of one or two arguments. Examples: + - * / ! & | %% Section 1 Vectors and Matrices Vector: an ordered collection of data of the same type > a = c(1,2,3) > a*2 [1] 2 4 6 Matrix: a rectangular table of data of the same type Example: the expression values for 10000 genes for 30 tissue biopsies: a matrix with 10000 rows and 30 columns. Section 1 Data frames Data frame: is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. And each row on one case Example: >a localisation tumorsize progress XX348 proximal 6.3 FALSE XX234 distal 8.0 TRUE XX987 proximal 10.0 FALSE Section 1 Loops When the same or similar tasks need to be performed multiple times; for all elements of a list; for all columns of an array; etc. for(i in 1:10) { print(i*i) } i=1 while(i<=10) { print(i*i) i=i+sqrt(i) } Section 1 lapply, sapply, apply When the same or similar tasks need to be performed multiple times for all elements of a list or for all columns of an array. May be easier and faster than “for” loops lapply( li, fct ) To each element of the list li, the function fct is applied. The result is a list whose elements are the individual fct results. > > > > > > > > li = list("klaus","martin","georg") lapply(li, toupper) [[1]] [1] "KLAUS" [[2]] [1] "MARTIN" [[3]] [1] "GEORG" Section 1 Regular expressions A tool for text matching and replacement which is available in similar forms in many programming languages (Perl, Unix shells, Java) > a = c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17") > grep("L", a) [1] 2 3 5 > grep("L", a, value=T) [1] "Ly-9" "MLN50" "CLH-17" > grep("^L", a, value=T) [1] "Ly-9" > grep("[0-9]", a, value=T) [1] "Ly-9" "MLN50" "ZNF191" "CLH-17" > gsub("[0-9]", "X", a) [1] "CENP-F" "Ly-X" "MLNXX" "ZNFXXX" "CLH-XX" Section 1 Storing data Every R object can be stored into and restored from a file with the commands “save” and “load”. This uses the XDR (external data representation) standard of Sun Microsystems and others, and is portable between MS-Windows, Unix, Mac. > save(x, file=“x.Rdata”) > load(“x.Rdata”) Section 1 Importing and exporting data There are many ways to get data into R and out of R. Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tabdelimited text files. > x = read.delim(“filename.txt”) also: read.table, read.csv > write.table(x, file=“x.txt”, sep=“\t”) Section 1 Statistical functions t.test() Student (test t), determines if the averages of two populations are statistically different. prop.test() hypothesis tests with proportions Non-parametrical tests kruskal.test() Kruskal-Wallis test (variance analysis) chisq.test() 2 test for convergence ks.test() Kolmogorov-Smirnov test Section 1 lm() > pisa.lm <- lm(inclin ~ year) > pisa.lm > summary(pisa.lm) Call: lm(formula = inclin ~ year) Residuals: Min 1Q Median 3Q Max -5.9670 -3.0989 0.6703 2.3077 7.3956 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -61.1209 25.1298 -2.432 0.0333 * year 9.3187 0.3099 30.069 6.5e-12 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 4.181 on 11 degrees of freedom Multiple R-Squared: 0.988, Adjusted R-squared: 0.9869 F-statistic: 904.1 on 1 and 11 DF, p-value: 6.503e-012 Section 1 Getting help Details about a specific command whose name you know (input arguments, options, algorithm, results): >? t.test or >help(t.test) Section 1 Getting help o HTML search engine o search for topics with regular expressions: “help.search” Web sites Section 1 www.r-project.org cran.r-project.org www.bioconductor.org Full text search: www.r-project.org or www.google.com with ‘… site:.r-project.org’ or other R-specific keywords Section 2 R Packages • There are many contributed packages that can be used to extend R. • These libraries are created and maintained by the authors. Section 2 Installing R packages from CRAN • install.packages(‘packageName’) • OR in Rgui: – select a local repository (if needed) – select package(s) from list Section 2 Ade4>-Analysis of Ecological Data : Exploratory and Euclidean methods in Environmental sciences http://pbil.univlyon1.fr/ADE4/home.php?lang=eng ade4 is characterized by : -the implementation of graphical and statistical functions - the availability of numerical data - the redaction of technical and thematic documentation - the inclusion of bibliographic references Section 2 Vegan: R functions for vegetation ecologists Vegan: R Labs for Vegetation Ecologists http://ecology.msu.montana.edu/labdsv/R/labs/ •Familiarization with Data Lab 1 Loading Vegetation Data and Simple Graphical Data Summaries Lab 2 Loading Site/Environment Data and Simple Graphical Summaries Lab 3 Vegetation Tables and Summaries •Modeling Species Distributions Lab 4 Modeling Species Distributions with Generalized Linear Models Lab 5 Modeling Species Distributions with Generalized Additive Models Lab 6 Modeling Species Distributions with Classification Trees •Ordination Lab 7 Principal Components Analysis Lab 8 Principal Coordinates Analysis Lab 9 Nonmetric Multi-Dimensional Scaling Lab 10 Correspondence Analysis and Detrended Corresponence Analysis Lab 11 Fuzzy Set Ordination Lab 12 Canonical Correspondence Analysis •Cluster Analysis Lab 13 Cluster Analysis Lab 14 Discriminant Analysis with Tree Classifiers •Introduction R for Ecologists, a primer on the S language and available software Analysis of Phylogenetics and Evolution http://ape.mpl.ird.fr/ •ape is not a classical software (i.e., a "black box" though it can be used as one) but an environment for data analyses and development of new methods where interactivity and user's decisions are left an important place. •ape is written in R, but some functions for computer-intensive tasks are written in C. •ape is available for all main computer operating systems (Linux, Unix, Windows, MacOS X 10.4 and later). •ape is distributed under the terms of the GNU General Public Licence, meaning that it is free and can be freely modified and redistribued (under some conditions). smatr:(Standardised) Major Axis Estimation and Testing Routines This package provides methods of fitting bivariate lines in allometry using the major axis (MA) or standardised major axis (SMA), and for making inferences about such lines. BiodiversityR: GUI for biodiversity and community ecology analysis This package provides a GUI and some utility functions (often based on the vegan package) for statistical analysis of biodiversity and ecological communities, • species accumulation curves, • diversity indices, Renyi profiles, • GLMs for analysis of species abundance and presenceabsence • distance matrices, • Mantel tests, and cluster, constrained and unconstrained ordination analysis. A book on biodiversity and community ecology analysis is available for free download from the website. untb: ecological drift under the UNTB A collection of utilities for biodiversity data. Includes the simulation of ecological drift under Hubbell's Unified Neutral Theory of Biodiversity, and the calculation of various diagnostics such as Preston curves. Functions (34) alonso Various functions from Alonso and McKane bci Barro Colorado Island (BCI) dataset butterflies abundance data for butterflies etienne Etienne's sampling formula expected.abundance Expected abundances under the neutral model plot.count Abundance curves rand.neutral Random neutral ecosystem volkov Expected frequency of species zsm Zero sum multinomial distribution as derived by McKane 2004 sem: Structural Equation Models sem is an R package for fitting structural-equation models. The package supports general structural equation models with latent varibles, fit by maximum likelihood assuming multinormality, and single-equation estimation for observed-variable models by two-stage least.squares 1 2 x1 x2 x11 1 x 21 31 3 4 x3 x4 5 x5 6 x6 x32 21 12 2 x42 x53 x63 1 11 22 32 23 3 1 2 21 y11 y1 1 y21 y2 2 y32 y3 3 y42 y4 4 2 R2WinBUGS: Running WinBUGS and OpenBUGS from R / S-PLUS The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has nished, it is possible either to read the resulting data into R by the package itself which gives a compact graphical summary of inference and convergence diagnostics or to use the facilities of the coda package for further analyses of the output. Others bio.infer: Maximum Likelihood Method for Predicting Environmental Conditions from Assemblage Composition popbio: Construction and analysis of matrix population models Construct and analyze projection matrix models from a demography study of marked individuals classified by age or stage. The package covers methods described in Matrix Population Models by Caswell (2001) and Quantitative Conservation Biology by Morris and Doak (2002). simecol allows to implement ecological models (ODEs, IBMs, ...) using a template-like object-oriented stucture. It helps to organize scenarios and may also be useful for other areas. demogR: Analysis of age-structured demographic models Section 3 Making Elegant Graphics in R Graphical capabilities 100 120 Simple, exploratory graphics are easy to produce. 0 20 40 60 dist 80 Publication quality graphics can be created. Several device drivers are available including: 5 15 20 25 speed 8 6 4 10 2 0 Postscript, 5 -2 -10 0 -5 0 -5 X 5 10 -10 Y Z On-screen graphics 10 Section 3 Types of graphics functions High level functions >-Plot graphs from data Low level functions >-Add to/Modify existing graphs Graphics parameters >-Control how plots are displayed Section 3 Some and high level functions • • • • • • • • Low points() lines() abline() text() mtext() title() legend() etc High • • • • plot() hist() barplot() boxplot() Section 3 Graphical parameters • list of parameters held in par() • controls how plots are displayed • values in par() can be over-ridden by calls to high-level functions – values are changed temporarily for plot – no permanent change to par() values • can be modified directly Section 3 Plotting single variables: histogram • First generate data rrnorm takes n random samples from a normal distribution Can specify mean and sd, defaults are (0,1) x<-rnorm(100) • Plot histogram: Histogram of x hist(x) 10 5 0 Frequency 15 20 hist is a generic function to plot histograms of a dataset -2 -1 0 x 1 2 Section 3 Plotting single variables: boxplot -1 -2 • Add title title(main="Boxplot of x1") 0 1 • boxplot function boxplot(x1) 2 Boxplot of x1 Section 3 3 0 -1 -2 -3 boxplot(x1,x2,x3) title(main=“three boxplots”) OR bp3<data.frame(x1,x2,x3) boxplot(bp3) 1 2 Generate more data x2<-rnorm(100) x3<-rnorm(100,2) 4 Plotting single variables: boxplot 1 2 3 Section 3 • barplot(tN,col=rainbo w(20) 15 10 5 • barplot(tN) 0 • tN <- table(Ni <rpois(100, lambda=5)) 20 Plotting single variables: barplot 0 1 2 3 4 5 6 7 8 9 11 12 Section 3 Barplots for grouped data • Using the VADeaths table from the datasets package: • >-library(datasets); data(VADeaths); VADeaths • Rural Male Rural Female Urban Male Urban Female 50-54 11.7 8.7 15.4 8.4 55-59 18.1 11.7 24.3 13.6 60-64 26.9 20.3 37.0 19.3 65-69 41.0 30.9 54.6 35.1 70-74 66.0 54.3 71.1 50.0 Section 3 Barplots for grouped data 60 50-54 55-59 60-64 65-69 70-74 0 0 10 20 50 30 40 100 50 150 70-74 65-69 60-64 55-59 50-54 70 200 barplot(VADeaths, col=rainbow(5 barplot(VADeaths, col=rainbow(5),legend=T) legend=T,beside=T) Rural Male Rural Female Urban Male Urban Female Rural Male Rural Female Urban Male Urban Female Section 3 Plotting single variables: scatterplot function (n=20) {x1<-rnorm(n) par(mfrow=c(3,2)) plot(x1,type="p",main="type p: points") plot(x1,type="l",main="type l: lines") plot(x1,type="b",main="type b: both") plot(x1,type="o",main="type o: overplot") plot(x1,type="h",main="type h: histogram") plot(x1,type="s",main="type s: steps") par(mfrow=c(1,1))} Section 3 Plotting two variables: scatterplot • Example data: “women” from the datasets package . Heights vs weights for America women To get data into workspace: >library(datasets);data(women); attach(women) 140 130 plot(women$height,women$weight) 120 plot(weight~height) weight 150 160 plot(x=height,y=weight, data=women) 58 60 62 64 66 height 68 70 72 Section 3 Adding to plots: lines • Straight lines added using “abline” function abline(v=x) >-draws a vertical line at value x abline(v=y) >-draws a horizontal line at value y abline(a=x,b=m)>- draws a line with intercept x and gradient m • lines() adds a series of connected lines – takes vectors of x and y co-ordinates and draws lines between them • segments() draws (disjoint) lines segments between pairs of points Section 3 Adding to plots: lines 100 80 60 40 20 abline(cars.lm, col="red“,lty=2) 0 cars.lm<-lm(dist~speed) dist library(datasets) data(cars) attach(cars) plot(dist~speed) 120 • “cars” from the datasets package has speed and stopping distances • draw scatter plot and add regression line 5 10 15 speed 20 25 Section 3 Adding to plots: lines and text • • • • • • • • • • • • • • function (df=1,p=0.05) {x<-seq(1,50,0.01) y<-dchisq(x,df) plot(x,y,type="l",xlim=c(0,10),ann=F) title(xlab=substitute(paste(chi^2,"stati stic")),ylab="Frequency",main=(substitut e(paste(chi^2,"distribution with",df,"df, p=",p)))) critx<-qchisq(p,df,lower.tail=F) abline(v=critx,col="red") xr<-c(critx,x[x>critx]) yr<-c(dchisq(critx,df),y[x>critx]) polygon(x=c(critx,critx,xr,10), y=c(0,dchisq(critx,df),yr,0),col="red”) xr<-round(critx,0)+2 yr<-max(y)/4 rx<-round(critx,3) text(xr,yr,substitute(paste("Critical ",chi^2,"=",rx)),col="red")} Section 3 Adding to plots: margin text • mtext() adds text in the margins of figures • Margins are numbered 1:4, clockwise from the bottom • Text is printed a specified number of lines away from the edge of the plot area • Can print in outer margins if available Section 3 Adding to plots: legends • “Legend” function adds labels to plots • Needs x and y co-ordinates, and a list of labels • Can be argument to high-level function, or called separately legend(x=30,y=0.06, legend=c("Null","Alternative" ,"Threshold"), lty=c(1,1,2), col=c("blue","black","red"),lw d=2) Section 3 Plotting symbols, pch • 25 default plotting characters • Controlled by pch parameter • Add pch = n argument to a high-level function plots points of type n • Argument to pch can be a single character, so pch = “x” will plot letter x • pch can take vectors as arguments, so different points will have different symbols function () {x<-rep(1:5,times=5) ,y<-rep(1:5,each=5) par(xpd=T), plot(x,y,type="p",pch=1:25,cex=3,axes=F,ann=F,bty="n",bg="red") title(main="Data symbols 1:25") text(x,y,labels=1:25,pos=1,offset=1)} Section 3 Plotting symbols: size and width • symbol size controlled by cex – cex=n plots a figure n times the default size • Line width controlled by lwd – lwd=n gives the width of the line function () {x<-rep(1:5,times=5),y<-rep(1:5,each=5) plot(x,y,type="p",pch=3,lwd=x,cex=y,xli m=c(0,5),ylim=c(0,5),bty="n", ann=F) title(main="R plotting symbols: size and width", xlab="Width (lwd)",ylab="Size (cex)")} Section 3 Plotting symbols: colour • Colour of objects controlled by “col”, see colours() for all ; palette() function () {x<-rep(1:5,times=5) y<-rep(1:5,each=5) plot(x,y,type="p",pch=3,lwd=x,ce x=y, col=1:25, xlim=c(0,5),ylim=c(0,5),bty=" n", ann=F) title(main="R plotting symbols: size and width", xlab="Width (lwd)", ylab="Size (cex)")} Section 3 Types of lines, lty function () {x<-1:10 plot(x,y=rep(1,10),ylim=c(1,6.5), type="l",lty=1,axes=F,ann=F) title(main="Line Types") for(i in 2:6){lines(x,y=rep(i,10),lty=i)} axis(2,at=1:6,tick=F,las=1) linenames<paste("Type",1:6,":",c("solid","dashed ","dotted","dotdash","longdash","two dash")) text(2,(1:6)+0.25,labels=linenames) box() } Section 3 Using lattice graphics par(mfrow=c(2,2)) hist(D$wg, main='Histogram',xlab='Weight Gain', ylab ='Frequency', col=heat.colors(14)) boxplot(wg.7a$wg, wg.8a$wg, wg.9a$wg, wg.10a$wg, wg.11a$wg, wg.12p$wg, main='Weight Gain', ylab='Weight Gain (lbs)', xlab='Shift', names = c('7am','8am','9am','10am','11am','12pm ')) plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain', xlab='Mets (min)',ylab='Weight Gain (lbs)',pch=2) plot(t1,D2$Intel,type="l",main='Closing Stock Prices',xlab='Time',ylab='Price $') lines(t1,D2$DELL,lty=2) Section 3 Grid Graphics Object Oriented - graphical components can be reused and recombined: library(grid); library(lattice) • • • • • • • A standard set of graphical primitives: grid.rect(...) grid.lines(...) grid.polygon(...) grid.circle(...) grid.text(...) library(help = grid) for details 0 0.25 0.5 text 0.75 1 R Package for Elegant Graphics– Lattice, ggplot2 Ggplot2: R Graphics Paul Murrell Ggplot2: Elegant Graphics for Data Analysis (Use R) Hadley Wickham Thanks you ! www.r-project.org