Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis Websites • R www.r-project.org – software; – documentation; – RNews. • Bioconductor www.bioconductor.org – software, data, and documentation; – training materials from short courses; – www.bioconductor.org/workshops/UCSC03/uc sc03.html – mailing list. R language: Overview • Open source and open development. • Design and deployment of portable, extensible, and scalable software. • Interoperability with other languages: C, XML. • Variety of statistical and numerical methods. • High quality visualization and graphics tools. • Effective, extensible user interface. • Innovative tools for producing documentation and training materials: vignettes. • Supports the creation, testing, and distribution of software and data modules: packages. R user interface • Batch or command line processing bash$ R R> q() to start to quit • Graphics windows > X11() > postscript() > dev.off() • File path is relative to working directory > getwd() > setwd() • Load a package library with library() • GUIs, tcltk Getting Help o Details about a specific command whose name you know (input arguments, options, algorithm): > ? t.test > help(t.test) o See an example of usage: > demo(graphics) > example(mean) mean> x <- c(0:10, 50) mean> xm <- mean(x) mean> c(xm, mean(x, trim = 0.1)) [1] 8.75 5.50 Getting Help o HTML search engine lets you search for topics with regular expressions: > help.search o Find commands containing a regular expression or object name: > apropos("var") [1] "var.na" ".__M__varLabels:Biobase" [3] "varLabels" "var.test" [5] "varimax" "all.vars" [7] "var" "variable.names" [9] "variable.names.default" "variable.names.lm" Getting Help o Vignettes contain text and executable code: > library(tkWidgets) > vExplorer() > openVignette() Created using the Sweave() function. .Rnw files produce a PDF file and a vignette. o To see code for a function, type the name with no parentheses or arguments: > plot R as a Calculator > log2(32) [1] 5 > print(sqrt(2)) [1] 1.414214 > pi [1] 3.141593 > seq(0, 5, length=6) [1] 0 1 2 3 4 5 > 1+1:10 [1] 2 3 4 5 6 7 8 9 10 11 R as a Graphics Tool 0.5 0.0 -0.5 -1.0 sin(seq(0, 2 * pi, length = 100)) 1.0 > plot(sin(seq(0, 2*pi, length=100))) 0 20 40 60 Index 80 100 Variables > a <- 49 > sqrt(a) [1] 7 numeric > b <- "The dog ate my homework" character > sub("dog","cat",b) string [1] "The cat ate my homework" > c <- (1+1==3) > c [1] FALSE > as.character(b) [1] "FALSE" logical Missing Values Variables of each data type (numeric, character, logical) can also take the value NA: not available. o NA is not the same as 0 o NA is not the same as “” o NA is not the same as FALSE o NA is not the same as NULL Operations that involve NA may or may not produce NA: > NA==1 [1] NA > 1+NA [1] NA > max(c(NA, 4, 7)) [1] NA > max(c(NA, 4, 7), na.rm=T) [1] 7 > NA | TRUE [1] TRUE > NA & TRUE [1] NA Vectors vector: an ordered collection of data of the same type > a <- c(1,2,3) > a*2 [1] 2 4 6 Example: the mean spot intensities of all 15488 spots on a microarray is a numeric vector In R, a single number is the special case of a vector with 1 element. Other vector types: character strings, logical Matrices and Arrays matrix: rectangular table of data of the same type Example: the expression values for 10000 genes for 30 tissue biopsies is a numeric matrix with 10000 rows and 30 columns. array: 3-,4-,..dimensional matrix Example: the red and green foreground and background values for 20000 spots on 120 arrays is a 4 x 20000 x 120 (3D) array. Lists list: ordered collection of data of arbitrary types. Example: > doe <- list(name="john",age=28,married=F) > doe$name [1] "john“ > doe$age [1] 28 > doe[[3]] [1] FALSE Typically, vector elements are accessed by their index (an integer) and list elements by $name (a character string). But both types support both access methods. Slots are accessed by @name. Data Frames data frame: rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Represents the typical data table that researchers come up with – like a spreadsheet. Example: > a <data.frame(localization,tumorsize,progress,row .names=patients) > a localization tumorsize progress XX348 proximal 6.3 FALSE XX234 distal 8.0 TRUE XX987 proximal 10.0 FALSE What type is my data? class Class from which object inherits (vector, matrix, function, logical, list, … ) mode Numeric, character, logical, … storage.mode Mode used by R to store object typeof is.function is.na names dimnames slotNames attributes (double, integer, character, logical, …) Logical (TRUE if function) Logical (TRUE if missing) Names associated with object Names for each dim of array Names of slots of BioC objects Names, class, etc. Subsetting Individual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name > a localization tumorsize progress XX348 proximal 6.3 0 XX234 distal 8.0 1 XX987 proximal 10.0 0 > a[3, 2] [1] 10 > a["XX987", "tumorsize"] [1] 10 > a["XX987",] localization tumorsize progress XX987 proximal 10 0 Example: subset rows by a vector of indices >a localization tumorsize progress XX348 proximal 6.3 0 XX234 distal 8.0 1 XX987 proximal 10.0 0 > a[c(1,3),] localization tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 > a[-c(1,2),] localization tumorsize progress XX987 proximal 10.0 0 subset rows by a logical vector > a[c(T,F,T),] localization tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 subset columns > a$localization [1] "proximal" "distal" comparison resulting in logical vector > a$localization=="proximal" [1] TRUE FALSE TRUE subset the selected rows > a[ a$localization=="proximal", ] localization tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 "proximal" Functions and Operators Functions do things with data “Input”: function arguments (0,1,2,…) “Output”: function result (exactly one) Example: add <- function(a,b) { result <- a+b return(result) } Operators: Short-cut writing for frequently used functions of one or two arguments. Frequently used operators <+ * / ^ %% %*% %/% %in% Assign Sum Difference Multiplication Division Exponent Mod Dot product Integer division Subset | & < > <= >= ! != == Or And Less Greater Less or = Greater or = Not Not equal Is equal Frequently used functions c Concatenate cbind, rbind min max length dim floor which table Concatenate vectors Minimum Maximum # values # rows, cols Max integer in TRUE indices Counts summary Sort, order, rank print cat paste round apply Generic stats Sort, order, rank a vector Show value Print as char c() as char Round Repeat over rows, cols Statistical functions rnorm, dnorm, pnorm, qnorm Normal distribution random sample, density, cdf and quantiles lm, glm, anova Model fitting loess, lowess Smooth curve fitting sample Resampling (bootstrap, permutation) .Random.seed Random number generation mean, median Location statistics var, cor, cov, mad, range Scale statistics svd, qr, chol, eigen Linear algebra Graphical functions plot Generic plot eg: scatter points Add points lines, abline Add lines text, mtext Add text legend Add a legend axis Add axes box Add box around all axes par Plotting parameters (lots!) colors, palette Use colors Branching if (logical expression) { statements } else { alternative statements } else branch is optional { } are optional with one statement ifelse (logical expression, yes statement, no statement) Loops When the same or similar tasks need to be performed multiple times; for all elements of a list; for all columns of an array; etc. for(i in 1:10) { print(i*i) } i<-1 while(i<=10) { print(i*i) i<-i+sqrt(i) } Also: repeat, break, next Regular Expressions Tools for text matching and replacement which are available in similar forms in many programming languages (Perl, Unix shells, Java) > a <- c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17") > grep("L", a) [1] 2 3 5 > grep("L", a, value=T) [1] "Ly-9" "MLN50" "CLH-17" > grep("^L", a, value=T) [1] "Ly-9" > grep("[0-9]", a, value=T) [1] "Ly-9" "MLN50" "ZNF191" "CLH-17" > gsub("[0-9]", "X", a) [1] "CENP-F" "Ly-X" "MLNXX" "ZNFXXX" "CLH-XX" Storing Data Every R object can be stored into and restored from a file with the commands “save” and “load”. This uses the XDR (external data representation) standard of Sun Microsystems and others, and is portable between MSWindows, Unix, Mac. > save(x, file=“x.Rdata”) > load(“x.Rdata”) Importing and Exporting Data There are many ways to get data in and out. Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tab-delimited text files. > x <- read.delim(“filename.txt”) Also: read.table, read.csv, scan > write.table(x, file=“x.txt”, sep=“\t”) Also: write.matrix, write Importing Data: caveats Type conversions: by default, the read functions try to guess and auto convert the data types of the different columns (e.g. number, factor, character). There are options as.is and colClasses to control this. Special characters: the delimiter character (space, comma, tabulator) and the end-of-line character cannot be part of a data field. To circumvent this, text may be “quoted”. However, if this option is used (the default), then the quote characters themselves cannot be part of a data field. Except if they themselves are within quotes… Understand the conventions your input files use and set the quote options accordingly. Bioconductor • R software project for the analysis and comprehension of biomedical and genomic data. – Gene expression arrays (cDNA, Affymetrix) – Pathway graphs – Genome sequence data • Started in 2001 by Robert Gentleman, Dana Farber Cancer Institute. • About 25 core developers, at various institutions in the US and Europe. • Tools for integrating biological metadata from the web (annotation, literature) in the analysis of experimental metadata. • End-user and developer packages. Example R/BioC Packages methods Class/method tools tools tkWidgets marrayTools, marrayPlots Sweave(),gui tools affy Affymetrix array analysis annotate Link microarray data to metadata on the web Clustering and classification Spotted cDNA array analysis mva, cluster, clust, class t.test, prop.test, Statistical tests wilcox.test