Introduction to the R language

advertisement
Introduction to the R language
Ahmed Rebai
Bioinformatics and Comparative Genome Analysis
Websites
• R www.r-project.org
– software;
– documentation;
– RNews.
• Bioconductor www.bioconductor.org
– software, data, and documentation;
– training materials from short courses;
– www.bioconductor.org/workshops/UCSC03/uc
sc03.html
– mailing list.
R language: Overview
• Open source and open development.
• Design and deployment of portable, extensible,
and scalable software.
• Interoperability with other languages: C, XML.
• Variety of statistical and numerical methods.
• High quality visualization and graphics tools.
• Effective, extensible user interface.
• Innovative tools for producing documentation
and training materials: vignettes.
• Supports the creation, testing, and distribution of
software and data modules: packages.
R user interface
• Batch or command line processing
bash$ R
R> q()
to start
to quit
• Graphics windows
> X11()
> postscript()
> dev.off()
• File path is relative to working directory
> getwd()
> setwd()
• Load a package library with library()
• GUIs, tcltk
Getting Help
o Details about a specific command whose name
you know (input arguments, options, algorithm):
> ? t.test
> help(t.test)
o See an example of usage:
> demo(graphics)
> example(mean)
mean> x <- c(0:10, 50)
mean> xm <- mean(x)
mean> c(xm, mean(x, trim = 0.1))
[1] 8.75 5.50
Getting Help
o HTML search engine lets you search for topics
with regular expressions:
> help.search
o Find commands containing a regular expression
or object name:
> apropos("var")
[1] "var.na"
".__M__varLabels:Biobase"
[3] "varLabels"
"var.test"
[5] "varimax"
"all.vars"
[7] "var"
"variable.names"
[9] "variable.names.default" "variable.names.lm"
Getting Help
o Vignettes contain text and executable code:
> library(tkWidgets)
> vExplorer()
> openVignette()
Created using the Sweave() function.
.Rnw files produce a PDF file and a vignette.
o To see code for a function, type the name
with no parentheses or arguments:
> plot
R as a Calculator
> log2(32)
[1] 5
> print(sqrt(2))
[1] 1.414214
> pi
[1] 3.141593
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
> 1+1:10
[1] 2 3
4
5
6
7
8
9 10 11
R as a Graphics Tool
0.5
0.0
-0.5
-1.0
sin(seq(0, 2 * pi, length = 100))
1.0
> plot(sin(seq(0, 2*pi, length=100)))
0
20
40
60
Index
80
100
Variables
> a <- 49
> sqrt(a)
[1] 7
numeric
> b <- "The dog ate my homework"
character
> sub("dog","cat",b)
string
[1] "The cat ate my homework"
> c <- (1+1==3)
> c
[1] FALSE
> as.character(b)
[1] "FALSE"
logical
Missing Values
Variables of each data type (numeric, character, logical)
can also take the value NA: not available.
o NA is not the same as 0
o NA is not the same as “”
o NA is not the same as FALSE
o NA is not the same as NULL
Operations that involve NA may or may not produce NA:
> NA==1
[1] NA
> 1+NA
[1] NA
> max(c(NA, 4, 7))
[1] NA
> max(c(NA, 4, 7), na.rm=T)
[1] 7
> NA | TRUE
[1] TRUE
> NA & TRUE
[1] NA
Vectors
vector: an ordered collection of data of the
same type
> a <- c(1,2,3)
> a*2
[1] 2 4 6
Example: the mean spot intensities of all 15488
spots on a microarray is a numeric vector
In R, a single number is the special case of a
vector with 1 element.
Other vector types: character strings, logical
Matrices and Arrays
matrix: rectangular table of data of the same
type
Example: the expression values for 10000
genes for 30 tissue biopsies is a numeric
matrix with 10000 rows and 30 columns.
array: 3-,4-,..dimensional matrix
Example: the red and green foreground and
background values for 20000 spots on 120
arrays is a 4 x 20000 x 120 (3D) array.
Lists
list: ordered collection of data of arbitrary types.
Example:
> doe <- list(name="john",age=28,married=F)
> doe$name
[1] "john“
> doe$age
[1] 28
> doe[[3]]
[1] FALSE
Typically, vector elements are accessed by their
index (an integer) and list elements by $name (a
character string). But both types support both
access methods. Slots are accessed by @name.
Data Frames
data frame: rectangular table with rows and columns; data
within each column has the same type (e.g. number, text,
logical), but different columns may have different types.
Represents the typical data table that researchers come up
with – like a spreadsheet.
Example:
> a <data.frame(localization,tumorsize,progress,row
.names=patients)
> a
localization tumorsize progress
XX348
proximal
6.3
FALSE
XX234
distal
8.0
TRUE
XX987
proximal
10.0
FALSE
What type is my data?
class
Class from which object inherits
(vector, matrix, function, logical, list, … )
mode
Numeric, character, logical, …
storage.mode Mode used by R to store object
typeof
is.function
is.na
names
dimnames
slotNames
attributes
(double, integer, character, logical, …)
Logical (TRUE if function)
Logical (TRUE if missing)
Names associated with object
Names for each dim of array
Names of slots of BioC objects
Names, class, etc.
Subsetting
Individual elements of a vector, matrix, array or data frame are
accessed with “[ ]” by specifying their index, or their name
> a
localization tumorsize progress
XX348
proximal
6.3
0
XX234
distal
8.0
1
XX987
proximal
10.0
0
> a[3, 2]
[1] 10
> a["XX987", "tumorsize"]
[1] 10
> a["XX987",]
localization tumorsize progress
XX987
proximal
10
0
Example:
subset rows by a
vector of indices
>a
localization tumorsize progress
XX348
proximal
6.3
0
XX234
distal
8.0
1
XX987
proximal
10.0
0
> a[c(1,3),]
localization tumorsize progress
XX348
proximal
6.3
0
XX987
proximal
10.0
0
> a[-c(1,2),]
localization tumorsize progress
XX987
proximal
10.0
0
subset rows by a
logical vector
> a[c(T,F,T),]
localization tumorsize progress
XX348
proximal
6.3
0
XX987
proximal
10.0
0
subset columns
> a$localization
[1] "proximal" "distal"
comparison resulting
in logical vector
> a$localization=="proximal"
[1] TRUE FALSE TRUE
subset the
selected rows
> a[ a$localization=="proximal", ]
localization tumorsize progress
XX348
proximal
6.3
0
XX987
proximal
10.0
0
"proximal"
Functions and Operators
Functions do things with data
“Input”: function arguments (0,1,2,…)
“Output”: function result (exactly one)
Example:
add <- function(a,b) {
result <- a+b
return(result)
}
Operators: Short-cut writing for frequently
used functions of one or two arguments.
Frequently used operators
<+
*
/
^
%%
%*%
%/%
%in%
Assign
Sum
Difference
Multiplication
Division
Exponent
Mod
Dot product
Integer division
Subset
|
&
<
>
<=
>=
!
!=
==
Or
And
Less
Greater
Less or =
Greater or =
Not
Not equal
Is equal
Frequently used functions
c
Concatenate
cbind,
rbind
min
max
length
dim
floor
which
table
Concatenate
vectors
Minimum
Maximum
# values
# rows, cols
Max integer in
TRUE indices
Counts
summary
Sort,
order,
rank
print
cat
paste
round
apply
Generic stats
Sort, order,
rank a vector
Show value
Print as char
c() as char
Round
Repeat over
rows, cols
Statistical functions
rnorm, dnorm,
pnorm, qnorm
Normal distribution random sample,
density, cdf and quantiles
lm, glm, anova
Model fitting
loess, lowess
Smooth curve fitting
sample
Resampling (bootstrap, permutation)
.Random.seed
Random number generation
mean, median
Location statistics
var, cor, cov,
mad, range
Scale statistics
svd, qr, chol,
eigen
Linear algebra
Graphical functions
plot
Generic plot eg: scatter
points
Add points
lines, abline
Add lines
text, mtext
Add text
legend
Add a legend
axis
Add axes
box
Add box around all axes
par
Plotting parameters (lots!)
colors, palette Use colors
Branching
if (logical expression) {
statements
}
else {
alternative statements
}
else branch is optional
{ } are optional with one statement
ifelse (logical expression, yes
statement, no statement)
Loops
When the same or similar tasks need to be
performed multiple times; for all elements of a
list; for all columns of an array; etc.
for(i in 1:10) {
print(i*i)
}
i<-1
while(i<=10) {
print(i*i)
i<-i+sqrt(i)
}
Also: repeat, break, next
Regular Expressions
Tools for text matching and replacement which are available in similar
forms in many programming languages (Perl, Unix shells, Java)
> a <- c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17")
> grep("L", a)
[1] 2 3 5
> grep("L", a, value=T)
[1] "Ly-9"
"MLN50" "CLH-17"
> grep("^L", a, value=T)
[1] "Ly-9"
> grep("[0-9]", a, value=T)
[1] "Ly-9"
"MLN50" "ZNF191" "CLH-17"
> gsub("[0-9]", "X", a)
[1] "CENP-F" "Ly-X"
"MLNXX"
"ZNFXXX" "CLH-XX"
Storing Data
Every R object can be stored into and restored
from a file with the commands
“save” and “load”.
This uses the XDR (external data
representation) standard of Sun Microsystems
and others, and is portable between MSWindows, Unix, Mac.
> save(x, file=“x.Rdata”)
> load(“x.Rdata”)
Importing and Exporting Data
There are many ways to get data in and out.
Most programs (e.g. Excel), as well as humans,
know how to deal with rectangular tables in the
form of tab-delimited text files.
> x <- read.delim(“filename.txt”)
Also: read.table, read.csv, scan
> write.table(x, file=“x.txt”, sep=“\t”)
Also: write.matrix, write
Importing Data: caveats
Type conversions: by default, the read functions try to
guess and auto convert the data types of the different
columns (e.g. number, factor, character). There are
options as.is and colClasses to control this.
Special characters: the delimiter character (space,
comma, tabulator) and the end-of-line character cannot
be part of a data field. To circumvent this, text may be
“quoted”. However, if this option is used (the default),
then the quote characters themselves cannot be part of
a data field. Except if they themselves are within
quotes…
Understand the conventions your input files use and set
the quote options accordingly.
Bioconductor
• R software project for the analysis and
comprehension of biomedical and genomic data.
– Gene expression arrays (cDNA, Affymetrix)
– Pathway graphs
– Genome sequence data
• Started in 2001 by Robert Gentleman, Dana
Farber Cancer Institute.
• About 25 core developers, at various institutions
in the US and Europe.
• Tools for integrating biological metadata from the
web (annotation, literature) in the analysis of
experimental metadata.
• End-user and developer packages.
Example R/BioC Packages
methods
Class/method tools
tools
tkWidgets
marrayTools,
marrayPlots
Sweave(),gui tools
affy
Affymetrix array analysis
annotate
Link microarray data to metadata
on the web
Clustering and classification
Spotted cDNA array analysis
mva, cluster,
clust, class
t.test, prop.test, Statistical tests
wilcox.test
Download