Statistical lingua franca Introduction to R Lecture 2 22 Sept 2015 Kyrylo Bessonov 1 Outline 1. Introduction to R language 1. basics 2. loops 3. data structures 2. Visualization 1. rich plotting options 3. Bioinformatics 1. repositories 2. annotation of biological IDs 2 Definition • “R is a free software environment for statistical computing and graphics”1 • R is considered to be one of the most widely used languages amongst statisticians, data miners, bioinformaticians and others. • R is free implementation of S language • Other commercial statistical packages are SPSS, SAS, MatLab 3 1 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/) History • 1993 creation – by Ross Ihaka and Robert Gentleman at the University of Auckland • Community extended via – Packages • Facts – initial interpreter ~ 1000 lines of C code – GNU project 4 R language features • Based on S language (commercial) – Created in 1976 by Bell Labs – Main moto: "to turn ideas into software, quickly and faithfully" (John Chambers) • Memory allocation – Fixed allocation at startup – ‘on-the-fly’ garbage collector – memory hungry • Variables have lexical scoping – functions have access to the variables which were in the effect when the function was defined 5 Lexical scoping The value of the variable is searched in the environment it is being called. If the value is not found, the search is continued in parent environment make.counter <- function(x = 0){ function(){ x <<- x + 1 cat("Inside the make.counter() the x is",x,"and not 5!\n") print(environment()) } } x <- 5 cat("Global value of x is", x , "\n"); print(environment()) counter <- make.counter() counter() print(x) 1) The value of x inside make.counter()is stored even after the function call and belongs to the function environment (i.e. scope) 2) The global value of x is not modified 3) The “inner function” (i.e. function()) can only access the local value of x More details are given at the functions section of the tutorial… 6 R language features • Flexible Plot Layouts – The layout command divides up the area into a matrix. nf <- layout(matrix(c(1,1,0,2), 2, 2, byrow=TRUE), respect=TRUE) – Calling layout.show()results in the layout being displayed for your reference. 7 R language features • A flexible statistical analysis toolkit – Statistical models • Regression, ANOVA, GLM, trees • Free data analysis – No subscription fees – Transparent code • Is a language – Can write complex scripts – Not limited by the GUI • Active community 8 Why to learn R? • R is widely used by bioinformaticians and statisticians • It is multiplatform – iOS, Windows, Linux • Large number of libraries • Main library repositories CRAN and BioConductor 9 In a nut-shell… • R is a scripting language and, as such, is much more easier to learn than other compiled languages such as C • R has reasonably well written documentation (vignettes) • Syntax in R is simple and intuitive if one has basic statistics skills • R scripts will be provided and explained in-class 10 Installation of R 1. go to http://www.r-project.org/ 2. select CRAN (Comprehensive R Archive Network) from left menu 3. link to nearby geographic site 4. select your operating system 5. choose "Base" installation 6. save R-X.X.X-win32.exe (windows) or R-X.X.Xmini.dmg (Mac OS X) 7. run the installation program accepting defaults 11 Tutorial Introduction to R language Part 1: Basics 12 Topics covered in this tutorial • Operators / Variables • Main objects types • Visualization – plot modification functions • Writing and reading data to/from files • Bioinformatics applications – Annotation of array probes 13 Variables/Operators • Variables store one element x <- 25 Here x variable is assigned value 25 • Check value assigned to the variable x >x [1] 25 • Basic mathematical operators : – arithmetic: + , - , * , / – power: ^ • Use parenthesis to obtain desired sequence of mathematical operations 14 Arithmetic operators • What is the value of small z here? >x <- 25 > y <- 15 > z <- (x + y)*2 > Z <- z*z > z [1] 80 15 Logical operators • These operators mostly work on vectors, matrices and other data types • Type of data is not important, the same operators are used for numeric and character data types Operator < <= > >= == != !x x|y x&y Description less than less than or equal to greater than greater than or equal to exactly equal to not equal to Not x x OR y x AND y 16 Logical operators • Can be applied to vectors in the following way. The return value is either True or False > v1 [1] 48 2 3 4 5 > v1 <= 3 [1] FALSE TRUE TRUE FALSE FALSE 17 Key functions • mathematical functions – sqrt, log, exp, sin, cos, tan • simple functions – max, min, length, sum, mean, var, sort • other useful functions – abs(-5) #absolute value – exp(8) # exponentiation – log(exp(8)) # natural logarithm – sqrt(64) # square root 18 R workspace • Display all workplace objects (variables, vectors, etc.) via ls(): >ls() [1] "Z" "v1" "x" "y" "z" • Useful tip: to save “workplace” and restore from a file use: >save.image(file = " workplace.rda") >load(file = "workplace.rda") 19 How to find help info? • Any function in R has help information • To invoke help use ? Sign or help(): ? function_name() ? mean help(mean, try.all.packages=T) • To search in all packages installed in your R installation always use try.all.packages=T in help() • To search for a key word in R documentation use help.search(): help.search("mean") 20 Basic data types • Data could be of 3 basic data types: – numeric – character – logical • Numeric variable type / class: > x <- 1 > mode(x) [1] "numeric" > class(x) [1] "numeric" 21 Basic data types • Logical variable type (True/False): > y <- 3<4 > mode(y) [1] "logical" • Character variable type: > z <- "Hello class" > mode(z) [1] "character" 22 Objects/Data structures • The main data objects in R are: – Matrices (single data type) – Data frames (supports various data types) – Lists (contain set of vectors) – Other more complex objects with slots • S3 and S4 objects • Matrices are 2D objects (rows/columns) > m <- matrix(0,2,3) > m [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 23 Vectors c() • Vectors have only 1 dimension and represent enumerated sequence of data. They can also store variables > v1 <- c(1, 2, 3, 4, 5) > mean(v1) [1] 3 The elements of a vector are specified /modified with braces (e.g. [number]) > v1[1] <- 48 > v1 [1] 48 2 3 4 5 24 Lists • Lists contain various vectors. Each vector in the list can be accessed by double braces [[number]]. Lists could be of different types > x <- c(1, 2, 3, 4) > y <- c(2, 3, 4) > L1 <- list(x, y) > L1 [[1]] [1] 1 2 3 4 [[2]] [1] 2 3 4 25 Matrices • Contain data of the same type – i.e. all cells are one of the 3 basic types • Defined with matrix() – Arguments: • initialization value (i.e. 0) • ncol: number of columns • nrow: number of rows matrix(0,ncol=2,nrow=2) [,1] [,2] [1,] 0 0 [2,] 0 0 26 Data frames • Data frames are similar to matrices but can contain various data types > x <- c(1,5,10) > y <- c("A", "B", "C") > z <-data.frame(x,y) x y 1 1 A 2 5 B 3 10 C • To get/change column and row names use colnames() and rownames() 27 Factors • Factors are special – could be vectors of integers or characters – can identify levels • i.e. unique variables in a vector > letters = c("A","B","C","A","C","C") - the vector letters has 6 chars but 3 levels > letters = factor(letters) [1] A B C A C C Levels: A B C > summary(letters) A B C 2 1 3 - summary()allows to see - number of variables per level - unique variables (levels) 28 Conversion between data types • One can convert one type of data into another using as.xxx where xxx is a data type 29 S4 objects • Allow to store more complex data strutures • Benefits – Provide greater modularity – Check for data type errors during assignment – Cleaner code • Every S4 objects belongs to a class require(methods) setClass("Box", slots=c(box_name = "character", address = "numeric", files = "list“) ,prototype=list(box_name=NULL, address=NULL, files=list()) ) 30 S4 objects • Each object is an instance of a class object = new("Box") • To access slots in the object use @ operator object@box_name object@address object@files • Assign values to slots object@box_name="pretty_box" 31 S4 object functions • Use functions to get and set values in objects setMethod(f="getBoxName", signature="Box", definition=function(object){ return(object@box_name) }) setMethod(f="setBoxName", signature="Box", definition=function(object,name){ object@box_name <-name }) 32 Read Input/ Write Output • To read data into R from a text file use read.table() – read help(read.table) to learn more – scan() is a more flexible alternative raw_data <-read.table(file="data_file.txt") • To write data into R from a text file use write.table() > write.table(mydata, "data_file.txt") 33 Loops • Loops allow repetition operations – can check given condition – execute n number of times • R is not specifically designed for loops – use them sparsely or replace by apply() – avoid nested loops • Will look at – for loop – while loop – repeat loop 34 For loop • Executed n number of times – safe to use – variable i will take values found in the array (i.e. 1:10) for(i in 1:10){ print(i) } 35 while loop • Will test infinitely for a particular condition until it becomes TRUE i=1; while(i<=10){ print(i) i=i+1 } 36 repeat • similar to the while loop • will always begin the loop – Executed at least once • Important to have a breaking condition – otherwise, infinite loop x <-1 repeat { x <- x+1 #if(x >= 5)break; } print(x) 37 apply() • A faster alternative to loops • Apply function to 2D data structures (matrix, df) apply(X, MARGIN, FUN, ...) X: is matrix or data frame (df) MARGIN: 1-columns and 2-rows FUN: function to apply to rows or columns M <- matrix(1:6, nrow=3, byrow=TRUE) apply(M, 1, sum) apply(M, 2, sum) #columns #rows 38 lapply() • is applied on lists / vectors • it does not work on matrices or higherdimensional arrays directly • returns list() M <- matrix(1:6, nrow=3, byrow=TRUE) lapply(1:2,function(x){sum(M[,x])}) L=list(sample(1:10,10)) lapply(L,mean) 39 sapply() • the "s" in "sapply" stands for "simplify • simplifies output to a vector M <- matrix(1:6, nrow=3, byrow=TRUE) sapply(1:2,function(x){sum(M[,x])}) [1] 9 12 40 tapply() • Apply function to subset data matrix tapply(summary, group, Function) Summary variable: Group variable: Function: selected variable to apply function on the filter variable used to split data function to apply to resulting sets medical.data <data.frame(patient = 1:100, age = rnorm(100, mean = 60, sd = 12), treatment = gl(2, 50, labels = c("Treatment", "Control"))) tapply(medical.data$age, medical.data$treatment, mean) Treatment Control 61.10559 59.37849 41 Conditional statements • allow to make choices – specific parts of the code executed • Add flexibility • Main statements – if … else – switch 42 If..else if (test expression) { #code } else if (test expression) { #code } else () { #code } • else()does not have any test expression – accepts all cases not covered by if()or else if() • test expression returns logical value (T or F) 43 If..else sign <- function(x){ if(x > 0){ print("x is a positive number"); } else if (x == 0){ print("x is zero"); } else{ print("x is negative") } } x<- 5; sign(x); x<- 0; sign(x); x<- -2; sign(x); 44 swich() switch (statement, list) • the statement is evaluated and value returned • based on this value, the corresponding item in the list is returned y <- rnorm(5) x <- "sd" z <- switch(x,"mean"=mean(y),"median"=median(y),"variance"=var(y),"sd"=sd(y)) print(z) sd(y) 45 Tutorial Introduction to R language Part 2: Functions 46 Functions in R • Functions encapsulate chucks of code – Helps with debugging and code readability – Variables passed by ‘pass by value’ add.function<-function(x){ x+5 } > add.function(2) [1] 7 47 Functions in R • • • • defined with the function() directive stored as R objects can be nested The return() directive is the last expression in the function body to be evaluated f <- function() { # Do something interesting return(1) } 48 Arguments • named arguments – optionally could have default values • not every function call in R makes use of all arguments – i.e. function arguments could be missing • then the default values are used mydata <- sample(1:10,20, replace=T) sort(mydata) sort(mydata, decreasing=T) 49 Arguments • can be matched by – relative sequential position: f(arg1, arg2, arg3) rnorm(100,0,1) – name: f(data=arg1, replace=arg2) rnorm(n=100, mean = 0, sd = 1) – can mix positional and by name matching mydata=data.frame(y=rnorm(100),x=rnorm(100)) lm(data = mydata, y ~ x, model = FALSE, 1:100) #equivalently lm(y ~ x, mydata, 1:100, model = FALSE) 50 Lazy evaluation • Arguments to functions are evaluated lazily – i.e. R does not check the # of arguments supplied to function – i.e. can give less but not more args to a function – args are evaluated as needed f <- function(a,b) { a^2 } f(a=10) #no error even if b is missing 51 The … argument • The ... argument indicates – a variable # of arguments – adds flexibility and abstraction • Uknown number of variables to be passed to a function apriori x=1:10; y=rnorm(10) f <- function(x, y, ...) { cat("value of x:",x,"\n"); cat("value of y:",y,"\n"); plot(x,y, ...) } f(x,y) #adding extra argument type=‘l’ f(x,y, type="l") 52 Scoping • Variable value is searched through a series of environments – the global environment – the namespaces of loaded libraries search() [1] ".GlobalEnv" "package:stats" [4] "package:grDevices" "package:utils" [7] "package:methods" "Autoloads" "package:graphics" "package:datasets" "package:base" • The global environment – is the user’s workspace 53 Scoping • Loading a package via library()adds the namespace of that package to position 2 of the search list library(igraph) search() [1] ".GlobalEnv" [4] "package:graphics" [7] "package:datasets" [10] "package:base" "package:igraph" "package:stats" "package:grDevices" "package:utils" "package:methods" "Autoloads" 54 Scoping • The scoping rules for R – different from the original S language – uses lexical scoping • related on how R “searches” for the variable value – simplifies statistical calculations f <- function(x, y) { y=2x+z } – z is a ‘free variable’ not defined in f() 55 Searching for the value • free variable value will be searched – in the environment in which a function was called – In the parent environment – In the top-level environment (i.e. global) – In the empty environment • if value not found – error thrown 56 Scoping • Typically, the user defined values are found – in the workspace (i.e. global environment) – a function is defined in the global environment • can have functions – defined inside other functions – the environment in which a function is defined could be environment of another function • child ο parent environments build.power <- function(n) { p <- function(x) { x^n } print(p) } square <- build.power(2) cube <- build.power(3) 57 Lexical Scoping • With lexical scoping – the variable value is looked up in the environment in which the function was defined y <- 10 f <- function(x) { y <- 2 y^2 } f() [1] 4 58 Other languages • Lexical scoping is supported in – Perl – Python –… 59 Dynamic scoping • With dynamic scoping – The variable value is looked up in the environment from which the function was called (the calling environment) y <- 10 f <- function(x) { y^2 } f() [1] 100 60 Regression and Generalized Linear Models Functions Examples 61 Regression • Regression analysis is used to – explain or model the relationship between • a single variable Y, called the response, output or dependent variable, and • one or more predictor, input, independent or explanatory variables, X1, …, Xp – p=1 simple regression – p>1 multivariate regression 62 Main uses of regression analysis • Prediction of future observations. • Assessment of the effect of, or relationship between, explanatory variables on the response. • A general description of data structure 63 Regression in R • the basic syntax for doing regression in R – lm(Y~model)to fit linear models – glm(Y~model)to fit generalized linear models. • where model could be specified following the general syntax rules in R – consists of predictor terms – provides Y~X relationship 64 Model syntax 65 Simple linear regression • The regression of Y on X is given by π¦π = βπ + β1 π₯π + επ for i = 1, … , p • Unknown parameters – βπ intercept • point in which the line intercepts the y-axis – β1 slope or coefficient • Increase in Y per unit change in X 66 Objective • To find the equation of the line that “best” fits the data. – find βπ and β1 such that the fitted values of Ε·π are as close as possible to observed π¦π – Residuals • The difference between – the observed value π¦π and the fitted value Ε·π » ππ = π¦π − Ε·π 67 residual sum of squares(RSS) • A usual way of calculating βπ and β1 – based on the minimization of the sum of the squared residuals, or • residual sum of squares (RSS) π ππ = π π (ππ )2 = π π (π¦π − Ε·π ) 2 68 Regression example • thuesen dataset – Ventricular shortening velocity – 24 rows and 2 columns • short.velocity (Y) • blood.glucose (X) – It contains ventricular shortening velocity and blood glucose for type 1 diabetic patients. 69 Regression example file="http://www.montefiore.ulg.ac.be/~kbessonov/present_data /GBIO0009-1_TopInBioinf2015-16/lectures/L2/thuesen.txt" data <- read.table(file, header=TRUE, stringsAsFactors=FALSE) options(na.action =na.exclude) fit.lm <- lm(short.velocity ~ blood.glucose, data=data) data.frame(data, fitted.value=fitted(fit.lm), residual=resid(fit.lm)) 1 2 3 4 5 blood.glucose 15.3 10.8 8.1 19.5 7.2 short.velocity fitted.value residual 1.76 1.433841 0.326158532 1.34 1.335010 0.004989882 1.27 1.275711 -0.005711308 1.47 1.526084 -0.056084062 1.27 1.255945 0.014054962 70 Measuring Goodness of Fit • Coefficient of Determination, R2 – Need to calculate TSS, RSS and SSreg • The ANOVA breaks the total variability observed in the sample into two parts – TSS = SSreg + RSS – TSS: total sum of squares (entire sample variability) – SSreg: regression sum of squares • Explained by fitted model – RSS: residual sum of squares (unexplained variability) 71 Calculating R2 anova(fit.lm) Analysis of Variance Table Response: short.velocity Df Sum Sq Mean Sq F value Pr(>F) blood.glucose 1 0.20727 0.207269 4.414 0.0479 * Residuals 21 0.98610 0.046957 --- R2 = SSreg/TSS R2 = 0.20727/(0.98610+0.20727) R2 = 0.1736846 72 Getting lm() summary summary(fit.lm) Call: lm(formula = short.velocity ~ blood.glucose, data = data) Residuals: Min 1Q Median 3Q Max -0.40141 -0.14760 -0.02202 0.03001 0.43490 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.09781 0.11748 9.345 6.26e-09 *** blood.glucose 0.02196 0.01045 2.101 0.0479 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2167 on 21 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.1737, Adjusted R-squared: 0.1343 F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479 * - are shorthand for significance levels Estimate - is the value of slope calculated by the regression Std. Error - Measure of the variability in the estimate for the coefficient β t value - measures whether or not the coefficient for this variable is meaningful for the model. It is used to calculate the significance levels. R-squared - Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. DF - The Degrees of Freedom is the difference between the number of observations (24) and the number of variables used in your model minus 1 73 (intercept counts as a variable). DF = 24-2-1 = 23 Generalized Linear Models Functions • The glm() – is designed to perform generalized linear models regression on outcome data that is • • • • binary count proportion continuous – overcomes limitation of classical regression models • do not need to transform the response Y to have a normal distribution N(0, σ2) – limitations • assumes linear Y~X relationship 74 glm() • glm(formula, family = "gaussian", data, …) – formula: symbolic description of the model to fit • e.g. y~x1+x2 or y~. or y~x1+x2+x1*x2 – family: type of distribution to apply to the response variable – data: data frame with the variables • glm returns an object with slots • summary(obj)provides compact summary 75 logistic regression example Using data on admissions where rank of 1 represents the highest and 4 the lowest prestige. Determine the major variable/factor impacting the admission decision mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") head(mydata) admit gre gpa rank 1 0 380 3.61 3 2 1 660 3.67 3 fit <- glm(admit ~ gre + gpa + rank, family=binomial(link="logit"), data=mydata) summary(fit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.449548 1.132846 -3.045 0.00233 ** gre 0.002294 0.001092 2.101 0.03564 * gpa 0.777014 0.327484 2.373 0.01766 * rank -0.560031 0.127137 -4.405 1.06e-05 *** AIC: 467.44 (the lower the better) 76 Exploring glm() object attributes(fit) $names [1] "coefficients" [4] "effects" [7] "qr" [10] "deviance" [13] "iter" [16] "df.residual" [19] "converged" [22] "call" [25] "data" [28] "method" "residuals" "R" "family" "aic" "weights" "df.null" "boundary" "formula" "offset" "contrasts" "fitted.values" "rank" "linear.predictors" "null.deviance" "prior.weights" "y" "model" "terms" "control" "xlevels" $class [1] "glm" "lm" 77 Accessing glm S3 object values • Can look at beta coefficients of each variable fit$coefficients • Can check how good is my fitted model fit$deviance fit$aic • Use TAB key for auto-complete 78 Regression in Genetics • Response yj for j=1..n • where case(1) or control(0) • Could refer to two types of patients or phenotypes • Genotypes gi for markers i=1..p • To run glm() need to code data – Additive – Dominant – Recesive 79 Data • There are genotypes for – 1018 individuals at 32 SNP markers – 32 columns give the marker genotypes • AA-11, AG-12 and GG-22, with NA for missing – The genotypes • 890kb region flanking the CYP2D6 gene • associated with the metabolism of drugs file="http://www.well.ox.ac.uk/rmott/LECTURES/LOGIST IC_REGRESSION/ugeno.dat" data <- read.table(file, header=TRUE, stringsAsFactors=FALSE) 80 Coding • Additive coding Code genotypes as – AA(11) x=0, – AG(12) x=1, – GG(22) x=2 additive <- function( x ) { return(as.numeric(factor(x))-1) } Code genotypes as – AA x=0, – AG x=0, – GG x=1 recessive <- function( x ) { return ( ifelse( additive(x) > 1, 1, 0 ) ) } Code genotypes as – AA x=0, – AG x=1, – GG x=1 dominant <- function( x ) { return ( ifelse( additive(x) > 0, 1, 0 ) ) } • Recessive coding • Dominant coding these functions allow to convert a genotype vector in a certain way (i.e. coding) 81 Fitting the additive model • Additive coding x=apply(data[,-c(1:2)],2,additive) data_prep=as.data.frame(cbind(y=data$y,x)) fit.add <- glm(y~m1, data=data_prep, family = 'binomial' ) Call: glm(formula = y ~ m1, family = "binomial", data = data_prep) Coefficients: (Intercept) -2.8695 m1 -0.9872 Degrees of Freedom: 1017 Total (i.e. Null); Null Deviance: 343.7 Residual Deviance: 335.1 AIC: 339.1 1016 Residual 82 additive model example 83 Fitting the dominant model • Dominant coding x=apply(data[,-c(1:2)],2,dominant) data_prep=as.data.frame(cbind(y=data$y,x)) fit.dom <- glm(y~m1, data=data_prep, family = 'binomial' ) Call: glm(formula = y ~ m1, family = "binomial", data = data_prep) Coefficients: (Intercept) -2.880 m1 -1.004 Degrees of Freedom: 1017 Total (i.e. Null); Null Deviance: 343.7 Residual Deviance: 336.2 AIC: 340.2 1016 Residual 84 Dominant model example 85 Fitting the recessive model • Recessive coding x=apply(data[,-c(1:2)],2,recessive) data_prep=as.data.frame(cbind(y=data$y,x)) fit.recessive <- glm(y~m1, data=data_prep, family = 'binomial' ) Call: glm(formula = y ~ m1, family = "binomial", data = data_prep) Coefficients: (Intercept) -3.126 m1 -15.440 Degrees of Freedom: 1017 Total (i.e. Null); Residual Null Deviance: 343.7 Residual Deviance: 340.1 AIC: 344.1 1016 86 Recessive model example 87 In-class exercise Part 1 of 2 88 In-class Exercises 1) Create vectors a=(5, 6, 7) and b=(10,3,1) and obtain a total sum of their elements that are greater than 5 2) Calculate 100 π=1 π + 5 3) Create matrix A and replace the element located in the 3rd row and 3rd column by the sum of the 1st and 2nd row 1 1 3 π΄= 5 2 6 −2 −1 −3 89 In-class Exercises 4) Write a function which takes a single argument which is a matrix A (see Q3). The function should return a matrix which is the same as the function argument but every odd number is doubled. π π π ππ 2 6 −2 −π −π 90 Plots generation in R • R provides very rich set of plotting possibilities • The basic command is plot() • Each library has its own version of plot() function • When R plots graphics it opens “graphical device” that could be either a window or a file 91 Plotting functions • R offers following array of plotting functions Function Description plot(x) plot of the values of x variable on the y axis bi-variable plot of x and y values (both axis scaled based on values of x and y variables) circular pie-char Plots a box plot showing variables via their quantiles Plots a histogram(bar plot) plot(x,y) pie(y) boxplot(x) hist(x) 92 Plot modification functions • Often R plots are not optimal at 1st • R has an array of graphical parameters Consult here is the full list • Some of the graphical parameters can be specified inside plot() or using other graphical functions such as lines() 93 Plot modification functions Function points(x,y) lines(x,y) Description add points to the plot using coordinates specified in x and y vectors adds a line using coordinates in x and y mtext(text,side=3) adds text to a given margin specified by side number boxplot(x) arrows(x0,y0,x1,y1, angle=30, code=1) abline(h=y) rect(x1, y1, x2, y2) this a histogram that bins values of x into categories represented as bars adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides rotational angle and code specifies at which end arrow should be drawn draws horizontal line at y coordinate draws rectangle at x1, y1, x2, y2 coordinates plots legend of the plot at the position specified by x and y vectors used to generate a given plot adds title to the plot adds axis depending on the chosen one of the 4 sides; vector specifying where tick marks are drawn used interactively to select locations with mouse legend(x,y) title() axis(side, vect) locator() 94 Plot margins • Outer margin – par(oma=c(3, 2, 2,1)) • Fig. margin – par(mar=c(5, 4, 4, 2)) • Note the num. order – down, left, up, right 95 Visualization of data in R cars <- c(1, 3, 6, 4, 9) trucks <- c(2, 5, 4, 5, 12) g_range <- range(0, cars, trucks) plot(cars, type="o", col="blue", ylim=g_range, axes=FALSE, ann=FALSE) lines(trucks, type="o", pch=22, lty=2, col="red") axis(2, las=1, at=4*0:g_range[2]) axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri")) box() title(xlab="Days", col.lab=rgb(0,0.5,0)) title(ylab="Total", col.lab=rgb(0,0.5,0)) 96 Visualization of data in R c1 <- c(1,3,6,4,9); c2 <- c(2,5,4,5,12); c3 <- c(4,4,6,6,16); autos_data <- cbind(c1,c2,c3); colnames(autos_data)<-c("cars", "trucks", "suvs"); barplot(as.matrix(autos_data), main="Autos", ylab= "Total", beside=TRUE, col=rainbow(5)) legend("topleft", c("Mon","Tue","Wed","Thu","Fri"), cex=0.8, bty="n", fill=rainbow(5)) 97 Visualization of data in R #Expand right side of the margin for the legend par(xpd=T, mar=par()$mar+c(0,0,0,4)) #Graph autos using heat colors #put 10% of the space between each bar, and make #labels smaller with horizontal y-axis labels barplot(t(autos_data), main="Autos", ylab="Total", col=heat.colors(3), space=0.1, cex.axis=0.8, las=1, names.arg=c("Mon","Tue","Wed","Thu","Fri"), cex=0.8) legend(6, 30, colnames(autos_data), cex=0.8, fill=heat.colors(3)); # Restore default margins par(mar=c(5, 4, 4, 2) + 0.1); 98 Visualization of data in R dotchart(t(autos_data), color=c("red","blue","darkgreen"), main="Dotchart for Autos", cex=0.8) 99 Visualization of data in R r <- rlnorm(1000); hist(r) 100 Visualization of data in R r <- rlnorm(1000) # Get the distribution h <- hist(r, plot=F, breaks=c(seq(0,max(r)+1, .1))) # Plot the distribution using log scale on both axes, and # use blue points plot(h$counts[h$counts > 0], log="xy", pch=20, col="blue", main="Log-normal distribution", xlab="Value", ylab="Frequency") 101 In-class exercises Part 2 of 2 102 In-class exercises 1) create two perfectly correlated vectors (with ρ=1) and plot them. 2) Change data points to a dashed line 3) Add a horizontal and vertical red lines to your plot at coordinates: a) (1,10) and (4,4); b) (7,7) and (1,20) 4) Add text to the plot with text() to the left and right of the diagonal line 5) Add title to your plot 6) Add legend to the plot 103 Installation of new libraries • There are two main R repositories – CRAN – BioConductor • To install package/library from CRAN install.packages("seqinr") To install packages from BioConductor source("http://bioconductor.org/biocLite.R") biocLite("GenomicRanges") 104 The Comprehensive R Archive Network (CRAN) • CRAN – package repository features 7154 available packages – install.packages("packageName") – Some popular packages • ggplot2 – beautiful plots • Matrix – sparse matrices • igraph – graphs and analysis thereof 105 source("http://bioconductor.org/biocLite.R"); biocLite("packageName") • Repository of biology-related libraries in R – 178,856 packages • Some list of libraries – biomaRt – IRanges 106 Installation of the new libraries • Download and install latest R version on your PC. Go to http://cran.r-project.org/ • Install following libraries by running install.packages(c("seqinr", "ape", "GenABEL") source("http://bioconductor.org/biocLite.R") biocLite(c("limma", "muscle", "affy","hgu133plus2.db","Biostings")) 107 Biological annotation with biomaRt • Biological experiments require ID conversions – e.g. microarray probe id to gene name – e.g. mapping of SNP to gene symbol • BioMart online service – can be accessed via web-GUI – programmatically via biomaRt 108 BioMart • • • • Go to http://central.biomart.org/converter/#!/ID_converter/ Select genome assembly version Paste or upload the ID list E.g. covert SOX1 to GOID 109 biomaRt • Access BioMart programmatically – install the biomaRt library source("http://www.bioconductor.org/biocLite.R"); biocLite("biomaRt"); require(biomaRt); - use the listMarts() to see the different databases listMarts() - we will use the ensembl mart ensMart<-useMart("ensembl") 110 biomaRt - listDatasets() to see which data sets that are available in the database listDatasets(ensMart) - Will use Κ»homo sapiensΚΌ dataset ensembl_hs_mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") - List retrieved attributes (db fields) listAttributes(ensembl_hs_mart)[1:100,] - Download data ensembl_df <- getBM( attributes=c("ensembl_gene_id", "ensembl_transcript_id", "hgnc_symbol","chromosome_name", "entrezgene"), mart=ensembl_hs_mart ) 111 biomaRt - List of genes to annotate my_genes = c("ENSG00000197971", "ENSG00000153165", "ENSG00000159352", "ENSG00000146006", "ENSG00000149809", "ENSG00000204179", "ENSG00000213023", "ENSG00000115008", "ENSG00000130844","ENSG00000155363") - Annotate IDs my_genes_ann = ensembl_df[match(my_genes, ensembl_df$ensembl_gene_id),] ensembl_gene_id ensembl_transcript_id hgnc_symbol chromosome_name entrezgene 66435 ENSG00000197971 ENST00000382582 MBP 18 4155 66038 ENSG00000153165 ENST00000409886 RGPD3 2 653489 190545 ENSG00000159352 ENST00000368884 PSMD4 1 5710 112 In-class exercise 1) Map SNP rs2066844 to a gene with biomaRt library. What is its gene symbol, full gene name and gene biological function? 113 References [1] Team, R. Core. "R Language Definition." (2000). [2] Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21.16 (2005): 3439-3440. 114 115 In-class exercises (answers) s.103 x=seq(1,10) y=2*seq(1,10) lines(c(1,10),c(4,4)) lines(c(7,7),c(1,20)) text(5, 8, "This is my plot", adj = c(0,0)) plot(x,y, type="c") legend(9,15, "data" , lty=1, col=c('red'), cex=0.8) 116 In-class exercises (answers). s113 library(biomaRt) snp_db=useMart("snp", dataset="hsapiens_snp") getBM(attributes="associated_gene", filters = "snp_filter", values = "rs2066844", mart=snp_db) 117