Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Lecture 1 Course Structure & Introduction to R MBP1010H † Dr. Paul C. Boutros DEPARTMENT OF MEDICAL BIOPHYSICS † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others Who Am I? • Received my PhD here in Medical Biophysics in 2009 • Started a lab that year at OICR • Research focuses on statistical techniques for developing biomarkers for personalized cancer treatment • The interface of clinical research, molecular biology, computer science and biostatistics • Six graduate students • This is the second full grad-course I am teaching • TA – Brendan Innes Lecture 1: Course Overview & Introduction to R bioinformatics.ca Who Are You? • MSc Students? PhD Students? Others? • First Year? Second Year? Third Year? Others? • Prior use of R? • (Bio)statistics in your thesis project? • Computational biology in your thesis project? • Genomics in your thesis project? • What do you want to get out of this course? Lecture 1: Course Overview & Introduction to R bioinformatics.ca My Philosophy For This Course • Learn how to do first (application), theory second • Cover less material, but make sure it is clear when and how to use it • Sometimes, the correct answer is “I’ll ask a real statistician” • I use this answer routinely • Grades are mostly based on your ability to get things done (56% assignments + 9% participation + 35% exam) Lecture 1: Course Overview & Introduction to R bioinformatics.ca Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Microarray Analysis I: Pre-Processing Lecture 7: Microarray Analysis II: Multiple-Testing Lecture 8: Data Visualization & Machine-Learning Lecture 9: Sequence Analysis Basics Final Exam (written) Lecture 1: Course Overview & Introduction to R bioinformatics.ca Lecture Weeks Are Not Contiguous! • • • • • • • • • • Lecture 1: 2016/01/04 (today!) Lecture 2: 2016/01/11 Lecture 3: 2016/01/18 (tentative) Lecture 4: 2016/01/25 (tentative) Lecture 5: 2016/02/01 Lecture 6: 2016/02/08 Lecture 7: 2016/02/29 Lecture 8: 2016/03/07 Lecture 9: 2016/03/14 Final Exam: 2015/03/28 (to be confirmed) FINAL DATES TO BE POSTED ASAP Lecture 1: Course Overview & Introduction to R bioinformatics.ca How Will You Be Graded? • 9% Participation: 1% per week • 56% Assignments: • 5 Individual @ 7% each = 35% • 1 Group @ 21% = 21% • 35% Final Examination: in-class • For most assignments each individual will get their own, unique assignment • Assignments will all be in R, and will be graded largely according to computational correctness only (i.e. does your R script yield the correct result when run) • Final Exam: both multiple-choice and written answers Lecture 1: Course Overview & Introduction to R bioinformatics.ca What Resources Can I Use? • Lecture notes and core R documentation alone should be sufficient, but if you want: • Introductory Statistics with R; Peter Dalgaard • Tutorial sessions (to be scheduled – likely this time-slot in non-class weeks) • Course Email: quantitativebiology.utoronto@gmail.com • My email: Paul.Boutros@oicr.on.ca Lecture 1: Course Overview & Introduction to R bioinformatics.ca House Rules • Cell phones to silent • No side conversations • Hands up for questions • Pay attention – I will randomly call on people during the course of each lecture. • Others? Lecture 1: Course Overview & Introduction to R bioinformatics.ca What Is Statistics? • The study of all aspects of data itself: • • • • Collection Organization Quantifying uncertainty Data presentation • Reporting/Description • Visualization • Analysis/Inference/ • Distinct but closely related to probability theory: • Statistics: learning from data • Probability Theory: inferring from the underlying population Lecture 1: Course Overview & Introduction to R bioinformatics.ca Population vs. Sample Population: all possible measurements Sample: the portion of the population we are studying All MBP Students = Population MBP Students in 1010 = Sample Is that sample representative? Lecture 1: Course Overview & Introduction to R bioinformatics.ca When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show an example of statistical analysis (good or bad) selected from a relatively recent paper January 2, 2014 Lecture 1: Course Overview & Introduction to R bioinformatics.ca Figure 1: At Least 6 P-Values Lecture 1: Course Overview & Introduction to R bioinformatics.ca How Do You Report Statistical Analyses? • Ideas? • What is a P-Value? • What is an Effect-Size? • Which matters to you as a biologist? Why? • Always report both Lecture 1: Course Overview & Introduction to R bioinformatics.ca R Lecture 1: Course Overview & Introduction to R bioinformatics.ca R Latest version: 3.2.3 I am using v3.2.0. The differences are minimal regarding the functionality we are going to use, and are mostly minor bug-fixes. Either version will be perfectly fine, and most older versions should work as well until the last few lectures. Lecture 1: Course Overview & Introduction to R bioinformatics.ca R Studio Don’t use this! Not ready for production-use! Lecture 1: Course Overview & Introduction to R bioinformatics.ca Why Are You Learning R? Why not Excel? Spreadsheets Are Hard Even If You Do It Right… “The accuracy statistical “researchers should continue to avoidofusing the “What we know about distributions in Microsoft statistical functions in Excel 2007 for any spreadsheet errors” Journal Excel 2007” scientific purpose” of End User Computing “On the accuracy of 10(2):15-21 statistical Procedures in “it is not safe to assume that Microsoft Excel’s Excel 2007” statistical procedures Microsoft give the correct answer. Spreadsheet Error Rate: 88% Persons who wish to conductStatistics statistical Computational & analyses some other52package.” Data Analysis Cell Error Rate: 2-7% should use Lecture 1: Course Overview & Introduction to R bioinformatics.ca Excel Ate My Gene Names Lecture 1: Course Overview & Introduction to R bioinformatics.ca Other Reasons to Use R • Emerging as the lingua franca of statistics • New methods first developed for and implemented-in R • Extraordinarily flexible: • Moving from simple to sophisticated analyses is easy • • • • • Free Community development leading to rapid improvements Works identically on any type of computer (PC, Mac, linux) Extraordinarily high-quality visualizations possible Reproducible research Lecture 1: Course Overview & Introduction to R bioinformatics.ca Complex Data Visualization in R Lecture 1: Course Overview & Introduction to R bioinformatics.ca Rest of Class • Live introduction to R • Many extra slides in this slide deck • The assignment will cover R basics • Topics we will try to cover (time-permitting): • • • • • R as a calculator Help! Data-types Basic flow-control operations Functions Lecture 1: Course Overview & Introduction to R bioinformatics.ca Let’s Look at the Parts of R • Overall Editor Experience • R can act as a very good calculator • It can store variables • But you should always save your R commands in a separate file containing nothing else. Why? • Reproducibility, • Separation of code & data = Reusability Lecture 1: Course Overview & Introduction to R bioinformatics.ca Different Data-Types • Scalar vs. Vector • String vs. Numeric • Categorical Data • Male vs. Female • Days of the Week • Colours • Functions Lecture 1: Course Overview & Introduction to R bioinformatics.ca Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Microarray Analysis I: Pre-Processing Lecture 7: Microarray Analysis II: Multiple-Testing Lecture 8: Data Visualization & Machine-Learning Lecture 9: Sequence Analysis Basics Final Exam (written) Lecture 1: Course Overview & Introduction to R bioinformatics.ca expressions R evaluates expressions. Entering expressions allows you to use R like a calculator. > 2+2 [1] 4 > exp(-2) [1] 0.1353353 > pi [1] 3.141593 > sin (2*pi) [1] -2.449294e-16 > 0/0 [1] NaN Tip: Predefined symbols: pi, letters, month.name Special symbols: NA, NaN, Inf, NULL, TRUE, FALSE Lecture 1: Course Overview & Introduction to R bioinformatics.ca strings R has a string datatype. Although you can accomplish all of your string-handling needs in R, other programming languages may be more suitable. > "Hello" [1] "Hello" > x <- paste("Hello", "World") >x [1] "Hello World" > m <- gregexpr("(\\b\\w{2})", x, perl=T) > y<-regmatches(x,m) >y [[1]] [1] "He" "Wo" > paste(y[[1]], collapse='') [1] "HeWo" Task: Assign a first and a last name to two variables. Create a third variable that contains the initials. Lecture 1: Course Overview & Introduction to R bioinformatics.ca dates R has a date datatype. > format(ISOdate(2000, 1:12, 1), "%b") [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" [7] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > format(Sys.time(), "%W") [1] "20" See strptime() for formatting options. Task: What weekday will your birthday be this year? Lecture 1: Course Overview & Introduction to R bioinformatics.ca help R has extensive help available for all of its functions and objects. > help (pi) > ?pi > ?sqrt > ?Special Task: Print pi to 10 digits. Fix: help "sqrt" Lecture 1: Course Overview & Introduction to R bioinformatics.ca help searches If you don't know the function name, try a keyword search. > help.search ("trigonometry") > ??input However, often a Google search will give you more immediate results. Tip: A table of all available packages is at : http://cran.r-project.org/ R Manuals are at: http://cran.r-project.org/manuals.html For a list of all functions in the base package see e.g.: http://ugrad.stat.ubc.ca/R/library/base/html/00Index.html Lecture 1: Course Overview & Introduction to R bioinformatics.ca assignments We need to be able to store intermediate results. In R, we can assign data to variables. > x <- 1/sqrt(4) > y <- sin(pi/6) > x+y [1] 1 The R community prefers "<-" to "=". Both are possible. "<-" is more general. Don't confuse "=" with "==" !!! Tip: Be explicit in variable names. Avoid "-" and "_", use mixed case or dots instead. Make variables upper-case nouns, functions lower-case verbs. Good: GeneIDs, simulateAlleles() Poor: q, calculate-number Lecture 1: Course Overview & Introduction to R bioinformatics.ca vectors We can't do much statistics with scalars. R is built to handle lists of numbers and other elements efficiently. Lists (vectors) can be created with the "c" operator (concatenate). > Weight <- c(60,72,75,90,95,72) > Weight[1] [1] 60 > Weight[2] [1] 72 > Weight [1] 60 72 75 90 95 72 > Height <- c(1.75,1.80,1.65,1.90,1.74,1.91) > BMI <- Weight/Height^2 # vector based operation > BMI [1] 19.59184 22.22222 27.54821 24.93075 31.37799 19.73630 Lecture 1: Course Overview & Introduction to R bioinformatics.ca vector operations If you apply an operation to a vector, it is applied to each element of the vector. If you apply an operation to two vectors, it is applied to each matching pair of elements. > x <- 1:5 > x+2 [1] 3 4 5 6 7 > y <- 6:2 > x+y [1] 7 7 7 7 7 Exercise: What happens if the vectors have different types (numeric, character, logical)? What happens if the vectors have different lengths? Lecture 1: Course Overview & Introduction to R bioinformatics.ca vector operations Exercise: Create a vector "x" with the following elements 1,3,10,-1. Print the square of these elements. Take the square root of x. Take the log of all values in x after adding 1. Lecture 1: Course Overview & Introduction to R bioinformatics.ca vector types R vectors can be of type: • numeric • character • logical. > x <- c(1, 5, 8) # Numeric >x [1] 1 5 8 > x <- c(TRUE, TRUE, FALSE, TRUE) # Logical >x [1] TRUE TRUE FALSE TRUE > x <- c ("Hello","world") # Character >x [1] "Hello""world" > x <- c(1, TRUE, "Thursday") # Mixed >x [1] "1" "TRUE" "Thursday" Task: Show that "TRUE" is no longer a logical type. Lecture 1: Course Overview & Introduction to R bioinformatics.ca missing and special values We have already encountered the NaN symbol meaning Not-aNumber, and Inf, -Inf. > Weight[5] <- NA > mean(Weight) [1] NA > mean(Weight, na.rm=TRUE) [1] 73.8 In practical data analysis a data point is frequently unavailable. In R, missing values are denoted by NA ("Not Available"). Lecture 1: Course Overview & Introduction to R bioinformatics.ca matrices and arrays A matrix is a two dimensional array of numbers. Matrices can be used to perform statistical operations (linear algebra). However, they can also be used to hold tables. > x<-1:12 >x [1] 1 2 3 4 5 6 7 8 9 10 11 12 > length(x) [1] 12 > dim(x) NULL > dim(x)<-c(3,4) >x [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 Lecture 1: Course Overview & Introduction to R > a<-matrix(1:12,nrow=3,byrow=TRUE) >a [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > a<-matrix(1:12,nrow=3,byrow=FALSE) >a [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > rownames(a)<-c("A","B","C") >a [,1] [,2] [,3] [,4] A 1 4 7 10 B 2 5 8 11 C 3 6 9 12 > colnames(a)<-c("1","2","x","y") >a 12x y A 1 4 7 10 B 2 5 8 11 C 3 6 9 12 bioinformatics.ca matrices and arrays >a 12x y A 1 4 7 10 B 2 5 8 11 C 3 6 9 12 Exercise: Print the values of the second column of a. Print the values of the second row of a. Print the value of the element in the lower left corner. Lecture 1: Course Overview & Introduction to R bioinformatics.ca matrices and arrays Matrices can also be formed by "glueing" rows and columns using cbind and rbind.This is the equivalent of c for vectors. > x1 <- 1:4 # Define three vectors > x2 <- 5:8 > y1 <- c(3,9) > MyMatrix <- rbind(x1,x2) > MyMatrix [,1] [,2] [,3] [,4] x1 1 2 3 4 x2 5 6 7 8 > MyNewMatrix <- cbind(MyMatrix,y1) > MyNewMatrix y1 x1 1 2 3 4 3 x2 5 6 7 8 9 Lecture 1: Course Overview & Introduction to R bioinformatics.ca factors It is common to have categorical data in statistical data analysis (e.g. Male/ Female). In R such variables are referred to as factors. This makes it possible to assign meaningful names to categories. A factor has a set of levels. > Pain <- c(0,3,2,2,1) > SevPain <- as.factor(c(0,3,2,2,1)) > levels(SevPain) <- c("none","mild","medium","severe") > is.factor(SevPain) [1] TRUE > is.vector(SevPain) [1] FALSE Lecture 1: Course Overview & Introduction to R bioinformatics.ca lists Lists can be used to combine objects (of possibly different kinds/sizes) into a larger composite object. The components of the list are named according to the arguments used. Components can be extracted with the double bracket operator [[ ]] Alternatively, named components can be accessed with the "$" separator. > A<-c(31,32,40) > S<-as.factor(c("F","M","M","F")) > L<-c("London","School") > MyFriends<-list(age=A,sex=S,meta=L) > MyFriends $age [1] 31 32 40 $sex [1] F M M F Levels: F M $meta [1] "London" "School" > MyFriends[[2]] [1] 31 32 40 > MyFriends$age [1] 31 32 40 Exercise: Combine Pain and SevPain into a list with a meaningful name. Lecture 1: Course Overview & Introduction to R bioinformatics.ca data frames A data frame is a matrix or a "set" of data. It is a list of vectors and/or factors of the same length that are related "across", such that data in the same position come from the same experimental unit (subject, animal, etc). > Probands <- data.frame(age=c(31,32,40,50),sex=S) > Probands age sex 1 31 F 2 32 M 3 40 M 4 50 F > Probands$age [1] 31 32 40 50 Why do we need data frames if they do the same as a list? More efficient storage, and indexing! R's read...() functions return data frames. Lecture 1: Course Overview & Introduction to R bioinformatics.ca names Names of an R object can be accessed and/or modified with the names() function. > x <- 1:3 > names(x) NULL > names(x) <- c("a", "b", "c") >x abc 123 > names(Probands) [1] "age" "sex" > names(Probands) <- c("age", "gender") > names(Probands)[1] <- c("Age") Tip: Give explicit names to variables. Names can be used for indexing. Lecture 1: Course Overview & Introduction to R bioinformatics.ca indexing (extracting) Indexing (> ?Extract ) is a great way to directly assess elements of interest. > # Indexing a vector > Pain <- c(0,3,2,2,1) > Pain[1] [1] 0 > Pain[2] [1] 3 > Pain[1:2] [1] 0 3 > Pain[c(1,3)] [1] 0 2 > Pain[-5] [1] 0 3 2 2 Lecture 1: Course Overview & Introduction to R > # Indexing a matrix > MyNewMatrix[1,1] [1] 1 > MyNewMatrix[1,] y1 1 2 3 4 3 > MyNewMatrix[,1] x1 x2 1 5 > MyNewMatrix[,-2] y1 x1 1 3 4 3 x2 5 7 8 9 > # Indexing a list > MyFriends[3] $meta [1] "London" "School" > MyFriends[[3]] [1] "London" "School" > MyFriends[[3]][1] [1] "London" > # Indexing a data frame > Probands[1,] Age gender 1 31 F > Probands[2,] Age gender 2 32 M bioinformatics.ca indexing by name Names can also be used to index an R object. > MyFriends$age [1] 31 32 40 > MyFriends["age"] $age [1] 31 32 40 > MyFriends[["age"]] [1] 31 32 40 > Probands["Age"] Age 1 31 2 32 3 40 4 50 > Probands[1] Age 1 31 2 32 3 40 4 50 > Probands[[1]] [1] 31 32 40 50 Exercise: Can the results of "[ ]" and "[[ ]]" extractions both be used in vector operations? Lecture 1: Course Overview & Introduction to R bioinformatics.ca conditional indexing Indexing can be conditional on another variable. > Pain; Fpain [1] 0 3 2 2 1 [1] none severe medium medium mild Levels: none mild medium severe > Age <- c(45,51,45,32,90) > Pain[Fpain=="medium" | Fpain=="severe"] [1] 3 2 2 > Pain[Age>32] [1] 0 3 2 1 Note: the conditional variable does not have to be part of the same data object. Exercise: Extract elements for "none" and for Age < 90. Lecture 1: Course Overview & Introduction to R bioinformatics.ca data input Normally, you would start your R session by reading in some data to be analysed. This can be done with the read.table function. Download the sample data to your local directory... > GvHD <- read.table("GvHD.txt", header=TRUE) > GvHD[1:10,] FSC.Height SSC.Height CD4.FITC CD8.B.PE CD3.PerCP CD8.APC 1 321 199 308 220 157 339 2 303 210 319 271 223 350 3 318 170 215 148 119 221 4 202 49 104 49 284 178 5 353 248 262 167 144 156 6 192 68 423 97 344 113 7 322 225 236 214 141 209 8 350 152 258 82 253 205 9 351 223 286 128 172 220 10 269 78 169 289 224 537 Tip: Alternatively – use the RStudio GUI. Lecture 1: Course Overview & Introduction to R bioinformatics.ca functions and arguments Many things in R are done using function calls, commands that look like an application of a mathematical function to one or several variables, e.g. log(x), plot(Weight,Height). When you use plot(Weight, Height) R assumes that the first argument is the x variable and the second is the y. If you do not know how to specify the arguments look at ?plot. Most function arguments have sensible defaults and can thus be omitted, e.g. plot(Weight, Height,col=1). If you do not specify the names of the argument, R interprets them by their default order. Lecture 1: Course Overview & Introduction to R bioinformatics.ca libraries Many contributed functionalities of R are available in R packages/libraries. Some of these are distributed with R while others need to be downloaded and installed separately. > library(survival) Loading required package: splines > library(samr) Error in library(samr) : there is no package called 'samr' > install.packages("samr") --- Please select a CRAN mirror for use in this session --also installing the dependencies ‘R.methodsS3’, ‘impute’, ‘matrixStats’ trying URL 'http://probability.ca/cran/bin/macosx/leopard/contrib/2.13/R.methodsS3_1.2.1.tgz' Content type 'application/x-gzip' length 47709 bytes (46 Kb) opened URL ================================================== downloaded 46 Kb [...] The downloaded packages are in /var/folders/dq/dqPEEPbFGFWs6MKN40ApRU+++TI/-Tmp-//RtmpNDvKDp/downloaded_packages > library(samr) Loading required package: impute Loading required package: matrixStats Loading required package: R.methodsS3 R.methodsS3 v1.2.1 (2010-09-18) successfully loaded. See ?R.methodsS3 for help. matrixStats v0.2.2 (2010-10-06) successfully loaded. See ?matrixStats for help. Lecture 1: Course Overview & Introduction to R bioinformatics.ca R programming: conditional statements R is a full-featured programming language. # if statement > x <- -2 > if(x>0) { + print(x) + } else { + print(-x) +} [1] 2 > > if(x>0) { + print(x) + } else if(x==0) { + print(0) + } else { + print(-x) +} [1] 2 Lecture 1: Course Overview & Introduction to R bioinformatics.ca R programming: loops # for loop n <- 1000000 x <- rnorm(n,10,1) y <- x^2 y <- rep(0,n) for (i in 1:n) { y[i] <- sqrt (x[i]) } # while loop Counter <- 1 while (Counter <= n) { y[Counter] <- sqrt(x[Counter]) Counter <- Counter+1 } Exercise: Apply sqrt() to x as a vector and compare execution speed. Lecture 1: Course Overview & Introduction to R bioinformatics.ca creating your own functions Function objects can simply be assigned. Oracle <- function() { WiseWords <- c( "Joy", "Plan", "Disappear", "Perhaps", "Sorrow", "Hope", "Change" ) n <- sample(WiseWords, 1) return(n) } > Oracle() [1] "Disappear" Exercise: Write a function to return the inverse of a number. Warn if input == 0; Lecture 1: Course Overview & Introduction to R bioinformatics.ca creating your own functions Computing a square root, based on Newton's method. MySqrt<-function(y) { x<-y/2 while (abs(x*x - y) > 1e-10) { x <- (x + y/x)/2 } x } Why would we do this? Because now we have the internals of a function exposed and can manipulate them. Exercise: Compare execution speed with sqrt() Store and return all intermediate values of x to see how the computation converges. Lecture 1: Course Overview & Introduction to R bioinformatics.ca