bioinformatics - Department of Medical Biophysics

advertisement
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 1
Course Structure & Introduction to R
MBP1010
†
Dr. Paul C. Boutros
Winter 2014
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Who Am I?
• Got my PhD here in Medical Biophysics in 2008
• Started a lab that year at OICR
• Research focuses on statistical techniques for developing
biomarkers for personalized cancer treatment
• The interface of clinical research, molecular biology, computer
science and biostatistics
• Five MBP graduate students
• This is the first full grad-course I am teaching
• TAs
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Who Are You?
• MSc Students? PhD Students? Others?
• First Year? Second Year? Third Year? Others?
• Prior use of R?
• (Bio)statistics in your thesis project?
• Computational biology in your thesis project?
• Genomics in your thesis project?
• What do you want to get out of this course?
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
My Philosophy For This Course
• Learn how to do first (application), theory second
• Cover less material, but make sure it is clear when and
how to use it
• Sometimes, the correct answer is “I’ll ask a real
statistician”
• I use this answer routinely
• Grades are mostly based on the ability to get things done
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
How Will You Be Graded?
• 9% Participation: 1% per week
• 56% Assignments: 8 x 7% each
• 35% Final Examination: in-class
• Each individual will get their own, unique assignment
• Assignments will all be in R, and will be graded according
to computational correctness only (i.e. does your R script
yield the correct result when run)
• Final Exam will include multiple-choice and written
answers
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
What Resources Can I Use?
• Lecture notes alone should be sufficient
• Tutorial sessions
• Recommended Book:
• Introductory Statistics with R; Peter Dalgaard
• Extensive documentation for R itself
• Several online tutorials
• Course Email: quantitativebiology.utoronto@gmail.com
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
• Others?
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
What Is Statistics?
• The study of all aspects of data itself:
•
•
•
•
Collection
Organization
Quantifying uncertainty
Data presentation
• Reporting/Description
• Visualization
• Analysis/Inference/
• Distinct but closely related to probability theory:
• Statistics: learning from data
• Probability Theory: inferring from the underlying population
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Population vs. Sample
Population: all possible measurements
Sample: the portion of the population we are studying
All MBP Students = Population
MBP Students in 1010 = Sample
Is that sample representative?
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
When Do We Use Statistics?
• Ubiquitous in modern biology
• Every class I will show a use of statistics in a (very, very)
recent Nature paper.
January 2, 2014
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Figure 1: At Least 6 P-Values
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
How Do You Report Statistical Analyses?
• Ideas?
• What is a P-Value?
• What is an Effect-Size?
• Which matters to you as a biologist? Why?
• Always report both
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
R
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
R
Latest version: 3.0.2 (released September).
I am using v3.0.1. The differences are minimal regarding the
functionality we are going to use, and are mostly minor bug-fixes.
Either version will be perfectly fine, and most older versions
should work as well until the last few lectures.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
R Studio
Don’t use this!
Not ready for production-use!
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Why Are You Learning R? Why not Excel?
Spreadsheets Are Hard
Even If You Do It Right…
“The accuracy
statistical
“researchers should continue
to avoidofusing
the
“What we know about
distributions
in Microsoft
statistical functions
in Excel 2007
for any
spreadsheet errors” Journal
Excel 2007”
scientific purpose”
of End User Computing
“On the accuracy of
10(2):15-21
statistical
Procedures
in
“it is not safe to assume
that Microsoft
Excel’s
Excel 2007”
statistical procedures Microsoft
give the correct
answer.
Spreadsheet Error
Rate: 88%
Persons
who wish
to conductStatistics
statistical
Computational
&
analyses
some
other52package.”
Data
Analysis
Cell Error Rate:
2-7% should use
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Other Reasons to Use R
• Emerging as the lingua franca of statistics
• New methods first developed for and implemented-in R
• Extraordinarily flexible:
• Moving from simple to sophisticated analyses is easy
•
•
•
•
•
Free
Community development leading to rapid improvements
Works identically on any type of computer (PC, Mac, linux)
Extraordinarily high-quality visualizations possible
Reproducible research
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Complex Data Visualization in R
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Let’s Look at the Parts of R
• Overall Editor Experience
• R can act as a very good calculator
• It can store variables
• But you should always save your R commands in a
separate file containing nothing else
• Why? Reproducibility, separation of code & data, reusability
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Different Data-Types
• Scalar vs. Vector
• String vs. Numeric
• Categorical Data
• Male vs. Female
• Days of the Week
• Colours
• Functions
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
expressions
R evaluates expressions.
Entering expressions allows
you to use R like a
calculator.
> 2+2
[1] 4
> exp(-2)
[1] 0.1353353
> pi
[1] 3.141593
> sin (2*pi)
[1] -2.449294e-16
> 0/0
[1] NaN
Tip:
Predefined symbols: pi, letters, month.name
Special symbols: NA, NaN, Inf, NULL, TRUE, FALSE
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
strings
R has a string datatype.
Although you can
accomplish all of your
string-handling needs in
R, other programming
languages may be more
suitable.
> "Hello"
[1] "Hello"
> x <- paste("Hello", "World")
>x
[1] "Hello World"
> m <- gregexpr("(\\b\\w{2})",
x, perl=T)
> y<-regmatches(x,m)
>y
[[1]]
[1] "He" "Wo"
> paste(y[[1]], collapse='')
[1] "HeWo"
Task:
Assign a first and a last name to two variables.
Create a third variable that contains the initials.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
dates
R has a date datatype.
> format(ISOdate(2000, 1:12, 1), "%b")
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun"
[7] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> format(Sys.time(), "%W")
[1] "20"
See strptime() for formatting options.
Task:
What weekday will your birthday be this year?
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
help
R has extensive help available
for all of its functions and objects.
> help (pi)
> ?pi
> ?sqrt
> ?Special
Task:
Print pi to 10 digits.
Fix:
help "sqrt"
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
help searches
If you don't know the function name,
try a keyword search.
> help.search ("trigonometry")
> ??input
However, often a Google search will give you more immediate results.
Tip:
A table of all available packages is at : http://cran.r-project.org/
R Manuals are at: http://cran.r-project.org/manuals.html
For a list of all functions in the base package see e.g.:
http://ugrad.stat.ubc.ca/R/library/base/html/00Index.html
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
assignments
We need to be able to store intermediate results.
In R, we can assign data to variables.
> x <- 1/sqrt(4)
> y <- sin(pi/6)
> x+y
[1] 1
The R community prefers "<-" to "=". Both are possible. "<-" is more general.
Don't confuse "=" with "==" !!!
Tip:
Be explicit in variable names. Avoid "-" and "_", use mixed case or dots instead. Make
variables upper-case nouns, functions lower-case verbs.
Good: GeneIDs, simulateAlleles()
Poor: q, calculate-number
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
vectors
We can't do much statistics with scalars. R is built to handle lists
of numbers and other elements efficiently.
Lists (vectors) can be created with the "c" operator
(concatenate).
> Weight <- c(60,72,75,90,95,72)
> Weight[1]
[1] 60
> Weight[2]
[1] 72
> Weight
[1] 60 72 75 90 95 72
> Height <- c(1.75,1.80,1.65,1.90,1.74,1.91)
> BMI <- Weight/Height^2 # vector based operation
> BMI
[1] 19.59184 22.22222 27.54821 24.93075 31.37799 19.73630
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
vector operations
If you apply an operation to
a vector, it is applied to each
element of the vector.
If you apply an operation to
two vectors, it is applied to
each matching pair of
elements.
> x <- 1:5
> x+2
[1] 3 4 5 6 7
> y <- 6:2
> x+y
[1] 7 7 7 7 7
Exercise:
What happens if the vectors have different types (numeric, character, logical)?
What happens if the vectors have different lengths?
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
vector operations
Exercise:
Create a vector "x" with the following elements 1,3,10,-1.
Print the square of these elements.
Take the square root of x.
Take the log of all values in x after adding 1.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
vector types
R vectors
can be of
type:
• numeric
• character
• logical.
> x <- c(1, 5, 8) # Numeric
>x
[1] 1 5 8
> x <- c(TRUE, TRUE, FALSE, TRUE) # Logical
>x
[1] TRUE TRUE FALSE TRUE
> x <- c ("Hello","world") # Character
>x
[1] "Hello""world"
> x <- c(1, TRUE, "Thursday") # Mixed
>x
[1] "1"
"TRUE" "Thursday"
Task:
Show that "TRUE" is no longer a logical type.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
missing and special values
We have already
encountered the NaN
symbol meaning Not-aNumber, and Inf, -Inf.
> Weight[5] <- NA
> mean(Weight)
[1] NA
> mean(Weight, na.rm=TRUE)
[1] 73.8
In practical data analysis a
data point is frequently
unavailable. In R, missing
values are denoted by NA
("Not Available").
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
matrices and arrays
A matrix is a two dimensional
array of numbers. Matrices can
be used to perform statistical
operations (linear algebra).
However, they can also be used
to hold tables.
> x<-1:12
>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> length(x)
[1] 12
> dim(x)
NULL
> dim(x)<-c(3,4)
>x
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Lecture 1: Course Overview & Introduction to R
> a<-matrix(1:12,nrow=3,byrow=TRUE)
>a
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> a<-matrix(1:12,nrow=3,byrow=FALSE)
>a
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> rownames(a)<-c("A","B","C")
>a
[,1] [,2] [,3] [,4]
A 1 4 7 10
B 2 5 8 11
C 3 6 9 12
> colnames(a)<-c("1","2","x","y")
>a
12x y
A 1 4 7 10
B 2 5 8 11
C 3 6 9 12
bioinformatics.ca
matrices and arrays
>a
12x y
A 1 4 7 10
B 2 5 8 11
C 3 6 9 12
Exercise:
Print the values of the second column of a.
Print the values of the second row of a.
Print the value of the element in the lower left corner.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
matrices and arrays
Matrices can also be formed by "glueing" rows and columns
using cbind and rbind.This is the equivalent of c for vectors.
> x1 <- 1:4 # Define three vectors
> x2 <- 5:8
> y1 <- c(3,9)
> MyMatrix <- rbind(x1,x2)
> MyMatrix
[,1] [,2] [,3] [,4]
x1 1 2 3 4
x2 5 6 7 8
> MyNewMatrix <- cbind(MyMatrix,y1)
> MyNewMatrix
y1
x1 1 2 3 4 3
x2 5 6 7 8 9
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
factors
It is common to have categorical data in statistical data analysis
(e.g. Male/ Female). In R such variables are referred to as
factors. This makes it possible to assign meaningful names to
categories.
A factor has a set of levels.
> Pain <- c(0,3,2,2,1)
> SevPain <- as.factor(c(0,3,2,2,1))
> levels(SevPain) <- c("none","mild","medium","severe")
> is.factor(SevPain)
[1] TRUE
> is.vector(SevPain)
[1] FALSE
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
lists
Lists can be used to combine
objects (of possibly different
kinds/sizes) into a larger
composite object.
The components of the list are
named according to the
arguments used.
Components can be extracted
with the double bracket
operator [[ ]]
Alternatively, named
components can be accessed
with the "$" separator.
> A<-c(31,32,40)
> S<-as.factor(c("F","M","M","F"))
> L<-c("London","School")
> MyFriends<-list(age=A,sex=S,meta=L)
> MyFriends
$age
[1] 31 32 40
$sex
[1] F M M F
Levels: F M
$meta
[1] "London" "School"
> MyFriends[[2]]
[1] 31 32 40
> MyFriends$age
[1] 31 32 40
Exercise: Combine Pain and SevPain into a list with a meaningful name.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
data frames
A data frame is a matrix or a "set" of data. It is a list of vectors
and/or factors of the same length that are related "across", such
that data in the same position come from the same
experimental unit (subject, animal, etc).
> Probands <- data.frame(age=c(31,32,40,50),sex=S)
> Probands
age sex
1 31 F
2 32 M
3 40 M
4 50 F
> Probands$age
[1] 31 32 40 50
Why do we need data frames if they do the same as a list?
More efficient storage, and indexing! R's read...() functions return data frames.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
names
Names of an R object can be accessed and/or modified with the
names() function.
> x <- 1:3
> names(x)
NULL
> names(x) <- c("a", "b", "c")
>x
abc
123
> names(Probands)
[1] "age" "sex"
> names(Probands) <- c("age", "gender")
> names(Probands)[1] <- c("Age")
Tip: Give explicit names to variables. Names can be used for indexing.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
indexing (extracting)
Indexing (> ?Extract ) is a great way to
directly assess elements of interest.
> # Indexing a vector
> Pain <- c(0,3,2,2,1)
> Pain[1]
[1] 0
> Pain[2]
[1] 3
> Pain[1:2]
[1] 0 3
> Pain[c(1,3)]
[1] 0 2
> Pain[-5]
[1] 0 3 2 2
Lecture 1: Course Overview & Introduction to R
> # Indexing a matrix
> MyNewMatrix[1,1]
[1] 1
> MyNewMatrix[1,]
y1
1 2 3 4 3
> MyNewMatrix[,1]
x1 x2
1 5
> MyNewMatrix[,-2]
y1
x1 1 3 4 3
x2 5 7 8 9
> # Indexing a list
> MyFriends[3]
$meta
[1] "London" "School"
> MyFriends[[3]]
[1] "London" "School"
> MyFriends[[3]][1]
[1] "London"
> # Indexing a data frame
> Probands[1,]
Age gender
1 31
F
> Probands[2,]
Age gender
2 32
M
bioinformatics.ca
indexing by name
Names can also be used
to index an R object.
> MyFriends$age
[1] 31 32 40
> MyFriends["age"]
$age
[1] 31 32 40
> MyFriends[["age"]]
[1] 31 32 40
> Probands["Age"]
Age
1 31
2 32
3 40
4 50
> Probands[1]
Age
1 31
2 32
3 40
4 50
> Probands[[1]]
[1] 31 32 40 50
Exercise: Can the results of "[ ]" and "[[ ]]" extractions both be used in
vector operations?
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
conditional indexing
Indexing can be conditional on another variable.
> Pain; Fpain
[1] 0 3 2 2 1
[1] none severe medium medium mild
Levels: none mild medium severe
> Age <- c(45,51,45,32,90)
> Pain[Fpain=="medium" | Fpain=="severe"]
[1] 3 2 2
> Pain[Age>32]
[1] 0 3 2 1
Note: the conditional variable does not have to be part of the same data object.
Exercise: Extract elements for "none" and for Age < 90.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
data input
Normally, you would start your R session by reading in some
data to be analysed. This can be done with the read.table
function. Download the sample data to your local directory...
> GvHD <- read.table("GvHD.txt", header=TRUE)
> GvHD[1:10,]
FSC.Height SSC.Height CD4.FITC CD8.B.PE CD3.PerCP CD8.APC
1
321
199
308
220
157 339
2
303
210
319
271
223 350
3
318
170
215
148
119 221
4
202
49
104
49
284 178
5
353
248
262
167
144 156
6
192
68
423
97
344 113
7
322
225
236
214
141 209
8
350
152
258
82
253 205
9
351
223
286
128
172 220
10
269
78
169
289
224 537
Tip: Alternatively – use the RStudio GUI.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
functions and arguments
Many things in R are done using function calls, commands that
look like an application of a mathematical function to one or
several variables, e.g. log(x), plot(Weight,Height).
When you use plot(Weight, Height) R assumes that the first
argument is the x variable and the second is the y. If you do not
know how to specify the arguments look at ?plot.
Most function arguments have sensible defaults and can thus
be omitted, e.g. plot(Weight, Height,col=1).
If you do not specify the names of the argument, R interprets
them by their default order.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
libraries
Many contributed functionalities of R are available in R
packages/libraries. Some of these are distributed with R while
others need to be downloaded and installed separately.
> library(survival)
Loading required package: splines
> library(samr)
Error in library(samr) : there is no package called 'samr'
> install.packages("samr")
--- Please select a CRAN mirror for use in this session --also installing the dependencies ‘R.methodsS3’, ‘impute’, ‘matrixStats’
trying URL 'http://probability.ca/cran/bin/macosx/leopard/contrib/2.13/R.methodsS3_1.2.1.tgz'
Content type 'application/x-gzip' length 47709 bytes (46 Kb)
opened URL
==================================================
downloaded 46 Kb
[...]
The downloaded packages are in
/var/folders/dq/dqPEEPbFGFWs6MKN40ApRU+++TI/-Tmp-//RtmpNDvKDp/downloaded_packages
> library(samr)
Loading required package: impute
Loading required package: matrixStats
Loading required package: R.methodsS3
R.methodsS3 v1.2.1 (2010-09-18) successfully loaded. See ?R.methodsS3 for help.
matrixStats v0.2.2 (2010-10-06) successfully loaded. See ?matrixStats for help.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
R programming: conditional statements
R is a full-featured programming language.
# if statement
> x <- -2
> if(x>0) {
+ print(x)
+ } else {
+ print(-x)
+}
[1] 2
>
> if(x>0) {
+ print(x)
+ } else if(x==0) {
+ print(0)
+ } else {
+ print(-x)
+}
[1] 2
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
R programming: loops
# for loop
n <- 1000000
x <- rnorm(n,10,1)
y <- x^2
y <- rep(0,n)
for (i in 1:n) {
y[i] <- sqrt (x[i])
}
# while loop
Counter <- 1
while (Counter <= n) {
y[Counter] <- sqrt(x[Counter])
Counter <- Counter+1
}
Exercise: Apply sqrt() to x as a vector and compare execution speed.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
creating your own functions
Function objects can simply be assigned.
Oracle <- function() {
WiseWords <- c(
"Joy",
"Plan",
"Disappear",
"Perhaps",
"Sorrow",
"Hope",
"Change"
)
n <- sample(WiseWords, 1)
return(n)
}
> Oracle()
[1] "Disappear"
Exercise: Write a function to return the inverse of a number.
Warn if input == 0;
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
creating your own functions
Computing a square root, based on Newton's method.
MySqrt<-function(y) {
x<-y/2
while (abs(x*x - y) > 1e-10) {
x <- (x + y/x)/2
}
x
}
Why would we do this? Because now we have the internals of a function
exposed and can manipulate them.
Exercise:
Compare execution speed with sqrt()
Store and return all intermediate values of x to see how the computation converges.
Lecture 1: Course Overview & Introduction to R
bioinformatics.ca
Download