Lab 1: You need to download R. Go to: http://cran.r-project.org/ Choose your computer and operating system and Download and Install R. The Binary Versions are the fastest. Also install the packages as many as you can. It will ask for a CRAN site or mirror that’s close to you. I always use USA(WA) as my CRAN site. Also, to get started on things we will need in lab Install the Bioconductor packages: YOU NEED TO LOAD BICONDUCTOR TO GET SOME OF THE PACKAGES WE WILL NEED. Go to http://www.bioconductor.org/ and download biocLite from the website under install. We will need several packages from this site. Before we get into actually analyzing genomic/proteomic/rna-seq data, let us first learn to navigate ourselves in R. Some simple things to do in R just to get acclimatized before we run packaged programs. 1. Data input: a. Reading from the keyboard (you physically typing in data) b. Reading from an existing file (I will use csv or txt files mostly in this class). 2. Doing descriptive statistics: a. Mean b. Median c. SD d. Variance e. Quartiles f. Confidence intervals g. Simple t-test (paired t, pooled t) h. Basic regression i. Basic ANOVA 3. Simple graphics a. Histogram b. Pie charts c. Scatter plot. Data: Data is R can be of various types. 1. Vectors: one dimensional data set. Example: x1=c(1,2,3,5,7) x2=c(2,7,8,4,1) 2. Matrix: 2 dimensional data, has rows and columns. Example: m1=cbind(x1,x2) OR m2=matrix(c(1,2,3,5,7,2,7,8,4,1),nrow=5,ncol=2) 3. Data frames (the one we will use the most): idea of a data set with various elements. mydata= data.frame(x1,x2,m1,m2) 4. Arrays: collection of matrices, multidimensional matrices 5. List: a collection of data frames, matrices, vectors etc. Variables can be different sizes and not all numerical. Inputting data from a file: I will show you how to read from a csv file (which will be common for us). Reading data from a csv file. Put the data in the correct directory and make sure that’s your work directory. I created a directory called myRfolder in my C drive. I will try to use that as my default directory. I put in a data set called classdata.csv (This is a class data dealing with the students height, age, shoesize, eye and hair color). Use the commands: >setwd("/myRfolder") >mydata<-read.table("classdata.csv",header=TRUE,sep=",",row.names="id",na.strings=" ") > mydata height age shoe eyecolor hair 1 62 18 6.5 brown blonde 2 69 19 11.0 hazel auburn 3 71 20 13.0 brown brown 4 69 20 6.5 hazel blonde 5 66 20 8.0 6 63 18 7.5 brown brown 7 66 20 9.5 blue blonde 8 65 20 8.5 blue blonde 9 67 19 9.5 brown brown 10 67 23 8.5 brown brown 11 67 19 9.5 green brown 12 69 NA 8.5 hazel blonde 13 66 20 10.0 14 70 20 9.5 hazel brown 15 69 20 9.5 hazel brown 16 72 20 11.0 blue blonde 17 70 19 10.5 blue brown 18 71 24 12.5 blue brown 19 63 19 8.0 green blonde 20 65 44 7.0 21 68 22 9.5 hazel brown 22 69 20 10.0 hazel brown 23 70 35 10.0 brown brown 24 70 22 10.0 blue brown 25 64 19 7.5 blue brown blue blonde blue blonde blue brown The data is saved in R as a dataframe mydata. It’s a temporary data set and will be lost when we get out of R. R looks at everything as a matrix. So if we want to get a summary measure for height all we have to do is refer to it as mydata[ ,1] referring to the first column of the data or can use the $ sign to say mydata$height Simple summaries: >summary(mydata[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 62.00 66.00 68.00 67.52 70.00 72.00 > summary(mydata$height) Min. 1st Qu. Median Mean 3rd Qu. Max. 62.00 66.00 68.00 67.52 70.00 72.00 > mean(mydata$height) [1] 67.52 > sd(mydata$height) [1] 2.77068 SIMPLE GRAPHICS > barplot(table(mydata$hair)) > pie(table(mydata$hair)) > hist(mydata$shoe) > plot(mydata$shoe,mydata$height) > table(mydata$hair,mydata$eye) To be able to qqplots we would need to have the library CAR installed. > library(stats) > qqnorm(mydata$shoe) > t.test(mydata$height,mydata$shoe) Welch Two Sample t-test data: mydata$height and mydata$shoe t = 90.2019, df = 39.25, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 56.95384 59.56616 sample estimates: mean of x mean of y 67.52 9.26 t.test(mydata$shoe, mu=8) the call for t tests: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) Wilcox.test(mydata$shoe,mu=8) Regression: >mymodel=lm(x1~x2,data=mydata) > class(mymodel) > summary(mymodel) ANOVA: ## Annette Dobson (1990) "An Introduction to Generalized Linear Models". ## Page 9: Plant Weight Data. ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <- gl(2, 10, 20, labels = c("Ctl","Trt")) weight <- c(ctl, trt) lm.D9 <- lm(weight ~ group) anova(lm.D9) summary(lm.D9) opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0)) plot(lm.D9, las = 1) # Residuals, Fitted, ... par(opar) Generate factors by specifying the pattern of their levels. Usage gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) R code: setwd("/myRfolder") mydata<-read.table("classdata.csv",header=TRUE,sep=",",row.names="id",na.strings=" ") mydata summary(mydata[,1]) summary(mydata$height) mean(mydata$height) sd(mydata$height) barplot(table(mydata$hair)) pie(table(mydata$hair)) hist(mydata$shoe) plot(mydata$shoe,mydata$height) table(mydata$hair,mydata$eye) library(stats) qqnorm(mydata$shoe) t.test(mydata$height,mydata$shoe) #Regression: mymodel=lm(mydata$height~mydata$age,data=mydata) class(mymodel) summary(mymodel)